PUBLIC OBJECT

A non-scientific survey of ZIP Metadata

Okio's has an openZip() function that views a .zip file as a FileSystem. To expose file metadata in the file system API, Okio must extract it from the .zip file.

There’s at least four ways to encode timestamps in a .zip file. I wanted to find out which of these formats are used in practice, so I can support them in Okio. I was also curious about what other metadata is used.

I used my Mac’s disk as a source of sample .zip files. I scanned 6.2 million files on my Mac, of which 50,861 successfully decoded as .zip files. My program attempted to decode every single file as a ZIP, regardless of its extension! I do lots of Android, JVM, and Kotlin work on this Mac so my sample is biased towards the ZIPs used in these ecosystems, including .jar, .apk, and .klib files.

Here's what I found...

Timestamps

32-bit DOS timestamps (Built-in)

Every ZIP files include a bit-packed last modified time. It spends 16 bits on the date another 16 bits on the time:

  • 7-bit year
  • 4-bit month
  • 5-bit day
  • 5-bit hour
  • 6-bit minute
  • 5-bit second

Spending only 7 bits on the year means that this format can't express dates before 1980 or after 2107. Our industry is great at hiding time bombs in things!

Spending only 5 bits on the second means that this format rounds (or truncates?!) seconds to an even number: your file could be last modified at 11:45:30 or 11:45:32, but not at 11:45:31.

No bits are spent on the time zone and the convention is to encode the local-computer's time zone when writing a file. When a person in Sydney sends GraysAlmanac.zip to a friend in San Francisco, the contents’ last-modified timestamps could be in the future!

Extended Timestamp (0x5455, 4,826 files)

This optional field holds seconds-since-epoch for last modified, last accessed, and created at timestamps. Each of these fields are optional.

They're stored as signed 32-bit integers and can store dates between 1901-12-13 and 2038-01-19. Library authors must act now to defuse this time bomb!

NTFS Timestamp (0x000a, 5 files)

I found just 3 distinct files (plus 2 duplicates) with this metadata:

  • A file named github-vector-logo.zip that I downloaded from seeklogo.com in 2022.
  • Android Studio’s bundled ffmpeg and javacpp libraries.
  • A sample file for testing NTFS timestamp handling in rubyzip. (Today I learned that when I use a Ruby Gem, that library's test suite could be saved to my computer?!)

The NTFS timestamp format doesn’t have any time bombs in it.

UNIX Timestamp (0x000d, 0 files)

This is similar to the extended timestamp metadata. This format also suffers from the 2038 problem.

Everything Else

Zip64 (0x0001, 2,031 files)

This metadata is present for .zips with more than 65,535 entries and also for .zips with any entry larger than 4 GiB.

Executable .jar Metadata (0xcafe, 3,196 files)

When creating a .zip with JarOutputStream, this metadata is added. Apparently it’s a hack from 1999 to support executable JAR files on Solaris.

Info-ZIP UNIX Metadata

I found three revisions of this field for storing UNIX metadata like the file’s owning username and group:

  • 0x5855: 41 files
  • 0x7855: 12 files
  • 0x7875: 244 files

UTF-8 Path Name (0x7075, 16 files)

This is necessary for .zip entries whose file names are non-ASCII. Okio doesn't support these.

None of the .zips that used this field did so usefully! That is, they encoded a UTF-8 that equaled the ASCII-name. (I’d probably get different results if I spoke more languages.)

It’s unclear to me whether this field specifies a full path or just a file name. I saw both approaches in the .zip files on my Mac. Yuck.

I assume that abusing UTF-8 paths in .zip files is a great way to find bugs and vulnerabilities!

Microsoft Open Packaging (0xa220, 7 files)

This metadata is used include gaps after .zip entries so they can be updated in-place. I only saw this used in Microsoft Office .xlsx and .docx files.

Zipalign (0xd935, 9 files)

This is used by Android’s zipalign tool to word-align the start of entries.

Next Steps

I’m surprised that I only found 10 distinct metadata tags!

I think it’s prudent to support NTFS timestamps in ZIP libraries like Okio, to defend against the 2038 and 2107 time bombs. (I hope that Internet research confirms this is already the general consensus?!)

I’m going to skip UTF-8 support for now, as it breaks my files-are-implementation-details hypothesis. I’m open to having my mind changed on this!