Gzip decompressed size

If you have a large gzip file, say the uncompressed size >4G, try this,

$ gzip -l transactions.gz
  compressed uncompressed  ratio uncompressed_name
  3530291054   4085787831  13.5% transactions

Although I know my uncompressed file size is about 20G.

It was a well-known issue since long ago. The root of the issue is that in GZIP spec, the last 4 bytes are for storing the uncompressed size in little endian. The “little endian” part is not the problem, the real problem is that it is an “int” instead of a “long”. The direct impact is that for any data larger than 4G, these 4 bytes can only store the residual part. There is basically no easy way to recover the original uncompressed size without decompressing the file.

The same issue is actually inherited in some Hadoop 1.x versions, and then in Spark binary which build on those versions of Hadoop library. So you will get

java.io.IOException: stored gzip size doesn't match decompressed size

error, when you try to load a gzipped file through

val rdd=sparkContext.textFile("transactions.gz")

By switching to Spark build with Hadoop 2.x will solve the problem, which I suspect doesn’t even bother to check the last 4 bytes of the gzipped file.

This entry was posted in General, Spark. Bookmark the permalink.

Leave a comment