Gzip decompressed size

If you have a large gzip file, say the uncompressed size >4G, try this,

$ gzip -l transactions.gz
  compressed uncompressed  ratio uncompressed_name
  3530291054   4085787831  13.5% transactions

Although I know my uncompressed file size is about 20G.

It was a well-known issue since long ago. The root of the issue is that in GZIP spec, the last 4 bytes are for storing the uncompressed size in little endian. The “little endian” part is not the problem, the real problem is that it is an “int” instead of a “long”. The direct impact is that for any data larger than 4G, these 4 bytes can only store the residual part. There is basically no easy way to recover the original uncompressed size without decompressing the file.

The same issue is actually inherited in some Hadoop 1.x versions, and then in Spark binary which build on those versions of Hadoop library. So you will get

java.io.IOException: stored gzip size doesn't match decompressed size

error, when you try to load a gzipped file through

val rdd=sparkContext.textFile("transactions.gz")

By switching to Spark build with Hadoop 2.x will solve the problem, which I suspect doesn’t even bother to check the last 4 bytes of the gzipped file.

This entry was posted in General, Spark. Bookmark the permalink.

	Bo Zhang on GroupBy on DataFrame is NOT th…
	crayola on GroupBy on DataFrame is NOT th…
	aedwip on Repartition vs. Coalesce
	crayola on GroupBy on DataFrame is NOT th…
	Bo Zhang on GroupBy on DataFrame is NOT th…

	Bo Zhang on GroupBy on DataFrame is NOT th…
	crayola on GroupBy on DataFrame is NOT th…
	aedwip on Repartition vs. Coalesce
	crayola on GroupBy on DataFrame is NOT th…
	Bo Zhang on GroupBy on DataFrame is NOT th…

Gzip decompressed size

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Recent Posts

Recent Comments

Archives

Categories

Meta

Gzip decompressed size

Share this:

Related

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Recent Posts

Recent Comments

Archives

Categories

Meta