If you have a large gzip file, say the uncompressed size >4G, try this,
$ gzip -l transactions.gz compressed uncompressed ratio uncompressed_name 3530291054 4085787831 13.5% transactions
Although I know my uncompressed file size is about 20G.
It was a well-known issue since long ago. The root of the issue is that in GZIP spec, the last 4 bytes are for storing the uncompressed size in little endian. The “little endian” part is not the problem, the real problem is that it is an “int” instead of a “long”. The direct impact is that for any data larger than 4G, these 4 bytes can only store the residual part. There is basically no easy way to recover the original uncompressed size without decompressing the file.
The same issue is actually inherited in some Hadoop 1.x versions, and then in Spark binary which build on those versions of Hadoop library. So you will get
java.io.IOException: stored gzip size doesn't match decompressed size
error, when you try to load a gzipped file through
val rdd=sparkContext.textFile("transactions.gz")
By switching to Spark build with Hadoop 2.x will solve the problem, which I suspect doesn’t even bother to check the last 4 bytes of the gzipped file.