File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
A friendly place for programming greenhorns!
Big Moose Saloon
Register / Login
Win a copy of
Java SE 8 for the Really Impatient
this week in the
Hive Gzip Compression splitting supported now?
Joined: May 21, 2009
Dec 19, 2012 18:41:28
Does Hadoop automatically support splitting Gzip files into Blocks now? I have read that splitting doesn't work for tables using gzip compression in Hadoop/Hive here:
From the above link: "in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause under-utilization of your cluster's 'mapping' power."
However, when I load my table exactly as they describe, I notice that the gz I load is definately split up into blocks in the place it stores my HDFS files. It looks like this after doing the load:
[current]$ pwd /foo/dev/hadoop/HDFS/dfs/data/current [current]$ ls -la -rw-r--r--. 1 foo bar 671 Dec 19 15:34 blk_105489922789526087_1136.meta -rw-r--r--. 1 foo bar 67108864 Dec 19 14:26 blk_-1527019105370167199 -rw-r--r--. 1 foo bar 524295 Dec 19 14:26 blk_-1527019105370167199_1075.meta -rw-r--r--. 1 foo bar 67108864 Dec 19 14:26 blk_-226975864542913836 -rw-r--r--. 1 foo bar 524295 Dec 19 14:26 blk_-226975864542913836_1075.meta -rw-r--r--. 1 foo bar 67108864 Dec 19 14:26 blk_-2476094208541673790
It is clearly chopping it up into 64 mb blocks during the load to HDFS.
Is this something they have added recently? I'm using Hadoop 1.0.4, r1393290 in psuedo cluster mode.
I agree. Here's the link:
subject: Hive Gzip Compression splitting supported now?
text file extractor
java.io.FileNotFoundException: Too many open files
How to Enable SSL on Tomcat 7 on Linux?
Issues with Tortoise CVS and Cruise control
All times are in JavaRanch time: GMT-6 in summer, GMT-7 in winter
| Powered by
Copyright © 1998-2014