I am pretty new to the HDFS and was looking for some opinions on some conflicting answers I have recently gotten.
1. Is it a good idea to compress the stream to write the file out to hadoop. One person told me they had got 10x benefit from doing this. Another told me that it was bad to compress b\c the map reduces that ran on the file could not be distributed using compressed files.
2. I read that map reduces running on hadoop works best with file sizes between 500gb and tb size files. Someone told me that the it works better with smaller files.
As per Definitive guide, "All compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings."
However HDFS offers various compression techniques. You can select the compression techniques depends on your need, ie Either you need better performance or better space optimization or wants to balance both.
Hadoop works well with large files. If you are using Hadoop for the storage & processing of small files,
i) Load on Name node will be more, as with more no. of small files, more amount of meta data needs to be saved & operated at Name Node.
ii) The complete utilization of Blocks may not be happen.