I have to process data in very large text files(like 5 TB in size). The processing logic uses supercsv to parse through the data and run some checks on it. Obviously as the size is quite large, we planned on using hadoop to take advantage of parallel computation. I install hadoop on my machine and I start off to write the mapper and reducer classes and I am stuck. Because the map requires a key value pair, so to read this text file I am not sure what should be the key and value in this particular scenario. Can someone help me out with that.
My thought process is something like this (let me know if I am correct) 1) Read the file using superCSV and hadoop generate the supercsv beans for each chunk of file in hdfs.(I am assuming that hadoop takes care of splitting the file) 2) For each of these supercsvbeans run my check logic.
The Map-Reduce approach requires the use of Keys/Values.
If you want to process some tasks in parallel you should be able to split the work somehow. So you need to define a key and value.
For example, each row of your csv could have ...
id, user, type, etc, etc, etc
So you may use id as a key, and the entire row as a value.
Thank you Pablo. That helps. Another question which I have is
I am sure that we can call a map-reduce job from a normal java application. Now the map-reduce jobs in my case has to deal with files on hdfs and also files on other filesystem. Is it possible in hadoop that we can access files from other file system while simultaneously using the files on hdfs. Is that possible ?
So basically my intention is that I have one large file which I want to put it in HDFS for parallel computing and then compare the blocks of this file with some other files(which I do not want to put in HDFS because they need to be accessed as full length file at once.
Nikhil Das Nomula wrote:Thank you Pablo. That helps. Another question which I have is
I am sure that we can call a map-reduce job from a normal java application. Now the map-reduce jobs in my case has to deal with files on hdfs and also files on other filesystem. Is it possible in hadoop that we can access files from other file system while simultaneously using the files on hdfs. Is that possible ?
So basically my intention is that I have one large file which I want to put it in HDFS for parallel computing and then compare the blocks of this file with some other files(which I do not want to put in HDFS because they need to be accessed as full length file at once.
The hadoop architecture is oriented to have several nodes working in parallel. And the HDFS automatically replicates data when its necessary, dealing with the concurrent access, etc,etc.
So, if you want to access your "external" file from hadoop you should model the access as if was a database. A unique entry point, managing concurrent access, etc.
In a map-reduce application, you want to make sure you keep network IO to a minimum. Any grid enabled application will have negative scalability if it's network IO bound. So, it's great that you are splitting one large file, because that ensures that only parts of the file go to the nodes, and that reduces network IO. However, if each node downloads another equally large file, then you might start loading the network because you have this large file being downloaded to all your nodes. You lose all the advantage that you gained by splitting your first file. Also, the external file system that hosts the second file becomes your bottleneck and point of failure
You might want to think about your map operation doing the work of "merging" the 2 files while it is splitting the first files. Basically, as it reads each row from the first file, lookup the related records in second file and send the records in first file along with related records in second file. You might want to pre-sort the second file if it is possible to make the lookups efficient.
Design decisions like these depends a lot on the relationship between the data in the 2 files. Before you make the decision to have a shared file system, you might want to consider the impact on scalability. Whatever you do, don't become network IO bound
permaculture is a more symbiotic relationship with nature so I can be even lazier. Read tiny ad:
a bit of art, as a gift, that will fit in a stocking