Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Using Hadoop to process large text files along with CSV

 
Nikhil Das Nomula
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have to process data in very large text files(like 5 TB in size). The processing logic uses supercsv to parse through the data and run some checks on it. Obviously as the size is quite large, we planned on using hadoop to take advantage of parallel computation. I install hadoop on my machine and I start off to write the mapper and reducer classes and I am stuck. Because the map requires a key value pair, so to read this text file I am not sure what should be the key and value in this particular scenario. Can someone help me out with that.

My thought process is something like this (let me know if I am correct) 1) Read the file using superCSV and hadoop generate the supercsv beans for each chunk of file in hdfs.(I am assuming that hadoop takes care of splitting the file) 2) For each of these supercsvbeans run my check logic.
 
Pablo Abbate
Ranch Hand
Posts: 30
Java Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Map-Reduce approach requires the use of Keys/Values.
If you want to process some tasks in parallel you should be able to split the work somehow. So you need to define a key and value.

For example, each row of your csv could have ...
id, user, type, etc, etc, etc

So you may use id as a key, and the entire row as a value.

Hope it helps you,
 
Nikhil Das Nomula
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Pablo. That helps. Another question which I have is

I am sure that we can call a map-reduce job from a normal java application. Now the map-reduce jobs in my case has to deal with files on hdfs and also files on other filesystem. Is it possible in hadoop that we can access files from other file system while simultaneously using the files on hdfs. Is that possible ?

So basically my intention is that I have one large file which I want to put it in HDFS for parallel computing and then compare the blocks of this file with some other files(which I do not want to put in HDFS because they need to be accessed as full length file at once.
 
Pablo Abbate
Ranch Hand
Posts: 30
Java Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nikhil Das Nomula wrote:Thank you Pablo. That helps. Another question which I have is

I am sure that we can call a map-reduce job from a normal java application. Now the map-reduce jobs in my case has to deal with files on hdfs and also files on other filesystem. Is it possible in hadoop that we can access files from other file system while simultaneously using the files on hdfs. Is that possible ?

So basically my intention is that I have one large file which I want to put it in HDFS for parallel computing and then compare the blocks of this file with some other files(which I do not want to put in HDFS because they need to be accessed as full length file at once.


The hadoop architecture is oriented to have several nodes working in parallel. And the HDFS automatically replicates data when its necessary, dealing with the concurrent access, etc,etc.
So, if you want to access your "external" file from hadoop you should model the access as if was a database. A unique entry point, managing concurrent access, etc.
 
Jayesh A Lalwani
Rancher
Posts: 2756
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In a map-reduce application, you want to make sure you keep network IO to a minimum. Any grid enabled application will have negative scalability if it's network IO bound. So, it's great that you are splitting one large file, because that ensures that only parts of the file go to the nodes, and that reduces network IO. However, if each node downloads another equally large file, then you might start loading the network because you have this large file being downloaded to all your nodes. You lose all the advantage that you gained by splitting your first file. Also, the external file system that hosts the second file becomes your bottleneck and point of failure

You might want to think about your map operation doing the work of "merging" the 2 files while it is splitting the first files. Basically, as it reads each row from the first file, lookup the related records in second file and send the records in first file along with related records in second file. You might want to pre-sort the second file if it is possible to make the lookups efficient.

Design decisions like these depends a lot on the relationship between the data in the 2 files. Before you make the decision to have a shared file system, you might want to consider the impact on scalability. Whatever you do, don't become network IO bound
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic