Following data are extracted from the 1st map-reduce task
country ; title ; sex ; units ; file location Turkey ; Population ; Males ; Persons ; L/F/W/A/5/LFWA55MATRQ647N.csv
Turkey ; Population ; Males ; Persons ; L/F/W/A/5/LFWA55MATRA647N.csv
Turkey ; Population ; Males ; Persons ; L/F/W/A/5/LFWA55MATRQ647S.csv
Turkey ; Population ; Males ; Persons ; L/F/W/A/5/LFWA55MATRA647S.csv
And then i try to set 2nd map-reduce task with csv files of the file location column. Data format of each csv files is like below
year ; population
2004 ; 2130034
2005 ; 2239913
2006 ; 2437712
2007 ; 2210673
But i have no idea how to set 2nd map-reduce task with using file location column data from 1st map-reduce task. The final output format is like below
country ; year ; population
Turkey ; 2004 ; 2130034
Turkey ; 2005 ; 2239913
Turkey ; 2006 ; 2437712
Turkey ; 2007 ; 2210673
As far as i know, input file path is set only in driver class with FileInputFormat.setInputPaths() method, but in my map-reduce task file location is handled only in map and reduce class.i wonder how to load input file path from map and reduce class into driver class?
How can i put file location value into FileInputFormat.setInputPaths() method, for example FileInputFormat.setInputPaths(job,new Path("L/F/W/A/5/LFWA55MATRQ647N.csv"));
I need your advice. Your help will be appreciated in advance!
Are your CSV files on HDFS ? How big is one file? I mean how many rows of "year";"population" does it contain ? You could copy them to HDFS first.
Then run a Pig script which would automatically chain the required MR jobs to process the data.
Pig script would roughly look like (Assuming output of your 1st MR is in a file)
1) Read the 1st MR output with schema - country,title,sex,units, file location (or name)
2) If CSV files are on HDFS, read those file using schema - file location (or name), year, population [You may have to write your own Loader Function for this as we want to have File location as one of the output fields]
3) Join 1 and 2 using "file location (name)" which would result in desired output i.e.
country, year, population
Of course, this all can be done using plain MR as well but you will have to chain those jobs together. Whichever way you proceed, I believe you would need to have CSV files on the HDFS cluster.