I'm developing an algorithm that needs to run two sequential mapreduce jobs, where the second one takes in input the input and the output of the first one at the same time. I found four ways to do it and I want to know witch of these is the most efficient or if there are other methods.
Distributed Cache
Merging all the reducer output into a single file and loading it on Distributed Cache
Adding it as a resource to the configuration class
As before I merge the output saving it on a
String and than:
Reading from hdfs
The second map reads the output files of first reducers directly from hdfs
Passing two values as input
I have found on
this webpage this pseudocode where it seems that they are passing two arguments as input to the second mapper but I don't know how to do that.