• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Multiple inputs on a single mapper in hadoop

 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm developing an algorithm that needs to run two sequential mapreduce jobs, where the second one takes in input the input and the output of the first one at the same time. I found four ways to do it and I want to know witch of these is the most efficient or if there are other methods.

Distributed Cache

Merging all the reducer output into a single file and loading it on Distributed Cache



Adding it as a resource to the configuration class

As before I merge the output saving it on a String and than:



Reading from hdfs

The second map reads the output files of first reducers directly from hdfs

Passing two values as input

I have found on this webpage this pseudocode where it seems that they are passing two arguments as input to the second mapper but I don't know how to do that.

 
Ranch Hand
Posts: 63
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Loading the reducer data on to Distributed Cache will work if the Reducer output is small not very big.

local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB.
 
Ivan Zandon
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What about performance (in time and space)? is It correct the method I use to merge all the outputs from the reducers or are there any better methods?
 
Rajesh Nagaraju
Ranch Hand
Posts: 63
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Merging of the reducer could be done with command

Then we can read this as an input for the mapper, and use Distributed Cache if the file is small and if it is large you will have another Mapper to process this file and then use MultipleInputs ( this becomes a Reducer side join).

If you still want to do a map side join then you can use CompositeInputFormat.

Lastly, to automate the whole process you could use Oozie or spring batch.

Hope to hear from others their views.
 
Ivan Zandon
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How can I execute that command automatically when the first reducer finishes before the second mapper starts?
Unfortunately I cannot use Oozie because I haven't the rights to install it on the production environment.
 
Rajesh Nagaraju
Ranch Hand
Posts: 63
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Spring Batch can be used as it will be only the jar file references that are needed in the classpath
and you can launch it in a shell script and use the spring hadoop package to run the hadoop command
 
reply
    Bookmark Topic Watch Topic
  • New Topic