I have a question addressing the data-flow in mapreduce-jobs.
DataFlow in a WordCount-example (taken from a book):
The mapper produces key-value-pairs like (car, 1)
maybe a lot of this pairs with the same key (car, 1)
The shuffle-phase produces key-ListOfValues-pairs like (car, 1, 1, 1)
The reducer summarizes the List and produces a key-value-pair like (car, 3)
When I want to use a combiner, I can use the reducer as a combiner (regarding to
the book, I've been learning from).
But how is this possible ? When I want to use the reducer as a combiner, there has to be a
shuffle-phase before the combiner, right ? Without a shuffle-phase, there is no
List of 1's and the combiner could not sum the value for a specific key.
Clearly I am missing something, can someone please explain it to me ?
The book would say it is good to use the same code as the reducer and not use the reducer itself.
When a mapper starts outputting data it first stores it in a circular buffer, when the circular buffer
reaches a threshold (configurable) it starts to spill it the disk.
The combiner is not a reducer as it runs on the mapper itself, what a combiner does is it combines the mapper
spills. The combiner is not guaranteed to run, it needs a minium number of spill files (again configurable). It also need
not run once based on the number of spill files the combiner can run multiple time
Hope that helps
Gartner says :Bigdata will be most advanced analytics products by 2015 !
Time to Become Big data architect by learning Hadoop(Developer,
Mahout, Splunk,R etc) from scratch to expert level