I have a question addressing the data-flow in mapreduce-jobs.
DataFlow in a WordCount-example (taken from a book):
The mapper produces key-value-pairs like (car, 1)
maybe a lot of this pairs with the same key (car, 1)
The shuffle-phase produces key-ListOfValues-pairs like (car, 1, 1, 1)
The reducer summarizes the List and produces a key-value-pair like (car, 3)
When I want to use a combiner, I can use the reducer as a combiner (regarding to
the book, I've been learning from).
But how is this possible ? When I want to use the reducer as a combiner, there has to be a
shuffle-phase before the combiner, right ? Without a shuffle-phase, there is no
List of 1's and the combiner could not sum the value for a specific key.
Clearly I am missing something, can someone please explain it to me ?
The book would say it is good to use the same code as the reducer and not use the reducer itself.
When a mapper starts outputting data it first stores it in a circular buffer, when the circular buffer
reaches a threshold (configurable) it starts to spill it the disk.
The combiner is not a reducer as it runs on the mapper itself, what a combiner does is it combines the mapper
spills. The combiner is not guaranteed to run, it needs a minium number of spill files (again configurable). It also need
not run once based on the number of spill files the combiner can run multiple time