Monica Shiralkar wrote:
Mike Simmons wrote:Keys are just a way of organizing the work to be done, and dividing it up among multiple nodes
Thanks But this dividing work on multiple nodes are not done by keys . Instead,this is done simple based total size divided by 128MB or 640MB. This will give the number of mappers. These mappers will be distributed among the total nodes.so how did keys come in picture for this part .?
I don't understand how you think this establishes that dividing the work is not done by keys. You're describing how to determine the number of mappers, which is a part of the process, and that particular calculation does not use keys, ok... but what are all the mappers
mapping? They convert input to key/value pairs. A partition function is used on each key to determine which reducer(s) to send each key/value pair to. This also ensures that if another input at another mapper has the same key, a key / value pair for that input will be sent to the
same reducer(s) that received the first pair. So all data for a given key will arrive that the same reducer. That way the reducer knows everything it needs to know about that particular key, so it can solve that particular part of the problem. As well as other keys and their respective portions of the problem - but still only a fraction of all the keys present in the overall problem. So, keys are fundamental to how we determine which nodes work on which parts of a big data set.
This does not imply that using key/value pairs is necessarily the only way to handle distributed computing. It isn't. But it is how the MapReduce algorithm works. If you can identify a key that makes sense for a particular problem, then you may be able to make use of MapReduce. If not, well, you may need some other algorithm.