Win a copy of 97 Things Every Java Programmer Should Know this week in the Java in General forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Jeanne Boyarsky
  • Junilu Lacar
  • Henry Wong
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Frits Walraven
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • salvin francis
  • fred rosenberger

Why was map reduce developed to reduce the parallely processed data based on common keys?

 
Ranch Hand
Posts: 1466
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In map reduce framework the map performs task in parallel and reducer reduces the parallelly processed data based on the common keys. Why has it been developed to collect it based on keys? Why was it not developed such that mappers process data parallelly and reducers just gather the data (not based on keys necessarily). thanks
 
Master Rancher
Posts: 3537
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Keys are just a way of organizing the work to be done, and dividing it up among multiple nodes.  We can't have all the data and processing on one node, so how do we decide which data and which processing goes where?  That will depend on the nature of the problem we're trying to solve... but whatever criteria we decide to use, we use that to define a key, which is used to route the data.
 
Monica Shiralkar
Ranch Hand
Posts: 1466
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Mike Simmons wrote:Keys are just a way of organizing the work to be done, and dividing it up among multiple nodes



Thanks But this dividing work on multiple nodes are not done by keys . Instead,this is done simple based total size divided by 128MB or 640MB. This will give the number of mappers. These mappers will be distributed among the total nodes.so how did keys come in picture for this part .?
 
Mike Simmons
Master Rancher
Posts: 3537
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Monica Shiralkar wrote:

Mike Simmons wrote:Keys are just a way of organizing the work to be done, and dividing it up among multiple nodes



Thanks But this dividing work on multiple nodes are not done by keys . Instead,this is done simple based total size divided by 128MB or 640MB. This will give the number of mappers. These mappers will be distributed among the total nodes.so how did keys come in picture for this part .?


I don't understand how you think this establishes that dividing the work is not done by keys.  You're describing how to determine the number of mappers, which is a part of the process, and that particular calculation does not use keys, ok... but what are all the mappers mapping?  They convert input to key/value pairs.  A partition function is used on each key to determine which reducer(s) to send each key/value pair to.  This also ensures that if another input at another mapper has the same key, a key / value pair for that input will be sent to the same reducer(s) that received the first pair.  So all data for a given key will arrive that the same reducer.  That way the reducer knows everything it needs to know about that particular key, so it can solve that particular part of the problem.  As well as other keys and their respective portions of the problem - but still only a fraction of all the keys present in the overall problem.  So, keys are fundamental to how we determine which nodes work on which parts of a big data set.

This does not imply that using key/value pairs is necessarily the only way to handle distributed computing.  It isn't.  But it is how the MapReduce algorithm works.  If you can identify a key that makes sense for a particular problem, then you may be able to make use of MapReduce.  If not, well, you may need some other algorithm.
 
Monica Shiralkar
Ranch Hand
Posts: 1466
8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks. I meant instead of aggregating values of similar keys together for gathering the parallelly computed data, why not aggregate the entire data together  ?Why not create just one key from the mapper and all values will be aggregated to this 1 key ?
 
You guys haven't done this much, have ya? I suggest you study this tiny ad:
Devious Experiments for a Truly Passive Greenhouse!
https://www.kickstarter.com/projects/paulwheaton/greenhouse-1
    Bookmark Topic Watch Topic
  • New Topic