• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Liutauras Vilda
  • Jeanne Boyarsky
  • paul wheaton
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Henry Wong
Saloon Keepers:
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Tim Moores
  • Mikalai Zaikin
Bartenders:
  • Frits Walraven

Why was map reduce developed to reduce the parallely processed data based on common keys?

 
Ranch Hand
Posts: 2952
13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
In map reduce framework the map performs task in parallel and reducer reduces the parallelly processed data based on the common keys. Why has it been developed to collect it based on keys? Why was it not developed such that mappers process data parallelly and reducers just gather the data (not based on keys necessarily). thanks
 
Master Rancher
Posts: 5060
81
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Keys are just a way of organizing the work to be done, and dividing it up among multiple nodes.  We can't have all the data and processing on one node, so how do we decide which data and which processing goes where?  That will depend on the nature of the problem we're trying to solve... but whatever criteria we decide to use, we use that to define a key, which is used to route the data.
 
Monica Shiralkar
Ranch Hand
Posts: 2952
13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:Keys are just a way of organizing the work to be done, and dividing it up among multiple nodes



Thanks But this dividing work on multiple nodes are not done by keys . Instead,this is done simple based total size divided by 128MB or 640MB. This will give the number of mappers. These mappers will be distributed among the total nodes.so how did keys come in picture for this part .?
 
Mike Simmons
Master Rancher
Posts: 5060
81
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Monica Shiralkar wrote:

Mike Simmons wrote:Keys are just a way of organizing the work to be done, and dividing it up among multiple nodes



Thanks But this dividing work on multiple nodes are not done by keys . Instead,this is done simple based total size divided by 128MB or 640MB. This will give the number of mappers. These mappers will be distributed among the total nodes.so how did keys come in picture for this part .?


I don't understand how you think this establishes that dividing the work is not done by keys.  You're describing how to determine the number of mappers, which is a part of the process, and that particular calculation does not use keys, ok... but what are all the mappers mapping?  They convert input to key/value pairs.  A partition function is used on each key to determine which reducer(s) to send each key/value pair to.  This also ensures that if another input at another mapper has the same key, a key / value pair for that input will be sent to the same reducer(s) that received the first pair.  So all data for a given key will arrive that the same reducer.  That way the reducer knows everything it needs to know about that particular key, so it can solve that particular part of the problem.  As well as other keys and their respective portions of the problem - but still only a fraction of all the keys present in the overall problem.  So, keys are fundamental to how we determine which nodes work on which parts of a big data set.

This does not imply that using key/value pairs is necessarily the only way to handle distributed computing.  It isn't.  But it is how the MapReduce algorithm works.  If you can identify a key that makes sense for a particular problem, then you may be able to make use of MapReduce.  If not, well, you may need some other algorithm.
 
Monica Shiralkar
Ranch Hand
Posts: 2952
13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks. I meant instead of aggregating values of similar keys together for gathering the parallelly computed data, why not aggregate the entire data together  ?Why not create just one key from the mapper and all values will be aggregated to this 1 key ?
 
Monica Shiralkar
Ranch Hand
Posts: 2952
13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:

If you can identify a key that makes sense for a particular problem, then you may be able to make use of MapReduce.  



Thanks. I think this line explains it all.

Nowdays, we also have Apache Spark and Strom ,for processing big data(although the reason for choosing them would be streaming data ). So, if Map Reduce is not suitable for solving a particular big data problem, we have other option too.I wonder what was the other option for processing Big Data in the days before Spark and Strom. As I understand that time only hadoop map reduce was available to solve big data problems so I wonder what was done if hadoop map reduce with its key value concept was not found suitable for solving a big data problem in those days .
 
expectation is the root of all heartache - shakespeare. tiny ad:
Gift giving made easy with the permaculture playing cards
https://coderanch.com/t/777758/Gift-giving-easy-permaculture-playing
reply
    Bookmark Topic Watch Topic
  • New Topic