• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Liutauras Vilda
  • Jeanne Boyarsky
  • paul wheaton
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Henry Wong
Saloon Keepers:
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Tim Moores
  • Mikalai Zaikin
Bartenders:
  • Frits Walraven

Is it same in Hadoop to have two data nodes of 50 GB each and have 1 data node of 100 MB?

 
Ranch Hand
Posts: 2953
13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Is it same in Hadoop to have two data nodes of 50 GB each and have 1 data node of 100 MB? If not so which one is better(faster processing).

thanks
 
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hadoop is all about distributing your data and your processing across multiple cheap machines. The data is replicated so there are e.g. 3 copies of each block of data, with diifferent copies on different machines. If you have more nodes than replicas, e.g. 3 replicas across 6 nodes, then on average each node only contains half the total original data volume. Hadoop knows where your data is replicated, so it can decide to process different subsets of your data on different nodes at the same time. This is how Hadoop allows you to exploit the power of distributed processing.

If you only have two nodes, and your replication factor is 2 or more, then each node contains all your data anyway, so Hadoop cannot decide how to break up the processing in this way. And if you only have one node, then nothing is distributed at all.
 
Ranch Hand
Posts: 782
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chris webster wrote:Hadoop is all about distributing your data and your processing across multiple cheap machines. The data is replicated so there are e.g. 3 copies of each block of data, with diifferent copies on different machines. If you have more nodes than replicas, e.g. 3 replicas across 6 nodes, then on average each node only contains half the total original data volume. Hadoop knows where your data is replicated, so it can decide to process different subsets of your data on different nodes at the same time. This is how Hadoop allows you to exploit the power of distributed processing.

If you only have two nodes, and your replication factor is 2 or more, then each node contains all your data anyway, so Hadoop cannot decide how to break up the processing in this way. And if you only have one node, then nothing is distributed at all.



The first case you mentioned i.e of 3 replicas across 6 nodes, you mentioned Hadoop can decide what to process where.

Whereas, in your last example, i.e two nodes with replication factor is 2 or more, in this case you said, Hadoop cannot decide how to breakup processing.

My question, why in 2nd case, Hadoop cannot decide ? If both nodes are deployed on two separate machines, and one machine is loaded and not have good resources as compare to the other, then don't you think YARN will select the second machine to process the task ?

Thanks.

Viki.
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
One node with 100 MB will be faster than two nodes with 50 GB, because in the first case, there is 1000x as little data.

You probably meant 100 GB instead of 100 MB.
 
Monica Shiralkar
Ranch Hand
Posts: 2953
13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

One node with 100 MB will be faster than two nodes with 50 GB, because in the first case, there is 1000x as little data.

You probably meant 100 GB instead of 100 MB.



Yes I meant 100 GB. So will One node  of 100 GB be faster or two nodes of 50 GB each?

thanks
 
Marshal
Posts: 79978
397
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
MS: please don't edit old threads like that: add a new post saying that “MB” was a misspelling.
 
Monica Shiralkar
Ranch Hand
Posts: 2953
13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:MS: please don't edit old threads like that: add a new post saying that “MB” was a misspelling.



Am I supposed to reply saying “MB” was a misspelling and then edit the subject or only reply ?
 
Campbell Ritchie
Marshal
Posts: 79978
397
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Please simply say MB was a mistake. You have replied, and I think you have done everything needed.
 
Monica Shiralkar
Ranch Hand
Posts: 2953
13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think that out of these 2 options, having 2 nodes of 50 GB will be faster than one node of 100 GB because with 2 nodes of 50 GB we will get more cores to process than 1 node of 100 GB.
 
Lookout! Runaway whale! Hide behind this tiny ad:
Gift giving made easy with the permaculture playing cards
https://coderanch.com/t/777758/Gift-giving-easy-permaculture-playing
reply
    Bookmark Topic Watch Topic
  • New Topic