This week's book giveaway is in the Jobs Discussion forum.
We're giving away four copies of Soft Skills: The software developer's life manual and have John Sonmez on-line!
See this thread for details.
Win a copy of Soft Skills: The software developer's life manual this week in the Jobs Discussion forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Hadoop on Different Platforms

 
Mohamed El-Refaey
Ranch Hand
Posts: 119
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are there any reference or benchmarks for Hadoop working on different platforms and OSes?

Thanks,
Mohamed
 
Carlos Morillo
Ranch Hand
Posts: 221
Java Python Scala
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hadoop runs on Linux only.
Your mileage may vary.
Some customers get really good performance on Cisco UCS and also HP DL380 among others.
Hadoop uses the notion of data locality, so the closer the data is to the node where the task is running the better performance you get.
Depending on the application SSDs might have a positive impact versus classic HDD.

MapR Enterprise Grade Distribution for Hadoop has several records for the most popular Hadoop benchmarks.
 
Mohamed El-Refaey
Ranch Hand
Posts: 119
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Carlos. So, how far the data locality improve the performance?
 
Garry Turkington
author
Greenhorn
Posts: 15
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mohamed,

The data locality optimization is one of the key techniques that allows Hadoop to scale and be so performant.

The basic idea is that if you have a cluster with large amounts of data you really don't want to be moving that data around the cluster to be processed. So when a MapReduce job is being scheduled the framework determines which pieces of data (blocks) need to be processed and on which machines they are located and then starts tasks to process the data on those hosts. By default Hadoop keeps 3 copies of each block. So in the best case if you have 10 blocks to be processed (usually its much higher) then the framework will schedule jobs on the host where replicas of each block reside.

Obviously as the data size increases this becomes more difficult; if you have 50 machines but 20000 blocks to process then scheduling becomes much more complex. But by aiming to process data where it resides a lot of data transfer and i/o is avoided.

Garry
 
Mohamed El-Refaey
Ranch Hand
Posts: 119
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Fantastic. Thanks Garry for the detailed explanation.
 
Consider Paul's rocket mass heater.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic