jQuery in Action, 3rd edition
The moose likes Hadoop and the fly likes Hadoop on Different Platforms Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Hadoop on Different Platforms" Watch "Hadoop on Different Platforms" New topic

Hadoop on Different Platforms

Mohamed El-Refaey
Ranch Hand

Joined: Dec 08, 2009
Posts: 119
Are there any reference or benchmarks for Hadoop working on different platforms and OSes?


Best Regards, Mohamed El-Refaey
Carlos Morillo
Ranch Hand

Joined: Jun 06, 2009
Posts: 221

Hadoop runs on Linux only.
Your mileage may vary.
Some customers get really good performance on Cisco UCS and also HP DL380 among others.
Hadoop uses the notion of data locality, so the closer the data is to the node where the task is running the better performance you get.
Depending on the application SSDs might have a positive impact versus classic HDD.

MapR Enterprise Grade Distribution for Hadoop has several records for the most popular Hadoop benchmarks.

SCSA, OCA, SCJP 5.0, SCJD, CCDH, CCAH http://www.linkedin.com/in/carlosamorillo
Mohamed El-Refaey
Ranch Hand

Joined: Dec 08, 2009
Posts: 119
Thanks Carlos. So, how far the data locality improve the performance?
Garry Turkington

Joined: Apr 23, 2013
Posts: 15

The data locality optimization is one of the key techniques that allows Hadoop to scale and be so performant.

The basic idea is that if you have a cluster with large amounts of data you really don't want to be moving that data around the cluster to be processed. So when a MapReduce job is being scheduled the framework determines which pieces of data (blocks) need to be processed and on which machines they are located and then starts tasks to process the data on those hosts. By default Hadoop keeps 3 copies of each block. So in the best case if you have 10 blocks to be processed (usually its much higher) then the framework will schedule jobs on the host where replicas of each block reside.

Obviously as the data size increases this becomes more difficult; if you have 50 machines but 20000 blocks to process then scheduling becomes much more complex. But by aiming to process data where it resides a lot of data transfer and i/o is avoided.

Mohamed El-Refaey
Ranch Hand

Joined: Dec 08, 2009
Posts: 119
Fantastic. Thanks Garry for the detailed explanation.
I agree. Here's the link: http://aspose.com/file-tools
subject: Hadoop on Different Platforms
It's not a secret anymore!