my dog learned polymorphism*
The moose likes Hadoop and the fly likes Hadoop on Different Platforms Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Hadoop on Different Platforms" Watch "Hadoop on Different Platforms" New topic
Author

Hadoop on Different Platforms

Mohamed El-Refaey
Ranch Hand

Joined: Dec 08, 2009
Posts: 119
Are there any reference or benchmarks for Hadoop working on different platforms and OSes?

Thanks,
Mohamed


Best Regards, Mohamed El-Refaey
www.egyptcloudforum.com
Carlos Morillo
Ranch Hand

Joined: Jun 06, 2009
Posts: 221

Hadoop runs on Linux only.
Your mileage may vary.
Some customers get really good performance on Cisco UCS and also HP DL380 among others.
Hadoop uses the notion of data locality, so the closer the data is to the node where the task is running the better performance you get.
Depending on the application SSDs might have a positive impact versus classic HDD.

MapR Enterprise Grade Distribution for Hadoop has several records for the most popular Hadoop benchmarks.


SCSA, OCA, SCJP 5.0, SCJD http://www.linkedin.com/in/carlosamorillo
Mohamed El-Refaey
Ranch Hand

Joined: Dec 08, 2009
Posts: 119
Thanks Carlos. So, how far the data locality improve the performance?
Garry Turkington
author
Greenhorn

Joined: Apr 23, 2013
Posts: 15
Mohamed,

The data locality optimization is one of the key techniques that allows Hadoop to scale and be so performant.

The basic idea is that if you have a cluster with large amounts of data you really don't want to be moving that data around the cluster to be processed. So when a MapReduce job is being scheduled the framework determines which pieces of data (blocks) need to be processed and on which machines they are located and then starts tasks to process the data on those hosts. By default Hadoop keeps 3 copies of each block. So in the best case if you have 10 blocks to be processed (usually its much higher) then the framework will schedule jobs on the host where replicas of each block reside.

Obviously as the data size increases this becomes more difficult; if you have 50 machines but 20000 blocks to process then scheduling becomes much more complex. But by aiming to process data where it resides a lot of data transfer and i/o is avoided.

Garry
Mohamed El-Refaey
Ranch Hand

Joined: Dec 08, 2009
Posts: 119
Fantastic. Thanks Garry for the detailed explanation.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: Hadoop on Different Platforms
 
Similar Threads
How much time (approx) can it take to learn and use hadoop ?
Problem in executing jar with windows 7 64 bit
Developing Solutions Using Apache Hadoop
a Synchronized qusetion
what's Hadoop ?