Hadoop runs on Linux only.
Your mileage may vary.
Some customers get really good performance on Cisco UCS and also HP DL380 among others.
Hadoop uses the notion of data locality, so the closer the data is to the node where the task is running the better performance you get.
Depending on the application SSDs might have a positive impact versus classic HDD.
The data locality optimization is one of the key techniques that allows Hadoop to scale and be so performant.
The basic idea is that if you have a cluster with large amounts of data you really don't want to be moving that data around the cluster to be processed. So when a MapReduce job is being scheduled the framework determines which pieces of data (blocks) need to be processed and on which machines they are located and then starts tasks to process the data on those hosts. By default Hadoop keeps 3 copies of each block. So in the best case if you have 10 blocks to be processed (usually its much higher) then the framework will schedule jobs on the host where replicas of each block reside.
Obviously as the data size increases this becomes more difficult; if you have 50 machines but 20000 blocks to process then scheduling becomes much more complex. But by aiming to process data where it resides a lot of data transfer and i/o is avoided.
Joined: Dec 08, 2009
Fantastic. Thanks Garry for the detailed explanation.
I’ve looked at a lot of different solutions, and in my humble opinion Aspose is the way to go. Here’s the link: http://aspose.com