I checked only the possiblity to use Hadoop on the cloud and I found some ec2 scripts which handles instance startups.
I'm not sure if it is possible to increase the size of a cluster dinamically. Currently I see some static configuration files which controls the number of nodes in the cluster.
Since the pricing model of EC2 instances are hourly based, I think it is a waste to launch a set of cluster nodes for a problem, then stop it. And for a new MapReduce job to start a new set of nodes. I would like to control the number of nodes more dinamically then be able to use that cluster size to many smaller and larger jobs in order to reduce the cost of instance prices. Is it possible in a straightforward way, or it is possible to implement (extend) some core parts of the hadoop infrastructure? Or it is absolutely not recommended?
I do have a whole chapter on running Hadoop on AWS/EC2. Unfortunately I must admit that I don't go much beyond the scripts included in Hadoop's distribution. The implications of using Hadoop on EC2 is quite subtle, so I devote most of the chapter to clarifying that instead.
As to Tibi's question about dynamically adjusting the size of the cluster, Hadoop is certainly designed to handle it, but I wouldn't recommend doing that with a Hadoop cluster in EC2. Hadoop assumes fairly stable clusters, and slowly "balances" the data across the cluster when you change its size. Just because you can add/remove nodes in EC2 fairly fast doesn't mean that Hadoop can respond equally fast.
You should also keep in mind that Hadoop doesn't necessarily execute just one job at a time. It's optimized for throughput so that if one job doesn't take up all the nodes in a cluster, it will (partially) start the next job in parallel to keep any node from being idle.
Joined: Jun 11, 2009
Thank you, Chuck.
The adjusting of the size of cluster would not be done with the intent to pull out an instance while running some tasks. Instead the main intent it would be to increase the number of instances while already running the Hadoop cluster, at some decision point (in my hand also). And some kind of gracefull shutdowns of some nodes, after the huge amount of load is getting back to a much lower one. In such case a mechanism which will inform the Hadoop cluster manager to not start new tasks on some nodes and when the current task are done, to notify us, or self shutdown the instance in cause.
Knowing a distribution of the loads of our current jobs, I calculated that a resonable amount of resources (costs) could be optimized.
Because of multi job session, I have a situation when for one client I start EC2 several (50) instances, then later on it comes a very different job, which does not requires so much instances. If the biggest job finished, the smaller job will scale up so much that the time required for the remaining task will finish proportiately soon. But I know that not short enough to not pass the accounting hour limit and I get another 50 x instance hour on my bill, while I know that an amount of 10 instance could finish the job. The difference of 40 x instancehour price is lost. Frankly to say, that is not too much to be worst considering gracefull shutdown. At least for graceful shutdown not.
The possibility of enlarging the size of the cluster is more stringent, because while processing small mapreduce jobs it arrives on the pipeline a huge job, immediately I should start a completely new hadoop cluster with many instances, while the smaller jobs from the original scheduling could be finished in the spare time of the huger cluster (consider the losts when the huge task is ending). Also managing multiple clusters + hdfs maybe complicates too much the problem.
At the moment I feel that dynamically starting up new nodes it would be significant. Graceful shutdown is not so stringent, because scaling out factor of map reduce algorithms.