This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
As someone who doesn't use Hadoop, at least not yet, it seems to me that to really get a feel for setting up, managing, and testing an implementation of Hadoop you need to have a multiple machine setup. You can't mimic real world use cases if you're running it on one machine. Arguably it's not even helpful to set it up on only two or three.
For a company like Yahoo, or even one with few hundred employees, this is not an issue. Set up a dozen servers, put a small team on the project, and you're off to the races.
For a company that's smaller, and already tossing a lot of money at attempts to make their existing data management (like a traditional RDBMS running on servers with a RAID array of SAS drives) work, it's a harder sell. The hardware budget is already stretched, and you have to divide time between patching up the existing setup and research on potential replacements.
For an individual hoping to learn Hadoop in their spare time, it's even more difficult. I have one desktop and laptop at home, and I suppose if I wanted to push my luck I could run virtual machines on each to emulate a 4-6 VM test-bed.
So my question to Chuck Lam or anyone else that is familiar with Hadoop, is how you set up your initial test bed at your job or in your house. What kind of starter projects did you tackle with it?
You only need one or two additional machines to test real-world scenarios, either physical or virtual, and they need very few resources for prototyping/sanity-checking. Shouldn't be difficult at all. I think HBase and HDFS projects are the easiest to get started with.
If you don't have in reserve enough computing power, many available instances, take a look at Amazon EC2 service, you can rent it and Hadoop has helpers scripts for you to install on EC2.
But you can I also run locally, I also run my local tests on a simple laptop where there are Xen domU instances where I run the Hadoop cluster nodes.
A combination of the two it would be the case when we would be able to run amazon images on our local xen, to not spend too much time just for testing where you don't really need computing power. This is a further subject maybe I'll study it a little bit later.
Hadoop can be run on a single machine. In fact, that's what a development set-up is usually like. You deploy your program to a cluster of machines only after it's been fully debugged in your development set-up. It's kind of like Web development. Even though a cluster of machines is used in a production environment, one machine is sufficient for development.
To learn programming in Hadoop/MapReduce, working under a single-machine set-up will get you very far. To actually get a taste of it running on a cluster, you can use Amazon Web Services (e.g., EC2). That's a fairly typical set-up for universities teaching students how to use Hadoop.
While I understand that you can run Hadoop in a single instance, I'm thinking of testing its scalability with respect to the specific problems my job tries to tackle. In that respect, I think EC2 is a fantastic solution because it makes the test comparisons more uniform. I can deploy on one instance and run a bunch of benchmarks, then deploy on three, four, six, and ten to see what factor of improvement I get at different scales.
In retrospect, I should have figured that out on my own. Still, I am grateful for your help.