Win a copy of Rust Web Development this week in the Other Languages forum!

Chuck Lam

author
+ Follow
since Aug 09, 2010
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Chuck Lam

I do have a whole chapter on running Hadoop on AWS/EC2. Unfortunately I must admit that I don't go much beyond the scripts included in Hadoop's distribution. The implications of using Hadoop on EC2 is quite subtle, so I devote most of the chapter to clarifying that instead.

As to Tibi's question about dynamically adjusting the size of the cluster, Hadoop is certainly designed to handle it, but I wouldn't recommend doing that with a Hadoop cluster in EC2. Hadoop assumes fairly stable clusters, and slowly "balances" the data across the cluster when you change its size. Just because you can add/remove nodes in EC2 fairly fast doesn't mean that Hadoop can respond equally fast.

You should also keep in mind that Hadoop doesn't necessarily execute just one job at a time. It's optimized for throughput so that if one job doesn't take up all the nodes in a cluster, it will (partially) start the next job in parallel to keep any node from being idle.
11 years ago
In fact, the last chapter of my book has a whole case study on how IBM uses Hadoop to implement its intranet search.

Long story short, Hadoop can be helpful in enterprise search when you need to implement search in a distributed system. And the main reasons for needing a distributed system in search are scale and complexity. When you're indexing lots of data (IBM's intranet is quite huge), using Lucene/Solr on a single machine would be too slow. Similarly, if you need to do any complex indexing, such as natural language processing, you will easily outgrow the capability of a single machine.
11 years ago
I've seen a number of courses in universities where students are expected to get up to speed on Hadoop in about 2-4 weeks. My memory is a bit vague on this one, but I do remember somewhere that a mid-term homework assignment was to implement PageRank over Wikipedia articles using Hadoop. I would certainly consider that a "comfortable" level.

Of course, your learning curve will vary depending on your background and available resources. The courses I referred to above almost always require "distributed systems" as a prerequisite. The classes also usually have a test cluster already set up. If you're setting one up yourself, factor in some time on learning systems administration.
11 years ago
Hadoop is based on a couple research papers published by Google explaining Google's data processing model. So the *conceptual model* can be considered the same. Of course, the details are very different.

To start out, Google's MapReduce programs are generally written in C/C++, while Hadoop's are generally Java-based. Given that both models have evolved separately over the years to target different communities, it shouldn't be surprising that the details are very different. Having said that, Google engineers I've talked to claim that learning Hadoop is relatively easy for them.
11 years ago
The word "invention" is a bit loaded so I'll sidestep it a little bit. Certainly the concept of map and reduce functions have been around for a long time, but MapReduce is a new framework for processing data sets in a scalable way. It's not just putting map and reduce function together.

Similarly you can argue that Google did not invent link analysis, but that misses the point. Their PageRank algorithm had applied link analysis in a scalable way to a new problem domain (Web) that it certainly stands on its own as a useful invention.
11 years ago
Hadoop can be run on a single machine. In fact, that's what a development set-up is usually like. You deploy your program to a cluster of machines only after it's been fully debugged in your development set-up. It's kind of like Web development. Even though a cluster of machines is used in a production environment, one machine is sufficient for development.

To learn programming in Hadoop/MapReduce, working under a single-machine set-up will get you very far. To actually get a taste of it running on a cluster, you can use Amazon Web Services (e.g., EC2). That's a fairly typical set-up for universities teaching students how to use Hadoop.
11 years ago
Yes, you have to rewrite your program to make it work over a cluster (distributed computing). Hadoop is a framework that makes the rewriting easier.

In my book, I gave an example of writing a word counting program. Writing such a program to run on a single machine is easy. Writing it to run on a cluster of machines introduces a lot of complexity. The Hadoop framework eliminates much of that complexity, but the program will have to be architected differently.
11 years ago
Hadoop is targeted for developing programs to process large data sets. It's useful whenever you have a lot of data to process or analyze. The first Hadoop application for many web companies is to analyze log data. For example, you can look at log data to see how many unique viewers you have and where do they tend to come from. Another popular usage is to analyze user data to understand them better.

One of the original inspiration for Hadoop/MapReduce is crawling and indexing for search engines. In fact, Doug Cutting, the inventor of Lucene, is also the inventor of Hadoop.
11 years ago
It's definitely possible to install Hadoop on a Mac. In fact, almost every developer you see in a Hadoop conference is carrying a Mac :P

To be more specific, Hadoop is targeted for running on Unix and has several modes of operation. In production ("fully distributed mode"), it runs on a cluster of Unix machines, which are usually cheap Linux boxes. In development ("standalone mode"), you run it on a single machine to have quick development cycles. It's very popular to use a Mac when running Hadoop in development mode. I don't know of anyone using Hadoop on a cluster of Mac's though, and I won't be surprised if you find trouble doing that.

Having said all that, just follow the standard instruction in my book or on any Hadoop tutorial to install on the Mac. You'll have to configure your JAVA_HOME property in the Hadoop configuration correctly for your set-up, but otherwise everything should be the same.

Have fun!
11 years ago
One thing I kept in mind in writing the book was to have many realistic examples using MapReduce, so you can see various ways of applying it. MapReduce is a different way at structuring problems so it does take some time to get used to. Just as a event-based framework helps a lot in writing GUIs, the MapReduce framework helps in writing large data processing programs. The hurdle is getting used to a different view on this class of problems.
11 years ago
Yes. I wrote the book because I heard the same frustrations from many people. Hadoop has a steep learning curve not because it's complicated, but because it's novel. Also, like many open source projects, a lot of the documentation are organized for reference rather than for learning. I intend my book for the general Java programmer with no background in distributed computing or data processing.
11 years ago
Hi everyone,

I'm the author of Hadoop in Action. Look forward to discussing the book and Hadoop with you guys!
11 years ago