With NoSQL (and Hadoop in particular) being one of the darlings of the new "Big Data" era of the last few years, I think that one of the first questions I have when choosing a NoSQL option is - "Should I be choosing a NoSQL option" or is a relational database the correct option for me. I know that at one of the DC "BigData" meetups, or TechDC meetups, Living Social talked about their strategy of saving all data into their NoSQL solution and determining it's usage later. Does your book cover the question of when it is appropriate to use Hadoop, and when your solution may in fact be better implemented in a RDBMS? Also, do you feel that most things that are implemented in RDMBS systems could just as efficiently (or more efficiently) be translated to a Hadoop solution?
And once you have made the decision to go down the NoSQL route, many of us still have to determine the technology.
"Facebook created Cassandra, Google created BigTable and MapReduce, Amazon created SimpleTable and LinkedIn created Project Voldemort" etc, etc. We know there are quite a few solutions out there, and I'm sure that the strength of the ecosystem plays a large part in what the correct choice is. Do you cover the topic of choosing a NoSQL solutions, and argue the case for why Hadoop is the correct choice, or is the book targeted at people who have already made that decision.
Thanks for your time and knowledge!
Joined: Oct 19, 2012
All very good questions! Yes it's tricky these days to pick the data storage system. A few years ago everything would automatically get stuck into a relational database as that was all that was widely available. Traditional databases still have their place, and you'll still find a lot of technology companies using sharded relational databases (with some memcached-like fronting system) to service web requests. Ultimately it comes down to your particular application - you need to map-out how you expect your data to be accessed, and whether you need things like transactions. Hadoop in its current form is a batch-based system, so you wouldn't want to use it (MapReduce) for serving any real-time data access use cases. HBase and friends on the other hand would be suited for real-time access, and scale very well too, but it's important to understand their limitations (such as how well they work at searching).
I don't go into NoSQL in length in my book, apart from covering some HBase and Hadoop integration use cases. My book should however give you a good sense of the various use cases that work well for Hadoop, and hopefully that'll give you a sense of how it can be leveraged, and whether that would be sufficient for your needs.
Thanks for the reply! The question of whether a transaction is needed or not is a very important one! My usage thus far has been for situations where a transaction isn't required - but now you have me curious to investigate transactional support for other use cases.
I'm hoping to explore Hadoop more for a pet project that I'm about to start, and am looking forward to finally getting some practical knowledge of it myself. Thanks, and congrats on the book!