Garry Turkington

author
+ Follow
since Apr 23, 2013
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Garry Turkington

Also you can download virtual machines with the software pre-installed and configured. For example, from Cloudera:

https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM


10 years ago
Mohamed,

The data locality optimization is one of the key techniques that allows Hadoop to scale and be so performant.

The basic idea is that if you have a cluster with large amounts of data you really don't want to be moving that data around the cluster to be processed. So when a MapReduce job is being scheduled the framework determines which pieces of data (blocks) need to be processed and on which machines they are located and then starts tasks to process the data on those hosts. By default Hadoop keeps 3 copies of each block. So in the best case if you have 10 blocks to be processed (usually its much higher) then the framework will schedule jobs on the host where replicas of each block reside.

Obviously as the data size increases this becomes more difficult; if you have 50 machines but 20000 blocks to process then scheduling becomes much more complex. But by aiming to process data where it resides a lot of data transfer and i/o is avoided.

Garry
10 years ago
Can you clarify what you mean?

Garry
10 years ago
If you are already an experienced Java developer then the next step is to grab a download and start playing with Hadoop. You can get either the software from one of the distribution homepages or even a pre-configured VM with everything configured from Cloudera at least, possibly Horton works too.

If you want to understand more of the underlying theory read the original Google papers on MapReduce and the Google File System (GFS) they are quite accessible for academic papers and have become absolute classics.

Garry
10 years ago
Do you mean something more specific here? In general if you have a problem along the lines of "I've a large volume of data and I want to extract statistics and run some deep analysis against it" then in my experience Hadoop is almost always worth considering.

Garry
10 years ago
That's an interesting problem. So if I understand you correctly the key gain here is to have a single source of the arrangements to pay data that can be queried by any of the providers and obviate the need for this data to be kept in multiple different RDBMS, along the way easing the data sharing problem by not requiring all files to be pushed to all partners?

If so then this would be a great fit for Hive. You could possibly take the existing data files, push them into Hive and then use a SQL-like syntax to run reports against the data.

I see two possible wrinkles that would need more detailed thought:

1. If the query load is lots of small queries (e.g. a query per customer) then Hive, having higher latency than a transactional RDBMS will give poor performance. But if the workload is more report-type queries like "select <payment records> from < table> where date = <date> and customer_id in <my customers>" then it would work well.
2. If you also wanted to hold the customer data and account info in Hadoop then that's more a Hbase type use case where ease of updates and low-latency query response times are more important. So you could potentially hold customer data in Hbase, payment data in Hive. You'd still have the benefit of a single shared system.

Or in other words my kneejerk response is to say it could be a good fit, certainly worth some exploration if you are looking to do some rationalization/ process streamlining.

Garry
10 years ago
Another interesting project is Storm -- often viewed as an alternative but there's a compelling argument that it can be effectively used as a streaming front-end to Hadoop.

http://stormproject.org
10 years ago
Low latency is really what it's all about. In particular latency that is low enough that you could potentially use it to directly back applications servicing direct end users.

But if your interest here and in the other question re realtime is not just for low latency but true 'hard' realtime systems with all their consequent requirements then that's likely not a good fit for Hadoop or any of the related projects. Indeed when you take into account the basic mechanics of a distributed system adding hard realtime requirements would put you into a very specialised niche that most Hadoop use cases don't have to worry about.

Garry
10 years ago
Regarding where it shines, it really is the classic situation that if you have large volumes of structured or semi-structured data and have analytics that need to touch a lot of that data then it's possibly a good fit.

I suspect I'll make this point multiple times this week -- I view Hadoop as one component of the data processing systems I build but I use it alongside traditional databases and data warehouses. If your use case requires you to pull specific items from a well structured data set then odds are you'll be much better off with a traditional RDBMS. Can you do it in Hadoop, sure, but pick the best tool for the job. If your queries on the RDBMS turn into table scans because of how much data you need process to generate your results then in that case I'd consider Hadoop.

I find the Java APIs in Hadoop very well designed and easy to pick up. I find the biggest learning curve is more conceptual; learning how to take a particular problem and expressing it as a series of MapReduce jobs. You can find yourself with a series of MR jobs the code for each is literally only a few lines in each map and reduce method. But put together in the MapReduce framework the processing chain can do extremely sophisticated things. This is where the real experience will need develop.

Garry
10 years ago
Hadoop using MapReduce is a batch processing framework. Typically you churn through a lot of data in queries that take seconds, minutes or longer.

Hbase and Accumulo offer something more like a database modelled on the Google BigTable paper. These can service low-latency end-user facing queries. Accumulo has a number of particular extensions over HBase, in particular around much finer grained security labelling and the ability to efficiently run server-side functions.
Garry
10 years ago
So true realtime isn't a core Hadoop strength at this point. The processing model really doesn't lend itself to those sort of requirements.

Looking ahead though two interesting developments are Yarn and Tez. The former treats the Hadoop platform as a more generic processing framework and Tez (just starting) will be building processing models that differ significantly from what we are used to in MapReduce today. Those seem like areas of major innovation to me and I suspect we'll see a whole raft of different types of processing happening on Hadoop in the future.

Garry
10 years ago
I really tend to avoid recommending a particular distribution as I think they all have a place. But if you take your list of ideal requirements then it's clear that given the current state of the underlying Apache projects that MapR is probably the best fit.

Let's be honest, prior to Hadoop 2.0 HA (particularly for the NameNode) has always been compromised to a degree. The system is near bullet-proof when most things fail, but have your NN go down and you are in trouble. Hadoop 2.0 improves that greatly and it'll be pretty cool to have all the major distributions offering out-of-the-box HA for both NN and JT.

But I'd also caution that DR is absolutely more than the choice of distribution and I think you touch on that. Whatever setup you choose fate will always find a failure scenario that causes some sort of operational crises. Lightning strikes are particularly good at highlighting these. And if you do need things like complete cross-site redundancy I suspect you'll end up building sufficient plumbing to make it all work that the choice of distribution and particular features is less relevant.

I think it's true to say that this sort of high-end cross-site DR is another area that Hadoop will continue to mature in but I'd also say that given previous experiences trying to implement other technologies that supposedly do have that level of DR are never as simple as the vendor says and this sort of thing is just fundamentally hard.

Garry
10 years ago
I also talk about EMR in the Hadoop Beginners Guide, particularly in the earlier chapters where there are examples of running code on a local cluster then doing the same on EMR.

My later chapters on topics such as cluster management, Sqoop and Flume don't really touch on EMR so if that is your interest look elsewhere.

Garry
10 years ago
Yes, but you need understand what you want from your data warehouse. If you need to ingest largely structured data and run SQL-type queries against it this is the core Hive use case and is probably what most people think when considering a DW and Hadoop.

If you need more custom analytics then its possible but at that point your integration with the front-end, user-facing querry tools will also become more specialised.

I also recommend thinking about what things Hadoop is good for and what workloads may make sense in a traditional DW. At my current company I've got both and I see it as case of complimentary and not necessary competing technologies.

In either case remember that Hadoop has a core trade-off of optimizing for throughput at the cost of latency. In a true DW type scenario this shouldn't be a consideration as almost by definition no one expects DW queries to have sub-second response times.

What is really nice is that with Hadoop you can have the best of both worlds. Use Hive for more traditional heavy analytic queries but then perhaps HBase for more user-facing lighter queries. And to look ahead a little both Cloudera's Impala and Apache Drill will offer a much lower-latency SQL interface to data residing on HDFS so the fusion will become greater.

Garry
10 years ago
The nice thing is that you can download and try a free version of each of them. The base Apache distro is good when you're just learning and getting started but you wouldn't deploy it in production.

The real advantage of the bundled distributions such as Cloudera and Hortonworks is that you don't have to do the juggling to get version x of Hive working with version y of Hadoop and version z of Hbase. They also come with better tools for deployment and management in an operational environment.

MapR gives that but as the previous poster mentioned also has a number of unique extensions such as its NameNode-free HA architecture and the NFS integration. If you have existing legacy systems that push data to NFS this is a very nice option.

So its really down to the combination of the products in the bundle, the tools you need to run it in production and how much you are willing to pay for the commercial aspects of CDH and MapR in particular. But as I said, try them all!

Garry
10 years ago