Meaningless Drivel is fun!*
The moose likes Other Big Data and the fly likes Two different solutions to improve developing efficiency in Hadoop Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Java 8 in Action this week in the Java 8 forum!
JavaRanch » Java Forums » Databases » Other Big Data
Bookmark "Two different solutions to improve developing efficiency in Hadoop" Watch "Two different solutions to improve developing efficiency in Hadoop" New topic
Author

Two different solutions to improve developing efficiency in Hadoop

David Li Xing
Greenhorn

Joined: Oct 09, 2013
Posts: 11

Hadoop is an outstanding big data solution. On one hand, its low cost and high scalability increases its popularity; on the other hand, its low development efficiency incurs user complaints.

Hadoop is based on the MapReduce framework for big data development and computation. Everything seems to be well if the computing task is simple. However, issues appear for those a little bit more complex computations. The poor development efficiency will bring more and more serious impacts with the growing difficulty of problem. One of the commonest computations is the "associative computing".

For example, in HDFS, there are 2 files holding the client data and the order data respectively, and the customerID is the associated field between them. How to perform the associated computation to add the client name to the order list?

The normal method is to input 2 source files first. Process each row of data in Map according to the file name. If the data is from Order, then mark the foreign key with ”O” to form the combined key; If the data is from Customer, then mark it with ”C”. After being processed with Map, the data is partitioned on keys, and then grouped and sorted on combined keys. Lastly, combine the result in the reduce and output. It is said that the below code is quite common:

As can be seen above, to implement the associative computing, since programmers are unable to use the raw data directly, they have to compose some complex codes to handle the tag, bypass the original framework of MapReduce, and design and compute the associative relation between data from the bottom layer. Obviously, handling such computations by this way requires those programmers with strong programming skills. Plus, it is quite time-consuming, and there is no guarantee on computational efficiency. The above case is just the simplest kind of associated computation. As you can image, if using MapReduce for multi-table association or the associative computing with complex business logics, then the degree of complexity will rise in a geometric ratio. The difficulty and development efficiency becomes nearly unbearable.

In fact, the associative computing itself is common and by no means complex. The reason of the apparent difficulty is that MapReduce is not specialized enough in a certain sector though it has strong universality. Similarly, developing via MapReduce is also quite inefficient when it comes to the ordered computations like year-on-year comparison and median operation, and the align or enum grouping.

Although Hadoop has packaged Hive/Pig and other advanced solutions with MapReduce, these solutions on one hand is not powerful enough, on other hand, they only offer the rather simple and basic queries. To complete the business logics involving complex procedure, the hard coding is still unavoidable.

Then, what can we do to boost the development efficiency for Hadoop? esProc is quite a good choice!

esProc is a pure Java parallel computation framework with the focus on boosting the capability of Hadoop and the basic ability to improve the development efficiency of Hadoop programmers.

Still the above example, esProc solution is shown below:
Main program:

Sub program:


The above are two methods to implement one task, people can choose either one as per the feature of your question.


David Li, a technical consultant on Database performance optimization, Big Data processing solution under Hadoop
David Li Xing
Greenhorn

Joined: Oct 09, 2013
Posts: 11

David Li Xing wrote:Hadoop is an outstanding big data solution. On one hand, its low cost and high scalability increases its popularity; on the other hand, its low development efficiency incurs user complaints.

Hadoop is based on the MapReduce framework for big data development and computation. Everything seems to be well if the computing task is simple. However, issues appear for those a little bit more complex computations. The poor development efficiency will bring more and more serious impacts with the growing difficulty of problem. One of the commonest computations is the "associative computing".

For example, in HDFS, there are 2 files holding the client data and the order data respectively, and the customerID is the associated field between them. How to perform the associated computation to add the client name to the order list?

The normal method is to input 2 source files first. Process each row of data in Map according to the file name. If the data is from Order, then mark the foreign key with ”O” to form the combined key; If the data is from Customer, then mark it with ”C”. After being processed with Map, the data is partitioned on keys, and then grouped and sorted on combined keys. Lastly, combine the result in the reduce and output. It is said that the below code is quite common:

As can be seen above, to implement the associative computing, since programmers are unable to use the raw data directly, they have to compose some complex codes to handle the tag, bypass the original framework of MapReduce, and design and compute the associative relation between data from the bottom layer. Obviously, handling such computations by this way requires those programmers with strong programming skills. Plus, it is quite time-consuming, and there is no guarantee on computational efficiency. The above case is just the simplest kind of associated computation. As you can image, if using MapReduce for multi-table association or the associative computing with complex business logics, then the degree of complexity will rise in a geometric ratio. The difficulty and development efficiency becomes nearly unbearable.

In fact, the associative computing itself is common and by no means complex. The reason of the apparent difficulty is that MapReduce is not specialized enough in a certain sector though it has strong universality. Similarly, developing via MapReduce is also quite inefficient when it comes to the ordered computations like year-on-year comparison and median operation, and the align or enum grouping.

Although Hadoop has packaged Hive/Pig and other advanced solutions with MapReduce, these solutions on one hand is not powerful enough, on other hand, they only offer the rather simple and basic queries. To complete the business logics involving complex procedure, the hard coding is still unavoidable.

esProc is a pure Java parallel computation framework with the focus on boosting the capability of Hadoop and the basic ability to improve the development efficiency of Hadoop programmers.

Still the above example, esProc solution is shown below:
Main program:

Sub program:


The above are two methods to implement one task, people can choose either one as per the feature of your question.
 
 
subject: Two different solutions to improve developing efficiency in Hadoop
 
Similar Threads
Hadoop key mismatch
App Developer (Hadoop) (Java, Scala, Closure ) in Cary, NC/ 140K/ USA
mapreduce giving a wrong count
Java Hadoop
WordCount program giving exception.