• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Two different solutions to improve developing efficiency in Hadoop

 
Greenhorn
Posts: 11
IBM DB2 Oracle
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hadoop is an outstanding big data solution. On one hand, its low cost and high scalability increases its popularity; on the other hand, its low development efficiency incurs user complaints.

Hadoop is based on the MapReduce framework for big data development and computation. Everything seems to be well if the computing task is simple. However, issues appear for those a little bit more complex computations. The poor development efficiency will bring more and more serious impacts with the growing difficulty of problem. One of the commonest computations is the "associative computing".

For example, in HDFS, there are 2 files holding the client data and the order data respectively, and the customerID is the associated field between them. How to perform the associated computation to add the client name to the order list?

The normal method is to input 2 source files first. Process each row of data in Map according to the file name. If the data is from Order, then mark the foreign key with ”O” to form the combined key; If the data is from Customer, then mark it with ”C”. After being processed with Map, the data is partitioned on keys, and then grouped and sorted on combined keys. Lastly, combine the result in the reduce and output. It is said that the below code is quite common:

As can be seen above, to implement the associative computing, since programmers are unable to use the raw data directly, they have to compose some complex codes to handle the tag, bypass the original framework of MapReduce, and design and compute the associative relation between data from the bottom layer. Obviously, handling such computations by this way requires those programmers with strong programming skills. Plus, it is quite time-consuming, and there is no guarantee on computational efficiency. The above case is just the simplest kind of associated computation. As you can image, if using MapReduce for multi-table association or the associative computing with complex business logics, then the degree of complexity will rise in a geometric ratio. The difficulty and development efficiency becomes nearly unbearable.

In fact, the associative computing itself is common and by no means complex. The reason of the apparent difficulty is that MapReduce is not specialized enough in a certain sector though it has strong universality. Similarly, developing via MapReduce is also quite inefficient when it comes to the ordered computations like year-on-year comparison and median operation, and the align or enum grouping.

Although Hadoop has packaged Hive/Pig and other advanced solutions with MapReduce, these solutions on one hand is not powerful enough, on other hand, they only offer the rather simple and basic queries. To complete the business logics involving complex procedure, the hard coding is still unavoidable.

Then, what can we do to boost the development efficiency for Hadoop? esProc is quite a good choice!

esProc is a pure Java parallel computation framework with the focus on boosting the capability of Hadoop and the basic ability to improve the development efficiency of Hadoop programmers.

Still the above example, esProc solution is shown below:
Main program:

Sub program:


The above are two methods to implement one task, people can choose either one as per the feature of your question.
 
David Li Xing
Greenhorn
Posts: 11
IBM DB2 Oracle
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

David Li Xing wrote:Hadoop is an outstanding big data solution. On one hand, its low cost and high scalability increases its popularity; on the other hand, its low development efficiency incurs user complaints.

Hadoop is based on the MapReduce framework for big data development and computation. Everything seems to be well if the computing task is simple. However, issues appear for those a little bit more complex computations. The poor development efficiency will bring more and more serious impacts with the growing difficulty of problem. One of the commonest computations is the "associative computing".

For example, in HDFS, there are 2 files holding the client data and the order data respectively, and the customerID is the associated field between them. How to perform the associated computation to add the client name to the order list?

The normal method is to input 2 source files first. Process each row of data in Map according to the file name. If the data is from Order, then mark the foreign key with ”O” to form the combined key; If the data is from Customer, then mark it with ”C”. After being processed with Map, the data is partitioned on keys, and then grouped and sorted on combined keys. Lastly, combine the result in the reduce and output. It is said that the below code is quite common:

As can be seen above, to implement the associative computing, since programmers are unable to use the raw data directly, they have to compose some complex codes to handle the tag, bypass the original framework of MapReduce, and design and compute the associative relation between data from the bottom layer. Obviously, handling such computations by this way requires those programmers with strong programming skills. Plus, it is quite time-consuming, and there is no guarantee on computational efficiency. The above case is just the simplest kind of associated computation. As you can image, if using MapReduce for multi-table association or the associative computing with complex business logics, then the degree of complexity will rise in a geometric ratio. The difficulty and development efficiency becomes nearly unbearable.

In fact, the associative computing itself is common and by no means complex. The reason of the apparent difficulty is that MapReduce is not specialized enough in a certain sector though it has strong universality. Similarly, developing via MapReduce is also quite inefficient when it comes to the ordered computations like year-on-year comparison and median operation, and the align or enum grouping.

Although Hadoop has packaged Hive/Pig and other advanced solutions with MapReduce, these solutions on one hand is not powerful enough, on other hand, they only offer the rather simple and basic queries. To complete the business logics involving complex procedure, the hard coding is still unavoidable.

esProc is a pure Java parallel computation framework with the focus on boosting the capability of Hadoop and the basic ability to improve the development efficiency of Hadoop programmers.

Still the above example, esProc solution is shown below:
Main program:

Sub program:


The above are two methods to implement one task, people can choose either one as per the feature of your question.

 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic