aspose file tools*
The moose likes Hadoop and the fly likes Mapreduce using Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Mapreduce using Java" Watch "Mapreduce using Java" New topic
Author

Mapreduce using Java

Madhumitha Baskaran
Ranch Hand

Joined: Feb 27, 2010
Posts: 66
Hi all,

Is it necessary to use Hadoop to implement Mapreduce programs in Java? Is it possible for me to implement it with out Hadoop and using Java classes alone.

Please help me.

Thanks in advance,
Madhu
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 532
    
    7
Of course it's possible.
Map reduce is just a technique to break up some input into smaller chunks, process each chunk to get a result, and finally aggregate all those chunk results to get a final result.
It lends itself to parallelism, because each chunk can be processed by a different thread or core or processor or machine, then collected at a central point and aggregated.
But such a simplistic implementation may not scale well or may become simply too time consuming for large datasets.

What hadoop brings to the table is the infrastructure and physical architecture to perform distributed map reduce on a large scale using a cluster of machines., with features like centralized tracking and supervision.
Since it stores chunk results in a distributed file system, it's also fault tolerant. It's suitable when datasets are in the 100s of MBs and above in size.
If your problem does not require that level of scalability or fault tolerance or involve large dataset sizes, you don't need Hadoop.

What's the nature of your problem?
Madhumitha Baskaran
Ranch Hand

Joined: Feb 27, 2010
Posts: 66
Thanks. Your answer was helpful.

I am working on a project actually to implement distributed GREP and distributed sorting using Mapreduce. I have an ordinary core I5 laptop and I dont have distributed environment to work on. So I am thinking that I can implement simplistic approach to implement.

If I use threads to implement the same , will it be possible for me to let each thread use one core and get executed simultaneously. Please help.

Thanks,
Madhu
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 532
    
    7
I am working on a project actually to implement distributed GREP and distributed sorting using Mapreduce. I have an ordinary core I5 laptop and I dont have distributed environment to work on. So I am thinking that I can implement simplistic approach to implement.

From your description of the problem, it looks like the intended system should indeed be distributed across machines at some point ("distributed GREP and distributed sorting") and even the strategy to do so has been decided as mapreduce (presumably using hadoop).

The only problem seems to be that for your development purposes you don't have a cluster of machines at the moment.

I don't think the solution should be decided by non availability of development resources. Rather it should be decided by how the system is going to be finally deployed in production.
You can start off by installing Hadoop in single machine mode on your laptop (they have a tutorial that explains how - very easy to do).
You can later simulate a cluster of machines on your laptop by installing a virtualization product like Virtualbox , creating atleast 1 Virtual Machine (your host machine and the virtual machine will play role of task tracker/name tracker), installing Hadoop on both them and running your jobs on this "virtual cluster". There is a learning curve involved here, but it'll be well worth it.
If at a later point you have access to more machines, you can very easily include them into your hadoop setup. The grepping and sorting (hadoop already supports sorted aggregation) logic will remain the same, regardless of whether hadoop is on a single machine or a cluster.

If I use threads to implement the same , will it be possible for me to let each thread use one core and get executed simultaneously. Please help.

How each thread is scheduled and assigned to a core depends on how the JVM is implemented, how the underlying OS inturn schedules them, what other applications are occupying the processor, etc. It's rather emergent behaviour. Java has no explicit parallelism capability - you can't tell java I have 4 cores and I want this thread to run on this core and this other thread to run on that core. You just implement multi threading using java APIs (Java 7 for example introduces the fork join API that makes tasks like yours easier), hope for the best, measure performance, and see if any code level optimizations are possible to utilize threads better.

From the description of your problem, I think you should stick with hadoop instead of going this route, since you need distributed. Going the threading route means you'll have to roll out your own distributed logic later on using RMI or something like that. Hadoop already has all that and is much less coding work. You can concentrate on the core analysis logic instead of the infrastructure to run that logic at the beginning.
Madhumitha Baskaran
Ranch Hand

Joined: Feb 27, 2010
Posts: 66
Thanks a lot. Your reply is extremely helpful. I will go for Hadoop itself. Because I might lose points if I do simple implementation using Java threads alone. I am hoping that getting myself familar to Hadoop is going to be manageable task.
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 532
    
    7
"lose points"?? Is this an academic project? Hadoop is easy to learn - no worries there.
Madhumitha Baskaran
Ranch Hand

Joined: Feb 27, 2010
Posts: 66
Yes. It is a project for my graduate studies. Thanks a lot for your help. Because I would have ended up doing normal implementation using threads.
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 532
    
    7
No problem. Good luck!
Satyaprakash Joshii
Ranch Hand

Joined: Jun 18, 2012
Posts: 139
I want to know: What is in Hadoop map reduce which was not in Google Map Reduce?
Srinivas Mupparapu
Greenhorn

Joined: Feb 12, 2004
Posts: 14

Satyaprakash Joshii wrote:I want to know: What is in Hadoop map reduce which was not in Google Map Reduce?


The difference is that Hadoop is open source Apache software whereas Google is not. Hadoop is built based on a white paper that Google has published on Map Reduce. Look at Hadoop's history for more info.
 
Don't get me started about those stupid light bulbs.
 
subject: Mapreduce using Java