This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
I am working on hadoop mapreduce to get performance benefit but when I run my program on hadoop it takes about 37 minutes where as it takes only about 5 minutes for simple C++ program for doing the same task.
Please TellTheDetails. What is your application doing? Where is it spending more time?
Priyanka Suresh Shinde
Joined: Nov 27, 2012
The input file contains the number of records, one per line. I have written one simple program to print those lines in which three words are common. In map function i have passed the word as a key and record as a value and compared those records in reduce function.
Parallel processing is not a silver bullet that will instantly turn every program to run x times faster. It adds a lot of overhead for creating all the threads, distributing work to them and then getting the results back and aggregating them again. If I understand your description right, there isn't any actual processing - your workers do nothing.
Imagine you need to do a project that will take a man-year of work. You can do it yourself in a year, or you can hire ten developers, distribute the work among them, manage them and deliver the project in, perhaps, three months. You might expect the project to be finished in five or six weeks, given that there are now ten people working on it, but it won't be the case. The developers won't spend all the time coding, they will need to meet and coordinate their work, which isn't needed if just one person does the work.
And now imagine that you'd hire ten developers to write a 20 lines "Hello, world!" application. They'll probably spend much, much more time doing so than if you whipped up the program yourself. Every one of them would in theory write just two lines of code, but the overhead of coordinating their work in this case is so big that it exceeds several times any benefit from having multiple people working on it.
Your program is similar - individual workers have very little work to do, but the amount of work needed to coordinate them is the same as if they worked hard. This simple program won't work well with Hadoop. Only programs that do substantial amount of work other than the Map and Reduce functions can experience any speedup at all. Hadoop is best suited for cases where you can distribute a lot of work among a lot of workers.