aspose file tools*
The moose likes Threads and Synchronization and the fly likes Parallel programming and process modelling Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Threads and Synchronization
Bookmark "Parallel programming and process modelling" Watch "Parallel programming and process modelling" New topic
Author

Parallel programming and process modelling

Andrew Hogendijk
Greenhorn

Joined: May 19, 2010
Posts: 12
Hi Sergey,

I am new to parallel programming and have two questions regarding this potentially advantageous field:

1/ Is there a given method / approach / process for breaking a task into the required steps so that it can be accomplished in parallel? (ie/ A way to see how it would look and decide if it is worth pursuing)

2/ Is there a difference in approach between (not the code necessarily) using 'standard' CPU's to achieve the job and using GPU's to do the same?

I am trying to figure this out for a while now in my spare time, but I am getting more confused with the information I find. I would have thought that the method of solving a problem would be the same but the code implementation would be specific. Can you confirm this? Is there a better way? I am looking at this area with video signal processing in mind.

Cheers and Thanks

The Frog
Sergey Babkin
author
Ranch Hand

Joined: Apr 05, 2010
Posts: 50
Andrew Hogendijk wrote:Hi Sergey,

I am new to parallel programming and have two questions regarding this potentially advantageous field:

1/ Is there a given method / approach / process for breaking a task into the required steps so that it can be accomplished in parallel? (ie/ A way to see how it would look and decide if it is worth pursuing)

2/ Is there a difference in approach between (not the code necessarily) using 'standard' CPU's to achieve the job and using GPU's to do the same?

I am trying to figure this out for a while now in my spare time, but I am getting more confused with the information I find. I would have thought that the method of solving a problem would be the same but the code implementation would be specific. Can you confirm this? Is there a better way? I am looking at this area with video signal processing in mind.

Cheers and Thanks

The Frog


It kind of depends on the task. I haven't done the video signal processing myself, so I can't say for sure. But the two basic approaches are breaking it along (pipelining) or across (partitioning data into subsets and having a separate thread or process or machine work on its subset). The guideline for partiitoning is to look for longish computations on a subset of data without touching the other subsets much. The problem is in the synchronization, so to get a win the time of computations should be a few times longer than the time of synchronization. There is no "universal" solution, since each task is different. The only universal part is to look for the semi-independen parts of the algorithm and separate them into their own threads.

ANother thing to consider is the memory issue. In SMP (symmetric multiprocessing with 'standard' CPUs) memory is the bottleneck. It can be sort of kept at bay by the caching, which works best by processors working mostly with its own memory ranges without dirtying each other cache. The architectures with GPUs (like in Sony PS3) try to solve this problem by using a private memory for each GPU. However the downside is that this memory size is small, and transfers between it are done in big blocks and become expensive. So the extra limitation for the GPU algorithms is that your thread's data chunk must fit into the GPU's provate memory.

Sorry about the very general answer, but I don't think there is a better one until a concrete task is considered.
Suddha Satwa Roy
Greenhorn

Joined: Apr 24, 2010
Posts: 3
I am sending you a link that can help you http://www.cs.rit.edu/~ark/lectures/pj05/workshop.pdf. Its very interesting..
Andrew Hogendijk
Greenhorn

Joined: May 19, 2010
Posts: 12
Thankyou both for your answers, I appreciate your time on this.

If I understand your answer Sergey, correctly, then the task(s) itself really is the determining factor in deciding how to approach the method of parallelization. There is no one-size-fits-all approach in that respect. That makes a lot of sense. You mentioned that the cost / hardware is a point of consideration, specifically mentioning the movement of data to and from the CPU's / GPU's cache or private memory. I would like to ask that in your experience how serious the impact is when a block / chunk of information to be processed does not fit directly into the cache and must be moved between the normal RAM to CPU and back again. I know that this is again a little vague for a question, but perhaps there might be a way of justifying larger hardware costs based on specific CPU / GPU parameters for a given task. By this I am guessing that the design of the hardware will greatly affect the operational result and not necessarily based on horsepower alone. Can you shed some light on that?

Cheers and thanks

The Frog
Sergey Babkin
author
Ranch Hand

Joined: Apr 05, 2010
Posts: 50
There are 2 aspects to it:

First, the data not fitting into the cache and causing the cache thrashing certainly causes the slow-down. A few years ago I wrote a simple program that allocates a large amount of memory and then keeps accessing it in a serial fashion in a loop, then on an SMP machine using multiple threads on multiple CPUs didn't help at all. All the threads together were doing the same amount of progress as one thread did without all the others. The memory really is a bottleneck. You can repeat this experiment and see how your machine fares :-) Some NUMA architectures might do better.

Even the overflowing TLB buffer may be a bitg issue. I've seen switching from 4KB to 2MB page size on x86_64/Solaris on a large-memory application improving the performance by some 20% if I remember correctly.

The second aspect is when multiple processors try to modify the data in the same cache line, even if the total amount of memory is small. This causes the cache line bouncing across the system bus between the processors, and even though not quite as bad as going to the memory, still pretty bad. In my book the chapter on locks describes this situation and what can be done to improve it.
Andrew Hogendijk
Greenhorn

Joined: May 19, 2010
Posts: 12
Thankyou once again Sergey. I am beginning to feel that processing tasks in parallel is very much an art, one where the efficiency of the result is directly related to the implementation methodology, and the hardware that is used. This leads me to another question: How portable are parallel processing solutions across different platforms? For example if I develop in Java the software may be run on any number of platforms and hardware, where if I understand you correctly, the set-up of that OS and hardware can have a dramatic effect (more horsepower does not necessarily mean more performance in this case). I suppose what I would like to know is if you believe it necessary to provide operational specifications for the underlying system to such an extent that software built this way doesnt really work well without it? Or another way, is a solution hardware specific?

Cheers

The Frog
Sergey Babkin
author
Ranch Hand

Joined: Apr 05, 2010
Posts: 50
Any kind of optimization works like this: you try different ideas, and see how they work. Quite often they work not quite the way expected, and sometimes you learn some amazing things about the libraries you use.

Some general rules of thumb described in the book like "don't clump many mutexes in a dense array" work for any machines, but how much difference do they make depends on the machines. To give any particular performance guarantees you always need to try on the target hardware and see how it works.
Andrew Hogendijk
Greenhorn

Joined: May 19, 2010
Posts: 12
Thankyou Sergey, I appreciate your time. Looks like I have some solid work and learning to do ahead of me.

All success with the book

Cheers

The Frog
Sergey Babkin
author
Ranch Hand

Joined: Apr 05, 2010
Posts: 50
Thanks for your interest! :-)
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parallel programming and process modelling