I am working on critical requirement . I need to develop a batch process which will process large amount of data ( may be millions of record ). We are thinking that on one server we will install multiple JVMs and assign chunk of data to them for processing .Can anybody help regarding this.
You might also consider doing the processing in the database (I assume the millions of records come from a database, if not, this is moot of course). This saves time needed to transfer the huge amount of data over network, and clever use of SQL can often reduce the amount of work dramatically.
You'd use the stored procedure mechanism of your database. Some databases (well, I know only of Oracle) would even allow you to code stored procedures in Java, therefore allowing you to choose the language you know better.
Database processing might be the answer but it might not be the answer for all applications. Applications that are pure Mapreduce will work better if you do the processing in the grid. If you can batch the input data into chunks, distribute them on the grid, then reduce them to a small set of results, you are better off doing the processing outside of the database. OTH, if you need to retreive large amounts of data from the database to do the processing, you are better off doing the processing in database. It all depends on whether you are CPU/memory bound or network bound. CPU/memory scale up nicely on a grid. You are better off doing it on the grid. Network doesn't scale on the grid; keep processing in database.
Gridgain is another one that you can use. We use Gigaspaces which is quite good, but costly. We have used Flux too, but I don;t like it.
Bringing up this old post. OP has mentioned of using multiple jvm on single server for performance.
Here we are looking at using multiple jvm/s on a single sever. Distributed computing(like Hadoop) is used in clusters, so that we can distribute tasks to different servers/computers, hence utilizing their computation power. Here we are talking about using distributed computing approach on single computer(using multiple jvms's). How would using multiple jvm result in better performance on single server rather than using multi-threading?
Author and all-around good cowpoke
Joined: Mar 22, 2000
How would using multiple jvm result in better performance on single server rather than using multi-threading?
It is hard to imagine how... possibly if you had a machine with really huge memory - much more than any one JVM could use, and the job required really large data in memory.
Whenever you are designing something for large data, it's always good to explore the possibility of splitting up processing across multiple threads running across multiple JVM's because it makes your application scale easily to a grid. If you start constrain yourself to a single JVMs, you will sooner or later reach a point where the number of threads/amount of data gets constricted by the bounds of a single JVM. You might not end up implementing a multi-process architecture in your first iteration, but atleast thinking about it during the design phase helps you in not taking the pitfalls that will make it problematic to move to a multi-process design.
Sudarshan Devardekar wrote:Apart from exploiting huge memory, is there any other advantage of using multiple jvm on single server?
I don't know, but it's possible that on a multi-processor server you might be able to have it run on a dedicated one. Back in the days of Sun, high-end Sparc-based boxes were very good at this; but I suspect they've been superceded now.
Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here