aspose file tools*
The moose likes Performance and the fly likes Processing huge data volume Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Processing huge data volume" Watch "Processing huge data volume" New topic
Author

Processing huge data volume

nitinram agarwal
Ranch Hand

Joined: Jan 29, 2009
Posts: 83
I am working on an interface which does DB interaction on some system. As part of my work, I am supposed to query the source db, invoke some procedure, get the data in a reference cursor and populate the destination db. As the data volume can be huge, I am using multi threading on the destination db to invoke the procedure. For ex, if the total number of entries that are supposed to loaded is 1 million, then on destination db, the procedure is invoked say 10 times with 100K records each. This arrangement is working fine except when the data volume at the source db is huge (for ex more than 2 million entries). I have set around 20 GB of heap space for processing the record but my program is failing with heap memory error. I want to know if there is a way to query the data from the source db in parallel mode (for ex, assuming, a total of 2 million records is fetched from the source stored procedure, my program should first fetch a subset of this record and then move on to next or something like that).One of the solution that I have proposed is to send the records in this manner though db side but I want to know if there is a better alternative. Please suggest
Deepak Bala
Bartender

Joined: Feb 24, 2006
Posts: 6662
    
    5

Can you provide some more context to this question ?

  • Is the data move part of an ETL job ?
  • Where was the multi-threading achieved ? Through multiple parallel JDBC calls to a procedure ?
  • Where was 20GB of heap allocated ? On a java program ?


  • SCJP 6 articles - SCJP 5/6 mock exams - More SCJP Mocks
    nitinram agarwal
    Ranch Hand

    Joined: Jan 29, 2009
    Posts: 83

    Is the data move part of an ETL job ?
    >> Yes.. but the current system does not use any ETL tool
    Where was the multi-threading achieved ? Through multiple parallel JDBC calls to a procedure ?
    >> yes.
    Where was 20GB of heap allocated ? On a java program ?
    >> yes
    Deepak Bala
    Bartender

    Joined: Feb 24, 2006
    Posts: 6662
        
        5

    If this is part of an ETL job, you are much better off using an ETL tool. The tool would have already solved many problems that you face now and ones that you have not even begun to consider. There are open source / free ETL tools out there. Use one of them to extract the data instead of using a custom java program to do it.

    I've used ETL tools before to load millions of rows into tables and I've never once worried about RAM.
    nitinram agarwal
    Ranch Hand

    Joined: Jan 29, 2009
    Posts: 83
    Hello Bala,
    Thanks for your response. Unfortunately client does not have budget for ETL products and hence for the moment we have to continue with java based program..
    Deepak Bala
    Bartender

    Joined: Feb 24, 2006
    Posts: 6662
        
        5

    nitinram agarwal wrote:Hello Bala,
    Thanks for your response. Unfortunately client does not have budget for ETL products and hence for the moment we have to continue with java based program..


    Some of these ETL products are open source. You can use them for free and pay for support when you need to. If you want to stick to the java program and not run out of memory I'd recommend limiting the threads through a thread pool and keeping an eye on memory and what consumes the most memory through a profiler. Running out of 20GB of heap space is abnormal even for a program that loads tons of rows into a database.
    nitinram agarwal
    Ranch Hand

    Joined: Jan 29, 2009
    Posts: 83
    The issue of running out of memory is happening when multiple jobs of similar nature is invoked at the same time (more than 10).. I have done profile check on my code to ensure that memory leak is fixed. On my desktop, I have a heap size of 1 gb and some of these jobs fail with heap error. The jobs cannot be executed one after another as some of the processing is SLA bound. I agree that the process might need relook but I was trying to see if something is possible technically for fetching the data in parallel and also some control on invoking the multiple jobs.
    Martin Vajsar
    Sheriff

    Joined: Aug 22, 2010
    Posts: 3610
        
      60

    Why do you keep the records in memory? If the process just copies records from one connection to another (possibly manipulating some records on the way), there should not be need to keep them in RAM. Such process should need next to nothing in memory terms, even if it uses several threads at once.

    (Sometimes the JDBC driver does some caching. Make sure you're using forward-only, read-only resultsets.)
    ejaz khan
    Greenhorn

    Joined: Apr 08, 2013
    Posts: 5
    Why can't we use the batch size less than 10k?
    Is it a 32 bit or 64 bit system because in 32 bit you will not be able to allocate memory to avoid swap space errors
    is it on unix or windows?
    stack size tuning helps resolving memory issues
    can you implement a check on source DB to provide you data in smaller chunks?
    is your code scalable so you can run multiple java processes?
    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 8008
        
      22

    nitinram agarwal wrote:Where was 20GB of heap allocated ? On a java program ?
    >> yes

    Then, unless the units of data you're processing are huge, it sounds to me like you do have a memory leak. Even allowing for some overhead, running out of 20Gb of memory on 2 million entries suggests that each one is using around 10k of space each, and I've never seen a table that had rows that big (and I've worked on some pretty big databases).

    How big is the database itself?

    Winston

    Isn't it funny how there's always time and money enough to do it WRONG?
    Articles by Winston can be found here
     
    It is sorta covered in the JavaRanch Style Guide.
     
    subject: Processing huge data volume