Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Processing huge data volume

 
nitinram agarwal
Ranch Hand
Posts: 90
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am working on an interface which does DB interaction on some system. As part of my work, I am supposed to query the source db, invoke some procedure, get the data in a reference cursor and populate the destination db. As the data volume can be huge, I am using multi threading on the destination db to invoke the procedure. For ex, if the total number of entries that are supposed to loaded is 1 million, then on destination db, the procedure is invoked say 10 times with 100K records each. This arrangement is working fine except when the data volume at the source db is huge (for ex more than 2 million entries). I have set around 20 GB of heap space for processing the record but my program is failing with heap memory error. I want to know if there is a way to query the data from the source db in parallel mode (for ex, assuming, a total of 2 million records is fetched from the source stored procedure, my program should first fetch a subset of this record and then move on to next or something like that).One of the solution that I have proposed is to send the records in this manner though db side but I want to know if there is a better alternative. Please suggest
 
Deepak Bala
Bartender
Posts: 6663
5
Firefox Browser Linux MyEclipse IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you provide some more context to this question ?

  • Is the data move part of an ETL job ?
  • Where was the multi-threading achieved ? Through multiple parallel JDBC calls to a procedure ?
  • Where was 20GB of heap allocated ? On a java program ?
  •  
    nitinram agarwal
    Ranch Hand
    Posts: 90
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Is the data move part of an ETL job ?
    >> Yes.. but the current system does not use any ETL tool
    Where was the multi-threading achieved ? Through multiple parallel JDBC calls to a procedure ?
    >> yes.
    Where was 20GB of heap allocated ? On a java program ?
    >> yes
     
    Deepak Bala
    Bartender
    Posts: 6663
    5
    Firefox Browser Linux MyEclipse IDE
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    If this is part of an ETL job, you are much better off using an ETL tool. The tool would have already solved many problems that you face now and ones that you have not even begun to consider. There are open source / free ETL tools out there. Use one of them to extract the data instead of using a custom java program to do it.

    I've used ETL tools before to load millions of rows into tables and I've never once worried about RAM.
     
    nitinram agarwal
    Ranch Hand
    Posts: 90
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hello Bala,
    Thanks for your response. Unfortunately client does not have budget for ETL products and hence for the moment we have to continue with java based program..
     
    Deepak Bala
    Bartender
    Posts: 6663
    5
    Firefox Browser Linux MyEclipse IDE
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    nitinram agarwal wrote:Hello Bala,
    Thanks for your response. Unfortunately client does not have budget for ETL products and hence for the moment we have to continue with java based program..


    Some of these ETL products are open source. You can use them for free and pay for support when you need to. If you want to stick to the java program and not run out of memory I'd recommend limiting the threads through a thread pool and keeping an eye on memory and what consumes the most memory through a profiler. Running out of 20GB of heap space is abnormal even for a program that loads tons of rows into a database.
     
    nitinram agarwal
    Ranch Hand
    Posts: 90
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    The issue of running out of memory is happening when multiple jobs of similar nature is invoked at the same time (more than 10).. I have done profile check on my code to ensure that memory leak is fixed. On my desktop, I have a heap size of 1 gb and some of these jobs fail with heap error. The jobs cannot be executed one after another as some of the processing is SLA bound. I agree that the process might need relook but I was trying to see if something is possible technically for fetching the data in parallel and also some control on invoking the multiple jobs.
     
    Martin Vajsar
    Sheriff
    Posts: 3752
    62
    Chrome Netbeans IDE Oracle
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Why do you keep the records in memory? If the process just copies records from one connection to another (possibly manipulating some records on the way), there should not be need to keep them in RAM. Such process should need next to nothing in memory terms, even if it uses several threads at once.

    (Sometimes the JDBC driver does some caching. Make sure you're using forward-only, read-only resultsets.)
     
    ejaz khan
    Greenhorn
    Posts: 5
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Why can't we use the batch size less than 10k?
    Is it a 32 bit or 64 bit system because in 32 bit you will not be able to allocate memory to avoid swap space errors
    is it on unix or windows?
    stack size tuning helps resolving memory issues
    can you implement a check on source DB to provide you data in smaller chunks?
    is your code scalable so you can run multiple java processes?
     
    Winston Gutkowski
    Bartender
    Pie
    Posts: 10234
    58
    Eclipse IDE Hibernate Ubuntu
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    nitinram agarwal wrote:Where was 20GB of heap allocated ? On a java program ?
    >> yes

    Then, unless the units of data you're processing are huge, it sounds to me like you do have a memory leak. Even allowing for some overhead, running out of 20Gb of memory on 2 million entries suggests that each one is using around 10k of space each, and I've never seen a table that had rows that big (and I've worked on some pretty big databases).

    How big is the database itself?

    Winston
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic