File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Performance and the fly likes Handle Huge Data Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Head First Android this week in the Android forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Handle Huge Data" Watch "Handle Huge Data" New topic

Handle Huge Data

Gandhi M Subramanian

Joined: Nov 03, 2005
Posts: 3

1. Read huge data(1.1 million records) and write to a flat CSV File.

Implementation Existing:
1. I have DAOImpl class that connects to database and get the records by executing a query which is stored in a List.
2. I iterate thro the list and write it to a flat CSV file.

Issue in Existing System:
1. Since the records i am trying to store in List is huge, the system fails and error out with OutOfMemory.

Proposed Solution I Fetched the records as chunk size and write it to file and it took 24 minutes to write one report of 850000 records to CSV file)

Code Implementation:

1. I created two methods as follows by storing one chunk of data from resultset into a list and process the list to write to file( chunk size could 1 to 100000) and the system worked fine with writing a huge data(850000 records) into flat CSV file in 24 minutes(CSV output file size is 45MB). here is the method i used to implement this scenario.

Client Class Implementation for Proposed Scenario I:

Question :
1. If i modify my getReportDataChunk method to get the fetchSize of resultSet and then set the rs.fetchSize(100000), the same report is generated in 2:13 Minutes. What is happening in the resultset fetch and also when i did a debug of the resultset while writing it to the file. once the initial fetchsize of 100000 is written to the file still i see the m_rows of resultset implementation holding the previous rows in memory?? Am i wrong in my implementation. If you guys could give me a solution to achieve better performance results that would help me a lot.


[added [code] tags - Jim]
[ November 11, 2005: Message edited by: Jim Yingst ]
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

thank you, for using code-tags; that's much better readable.

(updated, to say 'thank you'.)

[ November 10, 2005: Message edited by: Stefan Wagner ]
[ November 11, 2005: Message edited by: Stefan Wagner ]
Jeffrey Hunter
Ranch Hand

Joined: Apr 16, 2004
Posts: 305
Of course this may be besides the point, but, why would you want to store >1 million records in a flat file?

Also, you might look into multiple, smaller files and multi-threading to speed up your queries.

Also, you might want to look into a CachedRowset implementation in order to free up your database connection while processing the results.
Joe Ess

Joined: Oct 29, 2001
Posts: 9126

Originally posted by Jeffrey Hunter:
. . . multi-threading to speed up your queries.

Threads are useful for making a program appear to do two things at once, like update a progress bar while processing data. Unless one is on a multiple CPU system, these two threads of execution must share the CPU and due to the overhead of bookeeping and context swapping, will actually run SLOWER than a single threaded app.
Typical Database Bottlenecks:
1. Memory = In order to execute a query, a database must load all the records in all the tables in a query into memory. This can cause quite a lot of churn as the contents of physical memory are swapped out to disk to make room for more records. If your server doesn't have enough memory it is wasting time swapping when it can be searching.
2. Other Hardware = A database server is only as fast as its hardware (CPU, disk array and so on). If you aren't using enterprise-class hardware, forget about enterprise-class performance.
3. Network = if the database is on a remote server, the amount of data you can receive is limited to network throughput. If everybody is pummeling your 10mbps network downloading wares, forget about getting those 1 million records anytime soon.
4. Local Database = If the database is on the same machine as your app, you can have the DB and application fighting for resources. This would multiply the effects of the problems above unless you have multiple CPU's and many gigs of memory.
Gandhi, what you need to do is figure out where your bottleneck is and address the problem. Our guessing isn't going to help you. The
Java Platform Performance book is a good start on how to measure and improve performance but unfortunately it doesn't cover JDBC.
[ November 11, 2005: Message edited by: Joe Ess ]

[How To Ask Questions On JavaRanch]
William Brogden
Author and all-around good cowpoke

Joined: Mar 22, 2000
Posts: 12982
Seems to me that multithreading could help if either the database or the file being written is local to the machine doing the work since the network and local file system requests can run independently. If both processes need network bandwith, it may be another story.
Real measurements will be required.
Jeffrey Hunter
Ranch Hand

Joined: Apr 16, 2004
Posts: 305
I've just implemented a multi-threaded solution to speed up a slow query. Had a query running against time series data, which contained >55 million rows. Our final data set, once retrieved, was about 20MB.

I found that splitting the query up into three threads, each charged with retrieving an interval of the overall time series, cut down the query time by 30%. Haven't done substantial testing on this, but I think with some more tweeks, we can cut it down even more. Of course eventually the more threads used, the less likely you will save time, and it may end up costing you time, but in this case, three threads shaved off about 30%.
Virag Saksena
Ranch Hand

Joined: Nov 27, 2005
Posts: 71
As you have already discovered by setting a larger fetchsize, that most of your time was being spent in the network round-trips between the java app and the database (you reduced it from 24 minutes to 2 minutes).

What database are you using and is the resultset type scrollable ?

<a href="" target="_blank" rel="nofollow">The Peak of Performance</a>
Don't get me started about those stupid light bulbs.
subject: Handle Huge Data
It's not a secret anymore!