This week's book giveaway is in the OCAJP 8 forum. We're giving away four copies of OCA Java SE 8 Programmer I Study Guide and have Edward Finegan & Robert Liguori on-line! See this thread for details.
1. Read huge data(1.1 million records) and write to a flat CSV File.
Implementation Existing: 1. I have DAOImpl class that connects to database and get the records by executing a query which is stored in a List. 2. I iterate thro the list and write it to a flat CSV file.
Issue in Existing System: 1. Since the records i am trying to store in List is huge, the system fails and error out with OutOfMemory.
Proposed Solution I Fetched the records as chunk size and write it to file and it took 24 minutes to write one report of 850000 records to CSV file)
1. I created two methods as follows by storing one chunk of data from resultset into a list and process the list to write to file( chunk size could 1 to 100000) and the system worked fine with writing a huge data(850000 records) into flat CSV file in 24 minutes(CSV output file size is 45MB). here is the method i used to implement this scenario.
Client Class Implementation for Proposed Scenario I:
Question : 1. If i modify my getReportDataChunk method to get the fetchSize of resultSet and then set the rs.fetchSize(100000), the same report is generated in 2:13 Minutes. What is happening in the resultset fetch and also when i did a debug of the resultset while writing it to the file. once the initial fetchsize of 100000 is written to the file still i see the m_rows of resultset implementation holding the previous rows in memory?? Am i wrong in my implementation. If you guys could give me a solution to achieve better performance results that would help me a lot.
[added [code] tags - Jim] [ November 11, 2005: Message edited by: Jim Yingst ]
Originally posted by Jeffrey Hunter: . . . multi-threading to speed up your queries.
Threads are useful for making a program appear to do two things at once, like update a progress bar while processing data. Unless one is on a multiple CPU system, these two threads of execution must share the CPU and due to the overhead of bookeeping and context swapping, will actually run SLOWER than a single threaded app. Typical Database Bottlenecks: 1. Memory = In order to execute a query, a database must load all the records in all the tables in a query into memory. This can cause quite a lot of churn as the contents of physical memory are swapped out to disk to make room for more records. If your server doesn't have enough memory it is wasting time swapping when it can be searching. 2. Other Hardware = A database server is only as fast as its hardware (CPU, disk array and so on). If you aren't using enterprise-class hardware, forget about enterprise-class performance. 3. Network = if the database is on a remote server, the amount of data you can receive is limited to network throughput. If everybody is pummeling your 10mbps network downloading wares, forget about getting those 1 million records anytime soon. 4. Local Database = If the database is on the same machine as your app, you can have the DB and application fighting for resources. This would multiply the effects of the problems above unless you have multiple CPU's and many gigs of memory. Gandhi, what you need to do is figure out where your bottleneck is and address the problem. Our guessing isn't going to help you. The Java Platform Performance book is a good start on how to measure and improve performance but unfortunately it doesn't cover JDBC. [ November 11, 2005: Message edited by: Joe Ess ]
Seems to me that multithreading could help if either the database or the file being written is local to the machine doing the work since the network and local file system requests can run independently. If both processes need network bandwith, it may be another story. Real measurements will be required. Bill
Joined: Apr 16, 2004
I've just implemented a multi-threaded solution to speed up a slow query. Had a query running against time series data, which contained >55 million rows. Our final data set, once retrieved, was about 20MB.
I found that splitting the query up into three threads, each charged with retrieving an interval of the overall time series, cut down the query time by 30%. Haven't done substantial testing on this, but I think with some more tweeks, we can cut it down even more. Of course eventually the more threads used, the less likely you will save time, and it may end up costing you time, but in this case, three threads shaved off about 30%.
As you have already discovered by setting a larger fetchsize, that most of your time was being spent in the network round-trips between the java app and the database (you reduced it from 24 minutes to 2 minutes).
What database are you using and is the resultset type scrollable ?
<a href="http://www.auptyma.com" target="_blank" rel="nofollow">The Peak of Performance</a>