File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Performance and the fly likes Procesing 2 millions rows Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Procesing 2 millions rows" Watch "Procesing 2 millions rows" New topic

Procesing 2 millions rows

Lisa Modglin
Ranch Hand

Joined: Oct 28, 2003
Posts: 46
I need some help. I have written an application that processes three text files. I parse the data in each text file and insert the data line-by-line into a SqlServer database. There are 200,000+ lines in two of the files and it is taking my application 8 minutes to parse and insert the data. The third text file has 2 million rows and is taking 70 minutes to parse and insert.

I then have to retrieve the data from the database and format it into XML. That process is going to take an extremely long time. The catch, I need to get all of this done in two hours. Is it possible?

My application is using a BufferedInputString to read the text files. I read one line at a time, create an object, pass the object to a method that inserts the object attributes into a database. I took out the creation of the object and went straight from parsing to inserting and receive a one second improvement.

Any suggestions?

steve souza
Ranch Hand

Joined: Jun 26, 2002
Posts: 852
You will have to profile your app to see where it is running slow. Post your code and we can look at it.

A couple suggestions. Batch the rows of your inserts so each one doesn't have an implicit commit. I haven't used the addBatch method but it may help. Also you could do a begin tran insert say 10,000 rows (or some other numbers) and commit. This can often have big performance impacts.

You could also consider using your rdbms's bulk load capability.

But your first order of business is to see what is running slow. - a fast, free open source performance tuning api.
JavaRanch Performance FAQ
Lisa Modglin
Ranch Hand

Joined: Oct 28, 2003
Posts: 46
I saw another post on multithreading. I've never done that. Two responders said it was "easy". Did I mention I have 10 days to get this done, plus have another Web application complete that sits and listens for requests so that I know what data to use when creating the XML file.

What is the quickest reference for figuring out how to quickly turn my app into a multithreaded app?
steve souza
Ranch Hand

Joined: Jun 26, 2002
Posts: 852
If you have only 10 days it is even more important to figure out where you program is running slow.
fred rosenberger
lowercase baba

Joined: Oct 02, 2003
Posts: 10909

Most folks will tell you that it's generally pointless to look at your code, and say "oh, i think THIS is where the bottleneck is, so i'll refactor it". almost without exception, you're going to guess wrong. sure, you might see some improvement, but you're almost never going to find the problem by hand (unless you're REALLY experienced at looking for it).

get a profiler, and use it to figure out where your code is spending all it's time. figure out what changes will give you the most improvement with the least amount of development time.

I wouldn't consider multi-threading until i knew it would make a difference. if 99% of the time is spent crunching some extrememly expensive calculations, having a second thread won't really gain you that much time.

There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Balazs Borbely
Ranch Hand

Joined: Oct 11, 2004
Posts: 33
As it was said before, determine what part of the code is slow. You could use a profiler.

A few tips.
1. use batch for inserts
2. commit after each 1000 insert (the number can vary)
3. use threads, at least two,
- one thread reds the file, process the info and puts into a queue
- the second thread reads from the queue and executes the inserts into the db (the standard producer-consumer thread pattern)
4. use as few indexes as possible on that table, because at each insert the indexes are regenerated.
5. try to generate the XML using the db features for XML (I don't know about MSSQL but the Oracle has a reach XML feature set)

'Make everything as simple as possible, but not simpler.' --Albert Einstein
Scott Johnson
Ranch Hand

Joined: Aug 24, 2005
Posts: 518
If you are not processing the data, but simply inserting into the database, you might consider using one of SQL Server's import tools like BULK IMPORT or bcp.

If you're interested in speed, it will be hard to beat the speed of these utilities.

Even if you are processing the data, you could write another file containing the altered data and import that.
Raghubir Bose
Ranch Hand

Joined: Feb 18, 2006
Posts: 88
Adding to the last response you can look at this link

It has got an example of how to do it. I think you can have a batchifle that call this and just to add the java flavour ..write a java program that call this batchfile.

I will meet the fairy queen soon !
John Dunn
Ranch Hand

Joined: Jan 30, 2003
Posts: 1108
Use a buffer to capture your logging and write it at the end of a cycle. This way you can see where time is being spent. New bottlenecks will always arise, so it is handy to have a simple and relatively low-cost way of tracking time. You can then periodically check your logs to see the typical timings.

For example, allocate a StringBuffer of lets say 4000 bytes, and write some text and the elapsed time at various set points. Write the total elapsed time at the end. You will be able to see where the trouble spots are and what code you can ignore.

"No one appreciates the very special genius of your conversation as the dog does."
dema rogatkin
Ranch Hand

Joined: Oct 09, 2002
Posts: 294
Wow. Your code is incredible fast. I do inserting 2M records in 5 hours. I do batch insert with commit after 1000. It's optimum value for memory and speed. Anyway your problem is retrieving data back and generation XML. I thought maybe you can generate XML at time of insertion and to not read database back. In this case you certainly will fit in 2 hours.

Tough in space?, <a href="" target="_blank" rel="nofollow">Get J2EE servlet container under 150Kbytes here</a><br />Love your iPod and want it anywhere?<a href="" target="_blank" rel="nofollow">Check it here.</a><br /><a href="" target="_blank" rel="nofollow">Curious about generic in Java?</a><br /><a href="" target="_blank" rel="nofollow">Hate ant? Use bee.</a><br /><a href="" target="_blank" rel="nofollow">Need contacts anywhere?</a><br /><a href="" target="_blank" rel="nofollow">How to promote your business with a search engine</a>
Lisa Modglin
Ranch Hand

Joined: Oct 28, 2003
Posts: 46

It's been a couple of months and I can breathe now. Thanks for all the suggestions for solving my problem.

We made a completely SQLServer solution so I know it's not Java related, but...

We import the data using a DTS package (actually three of them), then we use SQLServer's BCP to create XML using FOR XML EXPLICIT queries. There were still some road blocks, but we ended up importing from the flat files and exporting the XML in 45 minutes. Really, really fast considering there are 2 millions rows in one of the tables. Some of the time in the process is wait time so the steps don't get in each other's way.

I'm still using my Java code that is slower when retrieving the information for one person and for sending course information which is medium-sized.
I agree. Here's the link:
subject: Procesing 2 millions rows
Similar Threads
Converting xml files to test files
inserting data from XML to database
Bulk data processing on J2EE container
SQL Loader help