• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Procesing 2 millions rows

 
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I need some help. I have written an application that processes three text files. I parse the data in each text file and insert the data line-by-line into a SqlServer database. There are 200,000+ lines in two of the files and it is taking my application 8 minutes to parse and insert the data. The third text file has 2 million rows and is taking 70 minutes to parse and insert.

I then have to retrieve the data from the database and format it into XML. That process is going to take an extremely long time. The catch, I need to get all of this done in two hours. Is it possible?

My application is using a BufferedInputString to read the text files. I read one line at a time, create an object, pass the object to a method that inserts the object attributes into a database. I took out the creation of the object and went straight from parsing to inserting and receive a one second improvement.

Any suggestions?

Lisa
 
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You will have to profile your app to see where it is running slow. Post your code and we can look at it.

A couple suggestions. Batch the rows of your inserts so each one doesn't have an implicit commit. I haven't used the addBatch method but it may help. Also you could do a begin tran insert say 10,000 rows (or some other numbers) and commit. This can often have big performance impacts.

You could also consider using your rdbms's bulk load capability.

But your first order of business is to see what is running slow.
 
Lisa Modglin
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I saw another post on multithreading. I've never done that. Two responders said it was "easy". Did I mention I have 10 days to get this done, plus have another Web application complete that sits and listens for requests so that I know what data to use when creating the XML file.

What is the quickest reference for figuring out how to quickly turn my app into a multithreaded app?
 
steve souza
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you have only 10 days it is even more important to figure out where you program is running slow.
 
lowercase baba
Posts: 13089
67
Chrome Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Most folks will tell you that it's generally pointless to look at your code, and say "oh, i think THIS is where the bottleneck is, so i'll refactor it". almost without exception, you're going to guess wrong. sure, you might see some improvement, but you're almost never going to find the problem by hand (unless you're REALLY experienced at looking for it).

get a profiler, and use it to figure out where your code is spending all it's time. figure out what changes will give you the most improvement with the least amount of development time.

I wouldn't consider multi-threading until i knew it would make a difference. if 99% of the time is spent crunching some extrememly expensive calculations, having a second thread won't really gain you that much time.
 
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As it was said before, determine what part of the code is slow. You could use a profiler.

A few tips.
1. use batch for inserts
2. commit after each 1000 insert (the number can vary)
3. use threads, at least two,
- one thread reds the file, process the info and puts into a queue
- the second thread reads from the queue and executes the inserts into the db (the standard producer-consumer thread pattern)
4. use as few indexes as possible on that table, because at each insert the indexes are regenerated.
5. try to generate the XML using the db features for XML (I don't know about MSSQL but the Oracle has a reach XML feature set)
 
Ranch Hand
Posts: 518
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you are not processing the data, but simply inserting into the database, you might consider using one of SQL Server's import tools like BULK IMPORT or bcp.

If you're interested in speed, it will be hard to beat the speed of these utilities.

Even if you are processing the data, you could write another file containing the altered data and import that.
 
Ranch Hand
Posts: 88
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Adding to the last response you can look at this link

http://www.15seconds.com/issue/011106.htm

It has got an example of how to do it. I think you can have a batchifle that call this and just to add the java flavour ..write a java program that call this batchfile.
 
slicker
Posts: 1108
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Use a buffer to capture your logging and write it at the end of a cycle. This way you can see where time is being spent. New bottlenecks will always arise, so it is handy to have a simple and relatively low-cost way of tracking time. You can then periodically check your logs to see the typical timings.

For example, allocate a StringBuffer of lets say 4000 bytes, and write some text and the elapsed time at various set points. Write the total elapsed time at the end. You will be able to see where the trouble spots are and what code you can ignore.
 
Ranch Hand
Posts: 294
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Wow. Your code is incredible fast. I do inserting 2M records in 5 hours. I do batch insert with commit after 1000. It's optimum value for memory and speed. Anyway your problem is retrieving data back and generation XML. I thought maybe you can generate XML at time of insertion and to not read database back. In this case you certainly will fit in 2 hours.
 
Lisa Modglin
Ranch Hand
Posts: 46
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
FOLLOW UP:

It's been a couple of months and I can breathe now. Thanks for all the suggestions for solving my problem.

We made a completely SQLServer solution so I know it's not Java related, but...

We import the data using a DTS package (actually three of them), then we use SQLServer's BCP to create XML using FOR XML EXPLICIT queries. There were still some road blocks, but we ended up importing from the flat files and exporting the XML in 45 minutes. Really, really fast considering there are 2 millions rows in one of the tables. Some of the time in the process is wait time so the steps don't get in each other's way.

I'm still using my Java code that is slower when retrieving the information for one person and for sending course information which is medium-sized.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic