I need some help. I have written an application that processes three text files. I parse the data in each text file and insert the data line-by-line into a SqlServer database. There are 200,000+ lines in two of the files and it is taking my application 8 minutes to parse and insert the data. The third text file has 2 million rows and is taking 70 minutes to parse and insert.
I then have to retrieve the data from the database and format it into XML. That process is going to take an extremely long time. The catch, I need to get all of this done in two hours. Is it possible?
My application is using a BufferedInputString to read the text files. I read one line at a time, create an object, pass the object to a method that inserts the object attributes into a database. I took out the creation of the object and went straight from parsing to inserting and receive a one second improvement.
You will have to profile your app to see where it is running slow. Post your code and we can look at it.
A couple suggestions. Batch the rows of your inserts so each one doesn't have an implicit commit. I haven't used the addBatch method but it may help. Also you could do a begin tran insert say 10,000 rows (or some other numbers) and commit. This can often have big performance impacts.
You could also consider using your rdbms's bulk load capability.
But your first order of business is to see what is running slow.
I saw another post on multithreading. I've never done that. Two responders said it was "easy". Did I mention I have 10 days to get this done, plus have another Web application complete that sits and listens for requests so that I know what data to use when creating the XML file.
What is the quickest reference for figuring out how to quickly turn my app into a multithreaded app?
Joined: Jun 26, 2002
If you have only 10 days it is even more important to figure out where you program is running slow.
Most folks will tell you that it's generally pointless to look at your code, and say "oh, i think THIS is where the bottleneck is, so i'll refactor it". almost without exception, you're going to guess wrong. sure, you might see some improvement, but you're almost never going to find the problem by hand (unless you're REALLY experienced at looking for it).
get a profiler, and use it to figure out where your code is spending all it's time. figure out what changes will give you the most improvement with the least amount of development time.
I wouldn't consider multi-threading until i knew it would make a difference. if 99% of the time is spent crunching some extrememly expensive calculations, having a second thread won't really gain you that much time.
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
As it was said before, determine what part of the code is slow. You could use a profiler.
A few tips. 1. use batch for inserts 2. commit after each 1000 insert (the number can vary) 3. use threads, at least two, - one thread reds the file, process the info and puts into a queue - the second thread reads from the queue and executes the inserts into the db (the standard producer-consumer thread pattern) 4. use as few indexes as possible on that table, because at each insert the indexes are regenerated. 5. try to generate the XML using the db features for XML (I don't know about MSSQL but the Oracle has a reach XML feature set)
'Make everything as simple as possible, but not simpler.' --Albert Einstein
Use a buffer to capture your logging and write it at the end of a cycle. This way you can see where time is being spent. New bottlenecks will always arise, so it is handy to have a simple and relatively low-cost way of tracking time. You can then periodically check your logs to see the typical timings.
For example, allocate a StringBuffer of lets say 4000 bytes, and write some text and the elapsed time at various set points. Write the total elapsed time at the end. You will be able to see where the trouble spots are and what code you can ignore.
"No one appreciates the very special genius of your conversation as the dog does."
Wow. Your code is incredible fast. I do inserting 2M records in 5 hours. I do batch insert with commit after 1000. It's optimum value for memory and speed. Anyway your problem is retrieving data back and generation XML. I thought maybe you can generate XML at time of insertion and to not read database back. In this case you certainly will fit in 2 hours.
Tough in space?, <a href="http://tjws.sf.net" target="_blank" rel="nofollow">Get J2EE servlet container under 150Kbytes here</a><br />Love your iPod and want it anywhere?<a href="http://mediachest.sf.net" target="_blank" rel="nofollow">Check it here.</a><br /><a href="http://7bee.j2ee.us/book/Generics%20in%20JDK%201.5.html" target="_blank" rel="nofollow">Curious about generic in Java?</a><br /><a href="http://7bee.j2ee.us/bee/index-bee.html" target="_blank" rel="nofollow">Hate ant? Use bee.</a><br /><a href="http://7bee.j2ee.us/addressbook/" target="_blank" rel="nofollow">Need contacts anywhere?</a><br /><a href="http://searchdir.sourceforge.net/" target="_blank" rel="nofollow">How to promote your business with a search engine</a>
Joined: Oct 28, 2003
It's been a couple of months and I can breathe now. Thanks for all the suggestions for solving my problem.
We made a completely SQLServer solution so I know it's not Java related, but...
We import the data using a DTS package (actually three of them), then we use SQLServer's BCP to create XML using FOR XML EXPLICIT queries. There were still some road blocks, but we ended up importing from the flat files and exporting the XML in 45 minutes. Really, really fast considering there are 2 millions rows in one of the tables. Some of the time in the process is wait time so the steps don't get in each other's way.
I'm still using my Java code that is slower when retrieving the information for one person and for sending course information which is medium-sized.