Requirement : Read data from a flat file and insert into database.
The total number of rows in the data file are around 1 lakh ( each line contains 700 characters). We need to dump those data into databse.
Method: Read the data using buffered input stream and store it in arraylist and insert into database. Since we cannot directly store all 1 lakh lines of data into arraylist at the same time, i am doing it block by block. Storing 25000 lines at a time and insert into database then clear arraylist then again start the process.
This way it takes 12 min to insert 1 lakh record into database.
I am not very sure how much time will it take really?
Are you using the SQL Statement addBatch() to batch up a bunch of inserts? You should be able to read a line and add a SQL statement without ever putting the lines into a collection like an ArrayList. Add, execute and commit some number of lines at a time. Give yourself a command line argument or configuration option to set the number of lines you read between commits.
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
When i looked at the executions, i don't see any delay in storing data into the collections classes and retreiving it.
Let me check it without using collection classes and just directly inserting the data into database.
Joined: Dec 06, 2005
I removed the collection classes and tried a direct insertion of data into database. Also i noticed database insert timing alone. I calculated start time before database action starts and ends. It took more time than when i used collection classes. For hundred thousand records it's taking 14 mins.
when i test database insertion alone, i am not reading the data from the file and inserting it...just a database insertion (hard coded value in the statment values) alone takes more time.
When I used to read data from flat file, I use to store it in collection classes (arraylist) and insert into database. In the mean time, when the ararylist size reaches 15000, I will clear it and start the iteration again. So database conenction also get refreshed during this time.
Is it because of single database connection inserting hundred thousand records takes me more time (This test is a direct database insertion wihtout using collection classes).
Where you store the data in memory on the client side will have very little effect. From the client's point of view, reading the records and writing them directly to the database should be the most effective. Anything else (such as storing them in collections) is overhead. The database doesn't care that you did all that.
How often are you committing these changes to the database? Are you using batches? Those are the relevant questions.
Hundred thousand records! It is interesting indeed...
For the database inserts are you using Stored Procedure? If not try it with Stored Procedure (making callable statements instead of Statement/Prepared Statement). It should make it much faster than the normal insert. However you may want to check the buffer limit in the database (since hundred thousand is huge).
Time taken depends upon various parameters like: -- What is DB(MySql,Postgres,Oracle,HSQLDB,ClouseScape,etc) -- What type of JDBC driver you are using ? -- What type of file system and related question (for HDD related problems) -- What is OS/Memory/CPU-Load while you are testing ? -- Are you using PrepareStatement & addBatch -- Did you set autocommit off before processing the commit
PS: I got 4500 records/sec in MySQL(for record size of 200 bytes) I got 6000 records/sec in PostgreSQL(for record size of 200 bytes) I got 36000 records/sec in HSQLDB(for record size of 200 bytes) my taking care of above parameters
Originally posted by Deepak Shankar: For the database inserts are you using Stored Procedure? If not try it with Stored Procedure (making callable statements instead of Statement/Prepared Statement). It should make it much faster than the normal insert. However you may want to check the buffer limit in the database (since hundred thousand is huge).
This sounds interesting. Why would using a callable statement+stored procedure be faster than a prepared statement? If anything I would have expected there to be additional overhead for pulling in the interpreter for the database vendor's more-than-SQL language processor.
Different question for Sathish - is there a particular vendor's database you are using? You may have additional options depending which vendor it is.
For maximum performance, consider your DBMS' bulk loader utility (e.g., SQLLDR in Oracle). They temporarily disable index updates & commits, which speeds things up dramatically. These utilities typically have a throughput of several hundred thousand rows per minute.
Is it a hard requirement that you must do it through a Java program? Else its is always better to use the DB vendors bulk-loading utilities. This can be invoked through some shell scripts/batch files to make things easier. This also makes more sense if you are not doing any additional processing with the data before upload.
I think the delay you found in your second attempt was probably because of commit operations happening more frequently than the earlier case. You need to look at the batch size in each case.