File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Spring and the fly likes Spring batch - 1 million rows, one column - too much time Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Frameworks » Spring
Bookmark "Spring batch - 1 million rows, one column - too much time" Watch "Spring batch - 1 million rows, one column - too much time" New topic
Author

Spring batch - 1 million rows, one column - too much time

Mustafa Garhi
Ranch Hand

Joined: Nov 05, 2008
Posts: 111
Hi,

Before a couple of days I was new to Spring Batch and I have done some reading.

We have a process written in batch that is supposed to read close to a million rows with only one column from a table and write it to a text file.

It points to ORACLE DB and the batch read size is default (10).

Driver used is oracle.jdbc.driver.OracleDriver.

ItemReader we have used is Spring's JdbcCursorItemReader and we have implemented ItemWriter, ItemStream, StepExecutionListener, InitializingBean for the writer.

We have a row mapper.

The whole process of read/write takes 5 hrs everyday which is not acceptable to the client.

When i check the logs I see that the time taken from the time a row is mapped till it is written (for each record) only 1 or 2 milliseconds are being used.

However the next row is mapped only after 15 milliseconds - what is Spring doing here in between? We want to optimize this.

Where should I start?

Please suggest, I would be grateful !
Saifuddin Merchant
Ranch Hand

Joined: Feb 08, 2009
Posts: 606

Whats you commit-interval? Why don't you post you config?
It shouldn't take so much time, I've seen batches which write close to a million records to the database in less than 30 mins. Read operation should be a lot faster.


Cheers - Sam.
Twisters - The new age Java Quiz || My Blog
Mustafa Garhi
Ranch Hand

Joined: Nov 05, 2008
Posts: 111
Sam, I am sorry but I wouldn't be able to post the config.

However, I realized that my knowledge of commitInterval was completely wrong as I thought it to be related to DB commits.

After your post I tried the commitInterval and it has made the process faster for sure. How much - I am still seeing, will let you know.

I am assuming if I keep the commitInterval too high, it will keep items in memory which means heap issues and all. Right?

Thanks for that great help.
Saifuddin Merchant
Ranch Hand

Joined: Feb 08, 2009
Posts: 606

Commits interval in the range of 100-1000 should be good.
Going as high as 10,000 should get you great performance without any issues.
Mustafa Garhi
Ranch Hand

Joined: Nov 05, 2008
Posts: 111
All right Sam. Thanks again.

One more question - if I am fetching only one column from the DB, should i need a RowMapper?

I think I can save some time if I can get rid of the mapper somehow.

I have tried removing it and my tomcat throws an error stating about the mandatory mapper.

I am digging more into this, but would appreciate if you had any ideas.
Saifuddin Merchant
Ranch Hand

Joined: Feb 08, 2009
Posts: 606

Hard to say without really knowing what you are doing.
I normally disagree with micro-optimization.

Personally I don't think you would get better performance by doing away with the row-mapper. Framework classes usually have optimized setting (e.g. database Buffer size)
Mark Spritzler
ranger
Sheriff

Joined: Feb 05, 2001
Posts: 17257
    
    6

You can implement your own ItemReader, and just use the JdbcTemplate and call one of the queryFor methods. If it is a int returned call queryForInt, if it is a String you have to call queryForString.

If you do that then you don't need the rowMapper. While I doubt you will see a great increase in performance, it still is very easy to quickly write.

Mustafa, as for as commitInterval. The way the Spring Batch works is called chunk processing. In the Read and process of a step it done as Item by Item, then the write portion gets a Collection where the collection has the number in the commit interval/chunk size. If I set that to 1000, it will read 1000 rows one by one, then pass each row to the RowMapper, after each read/process that Item is placed into the Collection, when that collection gets 1000 items, then the collection is sent to the Writer. The Writer writes the 1000 in a batch, then commits the transaction. Then we do the reading/processing again for the next 1000 Items. The reader does some caching of data until the Transaction is commited. So in your Oracle batch read size of 10, would mean that the reader will contact the database 100 times per chunk. Since the commit interval here is set to 1000. I would make the chunk size and batch read size in the database to be the same number. 1000 will be good, 10000 will also be fine too.

Mark


Perfect World Programming, LLC - Two Laptop Bag - Tube Organizer
How to Ask Questions the Smart Way FAQ
Saifuddin Merchant
Ranch Hand

Joined: Feb 08, 2009
Posts: 606

Id like to add a little to what Mark said,

The reader is buffered and most likely the number of records read from the database in one go, would be independent of the commit count. It would depend on the Driver, Resultset setting and other JDBC related setting.

So why does the commit interval of one take longer time? There are two reasons

a) Like Mark explained below spring follows a chunk oriented approach. So at the end of every commit interval - it would save the resource being written too (save the file, commit the database). This take some time.

b) Spring also saves meta-data information at the end of every commit interval, namely the STEP_EXECUTION_CONTEXT and the Number of Rows read and processed. It needs to do so for re-start-ability of the batch job. This takes the major chunk of time in your case.

Commit intervals up to 1,000 is safe. Anything over, you hardly notice any performance gains. Spring folks recommend a commit interval in the range of 100.
Jesus Angeles
Ranch Hand

Joined: Feb 26, 2005
Posts: 2057
Hi, Just want to add on a possible reason.

Maybe you can try if the sql query is optimum. Try execute it in toad and see how long it takes.

Some sql commands like 'order by' and much worse functions, can totally make your query crawl.
Mustafa Garhi
Ranch Hand

Joined: Nov 05, 2008
Posts: 111
Thanks everyone. The posts have really helped me understand batch better.

I got commitInterval added to the code.

On a little more research of our tomcat's memory settings (1 GB max) and the fact that this process runs every night at 2:00 AM which means less load on server, i deduced that i did not need to worry much about high value of commitInterval overflowing the memory and set the value to 10000.

The process completed in 44 mins.

Should I lower this value? (I am sorry I didn't read Sam's latest post carefully about the ideal value of commitInterval).

Thanks again mates.


Mark Spritzler
ranger
Sheriff

Joined: Feb 05, 2001
Posts: 17257
    
    6

You can always test it out. Can you make a staging like server with the same settings talking to a temp database, and just try different intervals and see how it affect it.

Mark
Saifuddin Merchant
Ranch Hand

Joined: Feb 08, 2009
Posts: 606

Mark Spritzler wrote:You can always test it out. Can you make a staging like server with the same settings talking to a temp database, and just try different intervals and see how it affect it.

Mark


Been there done that! 1000 or 10,000 makes no big difference to the time. Anything over 10,000 might slow things off (memory tradeoff).
Mustafa Garhi
Ranch Hand

Joined: Nov 05, 2008
Posts: 111
For one million records i just found difference of a few seconds between 1000 and 10,000.

We decided to stick to 10,000.

CommitInterval has done the trick.

Thanks everybody !!
Mark Spritzler
ranger
Sheriff

Joined: Feb 05, 2001
Posts: 17257
    
    6

Mustafa Garhi wrote:For one million records i just found difference of a few seconds between 1000 and 10,000.

We decided to stick to 10,000.

CommitInterval has done the trick.

Thanks everybody !!


Cool. Also consider this, if it has any affect. If a step fails and you want to restart. It will restart on the last chunk that caused the job to fail and have to do the entire chunk. So that would be 10K or 1000 depending on your chunk size. So there is a trade of to higher or lower. Just pick the happy medium for success completion time with restarts.

Mark
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Spring batch - 1 million rows, one column - too much time