File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes EJB and other Java EE Technologies and the fly likes Bulk data processing on J2EE container Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » EJB and other Java EE Technologies
Bookmark "Bulk data processing on J2EE container" Watch "Bulk data processing on J2EE container" New topic
Author

Bulk data processing on J2EE container

Solo Chiang
Greenhorn

Joined: Sep 14, 2005
Posts: 18
Hi

My project has a high-priority requirement that would require bulk data processing, be more specific, we need to parse an Excel file with 500,000 rows by POI and insert these 500,000 rows into database. Since this would be a time-consuming task for the application server. My strategy is

1) Use JMS queue to process the JDBC operation, so once file upload is completed. An event will be sent to the listener and put the database operation into JMS queue
2) Instead of regular JDBC operation, batch update will be applied when actually updating the database
3) An email will be sent to the user when the update is completed

The question I have is
1) Will POI 3.2 be able to able to handle 500,000 rows at a time?
2) More important, is there any other alternative design what you could recommend to process bulk data from Excel to database?

Thank you

Solo
Koen Aerts
Ranch Hand

Joined: Feb 07, 2012
Posts: 344

In a somewhat similar situation I had to parse and load large XML files and store certain values in an Oracle database, roughly over 10 million rows per night. This was in a JBoss ESB environment. Some files were small, several KB, but some of them were large, for instance 20GB and over. Instead of reading and sticking so much data in the JMS queue, I simply had the process "lock" the XML file (rename it) and only sent the file name in the queue. Then the receiving action class would pick up that file and parse it with WoodStox and use Hibernate to do the DB inserts. This worked quite well and very fast, and with multiple threads (multiple files at once) that was configurable in the ESB settings.
Jayesh A Lalwani
Bartender

Joined: Jan 17, 2008
Posts: 2274
    
  28

Does the data have to be in Excel format? Can't it be in CSV? Most servicing systems that generate large amounts of data have an easier time generating CSV because it's plain text. Also, most databases provide utilities that allow you to bulk load CSVs into the database. The database provided utility is generally orders of magnitude faster than doing JDBC batch update operations yourself. You can make your JDBC batch updates as fast as the database utility, but it would require quite a bit of reinventing of the wheel to get there. By using database utility, you are tying yourself to the database. However, IMO, it's worth it.

If you can push back on the requirement and go for CSVs instead of Excel, you might not need background processing. We routinely load 300-400K worth of records using sqlldr within 10s. It;s not a good idea to run the sqlldr on the web server in the long run. It would be better to offload that processing to another node. However, if you are looking at something quick and dirty, or doing a proof of concept, you can always run sqlldr on the web server today, and then build the logic to run it on a different node later.
Koen Aerts
Ranch Hand

Joined: Feb 07, 2012
Posts: 344

Yes, it would indeed be a good idea to see which loading or inserting tools you have with your database. You may also be able to tune your database further to perform faster on bulk inserts. For instance for Oracle, there are big performance differences in which tools or inserting methods you are using. Check this article out: http://dba-oracle.com/t_optimize_insert_sql_performance.htm
Solo Chiang
Greenhorn

Joined: Sep 14, 2005
Posts: 18
In the project I am working on , Excel file is the only choice they have at this point, and database is managed by another company so there will be political obstacle that has to be overcome in order to run SQL Loader directly on Oracle server. In this case, the solution has to be constrained to Excel file on J2EE container itself. Any choice other than JMS queue?

Solo

Jayesh A Lalwani
Bartender

Joined: Jan 17, 2008
Posts: 2274
    
  28

If you are constrained to a pure Java implementation, you need to be worried about your memory footprint. You might want to look at the streaming APi provided by POI.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Bulk data processing on J2EE container
 
Similar Threads
XA, JMS and a distributed database/CGI update
Copy excel file to access table
Retrieve data from a table
Synchronization approach
High volume data processing question