| Author |
Huge file processing
|
Karan Jain
Ranch Hand
Joined: May 30, 2007
Posts: 82
|
|
Hi,
I have a requirement where i have a property file(Name, Value pair) having 300,000 records and size is suppose 3 GB. I need to persist in the database.
What approach can be taken to effectively persist in the database.
I mean:
1. How to read the file?
2. What collections can be used?
3. How to effectively handle the transaction.
Any views are appreciated!
Thanks in advance!
|
 |
Adam Michalik
Ranch Hand
Joined: Feb 18, 2008
Posts: 128
|
|
|
As the file is that big, reading it all to memory is a bad idea. I'd recommend using a BufferedReader and reading your file line-by-line. Then, for each line you have to extract the name and the value and persist them in a database. No collection would be necessary for that approach. As for the transactions, it depends on your goal - you may want to open and commit a transaction for each line or for the whole file (all or nothing approach).
|
 |
Rob Spoor
Sheriff
Joined: Oct 27, 2005
Posts: 19216
|
|
If you use a PreparedStatement, you can use that one single PreparedStatement for all records:
Or perhaps you can try the addBatch and executeBatch methods. I don't know where that will cache the insert queries though, so it may be as bad as reading the entire file into memory.
|
SCJP 1.4 - SCJP 6 - SCWCD 5
How To Ask Questions How To Answer Questions
|
 |
Young Choi
Greenhorn
Joined: Aug 22, 2009
Posts: 4
|
|
BufferedReader is good for reading in a huge file of multiple records.
But if it is a huge file of "one record" (one line concatenated with multiple records) like 3GB to read in, even BufferedReader with -Xms1g -Xmx3g command line java execution option won't be able to process the data.
This is my situation that I have to process a 4.76GB file of one huge line (the file provider has sent us that kind of file concatenated into one line!).
Maybe I will have to split the file to smaller pieces for processing it, then I will have to merge them into one file to send it to the next system.
Is there any other good idea other than splitting it to pieces?
|
 |
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32675
|
|
Welcome to JavaRanch
You might have done better to start a new thread rather than reopening an old topic.
You will have to get details of the format of the file from whoever supplied it. Is there anything which starts off, or finishes, a record, which is distinguishable from anything else. If there is, can you match it is a regular expression and use a Scanner to read the file?
Is there a record number which increases from record to record?
Are the records in the file of a uniform length? In which case can you read a certain number of characters and call them a line?
I am sure other people will be able to suggest other strategies for parsing your file. If you can't get any of them to work, can you tell the file supplier off for giving you an impossible task?
I would agree with previous comments that it is better to try handling the file one record at a time than trying to handle the whole thing.
|
 |
Young Choi
Greenhorn
Joined: Aug 22, 2009
Posts: 4
|
|
Thank you for your attention!
Yes, the file's logical record has a fixed format, and each record size is 215 bytes.
In that case, can I read in 4.76GB file using Scanner class? I did not know about Scanner class until I saw your comment here by the way as I still use SDK1.4.
If so, could you post a sample Scanner usage for handling huge file of one big line like I have?
Thanks again.
|
 |
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32675
|
|
|
If your file is in bytes, then Scanner probably won't work, I am afraid; it only works on text files.
|
 |
William Brogden
Author and all-around good cowpoke
Rancher
Joined: Mar 22, 2000
Posts: 12268
|
|
If the file is fixed binary format, then the obvious approach is to do a binary read into a byte[] of the record size, then pass the byte[] to a method that knows how to unpack it. Let the file system take care of buffering, just work with one record at a time.
Do NOT use a Reader because Readers try to do a character conversion. Scanner also assumes you are working with a text String in a given character set.
You will need a complete record layout to work out how to pick Java values out of the byte[].
Bill
|
Java Resources at www.wbrogden.com
|
 |
Young Choi
Greenhorn
Joined: Aug 22, 2009
Posts: 4
|
|
Thanks a lot Bill!
The way you suggested was exactly right.
I used DataInputStream to read the file into a byte[] with fixed (or caculated) length for each chunk, and it works beautiful.
So, reading a file of one huge record (or line) is not a problem any more in Java. Now a remaining tiny issue is about the time to write into an output file with 4.76GB / 215 bytes records. But no quick way there would be I suppose.
Do appreciate again!
|
 |
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32675
|
|
|
Writing to a file is just like reading, only backwards. If you used readers before use writers (the FileWriter has a constructor which takes a boolean allowing you to append the text to the end of the file). If you used a XYZInputStream before, try an XYZOutputStream. You should find initial hints in the Java™ Tutorials.
|
 |
William Brogden
Author and all-around good cowpoke
Rancher
Joined: Mar 22, 2000
Posts: 12268
|
|
OK - IF you are both reading and writing records, it may be profitable to add your own buffering. We dont want the physical disk to be chasing back and forth between the reading and writing areas.
See java.io.BufferedOutputStream - note you can construct a huge output buffer, much larger than the default operating system one. Be sure to make it a multiple of the record size.
The java.io package presents many elegant demonstrations of the "Decorator" design pattern, making it easy to do some time trials with and without a huge output buffer.
Do some time trials and let us know how much difference it makes.
Bill
|
 |
 |
|
|
subject: Huge file processing
|
|
|