aspose file tools*
The moose likes Beginning Java and the fly likes Best  performance:  read a huge file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Best  performance:  read a huge file" Watch "Best  performance:  read a huge file" New topic
Author

Best performance: read a huge file

Edward Chen
Ranch Hand

Joined: Dec 23, 2003
Posts: 798
I am thinking a performance issue.

If I have a huge txt file (say, 4GB), like

name1,address1,date1......\n
name2,address2,date2......\n
.....

I need to read it, parse it, make some conversion(i.e, change data format) and then save it to database.

Which way we could get the best performance ? what collection type (vector, arraylist?) we need to use ?

So far, one thing I am sure, we need to use buffer. something else ? any web link is welcome.

Thanks.
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24184
    
  34

Originally posted by Edward Chen:
what collection type (vector, arraylist?) we need to use ?


Most likely you'd want to use no collection; the 4GB file would most likely take even more RAM than that, and so it's unlikely you could hold the whole thing in the Java heap (unless you have a 64bit JVM; even in that case, storing the whole thing is wasteful of space.) It would make more sense to read, say, 50 records, batch up the JDBC inserts, and then commit them; then go back and do 50 more.

As far as file I/O: be sure to use a BufferedReader.


[Jess in Action][AskingGoodQuestions]
Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970
If the conversions only apply to the data within one single line (i.e. lines are not related), then you should just read the file line-by-line and convert each line. You could then write that converted line to the database.

You certainly don't want to read the whole file into a Collection!

I'm no JDBC expert, but I suspect you could improve performance by holding-off writing to database until you had read a few lines, then writing a batch to database. Presumably, you would also use pre-compiled Statements for writing to database.


Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.
Edward Chen
Ranch Hand

Joined: Dec 23, 2003
Posts: 798
Thanks for your reply.

I am thinking, do we have a performance comparison between java, C++ and C# based on
1. same database
2. same size of files, 4GB
3. same job: read, parse, convert and save it into database
4. any technolgy (ie, NIO) could be used, including third party library.

Any clue ?

Thanks.
Tony Morris
Ranch Hand

Joined: Sep 24, 2003
Posts: 1608
You can read it into some type, but none of the collections, since they are poorly designed. Instead you'll have to write your own type that is "lazily evaluated" (I have written many such types released under the CPL) - since actually, how you read your file is dependent entirely on what you do with that file.

Haskell is a lazily evaluated pure FP language. Take a look at its readFile function: http://haskell.org/ghc/docs/latest/html/libraries/base/Prelude.html#v%3AreadFile
The readFile function reads a file and returns the contents of the file as a string. The file is read lazily, on demand, as with getContents.

Does this mean that the entire String exists in memory when this function is itself evaluated? Absolutely not. You can replicate exactly this behaviour in Java and in some ways, the core API has done so even if not explicitly stated and more often than not, in a horribly contrived manner.

In short, you'll have to provide the case for what you are actually going to do with the file to provide a more thorough answer, but until then, the answer is "read it into a lazily evaluated structure (of course!)".


Tony Morris
Java Q&A (FAQ, Trivia)
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24184
    
  34

Edward:

As far as comparing Java and C++: if the code in both languages is well-written, you'll see no difference at all. The performance of your disk I/O (i.e., the OS itself) and database access (i.e., the database engine, and communications with it) will totally swamp any computational overhead.
Edward Chen
Ranch Hand

Joined: Dec 23, 2003
Posts: 798
Ernest,

1. Where can I find the "well-written" read-write example which you mean ?

2. I am thinking, do we have any way to cut a 4GB file into sub-file (say 4*1GB), then we could use multithread to read it? NIO has this API ?

Thanks.
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24184
    
  34

As far as a "well-written" example goes, there are plenty of them out there; just the simple ones in Sun's I/O tutorial are fine. There are a few simple principles to adhere to:

- Use buffered I/O. Just wrapping a FileInputStream in a BufferedInputStream makes an enormous difference.

- Don't read just one byte at a time, but rather a decent-sized array full.

- Don't read 4GB using BufferedReader.readLine(), because creating all those Strings will kill you! Instead, try to process the data without creating any objects at all, if you can.

As far as splitting up the file: if that's a possibility, then it might be worth a try; multiple threads might be able to process data while others are waiting on I/O. You could simply use RandomAccessFile and start from N different locations within the file; figuring out what are valid start points might be tricky. As you say, NIO's asynchronous I/O capabilities are another possible option, although "let's use NIO" is not the magic speed bullet many people seem to think it is -- remember that all the FileReader/FileInputStream/RandomAccessFile/etc classes have been reimplemented on top of NIO in recent JDKs.
Mike Himstead
Ranch Hand

Joined: Apr 12, 2006
Posts: 178
Originally posted by Ernest Friedman-Hill:

- Don't read 4GB using BufferedReader.readLine(), because creating all those Strings will kill you! Instead, try to process the data without creating any objects at all, if you can.


Could you give a hint on how to achieve this?
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Best performance: read a huge file