• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Best performance: read a huge file

 
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am thinking a performance issue.

If I have a huge txt file (say, 4GB), like

name1,address1,date1......\n
name2,address2,date2......\n
.....

I need to read it, parse it, make some conversion(i.e, change data format) and then save it to database.

Which way we could get the best performance ? what collection type (vector, arraylist?) we need to use ?

So far, one thing I am sure, we need to use buffer. something else ? any web link is welcome.

Thanks.
 
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Edward Chen:
what collection type (vector, arraylist?) we need to use ?



Most likely you'd want to use no collection; the 4GB file would most likely take even more RAM than that, and so it's unlikely you could hold the whole thing in the Java heap (unless you have a 64bit JVM; even in that case, storing the whole thing is wasteful of space.) It would make more sense to read, say, 50 records, batch up the JDBC inserts, and then commit them; then go back and do 50 more.

As far as file I/O: be sure to use a BufferedReader.
 
Ranch Hand
Posts: 1970
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If the conversions only apply to the data within one single line (i.e. lines are not related), then you should just read the file line-by-line and convert each line. You could then write that converted line to the database.

You certainly don't want to read the whole file into a Collection!

I'm no JDBC expert, but I suspect you could improve performance by holding-off writing to database until you had read a few lines, then writing a batch to database. Presumably, you would also use pre-compiled Statements for writing to database.
 
Edward Chen
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for your reply.

I am thinking, do we have a performance comparison between java, C++ and C# based on
1. same database
2. same size of files, 4GB
3. same job: read, parse, convert and save it into database
4. any technolgy (ie, NIO) could be used, including third party library.

Any clue ?

Thanks.
 
Ranch Hand
Posts: 1608
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You can read it into some type, but none of the collections, since they are poorly designed. Instead you'll have to write your own type that is "lazily evaluated" (I have written many such types released under the CPL) - since actually, how you read your file is dependent entirely on what you do with that file.

Haskell is a lazily evaluated pure FP language. Take a look at its readFile function: http://haskell.org/ghc/docs/latest/html/libraries/base/Prelude.html#v%3AreadFile
The readFile function reads a file and returns the contents of the file as a string. The file is read lazily, on demand, as with getContents.

Does this mean that the entire String exists in memory when this function is itself evaluated? Absolutely not. You can replicate exactly this behaviour in Java and in some ways, the core API has done so even if not explicitly stated and more often than not, in a horribly contrived manner.

In short, you'll have to provide the case for what you are actually going to do with the file to provide a more thorough answer, but until then, the answer is "read it into a lazily evaluated structure (of course!)".
 
Ernest Friedman-Hill
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Edward:

As far as comparing Java and C++: if the code in both languages is well-written, you'll see no difference at all. The performance of your disk I/O (i.e., the OS itself) and database access (i.e., the database engine, and communications with it) will totally swamp any computational overhead.
 
Edward Chen
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ernest,

1. Where can I find the "well-written" read-write example which you mean ?

2. I am thinking, do we have any way to cut a 4GB file into sub-file (say 4*1GB), then we could use multithread to read it? NIO has this API ?

Thanks.
 
Ernest Friedman-Hill
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As far as a "well-written" example goes, there are plenty of them out there; just the simple ones in Sun's I/O tutorial are fine. There are a few simple principles to adhere to:

- Use buffered I/O. Just wrapping a FileInputStream in a BufferedInputStream makes an enormous difference.

- Don't read just one byte at a time, but rather a decent-sized array full.

- Don't read 4GB using BufferedReader.readLine(), because creating all those Strings will kill you! Instead, try to process the data without creating any objects at all, if you can.

As far as splitting up the file: if that's a possibility, then it might be worth a try; multiple threads might be able to process data while others are waiting on I/O. You could simply use RandomAccessFile and start from N different locations within the file; figuring out what are valid start points might be tricky. As you say, NIO's asynchronous I/O capabilities are another possible option, although "let's use NIO" is not the magic speed bullet many people seem to think it is -- remember that all the FileReader/FileInputStream/RandomAccessFile/etc classes have been reimplemented on top of NIO in recent JDKs.
 
Ranch Hand
Posts: 178
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ernest Friedman-Hill:

- Don't read 4GB using BufferedReader.readLine(), because creating all those Strings will kill you! Instead, try to process the data without creating any objects at all, if you can.



Could you give a hint on how to achieve this?
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic