aspose file tools*
The moose likes Performance and the fly likes What is the best way to work with a large amount of data? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "What is the best way to work with a large amount of data?" Watch "What is the best way to work with a large amount of data?" New topic
Author

What is the best way to work with a large amount of data?

William Tan
Greenhorn

Joined: Feb 12, 2007
Posts: 7
Assume you have a big text file to work that you need to store and work with, what would be the best way to do that?

I will be adding / removing a lot of elements so I really want to stay away from array resizing overhead.

I don't want to step on any toes here, but would binary trees / linked lists in C++ be a better solution?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Whether you use Java or C++ is somewhat irrelevant here. Java has binary trees and linked lists too. You can use the standard libraries, or make your own. I'd probably use whichever language you and/or your team are most comfortable with. For the remainder of this post I'll talk in terms of Java, though you could do similar things in other languages.

LinkedList would probably be my first approach, assuming that the text file is small enough (or RAM is large enough) that the whole thing can be comfortably kept in memory all at once. With LinkedList you'd usually want to access the list using an Iterator, not the List itself. Avoid any method where you specify an index. If that's not possible, then LinkedList may not be a good fit.

A binary tree (e.g. TreeSet) might well be a good option if the file is normally supposed to be sorted in some way, e.g. alphabetically or numerically.

Either way, if the file is too big to fit in memory, one option would be to split it into smaller segments of a more appropriate size, then process the segments one at a time by reading them into a LinkedList or whatever. Or you might just use a database instead if the operations you need to perform start becoming complex.


"I'm not back." - Bill Harding, Twister
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12789
    
    5
IF (big if) your text is plain ASCII you might be able to process it as bytes - thus saving the time and memory consuming conversion to characters.

There is one very important question when thinking about the process:
can it be done in one straight through pass, only looking at a small chunk of text at any one time or do the later contents affect the treatment of the early contents/

Bill
William Tan
Greenhorn

Joined: Feb 12, 2007
Posts: 7
The file will be a 1M+ line ASCII text file of instructions. Each line would then need to be tokenized (basically a CSV) and then executed sequentially.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
OK, sounds good. After you process one line, you you even need to remember it? It sounds like you may not need a List at all here. What sort of result do you expect to get, and what do you ultimately need to do with it? Write a new file? Put data in a database? Something else? Maybe you can simply write new lines to the output file (or insert new rows to the database) as you go, instead of first putting them in one big List and then reading the List into a file or DB. That way you would avoid needing the memory to store everything in a List at once.

Also, before you spend a lot of time trying to find the fastest, most efficient code possible, it would be a good idea to just write some code that does what you want, as simply as possible, and find out how fast it is. It may well be that it's already fast enough - and if not, you can make measurements to find out which parts, specifically, are making it slow. It's usually much better to improve performance using hard data rather than speculation.
William Tan
Greenhorn

Joined: Feb 12, 2007
Posts: 7
All data would need to stay in memory. There will not be that many entries, but their fields will need to be modified and searched very often.
Justin Chu
Ranch Hand

Joined: Apr 19, 2002
Posts: 209
    
    1
The best way to work with massive amount of data is to use stream processing or process buffered blocks at a time.

However, since the entire file needs to be (can be) placed in memory, maybe it is not that large afterall. It is better to put more focus on designing a data structure that optimizes the speed of the operations you'll need to perform on these data.

With the right data structures, array resizing overhead is just a small concern.
[ March 01, 2007: Message edited by: Chu Tan ]
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12789
    
    5
There will not be that many entries, but their fields will need to be modified and searched very often.


Is this a place for a custom class where each instance represents the content of a single line and there are methods to aid the modification and searching?

Bill
M Easter
Ranch Hand

Joined: Feb 11, 2007
Posts: 133
You may want to consider memory-mapped files in the Java nio package.

Bruce Eckel's "Thinking in Java" has examples (free online)

Btw, I have no affliation to Bruce! LOL... i have been recommending his stuff ever since an early Thinking in C++


M Easter
Software Composer - http://codetojoy.blogspot.com
William Tan
Greenhorn

Joined: Feb 12, 2007
Posts: 7
I've decided to go ahead with a vector of objects that is also linked to buy a vector of pointers in C++, but the compiler isn't being nice with vectors of non-primitive data types.

My main reason for not going with Java is relative familiarity with C++, pointers, and java vector overhead.
 
Consider Paul's rocket mass heater.
 
subject: What is the best way to work with a large amount of data?