aspose file tools*
The moose likes Performance and the fly likes Best way to store a huge string array/list Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Best way to store a huge string array/list" Watch "Best way to store a huge string array/list" New topic
Author

Best way to store a huge string array/list

Wei Lau
Greenhorn

Joined: Jan 13, 2010
Posts: 2
I am running a java program that has to read in about 250 million lines of string and hold them in memory.
each string contains 3 characters, I did the math and in theory the memory requirement of holding all these strings will
be is about (250M * 3 * 16bit =) 1.5GBytes.

I first attempted to sequentially read in each line as a String and put them in an ArrayList, but I am surprised
that for about every 10 million record it consumes 450MB of heap, and that I can get no near to 50 millions of records.

I know in Java String have quite some overheads, so I also try to make my own "String" class (wrapped around char[] )
but this still, costs about 320MB of heap every 10 million record.

I started to suspect my use of ArrayList but I saw somewhere that mentioned ArrayList have very little overhead,
and if this is true, I can't see any way of reading in such huge amount of data.

A few days ago I managed to use a *super* computer that has a total 32G of ram to test the program, with a massive
java -Xmx28000m Temp.java
it was aborted at about 100 million records and the log shows that the Maximum Vir mem used is 27G !

I may not totally understand how Java manages memory, but i was surprised about the overhead...
So here I am asking if someone knows a solution to:

store the above mentioned data within 24G (preferable even less than 12G, as I don't always have access to that *super* computer)
And yes, I have to hold them all once in memory, I know very few people do that but I will need to generate a Minimum perfect hash
for all the keys (Strings) so every single key has to be in memory to process. oh yes, no two Strings in my case are equal.

Thanks in advance!

p.s. I attached my test code here if it helps:



[NK: Added code tags. Use Codetags while posting code.]
Andrei Matyas
Greenhorn

Joined: Apr 15, 2007
Posts: 25
Hello,

You should implement a StringWrapper class (contructor with a string arg) that will convert the given string into a byte[] using the encoding "ISO-8859-1" (only 256 char (8 bit instead of 16 bit per char). Than you may use your wrapper in the list/map instead of strings.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12806
    
    5
If this was my problem I would consider working entirely with bytes. If the characters are all ascii that is entirely feasible.

In any case I would not try to have an object for each String, get the whole monster in memory and have access methods that calculate the location of the three characters of line N. If they can be bytes, so much the better.

Anyway the above is a wild guess because I don't know: What has to happen to the data next?

Fascinating problem anyway, let us know what you come up with.

Bill
Andrei Matyas
Greenhorn

Joined: Apr 15, 2007
Posts: 25
Yes indeed if you don't need high level methods you can forget the wrapper object (a useless ref of 2*4 byte on a 32 bit OS and 2*8 bytes on a 64 bit OS ).
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16145
    
  21

As a very rough guess, I'd estimate the fixed overhead for any java object, whether it's String or build-your-own is going to be a minimum of 12 bytes, so the actual 3 bytes per string is paltry by comparison.

You can get rid of that by putting all the characters into one big character array, since the overhead for a single character array object is the overhead for a single object. But then the task of tracking where the individual strings are within that array becomes your job, not Java's. And if you want to support variable-length strings, there'd be additional overhead for tracking the start/end points of the strings. Plus you'd either have to size the array worst-case or manage cases where the array turns out to be too small and must be replaced by an array with more capacity.

For data sets of this magnitude, it's often better to process them as serial data streams or database objects instead of trying to keep everything in RAM. Of course, not everything works well done that way, so you have to tune the solution to match the problem.


Customer surveys are for companies who didn't pay proper attention to begin with.
R van Vliet
Ranch Hand

Joined: Nov 10, 2007
Posts: 144
Every 3 character String (even UTF-16 ones) can be converted to a 64-bit signed integer that uniquely represents that string (in most practical cases such as alphanumeric strings this can be done with a 32-bit int as well). That reduces your memory consumption to 64/8*250 million = ~2Gb of memory as a single long[] (or half that as an int[]). Added bonus is extremely fast searches (sort + binary search) on one of the most performance efficient data types. Personally I would not work byte arrays for this specific problem unless you need to manipulate/get individual characters within these strings. It's not a huge difference but byte[][4] arrays are significantly less efficient memory wise and a byte[] is only preferable in specific situations like mentioned above.

By the way, ArrayList is not your problem here (since you use the ArrayList(int capacity) constructor it never has to be resized so it basically takes about 2Gb from the start on 64-bit systems with a relatively tiny bit of overhead). Your issue is that your char[] is actually an Object instance which in and by itself requires at least 16 bytes (12 but rounded up to nearest 8 byte number) and I vaguely remember the footprint for array types being slightly larger in VMs still. Regardless of the solution you go for you can safely assume that you must avoid having to create an Object instances for every single string. That in turn means you are required to use a single or a limited amount of arrays to represent your data.

As others have mentioned it really depends on what you have to do with the data. There are very few scenarios where keeping this large a data set in memory is an absolute requirement on functional or performance grounds. Interesting stuff though, can you give us some insight into what exactly this is for?

Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

Just a thought:

2^16 * 3 is about 200.000 combination. 200.000 * 3 * 16 / 8 is 1.2 MB
Why the 200K? Because the String pool should cache these Strings.

Or am I completely wrong? Of course there is an added cost of 250M * cost of a reference.


"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." --- Martin Fowler
Please correct my English.
R van Vliet
Ranch Hand

Joined: Nov 10, 2007
Posts: 144
Wouter Oet wrote:Just a thought:

2^16 * 3 is about 200.000 combination. 200.000 * 3 * 16 / 8 is 1.2 MB
Why the 200K? Because the String pool should cache these Strings.

Or am I completely wrong? Of course there is an added cost of 250M * cost of a reference.


it's not 2^16 * 3 but 2^16^3 to get all possible combinations. So yes, I'm afraid you're a bit wrong ;)

Either way it's pointless to store the strings themselves as I explained before. Simply generate a unique id for every possible combination and generate the actual strings on the fly when you need them. This saves both having to create object instances on the heap as well as storying references to that instance which are at least 8 bytes in size.

Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

You're right. I already agreed with your solution but it was, just like I said, just a thought.

It's also relevant what the range is. Let's say that the range is a-z A-Z 0-9 and some others signs
then the number of posibilties drops to about 1M. Then you could store the unique id in a int.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12806
    
    5
We still are missing the most important information - what has to happen to the data when it is in memory??

Is it going to be used for some sort of look-up, sorted, or what.

The answer to that question will indicate the storage approach.

SO - Wei Lau - why do you have to have this in memory?

Bill
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4659
    
    5

William Brogden wrote:SO - Wei Lau - why do you have to have this in memory?

Bill is on to something critical here.

The obvious optimization is to not store all the strings in memory. Use a structure, such as a skip list to keep pointers to fast access in memory, and use the disk drive to actually hold the data. Use the data structure to get close to where you expect the data, read a few blocks in, and search that. Even a squential search will be fast, and more sensible that having all this data in memory.

This is also a case where a good working set cache will do wonders. In all real world cases, the access is clustered around some smaller working set than the whole dictionary.

Another idea is to not store the characters even in bytes, if they belong to a restricted alphabet. Seven bits per character works great for US ASCII, Six bits per character works well for upper case letters and all the usual punctuation, numbers, etc. Radix50 works for many character sets.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Best way to store a huge string array/list