aspose file tools
The moose likes Performance and the fly likes What would be the best datastructure to hold huge data Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


JavaRanch » Java Forums » Java » Performance
Reply Bookmark "What would be the best datastructure to hold huge data" Watch "What would be the best datastructure to hold huge data" New topic
Author

What would be the best datastructure to hold huge data

Mohanraj Naidu
Greenhorn

Joined: Jun 29, 2003
Posts: 10
Hi All,
I want to design a datastructure in Java to hold lot of integer values. What would be the best datastructure for this purpose?
My requirement is:
I want to accumulate the primary keys of all the tables in some sort of map arranged by table name. There are around 600-700 tables and each tabble has goot huge number of records (30000-40000). I want to use some thing simillar to HashMap. But I doubt if it'll give me out of memory problem.
Can HashMap hold this amount of data? Is there any way to find out the limit? is there any better datastructure u can suggest?
Thanks in advance.
Mohan
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
What do you want to do with that "map"?


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Marc Haberkorn
Greenhorn

Joined: Aug 15, 2002
Posts: 10
Well, whether or not you run out of memory has to do with the size of your Java heap, so the answer to this question is dependent upon your runtime environment.
To answer the question about a HashMap holding a large amount of information...you can specify the load ratio when you initialize the object. This limits the amount of unused space in the underlying hash table. Of course, restricting this unused space causes a performance hit, as you will have more collisions in the hash table, but depending on what you're doing, that may or may not be an issue.
Jeroen Wenting
Ranch Hand

Joined: Oct 12, 2000
Posts: 5093
Given the amount of data I'd be more concerned about lookup times than storage space (at least if the system it is intended to run on has ample RAM).
To increase performance on that it might be a better idea to store the keys for each table in their own Map and do lookups on that, or to provide a staged approach in which you break up the keys into smaller subsets and first determine in some way in which of these subsets you should do a lookup (maybe store a key to these subsets in its own Map for that).


42
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
To increase performance on that it might be a better idea to store the keys for each table in their own Map and do lookups on that, or to provide a staged approach in which you break up the keys into smaller subsets and first determine in some way in which of these subsets you should do a lookup
The lookup operation is O(1) on the map, i.e. constant time no matter how large the collection is, assuming that the key objects have a decent hashing algorithm.
Ouaknin lionel
Greenhorn

Joined: Sep 12, 2003
Posts: 14
i will advise you to create your own data structure. We will discuss it further along but first let me say why you should use your own Structures:
- HashMap and HashTable can't store mere int, they have to store an Integer. And this is bad for performance because you have to do:
HashTable hash;
hash.put(new Integer(0));
and
int result=((Integer) hash.get(key)).intValue();
and this is bad for performance and memory footprint.
What i advise you to do is to create specific hashtable for int using underlying plain array. To know how to do that, take a closer look at the source code of Hashtable.


"Nobody will ever need more than 256 kb of ram" -Bill Gates
Ouaknin lionel
Greenhorn

Joined: Sep 12, 2003
Posts: 14
If you do your own data structure right, there's no way to do something faster. Take a look at hashtable. You might take a look at this Bestseller two: Java Performance Tuning. JacK Shirazi.
Amazing book....
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24081
    
  15

Hi "HWBBH,"
Welcome to Java Ranch! We don't have many rules around these parts, but we do have a naming policy. Please check it out, then change your display name to something with a few more vowels. Thanks, and see you around the Ranch!


[Jess in Action][AskingGoodQuestions]
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
I want to accumulate the primary keys of all the tables in some sort of map arranged by table name. There are around 600-700 tables and each tabble has goot huge number of records (30000-40000). I want to use some thing simillar to HashMap. But I doubt if it'll give me out of memory problem.
Ok, let's see the memory utilization if we used the Map. Actually, in the implementation bellow, I used HashSets to store the info (one HashSet per table), and the HashMap to store all of the HashSets. So, it's a HashMap of HashSets. Also, I specified only 400 records per table (as opposed to 40000 records per table). Another assumption that I made was that the primary key would be a String, and the size of the key is 5 characters long. I ran this test on my Winows98 machine, JDK1.4.2, with the following options -Xms128M -Xmx128M (to prevent the heap from resizing).
Here is the complete code:

Output:
total memory used = 22 Mb
Since the number of entries in all the HashSets was 700 * 400 = 280,000 and the original post was about storing 700 * 40000 = 28,000,000 entries, the expected memory usage would be around 22Mb * 100 ~ 2.2Gb
Time to order some memory chips!
Ouaknin lionel
Greenhorn

Joined: Sep 12, 2003
Posts: 14
sorry for the wacky alias. this is a real name.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18670

Hah, you too! People sometimes look at me strangely when I put repeated GC's in front of a measurement like this, but the fact is that I really have observed different results depending on whether I GC 1, 2, or even 3 times. Never obeserved any difference past 3, so far, but it wouldn't surprise me. "When control returns from the method call, the Java Virtual Machine has made a best effort to reclaim space from all discarded objects." Yeah, right. :roll:
Yeah, I know that elsewhere it's clear that there aren't any guarantees for GC, but my position is that if you can do better by repeating something, then what you did the first time wasn't really your best effort.


"I'm not back." - Bill Harding, Twister
John Smith
Ranch Hand

Joined: Oct 08, 2001
Posts: 2937
Hah, you too! People sometimes look at me strangely when I put repeated GC's in front of a measurement like this, but the fact is that I really have observed different results depending on whether I GC 1, 2, or even 3 times. Never obeserved any difference past 3, so far, but it wouldn't surprise me.
I've seen GC kicking in as much as 5 times before it reclaimed everything. Generational garbage collectors became too smart.
The best way you can observe it is to start the server process with the "-verbosegc" option and invoke System.gc() in a loop, periodically. What you would normally see, depending on the JVM version and the startup parameters, is that as you push GC to do its job, over, and over, it takes more and more time, for each invocation in the loop. Furthermore, the initial invocations will typically release only a small portion of the unreferenced objects, for performance reasons. On the console, you will see something like [GC: blah-blah]. But if you keep pushing, the full garbage collection kicks in, as indicated by the [Full GC: blah-blah] in the console. This full GC may take 3 to 5 seconds or even longer.
Kirk Pepperdine
Author
Ranch Hand

Joined: Jan 17, 2001
Posts: 71
Originally posted by Eugene Kononov:
Hah, you too! People sometimes look at me strangely when I put repeated GC's in front of a measurement like this, but the fact is that I really have observed different results depending on whether I GC 1, 2, or even 3 times. Never obeserved any difference past 3, so far, but it wouldn't surprise me.
I've seen GC kicking in as much as 5 times before it reclaimed everything. Generational garbage collectors became too smart.
This full GC may take 3 to 5 seconds or even longer.

I always add a sleep after the GC. As you said, a full GC may take quite some time to complete and you do want GC to be completed before taking any timings. In some most cases, I've found this to be very effective, more effective than calling gc multiple times.


Author of <a href="http://www.amazon.com/exec/obidos/ASIN/0672324261/ref=jranch-20" target="_blank" rel="nofollow">Ant Developer's Handbook</a>
Frank Carver
Sheriff

Joined: Jan 07, 1999
Posts: 6919
want to design a datastructure in Java to hold lot of integer values. What would be the best datastructure for this purpose?
...
I want to accumulate the primary keys of all the tables in some sort of map arranged by table name. There are around 600-700 tables and each tabble has goot huge number of records (30000-40000).
...
is there any better datastructure u can suggest?

Is there already any pattern to which keys have been allocated? Are there a lot of "runs" of keys (e.g. 1234,1235,1236,1237,1238,...) in each table?
I can think of several ways to do this with much reduced memory usage, but the most important question is "What do you plan to do with these keys once you have them?"
Just loading a bunch of data into memory serves no useful purpose. For it to be a worthwhile effort you need to use the loaded data somehow. My guess is that you might plan to use it when you generate new primary keys to ensure that they are not already used, but you may have an entirely different use in mind.
If you just want to be able to generate a new unique key for a given table, all you really need to do is read all the keys of the table, and keep track of the largest one as you go.

Then, when you need a new key, just increment max[] for that table and use the new top value. Total memory use, one integer for each table !
If you really need to reuse "gaps" in the key sequence, then you might consider "run length encoding": For each table, store an array of (start,length) pairs. Updating this array as you read in each key is more complex than the above, but not impossible:
  • if the key is both at the start of one run and at the end of another, merge the two runs (use the lowest start, and add the two lengths + 1)
  • if the key is at the start or end of an existing run, add the key by decreasing the start or increasing the length of the run
  • if the key is not adjacent to any runs, create a new run of length 1


  • The bottom line for all of these solutions is that the need you describe sounds much more like a Set than a Map. Each key doesn't actually connect with another object, it just exists (or not) in the collection.
    For good examples in "thinking outside the box" like this, I recommend the "Programming Pearls" books and articles by Jon Bentley.
    Did any of this help, or have I misssed the point?
    [ September 28, 2003: Message edited by: Frank Carver ]

    Read about me at frankcarver.me ~ Raspberry Alpha Omega ~ Frank's Punchbarrel Blog
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: What would be the best datastructure to hold huge data
     
    Similar Threads
    Maximum number of elements in a Treemap...
    array with elements of different types?
    parse large xmlfile and display using swing
    Datastructure using Java
    Files with data range