aspose file tools*
The moose likes Java in General and the fly likes Problem with handling huge data structures Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Problem with handling huge data structures" Watch "Problem with handling huge data structures" New topic
Author

Problem with handling huge data structures

Satya Maheshwari
Ranch Hand

Joined: Jan 01, 2007
Posts: 368
Hi

I have a requirement wherein my java program has to deal with huge data structures which sometimes exceed the available memory to the java process. I have set the heap size to the max permitted by the OS but still it is not enough.
To solve this I am thinking on the lines of having the data-structure overflow to the file-system. But considering the time we have to get this done, is there an already available solution which tackles such scenarios. Something of the sort of a disk based cache etc. Please help.

Regards


Thanks and Regards
Kiaamaa Liammes
Ranch Hand

Joined: Oct 03, 2009
Posts: 32

consider using a in memory database , I remember hsql being used for one of the projects


SCJP ,SCWCD
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12769
    
    5
A lot would depend on exactly how your program uses these "huge data structures" - random access? sequential access? - does your program have to modify and save these structures?

Explain in more detail please if you want better advice.

Are we talking about Java objects here? If so can you create a more compact representation based on the real characteristics of the data items?

Are you on a network where some work can be farmed out to another machine?

Bill
Satya Maheshwari
Ranch Hand

Joined: Jan 01, 2007
Posts: 368
William Brogden wrote:A lot would depend on exactly how your program uses these "huge data structures" - random access? sequential access? - does your program have to modify and save these structures?

Explain in more detail please if you want better advice.

Are we talking about Java objects here? If so can you create a more compact representation based on the real characteristics of the data items?

Are you on a network where some work can be farmed out to another machine?

Bill


Ok, let me explain the problem more specifically.
Basically I have a huge hash map with string to string mappings. This hashmap is used for making string to string replacements in several files. So its is a random access. Once this map is built up, it is only accessed and not modified.

Yes, putting this on another machine in the network is an option(similar to putting it on a file) though it may reduce the performance. Is using a LinkedHashMap as an LRU cache a good idea?
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12769
    
    5
Obviously the very best performance would be an all in memory HashMap.

Assuming that is impossible, lets consider some more questions.

Can lookups be batched? If you are going to have to go to some sort of file or network lookup, the bigger the batch you can create the better.

More questions:

Do these strings have to be UNICODE or are the characters in the ASCII set?

Do you have to check every single word for a possible substitution?

What is the likely frequency of having to replace a string? It may make a big difference if there are only a few per document versus wholesale replacement.

I can't see how a linked hash map would help, but a cache is a good idea since words tend to repeat in documents. Note that I am assuming typical text documents because thats the kind of thing I used to work with.

Bill

David O'Meara
Rancher

Joined: Mar 06, 2001
Posts: 13459

Are you using substring on file contents? That method can cause memory leaks since you think you're using part of the file but the immutability of Strings means you retain the whole file in memory. Maybe not, but then again maybe you're not using as much memory as you think.
Satya Maheshwari
Ranch Hand

Joined: Jan 01, 2007
Posts: 368
William Brogden wrote:Obviously the very best performance would be an all in memory HashMap.

Assuming that is impossible, lets consider some more questions.

Can lookups be batched? If you are going to have to go to some sort of file or network lookup, the bigger the batch you can create the better.


Yes, I think batching them would save on the n/w lookup. Actually I have been thinking on parallel lines about storing this in the filesystem, as you suggested for network. And then maintaining and LRU cache which keeps the most recently used stuff online in the LinkedHashMap. LinkedHashMap because it has an inherent support to act like an LRU cache.

William Brogden wrote:
More questions:

Do these strings have to be UNICODE or are the characters in the ASCII set?

ASCII. In fact all these strings are numbers.

William Brogden wrote:
Do you have to check every single word for a possible substitution?

I think so. I do pattern matching to find out a number and then look for it in the map. Is there any other way?

William Brogden wrote:
What is the likely frequency of having to replace a string? It may make a big difference if there are only a few per document versus wholesale replacement.

Frequency varies but I can predict beforehand it as it depends of the type of document.

William Brogden wrote:
I can't see how a linked hash map would help, but a cache is a good idea since words tend to repeat in documents. Note that I am assuming typical text documents because thats the kind of thing I used to work with.


LinkedHashMap because it can easily be extended to an LRU cache. These are not typical text documents but mostly XMLs and a few FLVs.


Satya Maheshwari
Ranch Hand

Joined: Jan 01, 2007
Posts: 368
David O'Meara wrote:Are you using substring on file contents? That method can cause memory leaks since you think you're using part of the file but the immutability of Strings means you retain the whole file in memory. Maybe not, but then again maybe you're not using as much memory as you think.


No I am not using substrings on the file contents. But yes, when you talked about memory leaks due to string immutability, I am thinking on reviewing the code to figure out any such culprits.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12769
    
    5
With all ASCII replacement text you could save considerable memory in a HashMap by storing replacement text as byte[] not String, String uses 16bit unicode.

IF the input numbers and replacement numbers fit in Java Integer or Long you could save even more and maybe keep the whole HashMap in memory. So - what is the range of these numbers?

Bill

Satya Maheshwari
Ranch Hand

Joined: Jan 01, 2007
Posts: 368
William Brogden wrote:With all ASCII replacement text you could save considerable memory in a HashMap by storing replacement text as byte[] not String, String uses 16bit unicode.

IF the input numbers and replacement numbers fit in Java Integer or Long you could save even more and maybe keep the whole HashMap in memory. So - what is the range of these numbers?

Bill



Thanks for the information. Yet the number would fit in Long though not in Integer.
 
wood burning stoves
 
subject: Problem with handling huge data structures