aspose file tools*
The moose likes Beginning Java and the fly likes Possible UDT for an Integer type Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Possible UDT for an Integer type" Watch "Possible UDT for an Integer type" New topic
Author

Possible UDT for an Integer type

David Galea
Greenhorn

Joined: May 21, 2013
Posts: 6
Hi guys,

I'm working on a project that requires a very large double array (say A) n x n where n ~ 100,000. The values in A represent integer values and are at the absolute max are 1000, i.e. max (A[i][j] ) = 1000. Consequently I am currently using a short type, i.e.

short[][] A = new short[n][m]

which I can comfortably use n,m = 30,000

I was wondering if it is possible to create another integer type that had self defined bounds, or if anyone knows of someone/organisation/etc that have implemented such types?

Thanks in Advance,

David
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38903
    
  23
There is no such thing as a double array. What you are describing is an array of arrays.
Of course you can create an IntUpTo1000 class. But it will occupy more memory than using short[][]s.
Your 100000² as shorts is about 20GB, so you are going to have RAM problems; can you get a computer with enough RAM to hold 20GB?
You would do well to create a Row class, and put it into a database. Databases are good at handling that amount of data. You may have performance problems if you have to fetch rows frequently, however.
Piet Souris
Ranch Hand

Joined: Mar 08, 2009
Posts: 552
    
  11
Another consideration is: do you always fill up the complete array?
If you only use a tiny fraction of the 100.000 ^2 possible locations, then you might
want to use something else than an array. A hashmap may be possible.

For some interesting reading, look here:
http://en.wikipedia.org/wiki/Sparse_matrix#Storing_a_sparse_matrix
David Galea
Greenhorn

Joined: May 21, 2013
Posts: 6
Hey guys,

Thanks for your responses (my apologies for the double array, I did mean an array of doubles)

The comp I'm working with has 16 GB of ram, with Ubuntu 11.00 64 bit as my OS and I'm using Eclipse as the dev environment, so it seems that I won't be able to run the full 100,000x 100,000 array. The density of the array is generally 25%, so I'm looking for a Sparse Matrix representation using a HashMap, as follows,

HashMap<Integer, HashMap<Integer, Integer>>

And see how I go

Thanks again for your help,

Cheers,

David
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

There will be higher per-entry overhead in a Map than in an array, so even at only 25% occupancy, a simple HashMap might not be any smaller than an array. Almost certainly HashMap<Integer, HashMap<Integer, Integer>> won't be.

And I thought you said it would be an array of doubles, but now you've got Integers.

If you're going to use a Map to represent a matrix, and you're trying to save memory, your fist step would be to make it "1D" hashmap. Don't nest the maps. Let the key indicate the row and the column. For instance you could let 0-99,999 indicate row 0, 100,000-199,999 indicate row 1, etc. Or you could use a long, where the high-order int is the row and the low-order int is the column.

Because of the higher per-entry overhead with a Map, that still might not be enough. So you might have to look into something like run-length encoding to reduce the overhead even further.

Or you could try using a database, as suggested, or maybe even a flat file.

Without knowing more about what your access patterns will be, it's impossible to evaluate the relative merits of different approaches.
Piet Souris
Ranch Hand

Joined: Mar 08, 2009
Posts: 552
    
  11
Intriguing problem...

Yes, 25% density would mean 2.5B(illion) locations, or 10GB memory. Let alone all the pointers
for the Integers in the hashmap.

One way to reduce some space is to use the fact that the values are 1000 max. Assuming no
negative values, it means we need only 10 bits to store a value. Storing 100.000 values in a row
would therefore mean 1.000.000 bits, or 125.000 bytes.
Since 1023 is not used, we could use this number to incdicate 'not yet used'.

A data type that works directly with bits is BitSet. So, we might use BitSet[1.000.000] for our
rows. Assuming that on average we only use 50.000 locations, that would mean we're wasting
half of this capacity, but I could not come up with a solution for this. Of course, BitSets are not
meant to be used for calculation purposes, but it's easy to use them in this way.

On average, we have 50.000 rows, so that would still give us 6.2GB total memory. Too much for
an average PC, I'm afraid. So, what we can do is to use some old tricks from the time we used floppies.
In those days, whenever we had fixed length records, we could open this file and play around
with the file pointer. In the Basic I used, I remember syntax like:



I just looked it up in the API, but we have a class called FileChannel that seems to be
able to do just this. It would mean that we only need to reserve memory for one
BitSet[1.000.000], reading the required record given a row index, and have a free space
on the HD of at least 13GB.

Well, this is what I could come up with. It does not require expensive databases or exceptional
tricks. However, I give no guarantees that it really DOES work, and if so, what performance
it would have. The biggest bitset I've ever used was only 11 bits long.




Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7838
    
  21

David Galea wrote:I was wondering if it is possible to create another integer type that had self defined bounds, or if anyone knows of someone/organisation/etc that have implemented such types?

Well, I'm pretty sure that Google have done a lot of work on huge "arrays".

However, like the others, I reckon that a database is your best option for storing the actual data, since handling huge amounts of it is precisely what databases were designed for.

A possible wrinkle you could try, though, would be a cache, implemented as a LinkedHashMap.

For example, you could easily set up a LinkedHashMap<Long, Short>, with a key set up as Jeff described, that holds, say, 25 million entries (or some power of two, multiplied by 0.75), and use it as a "most recently used" cache buffer. If there is any pattern to your access, you may find a size where you get lots of "hits" from such a cache, which then saves you the business of having to call the database, which will be a LOT slower than retrieving from an LHM.

Good luck, and good hunting.

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Possible UDT for an Integer type