aspose file tools*
The moose likes Java in General and the fly likes Files vs Arrays? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Files vs Arrays?" Watch "Files vs Arrays?" New topic
Author

Files vs Arrays?

Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Hi.

I have written a program which takes in data from text files that can be up to 5Mb. At the moment, the program is slow and i am getting the memory heap problem and I think its because im storing a lot of the data in arrays. I was wondering, would it be better to store data in files that can then be read rather than storing everything in arrays which meant they all have to be kept in memory at the same time?

Many thanks


Moosey knows best
Michael Morris
Ranch Hand

Joined: Jan 30, 2002
Posts: 3451
...would it be better to store data in files that can then be read rather than storing everything in arrays..


I doubt the arrays are what's causing the problem. My guess is you have an inefficient algorithm that is retrieving data from the arrays. File I/O for the most part is going to be much slower than anything stored on the heap and it seems somewhat strange to read the data in from a file, store it into another file and retrieve it again (if that's what you are implying).

How much heap are you using? Are the arrays multi-dimensional?

If your problem is an ineficient algorithm, you may want to consider using Maps instead of arrays, but make sure that the hashes for your keys are efficient.


Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius - and a lot of courage - to move in the opposite direction. - Ernst F. Schumacher
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Yes the arrays become multidimensional. Using one of the input files, the arrays would be 20000 by 20000.

Having put System.out.println("Test") statements in my code I find that the program crashes when declaring a 2D array of that size...

How do I work out what heap I am using?
[ May 27, 2007: Message edited by: Sam Bluesman ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Um, yes. A byte[][] of that size would take up 400 Mb or so, and other data types would take up more memory. So yes, that's likely to give you memory problems, and a file structure could well be better for you. Or maybe there's another way to handle you data, such as with sparse arrays (since there were only 5 Mb of data initially). What can you tell use about the data in the file? What does it represent?


"I'm not back." - Bill Harding, Twister
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Hi there. Basically, the file has the following format:

Object1 Object2 Stat

and so would have contents similar to:

AAAA BAAA 0.1
AAAA CAAA 0.2
AAAA DAAA 0.3
BAAA CAAA 0.4
BAAA DAAA 0.5
CAAA DAAA 0.6

...except imagine thousands of text lines like this

This is then put into a 2d array as follows


..except this would be obvisouly bigger than the data given above

BTW, what are sparse arrays?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
So, the file lines you're showing there look like they take about 14-15 bytes each, assuming 1 byte per character. More if the decimal values are longer. So a 5 Mb file would have a little under 40,000 lines, at most. A 20,000 x 20,000 grid has 400,000,000 positions in it. So about 99.99% of the grid is filled with zeroes, the default value. Is that right? A conventional 2-D array (or array of arrays, in Java) is taking up a large amount of memory that isn't really being used for anything.

This is exactly the sort of problem sparse arrays are designed for. Basically, they're alternate storage structures designed to handle arrays (or matrices) in which the majority of the elements are zero (or some other default value). However there is no one implementation strategy for these; there are several. You could spend some time googling "sparse arrays" or maybe "java sparse arrays" for more info and examples. You may also want to replace "arrays" with "matrices".

One relatively simple way to implement a sparse array in Java is to create a Map<Point,Double>. Each set of x-y coordinates with a nonzero value would get its own Point object, and then the value would be put in a Double object, and the pair would be put in the Map. That may sound inefficient, but it's way less inefficient than what you have now. At least in terms of memory. Whether or not it works well for you is another question, which may depend on what sort of processing you need to do. For some types of processing, you may benefit from having a TreeMap or NavigableMp rather than HashMap. But that's hard to say without a lot more analysis. I think a HashMap<Point, Double> would not be too difficult to make. You could try it an see how well it works, then analyze further if improvements are unnecessary.

Or you could make one big file, read with a RandomAccessFile. Or FileChannel could be faster if you're willing to do extra work to figure out how to use it effectively here. Again, it shouldn't be too difficult to try something like this with RAF, and see how fast it is. If it's not fast enough, you may well be able to improve performance by reading a chunk of the file into memory all at once, processing all that, and then moving on to the next chunk.

Can you tell us more about the type of processing you need to do with this data? What does the data represent? Can you consider each data element one at a time, or do the calculations involve several neighboring elements at once? Here is where there may be opportunities for clever tricks to avoid needing to have everything in memory at once. Maybe.
[ May 27, 2007: Message edited by: Jim Yingst ]
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Thanks for the advice. I've just learnt (from my superiors) that the format of the file will NOT be as said above!

It will be:

AAAA BAAA DataValue DataValue DataValue DataValue 0.1
AAAA CAAA DataValue DataValue DataValue DateValue 0.2
AAAA DAAA DataValue DataValue DataValue DateValue 0.3
BAAA CAAA DataValue DataValue DataValue DateValue 0.4
BAAA DAAA DataValue DataValue DataValue DateValue 0.5
CAAA DAAA DataValue DataValue DataValue DateValue 0.6

The above matrix that needs to be produced still stands.

As a general background and purpose of this program...

The file contains object identifiers and, as well as other data values, a correlation statistic between them.

Taking the file, the data should be represented as a matrix of correlations. It's almost like having a set of places (instead of objects) and the distances between them (instead of the correlation stats) and the idea is to put them into something similar to a SQUARE road distance chart. This MUST be placed in a text file so that it can be put into the statistical program R....or at worst an Excel worksheet, but I have absolutely no idea how to do this.

Thanks again
[ May 27, 2007: Message edited by: Sam Bluesman ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
hm. Well, so far I don't really have any idea what these values mean, or what the processing you're doing is like. You should probably just ignore my comments about sparse arrays unless you can determine that the vast majority of entries in the array are zero.

Is there a way that you can read the original file data into memory, and then generate the 20000x20000 array one row at a time? (Or even, one element at a time.) Is there any reason you need the whole thing in memory at once? If it's possible to calculate each value one at a time, I'd do that, and write the results to a file as you go.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12835
    
    5
Is there a way that you can read the original file data into memory,


That sounds like the right track to me. Your 5MB file will only take 10MB as a char[]. One pass through that will let you locate all the line starts as an int[] of pointers which will enable you to recover the contents of any line at memory speed. As Jim suggests you can then work on the problem in chunks of a reasonable size.

Bill
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Sorry guys...im not exactly sure you're trying to say...
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
What's being said is that perhaps there's a way to process the data in smaller chunks so you don't have to read in the entire set of data into an array at once; this is a bit like encoding/decoding an audio file: you do it block by block, not altogether. Could you elaborate more on the algorithm you're using to process the data? If you can be more specific (e.g. by giving an outline of the algorithm), it might help.

Alternatively, is there any reason why your file needs to be in text format? If it was a binary structure instead, you could use a RandomAccessFile to read/write and seek directly in the file, possibly also using a RAM buffer (array) to increase intermediate calculation performance. Also, a binary file would mean you don't have to parse the textual contents into primitive numbers (which is an expensive operation), and your file sizes would also be much reduced.
[ May 28, 2007: Message edited by: Charles Lyons ]

Charles Lyons (SCJP 1.4, April 2003; SCJP 5, Dec 2006; SCWCD 1.4b, April 2004)
Author of OCEJWCD Study Companion for Oracle Exam 1Z0-899 (ISBN 0955160340 / Amazon Amazon UK )
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
We are suggesting that it may be possible to solve this without ever creating a 20000 x 20000 array. Something like this:

I have no idea if this is possible for your problem, because I have no idea how you are supposed to calculate your results. But many problems are possible to solve in a manner similar to this. If you don't need to create the 20000 x 20000 array, then don't.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[Charles]: Alternatively, is there any reason why your file needs to be in text format? If it was a binary structure instead, you could use a RandomAccessFile to read/write and seek directly in the file, possibly also using a RAM buffer (array) to increase intermediate calculations.

Yes, when I talked about using RandomAccessFile in place of the 20000 x 20000 array earlier, I should have said you'd write and read the values using binary methods like writeDouble() and readDouble(), not printing the results as text. Even if you eventually need a text file, you could use a large binary file as an intermediate format.

Sam: it's very difficult to give good advice here without a better understanding of what the data in the input file means, and what the output file means. What do AAAA, BAAA refer to? What are the 0.1, 0.2, etc? And what in the world are the "DataValue" things?
[ May 28, 2007: Message edited by: Jim Yingst ]
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
OK Guys. I can see that you're trying to help but don't have much to go on: So..... here goes:

Background:

Single Nucleotide Polymorphisms (SNPs) are single base variants within our DNA. For example, where some might have an 'A' at a particular location in their DNA, others may have a 'G'

The file I want to take in has a list of SNPs and the correlations between other SNPs in a region of DNA. It contains the locations of the SNPs, the population from which the data came from e.g. USA, and actual names of the SNPs and the correlation between the two SNPs in that row; it has the following format:

I have chosen to use the locations of the two snps as their identifiers purely randomly, of course I could have chosen the Names...but there we go.

As an example, here is an example of some REAL data, but VERY much condensed - there are actually thousands of differnt SNPs in each file.


I would like to be able to get this into the following form:

Note 1) The correlations have to be taken away from 1, so what would have been 0.959 would be 0.041
Note 2) The matrix above is represented neatly for your viewing. A space beween each SNP location, statistic etc. would be the aim.

Here is an outline of my program:


[ May 28, 2007: Message edited by: Sam Bluesman ]
[ May 28, 2007: Message edited by: Sam Bluesman ]
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
Can I just clarify a couple of things?

(1) Only columns 1, 2 and 7 (the last) are important in this data:

(2) In this required output:The produced matrix is symmetric with 0s on the diagonal? This will be a massive optimisation because you'll only need n(n-1)/2 pieces of data to create the matrix, whereas before you required n^2 - that's a huge saving for large n!

(3) Finally, how many lines are there in the file?
[ May 28, 2007: Message edited by: Charles Lyons ]
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Hi Charles.

1) Correct. 1,2 and 7 are the ONLY columns I need.

2) "The produced matrix is symmetric with 0s on the diagonal?"
Absolutly!

3) The number of lines will vary from anything of 10000 to 80000. This will depend on the nunber of SNPs that are being examined.

Many thanks
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12835
    
    5
Here is what I was trying to suggest -

1. now that we know you are dealing with plain ascii we can use a byte[] and not loose any data
2. using the File object, get the length of the file and create a maching byte[]
3. read the entire file into the byte[] buffer
4. scan for line ends, recording the matching line starts as an ArrayList of Integer objects
5. convert the ArrayList data to an int[]

Now instead of reading the file repeatedly you can grab any arbitrary line as a String, making repeated passes through the data simple. You can now calculate and format your table one line at a time.

Bill
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
Presumably also the columns and rows in the output will need to be sorted so the headings are in ascending numerical order?
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Well they come in numerical order in the file already. I don't think it would matter too much if they were not in the output matrix.
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
2. using the File object, get the length of the file and create a maching byte[]
3. read the entire file into the byte[] buffer
But most of the data is redundant, so why store it at all?
4. scan for line ends, recording the matching line starts as an ArrayList of Integer objects
More efficient to scan with BufferedReader's readLine() - it calculates line terminating characters correctly, and obtains a String we'll want to parse anyway.
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
Well I was intrigued by this problem, so for a bit of fun, here's my draft solution:
It may well be full of compilation errors and I deliberately haven't finished it to give you something to work with. In particular, I think it will only output the upper (or lower) triangle of the matrix, so you'll need to think about how to get the diagonals and other triangle out. However, I think (after bug fixing) it should handle reading and sorting entries to provide a convenient way to obtain that data. You'll have to write the output code yourself! The only storage here is a TreeMap containing (n(n-1)/2) DataPoint and Double objects.

If you don't understand something or I've done something stupid, please post back your questions/comments...
[ May 28, 2007: Message edited by: Charles Lyons ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
OK, from what you've said, the largest file you have is 80,000 lines, which would create nonzero entries at 160,000 positions in the matrix, which has 400,000,000 positions total. That's a very sparse matrix. Because of the symmetry it could be done as a triangular matrix, but that only saves about half the space. It's still huge. You can use much less.

I would use my earlier HashMap suggestion. Whenever you you have a new line from the input file, take the two SNP locations and store them as a Pair:

Note the constructor makes sure that whatever order a and b are in, x is the lower number, and y is the higher number. Since the line for "21876767 21880326" also represents the correlation for "21880926 21876767", there's no need to store the latter one as well. This way we use one object to represent both.

Then just take each Pair and put it in a HashMap with the value:

Then you can just print out your table directly, without ever putting everything in a big, mostly-empty array:

Note that we don't need anything at all in memory except for the HashMap and its contents, plus a few local variables. And the HashMap is just a little bigger than your three arrays were (which we no longer need), and much, much smaller than a 20000 x 20000 matrix.
[ May 28, 2007: Message edited by: Jim Yingst ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Well, same basic idea. I don't see a purpose to using TreeMap rather than HashMap, since the ordering we need in the output is achieved by the loops. HashMap ought to be faster. But I might be missing something.
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
I don't see a purpose to using TreeMap rather than HashMap, since the ordering we need in the output is achieved by the loops. HashMap ought to be faster. But I might be missing something.
I don't know how the SNPLocation headings are being derived: numbers like 21876767 followed by 21880326 and 21884299 may not necessarily be in the appropriate order in the input file - and I'm not sure how you loop those from 1 to 20000 without knowing what they are in advance (there is no common difference between those three for instance)? Unless there's a pattern/formula I'm missing which determines them, it seemed sensible to order on the values of the input SNPLocations, rather than trust the file to be in the correct order already. If there is a pattern to exploit, HashMap is probably faster, though I would think TreeMap only has a performance overhead when adding the data (not retrieving it). Otherwise I think we have basically identical approaches!
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[Charles]: The only storage here is a TreeMap containing (n(n-1)/2) DataPoint and Double objects.

Doesn't your map have only n entries? Same as mine?
[ May 28, 2007: Message edited by: Jim Yingst ]
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
Doesn't your map have only n entries? Same as mine?
I was interpreting n as the size of the resulting matrix (i.e. n x n matrix), while in fact if N is the number of lines in the input file (ignoring the first two), then we should have:

N = n(n-1)/2

So we're both correct, for different definitions of n or N! The map itself shouldn't contain any more than N (number of input data lines) entries.
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Hi Guys.

Thank you very much for trying to work this out. But there is a lot of code there I don't understand and so it's going to take me a while to work through it as I have not ever used any types of '-Map' before. I'll let you know of any questions I am sure to have about this.

Thanks again
[ May 28, 2007: Message edited by: Sam Bluesman ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[Charles]: I don't know how the SNPLocation headings are being derived: numbers like 21876767 followed by 21880326 and 21884299 may not necessarily be in the appropriate order in the input file - and I'm not sure how you loop those from 1 to 20000 without knowing what they are in advance (there is no common difference between those three for instance)?

I missed that the numbers were larger than his stated range of 20000, implying that the list of SNP locations has many gaps in it. OK, so just take the lists of SNP locations he built in the first place, and sort them. Well, better to combine them and eliminate any duplicates, I guess, so put them all in a single TreeSet and do a toArray() on the result:

Now you've got a complete list of SNP locations, in order. The final loops become:

I added a comment about null map values, which should be extremely common since the size of the input file is so small compared to the output table.

[Charles]: I would think TreeMap only has a performance overhead when adding the data (not retrieving it).

It's O(log(n)) for get() as well as put(). Unless you're iterating the entrySet(), but I don't see how that would work here.

[Charles]: I was interpreting n as the size of the resulting matrix (i.e. n x n matrix),

Ah, makes sense.
[ May 28, 2007: Message edited by: Jim Yingst ]
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Hi Guys.

Im trying to implment a TreeMap as shown by charles but when I compile i get the folling error:

type TreeMap does not take parameters

Any ideas?

Cheers
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Sorry. I've just figured that i forgot an import statement - after hours of reading!!!
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Right. I've been doing my reading and examining the code used by Charles. I am printing to the screen the values held in the TreeMap. But I am a bit lost as to where the values of the SNP locations have gone?

Could you please clear this up for me?

MANY thanks
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
Using either of the solutions given previously, the keys for the map hold both the snp1 and snp2 values... You can obtain the pairs using the entrySet() method of TreeMap and iterating through that (possibly marginally more efficient than iterating through the keys and using get() on TreeMap to retrieve the value).

By the way, how's the running time? If it's a problem, Jim's solution is quite possibly slightly faster, though requires separate sorting code.
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Hello there.

OK. I've tried that using the following code:


However, I get the following in my XP console:

[FileInProgram$DataPoint@457=0.5, FileInProgram$DataPoint@457=0.7, FileInProgram$DataPoint@8ae=0.9][FileInProgram$DataPoint@457=0.5, Fi
leInProgram$DataPoint@457=0.7, FileInProgram$DataPoint@8ae=0.9][FileInProgram$DataPoint@457=0.5, FileInProgram$DataPoint@457=0.7, FileI
nProgram$DataPoint@8ae=0.9]

...having used the following small input file:

I think i'm not doing something right as I can't see the SNP locations listed.
[ May 30, 2007: Message edited by: Sam Bluesman ]
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
BTW. Performace is excellent.
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
References like "FileInProgram$DataPoint@457" are the default implementation of toString() - which is invoked to output the DataPoint as a String.

Since you don't actually want toString() anyway, try the following (uses Java 5 syntax for the iterating):The keys should be in ascending order of SNP1. Now you've got the data in the map and an output set, you'll need to figure out how to output a matrix grid of the required values.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
You're printing the entire entrySet() three times. And by using print() rather than println(), it's more confusing because there's no line separator between entries. Try something like this:

The real problem though is that the DataPoint class doesn't have a toString() method that prints the point in a helpful manner. Neither did my Pair. Oops. However, you can add one, overriding the toString() method in DataPoint so that instead of "FileInProgram$DataPoint@457" you get something like "[1111,2222]"

Note that with a good toString() in place, you'd have gotten readable output just by writing

[ May 30, 2007: Message edited by: Jim Yingst ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[Charles]: Since you don't actually want toString() anyway

Well, not for the final result, in this case. But it's useful for being able to print something out to see what data is there. In my response I'm assuming there will be more work to format the output later. The toString() is just for debugging.
Sam Bluesman
Ranch Hand

Joined: Nov 21, 2004
Posts: 191
Ah. That seemed to work guys.

Thank you very much.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Files vs Arrays?