I have a program in which I need to store approximately 11 million words (about 8 mb) in an arrayList. My problem is arrayList will only hold 1 million String objects. Even when I explicitly declare a size, like this
or use the ensureCapacity method like this
the arrayList will still only hold 1 million String objects. Should I be using another data structure? I really would like to use arrayList, unless it is impossible to store this much data in it. Thanks.
No, the maximum capacity of an ArrayList (or any kind of list whatsoever) is limited only by the amount of memory the JVM has available.
Your estimate of the amount of memory required for your data is surely wrong; if 11 million words really required 8 million bytes then each word, on average, would require less than one byte. Whereas in reality a String object requires something like 40 bytes, minimum.
So you may want to look at the possibility of giving your JVM more memory to work with.
An ArrayList can easily hold 11 million String references, provided there's sufficient heap space available hold all those String objects.
How did you arrive at a maximum of 1 million String/Object references? Did you encounter some sort of error message?
Edit: Refresh the page whydontcha...
Build a man a fire, and he'll be warm for a day. Set a man on fire, and he'll be warm for the rest of his life.
The theoretical limit for ArrayList capacity is Integer.MAX_VALUE, a.k.a. 2^31 - 1, a.k.a. 2,147,483,647. But you'll probably get an OutOfMemoryError long before that time because, well, you run out of memory.
An ArrayList containing 11 million references should be about 42MB, assuming a reference is 4 bytes long.
That's 42MB in references alone. The actual Strings aren't even considered yet.
The mind is a strange and wonderful thing. I'm not sure that it will ever be able to figure itself out, everything else, maybe. From the atom to the universe, everything, except itself.
Aaron Ravi Jakobovits
Joined: Jul 27, 2010
Okay, so the file is 1.0288mb (10,288kb) and contains 10,534,015 Strings, to be precise. I think my problem is that the arrayList is declared within my main method. Would declarations and definitions within a main method be stored dynamically on the heap or stack?
If you do the math, 10,288kb * 1000 = 10,288,000, so yes, a little larger than 1 byte per string on average; but how can that be? I just googled the size of a string in Java and it is at least 4 bytes, right? Now I'm totally confused. I can post the code and add the file as an attachment if anyone would like to check this, but I don't know where I could have made an error.
Aaron Ravi Jakobovits
Joined: Jul 27, 2010
Okay, I made a mistake, big time. I'm using a BufferedInputStream object and the .read() method associated with it to iterate through the file. The .read() method apparently returns the number of bytes, not tokens, my mistake. So my question: is the definition of the arrayList inside the main method of the program limiting the number of elements the arrayList can hold and why?
Aaron Ravi Jakobovits wrote:If you do the math, 10,288kb * 1000 = 10,288,000, so yes, a little larger than 1 byte per string on average; but how can that be?
Well, the most likely scenarios are:
(1) You are mistaken, and the file is much bigger than 10,288kb.
(2) You are mistaken, and the file contains far fewer than 10,534,015 Strings.
(3) You are right, and almost all of the "strings" are exactly one byte in length, or maybe zero, and you neglected to tell us the magic formula by which you determine how long a single "string" (or maybe "line") is. Maybe all "strings" are exactly one character? Otherwise it seems like you need to allocate some more bytes to tell us how long each string is.
Aaron Ravi Jakobovits wrote:I just googled the size of a string in Java and it is at least 4 bytes, right? Now I'm totally confused. I can post the code and add the file as an attachment if anyone would like to check this, but I don't know where I could have made an error.
Hmmm, 4 bytes still seems an underestimate, but whatever. We don't know where you could have made an error either, but it seems that posting the code may be the best course of action.
But how did you come to the conclusion that there is a limit of 1 million strings in an ArrayList? Did you get an error while you tried to run your program? Perhaps an OutOfMemoryError?
If you really need to hold all those 10.5 million strings in memory at once, you might want to give the JVM some more memory by using the -Xmx command line switch. For example:
java -Xmx512m com.mypackage.MyProgram
to give the JVM max. 512 MB memory to work with. The default for the max. amount of memory when using a 32-bit JVM on Windows is quite low, I think 64 MB. If you have 10.5 million strings I can imagine that you'd easily be using more than 64 MB memory.
Note that a 32-bit JVM on Windows allows you to go up to 1.5GB, not more. That's a Windows limitation. If you ever need to use more than 1.5GB inside a JVM you'll need to start using a 64-bit JVM, which of course also requires a 64-bit Windows.
Aaron Ravi Jakobovits wrote:If you do the math, 10,288kb * 1000 = 10,288,000, so yes, a little larger than 1 byte per string on average; but how can that be? I just googled the size of a string in Java and it is at least 4 bytes, right?
But those two numbers are only loosely related. The first thing to realize is that a character in Java requires two bytes of memory (it's a Unicode character). So if those bytes in your file are all ASCII characters, you need at least twice as much as 10,288 KB to store them as string data. And second, a String is implemented as an object containing an array of characters and some other control information. The estimates I have seen for this say it's more like 40 bytes than 4 bytes. So roughly speaking you need 40 bytes per String as overhead plus 2 bytes for each character in the data. And then there's the references to those Strings which you store in that list. Those are what take 4 bytes, so there's another 4 bytes overhead for each String.
And don't take those numbers as precise information. They are estimates with various degrees of accuracy and might depend on your environment. For example a 64-bit Java might require more memory to store references than a 32-bit Java. Or not... I don't really know.