During analysis of a heap dump I found that the value in my main HashMap is not a substring as put there, but the complete string and an index pointing at the substring. Is it smart enough to reference the same string or does it clone the string, which would result in excess copies of the same string in memory?
I know from my analysis that the HashMap in question takes up over 15 MB of the heap, but a similar thing happens with the key and it comes from a different string for every different value (approx. 85 different values). By my calculations it should contain less than 5 MB of data in keys and values so where does the remaining 10 MB come from?
Philip Grove wrote:During analysis of a heap dump I found that the value in my main HashMap is not a substring as put there, but the complete string and an index pointing at the substring. Is it smart enough to reference the same string or does it clone the string, which would result in excess copies of the same string in memory?
Actually, your question has little to do with HashMap and more to do with: Is the result of a substring() a reference to the same String; and the answer is: not quite.
A substring is a separate String object, but (and I'm almost certain of this, but I'm happy to be corrected if anyone knows better) it shares the character array of the original String. Thus, it will take whatever space overhead is associated with an object (≈16 bytes I think), plus internal indexes (2 or 3 ints; I forget), plus the reference of the array itself (4/8 bytes).
PS: It's also worth noting that Java characters takes two bytes, not one. Not sure if you took that into account in your calculations.
Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
I think Winston got it - now I recall hitting the same problem. Here is the Java 6 substring code:
Note that the "value" here is the existing array
so the new String object keeps a reference to the big array it was derived from!
However, note that the following constructor checks for this situation and makes a new copy of the substring characters:
SO - to get rid of the reference to the big String it looks like
String s = new String( bigstring.substring(......) ) ;
Pat Farrell wrote:Unless the characters are from a language that uses 3 or 4 byte code points.
Which doesn't alter the fact that a Java character is a 16-bit unsigned number. I notice that UseCompressedStrings (which I've never tried) is defined as a 'performance' option, but I wonder if it actually saves anything except space (one article I read suggested that it's 5-10% slower). It's also likely to make space estimation more complex for anything but pure ASCII text.
Well I've been doing quite a lot of profiling with compressed strings and haven't noticed any difference in terms of latency from the "compression" though I could well believe there is some. My main reason for using it is we are very sensitive to GC and memory usage and that's the key performance issue according to the stats. So I think it is a performance option but you need to know what your applications string usage profile is and the configuration of your garbage collector.
I think with all performance its profile first then optimize.