We are facing an OOM in production, that shows intermittently, with any load and in any function of the application.
I know that sounds not new to all. Anyway, We need advice on how to solve the problem. So, here is a list of questions.
1. Using a HeapDump and several analysis tools, we were not able to find any memory leak. We have enough memory and a -Xms512m -Xmx784m setting. Nevertheless, the dump file is just 115mb or less. Can we compare that? Is that size a clue we can use to determine the problem is somewhere else and not actually that the memory is being totally consumed?
2. Now, if the above holds true, and there is actually memory unused (wonder how can we totally be sure of that), how come the JVM throws the exception?
3. I read something about the different memory "sections" to hold object of different ages (GC generational model). Is if possible some of those sections are not growing although there is memory to use? How can we fine tune those settings, if there are settings for that?
4. Further more, we are working in this config: Jboss 4.0.1sp1 JVM 1.4.2_12 Operating System W2K3 SP2 Could it be and older JMV-New O.S conflict problem?
5. Unexpected Errors. Should this mean an error captured at JVM? That is, the exception is not caught during normal execution like creating an object or going through a loop. Our guess it is thrown when actually executing the GC. How can we sure of that? If that is correct, does it mean is it a JVM problem?
For most of these, I don't know, really. But here are a couple possible answers:
2. It's possible for the JVM to throw this error even though there's a lot of unused memory if there was a particular line that required much more memory than was available. E.g.
This line might fail even though there's more than enough memory for everything else the application is doing. Of course something big like this is usually pretty obvious, so your problem is probably something else.
5. "Unexpected Error" was apparently something in the RMI code. It looks like the OOME occurred on the server, but you're looking at it from an RMI client. The RMI server caught the OOME, wrapped it with a ServerError, and serialized it to the RMI client. To know more about exactly what was happening - is there no stack trace? Sometimes with an OOME there is none, and it will say "stack trace unavailable". But otherwise, find the stack trace. If you don't see "unavailable" then probably some code is printing the error but omitting the trace, which is usually a mistake. The trace can help you answer questions about what's going on at the time of the error.
If the stack trace is unavailable, you can try putting a lot of logging statements in to the server code (possibly using AOP). The you can look at the server logs and see what the server was doing just prior to the OOME.
We had similar type of situation where memory was 50% use but ran into outofmemory. It turned out to be when JVM allocate memory to objects, short lived objects will go into YoungGeneration, long lived ones goes into oldGeneration. Even though memory was at 50% usage, we ran out of young generation. The way I have determined that was by dumping GC logging, as soon as we get OutOfMemory error we looked at gc.log and saw that we were at 100% of young generation(default is 64MB). You can have below setting run.bat if it's windows, run.sh if it's linux.