This week's book giveaway is in the Other Open Source APIs forum. We're giving away four copies of Storm Applied and have Sean Allen, Peter Pathirana & Matthew Jankowski on-line! See this thread for details.
I am seeing quite a strange behaviour from the garbage collector in one of our applications.
The application is a standalone Java app, it receives requests over a TCP socket and responds to them. It is basically a kind of a cache before the database. There is a constant request flow of 10-100 requests per second. Average response time is below 50ms. It is running on a multiprocessor Sun machine, JDK 1.4.1_02.
The problem arises when the app has been running for a while. Depending on the allowed maximum memory size - with 256mb after about half a day, with 512mb after about a day, the garbage collector seems to be running most of the time. We turned on GC logging and from the log I can see that initially for a long period only the DefNew collector is used. The GC times are ~0.05sec then. Then after some time old generation collections start taking place. These collections take 3-4 seconds. The old gen collections then become more frequent until all that is going on, is one collection after another. About 3-4seconds for the collection, then after another 3-4 seconds the same thing again.
Can anyone please help in understanding what can cause this kind of behaviour? I am quite sure that the app is stable in a sense that it doesn't start to eat up extra memory at some point.
But I would advice you to try the -server switch when starting the application. This optimizes the JVM for long-running applications - as far as I know it also uses a different GCing strategy.
The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
The fact that GC-ing old collections takes up to 5 seconds is not so strange. That's why this kind of garbage collection is posponed as long as possible. The only reason for the GC to collect the old collection garbage, is when normal garbage collection is no longer sufficent. In general, this is an indication that your application can't handle the load, or has a small leak in. For your kind of app, an app that handles a continous flow of requests, and doesn't deal with request bursts, this is no normal behaviour.
You should check your app again. Assigning null values to objects, in particular collections, that are no longer needed is good practice. Check your application for global collections that might keep some sort of an hidden handle to your datastructures, or something like that.
If you are absolutely sure you can not improve you code anymore, try thinking about the load. I don't know how you handle your requests. Do you start a new thread for every request? Do you use a threadpool with a queue, or just one thread with a queue? Or one thread w/o a queue? In any case, your server might be overloaded. If you send more requests than your server can handle, and you don't give the server time to catch up, the queue of incoming requests only grows and grows, until you get an OOM. Maybe you want to cluster your program and divide the requests among the different machines. It might be a bit of programming work, but I think it shouldn't take too long...
Good luck Kris
Joined: Jul 11, 2001
Originally posted by Kris Philippaerts: You should check your app again. Assigning null values to objects, in particular collections, that are no longer needed is good practice.
I disagree in the general case. There are only very rare cases in which explicitely setting a reference to null makes any sense.
Joined: Jun 10, 2004
Originally posted by Kris Philippaerts: You should check your app again.
This is exactly what I ended up with myself. So I did a recheck and also downloaded an evaluation version of a profiler which showed me what I didn't believe in the beginning - we have a leak. After some monitoring with the profiler I found the reason. It wasn't so much a leak but rather a silly bug where one of the cache invalidation algorithms had a minus sign in the wrong place and basically the cache was never cleared. Problem solved now.