This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
Occasionally, my stand-alone Java application will just go nuts and drag down the entire server. It's just a Java app running as a Windows Service (using jsl.exe), running on Win 2000, using Sun JRE 1.5.0_17.
So I have this application that listens on a couple of ports with a ServerSocket and SSLServerSocket. Those listening threads, in turn, fork a thread for each socket request that comes in. There's also a couple of threads that go to sleep and wake up periodically to do logging or back-end MQ integration.
So anyway, out of 1000+ of these servers, every once in a while, this will happen (2-5 in a month). In our QA Lab, we haven't been able to reproduce this problem, but I noticed that there seem to be 'dangling threads' that never seem to away, even after a few hours. I'm trying to figure out if this is a quirk of the Sun Java JRE or if I'm doing something wrong? There's no sense to these threads. That is, the dangling threads are striped across different pieces of code, so it doesn't seem like an infinite loop in a specific place. Furthermore, memory doesn't seem to appreciate greatly either.
So I'm implementing a Windows Scheduled Task to restart the Java app every 24 hours, as a safeguard. The theory is that eventually, these dangling threads add up, and after several weeks or months, they eventually start to drag down the JVM, just in terms the bookkeeping cost of keeping them around. Not the greatest theory, but the best I've got right now.
Does anyone know of the JRE 1.5 having a known issue close to this, or have any other theories? Can I blame this on the thread-handling of the JVM, or should I keep hunting for a flaw in my code?
BTW, I don't know the thread state at this time, but I'm going the thing where I get the list of threads and print out their CPU time for each thread. I'm noticing that there are threads that are still on the list several hours after the thread should have been expired and garbage-collected, despite a fair amount of server activity.
Example log output: (Threads 2-11 are my server/JVM threads)
Thread ID=1809, CPU Time=0
Thread ID=1787, CPU Time=0
Thread ID=1758, CPU Time=0
Thread ID=1752, CPU Time=0
Thread ID=738, CPU Time=0
Thread ID=734, CPU Time=0
Thread ID=724, CPU Time=0
Thread ID=11, CPU Time=578125000
Thread ID=10, CPU Time=0
Thread ID=9, CPU Time=0
Thread ID=8, CPU Time=2109375000
Thread ID=7, CPU Time=171875000
Thread ID=4, CPU Time=0
Thread ID=3, CPU Time=156250000
Thread ID=2, CPU Time=46875000
I am not sure about the solution to your problem but possibly you should look at these questions to better understand your problem:
1.After a period of time, do you see majority of threads around a particular code. If yes, there could possibly be a bug in that area. Maybe the threads do not end as expected.
2.Are you sure there are no deadlocks happening?
3.What is preventing you from using a publicly available server instead of forking threads yourself?
Thanks and Regards
Joined: Jul 31, 2007
The threads that are dangling are really just spread out seemingly randomly across different parts of code.
No deadlock possible because there are no shared resources except files and those are accessed in the same order by all threads. I'm good about not using static class variables in the code.
Project constraints prevent me from using a real server or server software. I had to do the multi-threading myself, but it's not like I did anything special. I used the sample code from the Sun Java Tutorials on how to open a socket connection. It's a very simple run() method that runs, does some stuff, and finishes, that's it! I just don't get what's wrong here, and I'm catching all exceptions and writing them to logs, but nothing there, either.
Presumably from the stack traces you can dump you should be able to see what about the code is failing to terminate the thread eg boolean flag not being set.
ie what condition terminates your code, should a timer have been fired, IO not being interruptable on Windows.
As its so intermitant and the threads are obviously doing something as they consume CPU ? and its Windows so suprious thread wake up shouldn't be an issue (as much) ... failure of happens before ordering on a stop condition ? Does the machines it fails on have more CPUs ;-)
We'd really need some idea of the structure of the code ?
"Eagles may soar but weasels don't get sucked into jet engines" SCJP 1.6, SCWCD 1.4, SCJD 1.5,SCBCD 5