I have two tomcats here working in parallel on separate servers. Both have exactly the same configuration and get requests by a load balancer. On each tomcat my struts webapp is deployed. Both webapps are accessing the same database. And both suddenly stopped working without any kind of error message.
What happened yesterday in detail is that suddenly server A stopped working. The logs stopped at 09:55am . No error message, nothing. When I tried to connect to the server my connection timed out. Half an hour later server B also stopped working. Exactly the same way, no connection possible no error in log. After resetting Server A at 10:40am the other server (B) suddenly continued working. There where some minor error in the log caused by lost connections... but thats all. Server B just kept working as if nothing has happened.
My first suggestion was that there was an deadlock situation at the database...but during the 'pause'-time there weren't any waiting connections. Next to this even if the database isn't available the tomcat should do something, shouldn't it? Like accepting the connection and do some logging (I do some logging every time a user connects, independent whether he's accessing data from the db or not)....Additionally the database in the tomcat runs in a timeout after 10s.
I am always thinking about the DB since it's the only thing how both servers are connected. There is no interaction between the two servers except via the database.
Since the restart of tomcat A everything is working perfectly again.
Does anyone of you have any idea what could has happened here?
There's another possibility. There might be some other, unrelated process running on those servers that's stealing all the CPU, disk, and/or network resources from the Tomcat server and its apps. In a worst-case scenario, your servers might have been "pwned", and are serving (unbeknownst to you) as nodes in a botnet.
If this problem recurs, you should monitor the systems as a whole to see if you can detect unusual loads.
Sometimes the only way things ever got fixed is because people became uncomfortable.
Unfortunately bot machines run on VMs and aren't connected to the world wide web. So I think (& hope) we didn't get pwned....
The underlying machine was watched and didn't have any performance problems....
in opposition to what the server administration said at first there was an enormous peak at the CPU usage at the error time.
There seems to be something in the applications sources going wrong... I'll put massive logging at it to be able to reproduce it the next time it occurs (which hopefully sin't during the next 20 years ;) )
Thanks a lot for your help
knowledge is the difference between drudgery and strategic action -- tiny ad
Free, earth friendly heat - from the CodeRanch trailboss