I am having a problem that is strange and very hard for me to track down. I have several servers that are communicating with each other and with external callers. I'm not sure where to start with describing the problem but I'll try.
I have a socket that does a lookup of an IP address, determines where the user is coming from and passes back a
String[] of ip data (ISO country, city, state and if the IP is anonymous in which case I block the user). That code has been running fine for a while. The caller calls the Client who bounds to the server, server looks up the data and sends back the array. It's great and runs smoothly.
We decided to switch all of our code from ColdFusion over to
Java so the iframes our customers are displaying (that were straight HTML rendered by CF) are now HTML calling to a different Java RIM server that I created. Since this is new code, we roll it out for only one client. The new server gets all kinds of data from our DB and passes all the display info to the HTML presentation layer. To do so, the RMI server (com.cpawins.offerplatorm.RemoteServerImpl) has to call the Geo IP socket (com.cpawins.IPValidation.CPAWinsGeoClient talks to the Geo Server) which was previously called exclusively by ColdFusion (for months with no bugs). We turn on our RMI server and all is great. Then we increase traffic to about 2 to 4 hits per second. Suddenly, everything becomes erratic.
What will happen is that the RMI server will run along for a while and then suddenly freeze. Sometimes it freezes and shows the exception below. Other times it just freezes. It may freeze within a minute after it comes up or it may freeze a day after it is started. But, it's guaranteed to stop working at some point. What's worse is that this RMI server is taking care of only ONE customer right now. All other customers are using the old CF code calling to the Geo IP server. When the RMI server freezes up, it kills our entire web site (all customer are down and our main site goes down which calls absolutely none of this code). It appears that IIS is just waiting for a response and since it does not get one, it just kills everyone. Kill the new RMI server and everything, including all the servers that were not restarted, works again (except the customer's iframe that relies on the RMI server). Start the RMI server up again and everything is fine for a little while.
Since this is new code, it's limited in what it is handling right now. It makes no sense. The same process runs over and over, sometimes many tens of thousands of times, and then suddenly just stops working, sometimes throwing the following exception:
I can't reproduce this error. The only time I've ever seen it is when I have not ordered socket in/out streams properly - one stream tries to talk when something is not open. I don't see how this can be the case here as everything runs fine for quite a while without error - and half the time it freezes, there's no error at all.
The code where the error is taking place is below (but this code works fine until I start calling it through the new RMI server - when it's just CF, no crashes are happening... ColdFusion calls the exact same code).
The Socket Client com.cpawins.IPValidation.CPAWinsGeoClient.getIPData(CPAWinsGeoClient.java:32)
For about 1 year, this client has run fine when called like this:
The Client is still being called that way. But, when the client is called as follows it errors or freezes intermittently:
RMI Server Object com.cpawins.offerplatform.geotarget.UserGeo.fillCountryData(UserGeo.java:61)
BTW, I have a printout happening on my RMI server console. When the RMI server stops working (it's not the Geo IP server that stops it is the code referring to it that has problems), I will get a flood of messages on screen from all the users that were trying to hit the RMI server when it was frozen. I would think that all the IP address references are out of memory. But, it appears that they just keep building up in the web server and then once I kill the RMI server and restart it, the web server sends through a bunch of data for users who have long ago gotten a time out error.
Aside from the general question of why this is happening, I am curious... do you think that if I switched the Geo IP socket to an RMI server that these problems may subside? I'm just shooting in the dark here.
Thanks for any help you can give.
Al