OK, I have opened up a PMR with IBM but figured it can't hurt to try my question here too. To make a long story short i'm getting JMS related errors during listener port startup for my MDBs. I believe the errors are related to a previously failed global transaction.
Here is the rundown... I am implementing clustering support for my comanies product for WAS 126.96.36.199. We run on Solaris/Sparc, WAS 188.8.131.52 ND and WAS 184.108.40.206 BASE machines make up the cluster. Global security is enabled, Java 2 Security disabled. Using a Sun ONE external LDAP server to store credentials. This problem exists with embedded JMS, have not tested against full MQ series yet. I have a Topic for public events and a "internal" Queue for product specific events that only our app needs to listen for. We have several MDBs that listen to the Topic and Queue. I have NDM on one machine, and a federated Node on that machine, as well as a second federated node on another machine. Both have their own JMS/JDBC resources. We use jgroups to build logical clusters and for event logging fail over.
Anyways, so me and my Integratin Team manger are testing some new code inside the MDB. We launch our test from the "master" node (the one with both NDM and a federated Node). The master server starts kicking off the work, handeling the request. However, as an event is thrown to the Queue the "clone" Node's MDB is the one that picks it up. Because of an oversight in the code the MDB threw an uncaught null pointer exception, causing the MDB to stop and lots of other bad stuff.
My integration team manager goes back to the code and fixes the problem area and sends me the new ear. I redeploy and all that, reset all the servers, and then this is what I see:
Immediately it looks like a transaction problem to me. So i shutdown everyting, the Servers, the Node Agents, Deployment Manager. I then erase all tranlogs, and then bring everything back up. Guess what...still there!! It's only on the "clone" node too. The "master" starts up fine. Also note that the "clone" sserver listener ports startup, but after a minute or show these errors get thrown. Then the listener ports restart themselves, and then throw the error again. It just keeps looping itself.
I tried deleting the Cluster, all resources, basically did an uninstall and reinstall. The problem is still there. My best guess is when that null pointer was thrown in the MDB the first time something went really wrong with the XA transaction and that it is trying to finish itself. However, I have no clue how to manually finish it or cancel it besides deleting the folders found in <WAS_ROOT>/tranlog.
I did some tracing with the following strings but saw nothing obvious (of course I don't really know what i'm looking for).
If anyone has ANY ideas please let me know. This problem is driving me nuts!