I have a communications project between two devices talking over a network.
Something will occasionally suspend all active threads in my program for up to 250ms. For lack of a better word, I am calling this time of zero execution "dead time". Network data is NOT lost while during dead time. As soon as the network monitor thread gets a chance to execute, 100% of all data arriving during the dead time will be delivered. But other threads that are waiting for (parsed) messages from the network monitor might time out waiting for for the network data in less time than the dead time period. Of course, they don't actually time out until their threads get some execution time. And if the network monitor thread happens to execute first, and it has a big enough time slice that it can parse the data and publish it to the other threads, there will be no problem. But if network monitor is NOT executing when the dead time occurs, one of these other threads is likely to time out as soon as execution resumes. It will exit with an error just before the network monitor can deliver the data.
As it turns out, my protocol is robust against lost or corrupted network data. But data that arrives horribly late, and just EXACTLY at the moment of time out, can cause the two devices to get out of sync.
I don't know for sure what causes these dead time delays. My theory is that this is the Garbage Collector has temporarily suspended all threads until such time as it can do its work. I don't want to micromanage my GC. The people who design those things know way more than I do. But is there any way to give it a hint? That is, is there anyway to say "Hey there GC, this would be a good time for you to do your business, I am not doing anything important right now". Or perhaps the other way, "Hey there GC, until I tell you otherwise, try to keep "dead time" delays to no more than 100ms". Or maybe I could just poll the GC after the fact when I detect the symptoms of dead time.
If the GC is completely out of my reach, is there some way for a thread to say "Sleep until every other thread that is currently awaiting a chance to execute has a chance to execute".
Perhaps there is some way for one of my threads to determine how long the network monitor thread has been suspended. If it detects that the network monitor thread has been suspended for a period of time which is greater than my time out, then I can extend the time out until such time as the network thread has a chance to execute.
Ultimately I may just need to amend the protocol somehow. But distributing the new protocol is kind of tricky.
Derik Davenport wrote:As it turns out, my protocol is robust against lost or corrupted network data. But data that arrives horribly late, and just EXACTLY at the moment of time out, can cause the two devices to get out of sync.
That is, is there anyway to say "Hey there GC, this would be a good time for you to do your business...
Well, first, I would be sure that it is in fact the GC that's causing the problems; you don't want to write a lot of brittle get-around code just to solve a problem that doesn't exist.
Second, if your protocol is robust against lost data, why not write a robust receiver that regards data as either "received in full" or "lost"?
Third, since Threads are involved, is it not possible that the problem could be solved with proper synchronization?
Only suggestions, but I would definitely look at other solutions before you start gc-tweaking, because anything on that front is likely to be (a) error-prone and (b) arcane.
Others may disagree, of course.
Bats fly at night, 'cause they aren't we. And if we tried, we'd hit a tree -- Ogden Nash (or should've been).
Articles by Winston can be found here
Joined: May 30, 2011
I agree that writing code to the GC is probably a bad idea, especially when I don't know that it is the GC that is responsible. What I really need is some way for a thread, which detects a missing data event, to allow other waiting threads to execute.
why not write a robust receiver that regards data as either "received in full" or "lost"
In fact, I do have such a routine but it is possible for it get tricked. I know the detailed explanation above is hard to read. But here is the short version.
The thread that is actually going to process the data only waits for so long. It has to time out eventually or my system would become non-responsive every time a packet got lost on the network. Question: What happens if the thread that times out does so at the precise moment that the message arrives? Answer: As it times out, it declares the message it is waiting for to be dead. That is, it makes the erroneous assumption that it will never arrive. It then requests a second copy. Unfortunately, that first message is not really dead. It is held in limbo by the dead time. As soon as the network monitor thread starts, that limbo message gets pulled out of limbo and dropped on the message queue. But by that time, the first thread has already sent a request for a second copy, it confuses that Limbo-late message with the second copy that was requesting. In fact, due to an oversight in the protocol, it has no way to distinguish limbo-late message from the second message that it requested. It thinks all is well but in reality, there is one more message on the message Queue (which is by the way fully synchronized) than it thinks is on there. Presto! I am out of sync.
The correct solution to this problem which involves a change in the protocol. If I serialized every Message, I could tell whether the message I was looking at was the actual message I was waiting for, or one that arrived very late due to a dead time detour to Limbo land. But the correct solution can't be easily implemented on existing devices. So I am looking for a solution that will allow me to work with legacy stuff until a new protocol can be written.