Hi all,
I have an issue with a Windows Service I wrote. This issue only comes up on one particular client's server and roughly once a day. The service is installed on about 40 servers, but this is the only place it happens :-S
The issue: The service simply stops doing what it's supposed to be doing. It doesn't write anything to the eventlog - even though I've made it so any exceptions will be written there.
Below is the main clump of code that this service consists of. It's quite simple. To break it down, the service starts a few threads each minute which does some measuring on disk space, available memory and so on. A lot of the writing to eventlog is part of my debugging the issue to try and solve it.
private static void StartProbing()
{
int updateCount = 2;
bool error = false;
while (!Stop)
{
updateCount++;
try
{
if (error)
EventLog.WriteEntry("MyService", "Trying again. updateCount = " + updateCount, EventLogEntryType.Information);
if (updateCount >= 2)
{
EventLog.WriteEntry("MyService", "Trying to get config.", EventLogEntryType.Information);
GetConfig();
updateCount = 0;
if (error)
{
error = false;
EventLog.WriteEntry("MyService", "Successfully connected to server.", EventLogEntryType.Information);
}
}
foreach (Task task in tasks)
{
Thread t = null;
switch (task.Type.ToLower())
{
case "ping":
t = new Thread(new ParameterizedThreadStart(_Ping));
break;
case "diskcheck":
t = new Thread(new ParameterizedThreadStart(_DiskCheck));
break;
case "cpucheck":
t = new Thread(new ParameterizedThreadStart(_CpuCheck));
break;
case "memcheck":
t = new Thread(new ParameterizedThreadStart(_MemCheck));
break;
case "proccheck":
t = new Thread(new ParameterizedThreadStart(_ProcCheck));
break;
}
if (t != null)
t.Start(task);
}
if (Stop)
break;
Thread.Sleep(1000 * 60);
}
catch (Exception x)
{
error = true;
EventLog.WriteEntry("MyService", "Probing loop: " + x.Message, EventLogEntryType.Error);
Thread.Sleep(5000);
}
}
if(Stop)
EventLog.WriteEntry("MyService", "Stopping probing. updateCount = " + updateCount, EventLogEntryType.Error);
}

6 answers
Well, that's telling us that a connection failure is occurring in the GetConfig() method. On the third and fourth updates, the failure is being caught as an exception in the StartProbing() method but on the fifth update something is happening which is not being caught and is hanging the service.
Assuming you agree with this analysis, I think you're going to have to include some more debugging code to identify on which line the failure is occurring.
On the face of it, the code is progressing past this line in GetConfig() on the fifth update otherwise an exception would be thrown again:
but I'd check that and also whether it's getting past these lines as well:
My gut feeling is that this a connection problem rather than a 'shared state' threading problem with your Task objects because, if it were the latter, you'd be seeing similar problems on the other servers.
When you examine the logs for the 'successful' servers, do you see any evidence of repeated connection failures there?
answered one year ago by:
17279
2499
every once in a while, a connection to another (random) service will fail, and immediately succeed. For some reason, this server stays down, so I think you're right about it being a connection failure. I'll add some logging to the GetConfig, SendData and ReceiveData methods and we'll have more for the next crash :-P
2499
It's also worth noting that there was a connection failure in the log on this server just before it died that went up to 20 retries and then succeeded...
17279
I wondered whether there might be some significance in 5 retries but evidently there isn't .
2499
update: this morning I noticed that the log stopped when the service was trying to receive data. I therefore added a receive-timeout of 5 seconds to make sure it doesn't hang there forever. I'll let it run (with logging) and see where we're at
17279
Sounds promising. If that is the problem, then it's easy to fix :)
2499
The service hasn't failed for about a week now :) so this seems to have solved it :D
17279
Thank goodness for that :)
This is the sort of problem where you can look at the code until you're blue in the face but still not find anything!
If nothing's being written to the event log, then my guess is that a deadlock is occurring between two (or more) of your threads which is causing the service to hang. There may be some aspect of the server on which this occurs which makes it more susceptible to deadlocks than the others though there's always the chance that it may eventually occur on one of the other servers as well.
If the threads are accessing any shared state, I'd have a look at your locks and which objects they're synchronized on to see if you can think of any cirumstances where a deadlock might occur i.e. where threads are waiting for each other to release a lock.
EDIT
No sign of any shared state or locks in that code :)
The only thing I can think of which could conceivably cause the service to hang is when you create a new PerformanceCounter. These, of course, use underlying system resources and so, if for some reason, the OS is unable to satisfy the resource allocation immediately the thread might hang.
I'd try writing something to the log just before the PerformanceCounter is 'newed' and then immediately afterwards.
answered one year ago by:
17279
The threads are (or should be) quite independent. This is an example of the code that starts a cpucheck:
Any ideas for further debugging? :)
answered one year ago by:
2499
17279
Please see my edit for a further idea.
I've just installed a version with that kind of logging, now we just have to wait for it to fail :-P
answered one year ago by:
2499
17279
Good luck with that!
It seems my threads were sharing some data, only with the main thread. This is the workflow of my service:
Every minute: Start some threads to do work. These take in an object (instance of the Task class) as a parameter.
Every other minute: Ask the server for some work. This updates the collection of tasks.
I think that while the main thread was updating my tasks collection, another thread was trying to read an object in that collection. This is apparently a no-go, so I made this change:
And the service hasn't failed since :D
Thanks for the help :)
answered one year ago by:
2499
17279
Yes, List<Task>, if that's what you're using, is not thread-safe. I still don't understand why this problem should only manifest itself on one particular server but, no matter, if it's working OK now:)
2499
That's a mystery to me too. However, it seems that this particular server is rather overloaded so I guess that could be why :)
2499
blast, the damn thing just failed again. Any more ideas? :-S
17279
Can you tell from the log where the code is failing?
2499
Unfortunately, in my glee I removed the extensive logging. I've just re-enabled it so I should get some better ideas next time this thing crashes and burns.
This is the code I'm currently using (The whole damn thing :D)
This is the output from my log when it went down last night:
And that is the end of the log ... it just stops :-/
answered one year ago by:
2499
2499
Oh, and the service (and thereby the .exe file) was still running this morning, apparently not doing anything.