Monitored (Watchdog) Asynchronous Process in C#
I have a requirement that I need to know if a long running process is still alive or not. If a process is no longer alive (it is hung, crashed, or is in an infinite loop) then the process should be restarted. There will be multiple long running processes eventually and they will need to run concurrently. This blog covers one implementation to achieve these requirements.
Final Utilization
To help clarify the requirements, let’s look at how I want to use the code. Here is a console application that utilizes a monitored process:
class Program { static Monitored1 monitored; static void Main(string[] args) { monitored = StartNewProcess(); Console.WriteLine("Press any key to stop process..."); Console.ReadKey(); monitored.Stop(); Console.WriteLine("Press any key to exit..."); Console.ReadKey(); } static void monitored_NoContactAlarm() { Console.WriteLine("No contact alarm!"); Console.WriteLine("Restarting"); monitored.Stop(); monitored = StartNewProcess(); } static void monitored_RunComplete() { Console.WriteLine("Task has completed."); } static Monitored1 StartNewProcess() { monitored = new Monitored1(); monitored.NoContactAlarm += new MonitoredAsyncRunBase.NoContactAlarmHandler(monitored_NoContactAlarm); monitored.RunComplete += new MonitoredAsyncRunBase.RunCompleteHandler(monitored_RunComplete); monitored.Start(); return monitored; } }
Notice in the Main method, I create an instance of a Monitored1. Monitored1 is a concrete implementation of a monitored asynchronous process. In StartNewProcess, I do three things: construct a new monitored object, connect callbacks to two events (covered later) and then start the long running process. My particular process, will run forever (it will be a service). Thus the need for the Stop method call in Main.
The two events that are received provide triggers for when the process has a “no contact” alarm (the process is no longer alive) and when the long running process completes. The NoContactAlarm event allows the process to be stopped and restarted. The RunComplete event allows post processing to occur. Let’s look at some implementation details.
Base Class Implementation:
First let me introduce two interfaces that i will use in this implementation:
public interface IMonitored { bool IsAlive { get; } } public interface IRunnable { bool Run(); }
The IMonitored interface will be implemented to provide a property that indicates whether the process is alive or not. The IRunnable interface will provide the implementation for the long running process.
The intent is to create an abstract base class that provides the ability to start and monitor the long running process. Here is the declaration for the abstract base class:
public abstract class MonitoredAsyncRunBase : IRunnable, IMonitored { public DateTime LastContact { get; set; } public TimeSpan MaxTimeExpectedBetweenContacts { get; set; } public MonitoredAsyncRunBase(TimeSpan maxTimeExpectedBetweenContacts) { MaxTimeExpectedBetweenContacts = maxTimeExpectedBetweenContacts; } public MonitoredAsyncRunBase() { MaxTimeExpectedBetweenContacts = new TimeSpan(0, 0, 10); } public bool IsAlive { get { TimeSpan span = DateTime.Now - LastContact; if (span > MaxTimeExpectedBetweenContacts) { return false; } return true; } } public abstract bool Run(); }
First notice that the abstract base class implements the IRunnable interface using an abstract method. This pushes the specific implementation of the Run method to the class that derives from this base class. The class provides two properties (LastContact, MaxTimeExpectedBetweenContacts) that are used to determine if the long running process is alive. The implementation of the IMonitored interface provides the logic to determine if this process is alive.
This implementation will rely on the long running process to periodically update the LastContact date. Essentially the long running process must send a ping (or a heartbeat) before the max time expected TimeSpan expires. The following method is provided to simplify that update:
public void SendPing() { LastContact = DateTime.Now; }
We discussed the events that are provided to allow for callbacks to be triggered when the process finishes or is no longer alive. These are implemented with the following bit of code:
public delegate void NoContactAlarmHandler(); public event NoContactAlarmHandler NoContactAlarm; public delegate void RunCompleteHandler(); public event RunCompleteHandler RunComplete;
To start the monitored process, the following code is used:
private Thread _thread; private System.Timers.Timer _timer; private bool _stopThread = false; public bool Start() { if (_thread != null) { return false; } _timer = new System.Timers.Timer(MaxTimeExpectedBetweenContacts.TotalMilliseconds); _timer.Elapsed += new System.Timers.ElapsedEventHandler(_timer_Elapsed); _timer.Start(); _stopThread = false; _thread = new Thread(new ThreadStart(ThreadProcess)); _thread.Start(); return true; }
The code first checks to insure a thread does not currently exist. If one does, then a new one is not allowed to be started and the method returns false.
A System.Timers.Timer object is created with an elapsed time set to the maximum time expected between contacts. The timer object is running on a thread from the thread pool. This allows for a “watch dog” callback to be executed to check the IsAlive property and trigger events. The following is the timer callback that will raise the NoContactAlarm event if necessary:
private void _timer_Elapsed(object sender, System.Timers.ElapsedEventArgs e) { if (!IsAlive) { if (NoContactAlarm != null) { NoContactAlarm(); } } }
The Start method then creates a new System.Threading.Thread object that is used to run the long running process. Long running processes should not be run on the thread pool. The following method implements the thread:
private void ThreadProcess() { try { while (!_stopThread) { Run(); } } catch { /* Provide logging... */ } // Stop the timer. // if (_timer != null) { _timer.Stop(); } // Raise the run complete event. // if (RunComplete != null) { RunComplete(); } }
As you can see by this code, the IRunnable implementation provides an operation that is repeated over and over using a while loop. This operation will provide a long running “service”. The only requirement on the IRunnable implementation is that it calls the SendPing method before the no contact time span expires. The service can be stopped after any completed Run cycle by setting the _stopThread boolean to true. The execution of Run is wrapped in a try/catch block to thwart any unhandled exception in the Run code from killing the monitor process. The final bit of code in the thread process provides cleanup of the timer and then raises the RunComplete event.
Stopping the process is provided with the following method:
public bool Stop() { if (_thread != null && _thread.IsAlive) { // Try to stop by allowing the thread to stop // on its own. // _stopThread = true; if (!_thread.Join(3000)) { // Tried to avoid this, but must abort. This // has the potential to leave objects that cannot // be reclaimed by the GC and/or objects in a // unknown state. // Console.WriteLine("Thread.Abort called...ouch!"); _thread.Abort(); } _thread = null; } return true; }
This code attempts to stop the thread by setting the _stopThread boolean and waiting for the thread to finish. If this does not work, then the Abort method is called on the _thread object. I tried to avoid this because it has known issues. If anyone has a better implementation, please comment.
Concrete Implementation:
The following is a concrete (albeit very simple) example of the MonitoredAsyncRunBase class:
public class Monitored1 : MonitoredAsyncRunBase { int i = 0; public override bool Run() { // Simulate a long running task that needs to be // monitored. // i++; Console.WriteLine("Round " + i.ToString()); Thread.Sleep(1000); if (i < 2) { SendPing(); } return true; } }
The above code simply simulates a “service” that takes 1 second to complete. The first two cycles of this service use the SendPing method to indicate the process is still alive. Later cycles do not contact using the SendPing. This simulates the thread is hung. The following is the console output for this simple example:
Not super exciting, but here is what you can see:
- Each cycle (1 second) of the Run implementation writes a “Round X” line.
- The “watch dog” timer is being triggered every 10 rounds (10 seconds).
- The first “watch dog” event found the process still alive. This occurs because the process last called SendPing after Round 2. So at the first “watch dog” event, the last contact time span was roughly 8 seconds (less that the 10 seconds required for the no contact event to be raised).
- The second “watch dog” event found the process not alive. The no contact event was raised. The thread was stopped and then restarted by the code in the no contact event handler.
Summary:
This code is far from production ready, but provides a basis for wrapping long running processed in a monitored environment capable of restarting the process if necessary. Having each long running process responsible for their status seems logical. I am not sure if an outside process would ever be able to determine if the thread was stuck in an infinite loop or just processing. I have thought about running each long running process in their own app domain, but at this point I don’t think the added complexity is worth it.
Hi! I could have sworn I’ve been to this blog before but after checking through some of the post I realized it’s new to me. Nonetheless, I’m definitely delighted I found it and I’ll be bookmarking and checking back frequently!
Nice, thanks! Little remark though, _timer should be stopped first when calling Stop() or else NoContactAlarm() could easily be called multiple times and thus starting the process multiple times.
Hi,
Nice article..
Can you please email me this code?
silverzen007@gmail.com
I want to monitor different wcf services, I have no idea where to start, any suggestions?
Hi. Your article is very interesting, but without source code ;-(
Can you please email me this code?