The Watchdog Concept for Kettle was presented at the Pentaho Community Event in Cascais, Portugal in September 2010. It came into my mind when I created the operational patterns for the Pentaho Data Integration for Database Developers training course and combined this with solutions from electronics to solve these types of problems (e.g. detect software or hardware errors reliably) and apply immune aware development concepts.
What’s the idea behind this?
When you want to check if your Kettle jobs and transformations execute correct, you can use some operational patterns like
- Health Checks for the JVM
- Health checks for clustered environments and failover of a master server
- Analyze log entries
- Use restartable solutions
- and others that are described in more detail in the Pentaho PDI training class
When you need a different approach for more safety and to measure additional values, then you can use the concept of a Watchdog and combine this with the normal patterns. A use case is e.g. you are building a very critical system (imagine a rocket flying to space) and engage different software developers, use different measuring units and different computers to test if everything is correct. When one of the units fails, you can signal an event and e.g. abort the launch. Well, the Kettle world is most times not such critical but it is used more and more in systems that need a very high reliability and 24×7 with high and continuously guaranteed throughput.
With a Kettle solution you may need many components to check, e.g. the Data Integration Server, Carte and clustered environments, Kitchen/Pan jobs and transformations, detect dead locks and also for external components like the JVM (e.g. memory, used CPUs), the server (e.g. test with a network ping) and the databases (up and running, capacity etc.).
You also need a signal to noise detection: Define what is normal (noise) and notify on exceptions (e.g. set thresholds absolute or relative eventually by average), define what is unusual and notify on these events and define events (e.g. actions, notifications & alerts). By all these checks you have a primary constraint: You need to minimise the footprint and impact to the system by the measurements.
Why a Watchdog can help here?
One solution for checking if everything is on track is to use the concept of a Watchdog with tasks and events:
A software crash might go undetected by conventional watchdog strategies when you have multitasking (e.g. many PDI jobs and cluster nodes). The success lies in weaving the watchdog into the fabric of all of the system’s tasks, which is much easier than it sounds:
- Build a watchdog task
- Create a data structure (database table) that has one entry per task
- When a task starts it increments its entry in the structure. Tasks that only start once and stay active forever can increment the appropriate value each time through their main loops, e.g. every 10,000 rows
- As the job or transformation runs the number of counts for each task advances.
- Infrequently but at regular intervals the watchdog runs.
- The watchdog scans the structure, checking that the count stored for each task is reasonable. One that runs often should have a high count; another which executes infrequently will produce a smaller value.
- If the counts are unreasonable, halt and let the watchdog timeout and fire an event. If everything is OK, set all of the counts to zero and exit.
This is a derived concept from Jack Ganssle (2004): Great Watchdogs (especially the section WDTs for Multitasking).
An example implementation with Kettle
Most of the Kettle health checks can be accomplished with this Watchdog concept. An example implementation with Kettle is task oriented, not server oriented. This means it will check, if the task (a Transformation or Job) is running as expected independently in what environment (e.g. clustered or not).
Since Kettle has a wide variety of transformation steps and job entries I tend often not to program in a classical programming language but solve this solely with Kettle.
Here is an example of the event handling that could be altered very flexible by adding new events:
And here is an example of the above described logic of the Watchdog step. Sooner or later this could be simplified by building some Kettle steps that encapsulate these logics but for now it just works: