User Tools
Writing /app/www/public/data/meta/watchdog_agent.meta failed
watchdog_agent
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| watchdog_agent [2019/06/10 18:02] – lmoore | watchdog_agent [2021/11/25 17:05] (current) – 10.5.5.89 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | **Watchdog Agent** | ||
| + | **Alarm** | ||
| + | |||
| + | CRITICAL - WatchdogAgent : errigalWatchdogStateApplicationFailureNotification - WatchdogApplicationFailureAlarm. | ||
| + | |||
| + | |||
| + | **Context** | ||
| + | |||
| + | This is one of the most important Watchdog alerts you will see. When a Watchdog cannot start it will generate the Watchdog Agent alert. The Watchdog Agent is part of a Watchdog. | ||
| + | |||
| + | |||
| + | **Decision** | ||
| + | |||
| + | When this Watchdog is received __you must__: | ||
| + | * First, check that the Watchdog is running and if not take the relevant action. | ||
| + | * Because the Watchdog Agent is not currently set up to clear itself you must manually clear the alarm. | ||
| + | * You will then have to select "Alarm Clear received" | ||
| + | * On the Node monitor, you can check the " | ||
| + | * You can alternatively clear the alarm via the Database ( This option is preferred in this circumstance) | ||
| + | |||
| + | Please use the following query to check the Alarm has cleared on Atlas | ||
| + | |||
| + | |||
| + | < | ||
| + | |||
| + | or via Terminal | ||
| + | |||
| + | < | ||
| + | |||
| + | |||
| + | **Consequences** | ||
| + | |||
| + | If a Watchdog Agent Watchdog is not actioned it could mean we miss an important alert this could happen as follows: | ||
| + | |||
| + | * Watchdog is running but has an active alarm on the Watchdog Agent. | ||
| + | * Watchdog fails to start. | ||
| + | * The active alarm on the Watchdog Agent means we would not be alerted to Watchdog failing. | ||
| + | * No Watchdog running for the system in question could lead to Operations not being informed of a Critical system problem. | ||