User Tools
watchdog_agent
This is an old revision of the document!
Watchdog Agent
Alarm
CRITICAL - WatchdogAgent : errigalWatchdogStateApplicationFailureNotification - WatchdogApplicationFailureAlarm.
Context
This is one of the most important Watchdog alerts you will see. When a Watchdog cannot start it will generate the Watchdog Agent alert. The Watchdog Agent is part of a Watchdog.
Decision
When this Watchdog is received you must:
- First, check that the Watchdog is running and if not take the relevant action.
- Because the Watchdog Agent is not currently set up to clear itself you must manually clear the alarm.
- You will then have to select “Alarm Clear received” on related Ticket(s) too.
- On the Node monitor, you can check the “Review the logs” for Watchdog agent alarms and clear them there.
- You can alternatively clear the alarm via the Database ( This option is preferred in this circumstance)
Please use the following query to check the Alarm has cleared on Atlas
select * from active_alarm where cleared = false and context like '%Watchdog%'
or via Terminal
mysql -uroot -pozzrules -hatlas.err -e "update snmp_manager.active_alarm set cleared = True where cleared is False and context like '%Watchdog%'";
Consequences
If a Watchdog Agent Watchdog is not actioned it could mean we miss an important alert this could happen as follows:
- Watchdog is running but has an active alarm on the Watchdog Agent.
- Watchdog fails to start.
- The active alarm on the Watchdog Agent means we would not be alerted to Watchdog failing.
- No Watchdog running for the system in question could lead to Operations not being informed of a Critical system problem.
watchdog_agent.1624612196.txt.gz · Last modified: 2021/06/25 10:09 by 127.0.0.1