User Tools

Site Tools


Writing /app/www/public/data/meta/watchdog_agent.meta failed
watchdog_agent

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
watchdog_agent [2020/01/08 12:13] lmoorewatchdog_agent [2021/11/25 17:05] (current) 10.5.5.89
Line 1: Line 1:
 +**Watchdog Agent**
  
 +**Alarm**
 +
 +CRITICAL - WatchdogAgent : errigalWatchdogStateApplicationFailureNotification - WatchdogApplicationFailureAlarm.
 +
 + 
 +**Context**
 +
 +This is one of the most important Watchdog alerts you will see. When a Watchdog cannot start it will generate the Watchdog Agent alert. The Watchdog Agent is part of a Watchdog. 
 +
 +
 +**Decision**
 +
 +When this Watchdog is received __you must__:
 +  * First, check that the Watchdog is running and if not take the relevant action. 
 +  * Because the Watchdog Agent is not currently set up to clear itself you must manually clear the alarm.
 +  * You will then have to select "Alarm Clear received" on related Ticket(s) too.
 +  * On the Node monitor, you can check the "Review the logs" for Watchdog agent alarms and clear them there.
 +  * You can alternatively clear the alarm via the Database ( This option is preferred in this circumstance)
 +
 +Please use the following query to check the Alarm has cleared on Atlas
 +
 +
 +<code>select * from active_alarm where cleared = false and context like '%Watchdog%'</code>
 +
 +or via Terminal
 +
 +<code>mysql -uroot -p(add password) -hatlas.err -e "update snmp_manager.active_alarm set cleared = True where cleared is False and context like '%Watchdog%'";</code>
 +
 +
 +**Consequences**
 +
 +If a Watchdog Agent Watchdog is not actioned it could mean we miss an important alert this could happen as follows:
 +
 +  * Watchdog is running but has an active alarm on the Watchdog Agent. 
 +  * Watchdog fails to start. 
 +  * The active alarm on the Watchdog Agent means we would not be alerted to Watchdog failing.
 +  * No Watchdog running for the system in question could lead to Operations not being informed of a Critical system problem.