User Tools

Site Tools


watchdog_agent

Watchdog Agent

Alarm

CRITICAL - WatchdogAgent : errigalWatchdogStateApplicationFailureNotification - WatchdogApplicationFailureAlarm.

Context

This is one of the most important Watchdog alerts you will see. When a Watchdog cannot start it will generate the Watchdog Agent alert. The Watchdog Agent is part of a Watchdog.

Decision

When this Watchdog is received you must:

  • First, check that the Watchdog is running and if not take the relevant action.
  • Because the Watchdog Agent is not currently set up to clear itself you must manually clear the alarm.
  • You will then have to select “Alarm Clear received” on related Ticket(s) too.
  • On the Node monitor, you can check the “Review the logs” for Watchdog agent alarms and clear them there.
  • You can alternatively clear the alarm via the Database ( This option is preferred in this circumstance)

Please use the following query to check the Alarm has cleared on Atlas

select * from active_alarm where cleared = false and context like '%Watchdog%'

or via Terminal

mysql -uroot -p(add password) -hatlas.err -e "update snmp_manager.active_alarm set cleared = True where cleared is False and context like '%Watchdog%'";

Consequences

If a Watchdog Agent Watchdog is not actioned it could mean we miss an important alert this could happen as follows:

  • Watchdog is running but has an active alarm on the Watchdog Agent.
  • Watchdog fails to start.
  • The active alarm on the Watchdog Agent means we would not be alerted to Watchdog failing.
  • No Watchdog running for the system in question could lead to Operations not being informed of a Critical system problem.
watchdog_agent.txt · Last modified: 2021/11/25 17:05 by 10.5.5.89