AlarmsShouldBeCleared

Level:
+ 800 - MINOR
+ 1000 - MAJOR
+ 1200 - CRITICAL

Purpose:

Scenario: This watchdog checks the database for the AlarmCache to ensure that the alarms that are cleared in the SNMP Manager are removed from the AlarmCache's active alarm table

The query is:

SELECT count(*) FROM alarm_cache.active_alarm acaa JOIN snmp_manager.active_alarm smaa ON smaa.id = acaa.id WHERE smaa.cleared;

The severity/level of the alert is dependent on the value returned

Resolution:

Manual Action Steps: Firstly, if the alert is only a minor then we can attempt to fix is by running using the Alarm Cache Audit It can be found by hitting

/SnmpManager/alarmCacheAudit

Below are examples for ExteNet and ATC

https://nocportal.extenetsystems.com/SnmpManager/alarmCacheAudit
https://atcwatchdog.com/SnmpManager/alarmCacheAudit

If the alert is Major, Critical or the first step hasn't worked then we must clear the alarms manually using a query

DELETE FROM alarm_cache.active_alarm WHERE id IN (SELECT id FROM (SELECT acaa.id AS id FROM alarm_cache.active_alarm acaa JOIN snmp_manager.active_alarm smaa ON smaa.id = acaa.id WHERE smaa.cleared) AS tempTable);

This should be all you need, however, if the Alarm Cache is completely out of sync then we must restart the alarm cache app and then clear the queue in RabbitMQ

In a terminal, navigate to your deployment playbooks repo.

ansible-playbook -i ../env-configuration/prodatc/hosts.ini --diff --vault-id @prompt alarmcache.yml -e "actions='stop'"

Once Alarm Cache has stopped, we must purge the queue from RabbitMQ This is typically at http://loadbalancer:15672 Examples:

http://extlb.ext:15672
http://atclb1.atc:15672
http://atcapps2.atc:15672

Sign in credentials for RabbitQA are in PWSafe For EXT the username is admin For ATC the username is rabbit

Once you sign in, there is a queue in the Queues page called alarm_cache_inbound_queue If the queue was/is filling up, then AlarmCache was likely overwhelmed by a spike of many alarms firing and clearing. Scroll down and click 'Purge Messages'

Now start AlarmCache again

ansible-playbook -i ../env-configuration/prodatc/hosts.ini --diff --vault-id @prompt alarmcache.yml -e "actions='start'"

Monitor the queue yourself for a bit and see if Alarm Cache is about to catch up with the queue and bring the numbers under control. And since this might have a few of the clears missed from this, you may need to repeat the query to delete the cleared alarms.

Finally, perform the first step again of checking the Alarm Cache Audit

Auto Clear: If the number of active alarms below 800 in difference the watchdog will clear the alert