User Tools
This is an old revision of the document!
AlarmsShouldBeCleared
Level: - 800 - MINOR - 1000 - MAJOR - 1200 - CRITICAL
Purpose:
Scenario: This watchdog checks the database for the AlarmCache to ensure that the alarms that are cleared in the SNMP Manager are removed from the AlarmCache's active alarm table
The query is:
SELECT count(*) FROM alarm_cache.active_alarm acaa JOIN snmp_manager.active_alarm smaa ON smaa.id = acaa.id WHERE smaa.cleared;
The severity/level of the alert is dependent on the value returned
Resolution:
Manual Action Steps: Firstly, if the alert is only a minor then we can attempt to fix is by running using the Alarm Cache Audit It can be found by hitting
/SnmpManager/alarmCacheAudit
Below are examples for ExteNet and ATC
https://nocportal.extenetsystems.com/SnmpManager/alarmCacheAudit https://atcwatchdog.com/SnmpManager/alarmCacheAudit
If the alert is Major, Critical or the first step hasn't worked then we must clear the alarms manually using a query
DELETE FROM alarm_cache.active_alarm WHERE id IN (SELECT id FROM (SELECT acaa.id AS id FROM alarm_cache.active_alarm acaa JOIN snmp_manager.active_alarm smaa ON smaa.id = acaa.id WHERE smaa.cleared) AS tempTable);
This should be all you need, however, if the Alarm Cache is completely out of sync then we must restart the alarm cache app and then clear the queue in RabbitMQ
In a terminal, navigate to your deployment playbooks repo.
ansible-playbook -i ../env-configuration/prodatc/hosts.ini --diff --vault-id @prompt alarmcache.yml -e "actions='stop'"
Once Alarm Cache has stopped, we must purge the queue from RabbitMQ This is typically at http://loadbalancer:15672 Examples:
http://extlb.ext:15672 http://atclb1.atc:15672 http://atcapps2.atc:15672
Sign in credentials for RabbitQA are in PWSafe
For EXT the username is admin
For ATC the username is rabbit
Once you sign in, there is a queue in the Queues page called alarm_cache_inbound_queue
If the queue was/is filling up, then AlarmCache was likely overwhelmed by a spike of many alarms firing and clearing.
Scroll down and click 'Purge Messages'
Now start AlarmCache again
ansible-playbook -i ../env-configuration/prodatc/hosts.ini --diff --vault-id @prompt alarmcache.yml -e "actions='start'"
Monitor the queue yourself for a bit and see if Alarm Cache is about to catch up with the queue and bring the numbers under control. And since this might have a few of the clears missed from this, you may need to repeat the query to delete the cleared alarms.
Finally, perform the first step again of checking the Alarm Cache Audit
Auto Clear: If the number of active alarms below 800 in difference the watchdog will clear the alert