Differences

This shows you the differences between two versions of the page.

--- resolution_area:watchdog_resolutions:res-w9107 [2021/06/25 10:09] – external edit 127.0.0.1
+++ resolution_area:watchdog_resolutions:res-w9107 [2021/07/08 15:03] (current) – wflaherty
@@ Line 1: / Line 1: @@
 ===== AlarmsShouldBeCleared=====
-**Level:**
+**Level:** \\
++ 800 - MINOR\\
++ 1000 - MAJOR\\
++ 1200 - CRITICAL\\
 **Purpose:**
 **Scenario:**
+This watchdog checks the database for the AlarmCache to ensure that the alarms that are cleared in the SNMP Manager are removed from the AlarmCache's active alarm table
+The query is:
+<code>
+SELECT count(*) FROM alarm_cache.active_alarm acaa JOIN snmp_manager.active_alarm smaa ON smaa.id = acaa.id WHERE smaa.cleared;
+</code>
+The severity/level of the alert is dependent on the value returned
 **Resolution:**
 **Manual Action Steps:**
+Firstly, if the alert is only a minor then we can attempt to fix is by running using the Alarm Cache Audit
+It can be found by hitting
+<code>
+/SnmpManager/alarmCacheAudit
+</code>
+Below are examples for ExteNet and ATC
+<code>
+https://nocportal.extenetsystems.com/SnmpManager/alarmCacheAudit
+https://atcwatchdog.com/SnmpManager/alarmCacheAudit
+</code>
+If the alert is Major, Critical or the first step hasn't worked then
+we must clear the alarms manually using a query
+<code>
+DELETE FROM alarm_cache.active_alarm WHERE id IN (SELECT id FROM (SELECT acaa.id AS id FROM alarm_cache.active_alarm acaa JOIN snmp_manager.active_alarm smaa ON smaa.id = acaa.id WHERE smaa.cleared) AS tempTable);
+</code>
+This should be all you need, however, if the Alarm Cache is completely out of sync then we must restart the alarm cache app and then clear the queue in RabbitMQ
+In a terminal, navigate to your deployment playbooks repo.
+<code>
+ansible-playbook -i ../env-configuration/prodatc/hosts.ini --diff --vault-id @prompt alarmcache.yml -e "actions='stop'"
+</code>
+Once Alarm Cache has stopped, we must purge the queue from RabbitMQ
+This is typically at http://loadbalancer:15672
+Examples:
+<code>
+http://extlb.ext:15672
+http://atclb1.atc:15672
+http://atcapps2.atc:15672
+</code>
+Sign in credentials for RabbitQA are in PWSafe
+For EXT the username is ''admin''
+For ATC the username is ''rabbit''
+Once you sign in, there is a queue in the Queues page called ''alarm_cache_inbound_queue''
+If the queue was/is filling up, then AlarmCache was likely overwhelmed by a spike of many alarms firing and clearing.
+Scroll down and click 'Purge Messages'
+Now start AlarmCache again
+<code>
+ansible-playbook -i ../env-configuration/prodatc/hosts.ini --diff --vault-id @prompt alarmcache.yml -e "actions='start'"
+</code>
+Monitor the queue yourself for a bit and see if Alarm Cache is about to catch up with the queue and bring the numbers under control.
+And since this might have a few of the clears missed from this, you may need to repeat the query to delete the cleared alarms.
+Finally, perform the first step again of checking the Alarm Cache Audit
 **Auto Clear:**
+If the number of active alarms below 800 in difference the watchdog will clear the alert

Sidebar

Internal Errigal Collaboration Wiki

Differences

Sidebar

Internal Errigal Collaboration Wiki

User Tools

Site Tools

Differences

Page Tools