Differences

This shows you the differences between two versions of the page.

--- development:applications:alarmcache:troubleshooting [2019/10/18 12:37] – [When Alarm Cache is very out of Sync] ccarew
+++ development:applications:alarmcache:troubleshooting [2021/07/21 17:05] (current) – [Sync Up Alarms] wflaherty
@@ Line 1: / Line 1: @@
+====== Troubleshooting Alarm Cache ======
+===== Sync Up Alarms =====
+At any given time the following two queries should match in count:
+<code>
+select count(*) from alarm_cache.active_alarm;
+select count(*) from snmp_manager.active_alarm where !cleared;
+</code>
+If a restart is ever required, running the above 2 queries is recommended to ensure everything is still synced up, it should be since any messages received will wait on the queue until the application is back up.
+If these do not match something has gone wrong with the process. To counteract this 2 hidden controllers were built in (one in SNMP Manager and one in AlarmCache).
+THE URL's for these controllers are:
+<code>
+...../SnmpManager/alarmCacheAudit
+...../alarmCache/alarmCacheAudit
+</code>
+The function of the SNMP Manager controller is to populate a JSON create message for every active alarm that isn't cleared in it's DB and pass those onto the alarm cache, this will create any alarms it doesn't already have in it's DB.
+The opposite side of this is if the alarm cache has alarms which have already cleared in the SNMP Manager DB, this controller fires a single JSON message containing all the ID's of alarms present in it's DB and passing those to the SNMP Manager. The SNMP Manager then processes this list and replies letting the alarm cache know if any of those have cleared.
+There is also an alarmCacheAudit which you can pass IDs into. The following query is useful in determining alarms in alarm cache which are cleared in the SNMP Manager:
+<code>
+select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared;
+</code>
+The results of this query can be passed into the alarmCacheAudit to only send those IDs to the SNMP Manager for processing using the following url:
+https://DOMAIN_NAME/alarmCache/alarmCacheAudit?alarmIds=1,2,3,4,5,6
+<code>
+https://DOMAIN_NAME/alarmCache/alarmCacheAudit?alarmIds=
+</code>
+Where the IDs are the ones received from the above query
+===== ClearWithNoAlarmsAuditJob =====
+  * Runs via scheduler every 10 minutes in alarm_cache -> ClearWithNoAlarmsAuditJob.groovy
+  * findAllByDateProcessedByCacheLessThan(30 minutes in the past)
+  * publishes update to EMS via exchange ems_push_notification_topic -> binding = AlarmCacheUpdate
+  * Processed in EMS front-end via RabbitMQFrontendConnector.java -> event fired to update cache using websockets.
+===== When Alarm Cache is very out of Sync =====
+  * In general we should **not** purge all the data in alarm cache unless it is very old i.e do not clear out tables fully.
+  * If there is a sync issue (large difference between open active alarm count) we should run **/SnmpManager/alarmCacheAudit** and this will fire create active alarm messages to the alarm cache, if cache has them it does nothing, if cache does not it creates them (a lot slower than just checking if they are there).
+  * If the queue “alarm_cache_inbound_queue” is getting backed up (in the thousands) you may need to do the following (The reason is that alarm cache will be receiving too many creates.):
+  -  stop alarm cache
+  - purge the queue from rabbitMQ UI
+  - restart alarm cache
+  - wait for cache to catchup with queue (new alarms will continually be fired into this queue) and run **/SnmpManager/alarmCacheAudit** when queue is stable and has low amount of messages in it.
+  - This may take several tries (shutdown, purge, startup) until the number of alarms gets close to one another.
+You will then want to delete any alarms that are cleared in Snmp but not in alarm cache (this is due to purging the queue, we miss some creates and clears).
+<code>
+// Deletes for Alarms that have cleared
+delete from alarm_cache.cluster_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared);
+delete from alarm_cache.section_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared);
+delete from alarm_cache.carrier_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared);
+delete from alarm_cache.active_alarm where id in (select id from (select acaa.id as id from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared) as tempTable);
+delete from alarm_cache.clear_with_no_alarm;
+// Deletes for Alarms in Alarm Cache not in Snmp
+delete from alarm_cache.cluster_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null);
+delete from alarm_cache.section_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null);
+delete from alarm_cache.carrier_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null);
+delete from alarm_cache.active_alarm where id in (select id from (select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null) as tempTable);
+</code>
+We can then check what the differences between alarm cache and snmp manager are :
+<code>
+// Alarms Not in alarm cache
+select smaa.id, smaa.created_date from snmp_manager.active_alarm smaa where !smaa.cleared and not (smaa.id in (select id from alarm_cache.active_alarm));
+// Clears still in alarm cache
+select acaa.id, acaa.create_date from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared;
+// Alarms in Alarm Cache not in Snmp
+select acaa.id, acaa.create_date from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null;
+</code>
+If there are still missing clears you can run the above deletes again, if there are some missing alarms you can run /SnmpManager/alarmCacheAudit again

Sidebar

Internal Errigal Collaboration Wiki

Differences

Sidebar

Internal Errigal Collaboration Wiki

User Tools

Site Tools

Differences

Page Tools