Table of Contents

Troubleshooting Alarm Cache

Sync Up Alarms

At any given time the following two queries should match in count:

select count(*) from alarm_cache.active_alarm;
select count(*) from snmp_manager.active_alarm where !cleared;

If a restart is ever required, running the above 2 queries is recommended to ensure everything is still synced up, it should be since any messages received will wait on the queue until the application is back up. If these do not match something has gone wrong with the process. To counteract this 2 hidden controllers were built in (one in SNMP Manager and one in AlarmCache).

THE URL's for these controllers are:

...../SnmpManager/alarmCacheAudit

...../alarmCache/alarmCacheAudit

The function of the SNMP Manager controller is to populate a JSON create message for every active alarm that isn't cleared in it's DB and pass those onto the alarm cache, this will create any alarms it doesn't already have in it's DB.

The opposite side of this is if the alarm cache has alarms which have already cleared in the SNMP Manager DB, this controller fires a single JSON message containing all the ID's of alarms present in it's DB and passing those to the SNMP Manager. The SNMP Manager then processes this list and replies letting the alarm cache know if any of those have cleared.

There is also an alarmCacheAudit which you can pass IDs into. The following query is useful in determining alarms in alarm cache which are cleared in the SNMP Manager:

select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared;

The results of this query can be passed into the alarmCacheAudit to only send those IDs to the SNMP Manager for processing using the following url:

https://DOMAIN_NAME/alarmCache/alarmCacheAudit?alarmIds=1,2,3,4,5,6

https://DOMAIN_NAME/alarmCache/alarmCacheAudit?alarmIds=

Where the IDs are the ones received from the above query

ClearWithNoAlarmsAuditJob

When Alarm Cache is very out of Sync

  1. stop alarm cache
  2. purge the queue from rabbitMQ UI
  3. restart alarm cache
  4. wait for cache to catchup with queue (new alarms will continually be fired into this queue) and run /SnmpManager/alarmCacheAudit when queue is stable and has low amount of messages in it.
  5. This may take several tries (shutdown, purge, startup) until the number of alarms gets close to one another.

You will then want to delete any alarms that are cleared in Snmp but not in alarm cache (this is due to purging the queue, we miss some creates and clears).

// Deletes for Alarms that have cleared
delete from alarm_cache.cluster_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared);

delete from alarm_cache.section_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared);

delete from alarm_cache.carrier_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared);

delete from alarm_cache.active_alarm where id in (select id from (select acaa.id as id from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared) as tempTable);

delete from alarm_cache.clear_with_no_alarm;

// Deletes for Alarms in Alarm Cache not in Snmp
delete from alarm_cache.cluster_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null);

delete from alarm_cache.section_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null);

delete from alarm_cache.carrier_alarm where alarm_id in (select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null);

delete from alarm_cache.active_alarm where id in (select id from (select acaa.id from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null) as tempTable);

We can then check what the differences between alarm cache and snmp manager are :

// Alarms Not in alarm cache
select smaa.id, smaa.created_date from snmp_manager.active_alarm smaa where !smaa.cleared and not (smaa.id in (select id from alarm_cache.active_alarm));

// Clears still in alarm cache
select acaa.id, acaa.create_date from alarm_cache.active_alarm acaa join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.cleared;

// Alarms in Alarm Cache not in Snmp
select acaa.id, acaa.create_date from alarm_cache.active_alarm acaa left join snmp_manager.active_alarm smaa on smaa.id = acaa.id where smaa.id is null;

If there are still missing clears you can run the above deletes again, if there are some missing alarms you can run /SnmpManager/alarmCacheAudit again