User Tools

Site Tools


Writing /app/www/public/data/meta/watchdogs/informing_the_customer.meta failed
watchdogs:informing_the_customer

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

watchdogs:informing_the_customer [2017/04/18 14:32] – created cokeeffewatchdogs:informing_the_customer [2021/06/25 10:09] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +=====What alarms do customers need to be informed of?=====
  
 +In general we don't have to inform the customers of every little problem we deal with. What we have to inform them of is anything that's classed as an "outage". It's also good practice to inform them of any issues that they're likely to run into themselves anyway. If we tell them before they run into it and have to tell us it looks better on us and saves face.
 +
 +
 +===Watchdogs===
 +
 +Some watchdog alerts go to the customer as well. Make sure to always check the email distribution of the watchdog. If they are then we should reply to it straight away to at least say we're investigating the issue.
 +
 +===Outages===
 +
 +An outage is anything that brings down an application, stops core functionality or prevents users from accessing or using the applications properly.
 +
 +Anything that affects the core functioning of the applications is an outage - If the SNMP Manager is not processing alarms, the Ticketer is not creating tickets, the NOC Portal is not displaying alarms or the Reporting Manager is not emailing out scheduled reports then the customer should be informed immediately.
 +
 +Examples of alarts that could indicate a core process has failed:
 +OutOfMemoryErrorFound, PermGenErrorFound, errigalApplicationProcessInactiveAlarm, JVMHeapSizeHigh, TheURLisUnavailable - Indicates that an application may have died or could be about to die. First thing you should do is try to access the application on the affected server to see if it's still up. If it's down inform the customer immediately and start investigating.
 +
 +ActiveAlarmCreation, handlerFailover, generalTrapSummaryCreationInactive, TrapCountCheck, TrapParsingCheck, remoteTicketCreationInactive - Indicates that new traps, alarms or tickets are not being created. This can sometimes just be due to a quiet period and not actually a problem. But if there is a real issue with trap creation the customer must be informed immediately while we move to fix.  
 +
 +Any problem that requires a restart of one of the handlers is an outage and the customer should be notified of it. If possible they should be notified before restarting.
 +
 +===Issues visible to the customer===
 +
 +ClientAlarmPollerInactive - This indicates that the Client Active Alarm Poller has stopped or slowed to a crawl. This means that the alarm list that users see in the Node Monitor is not being updated. Traps may still be processing and alarms may still be being created but the users won't be able to see them. This alarm has often been the first thing to show up when the entire application is having issues and could be the canary that indicates a much larger scale problem.
 +
 +
 +Anything else that is visible to the customer - If it's something that the customer is going to notice they should hear it from us before they bring it to us. Examples: Tickets in a particular workflow can't be moved, a carrier not displaying any elements in the Node Monitor, NOC Portal throwing an error when selecting a cluster.
 +
 +
 +===Feel free to ask===
 +If you're not sure about things ask! If we're in office hours you can just ask someone else in person. If you're on weekend on-call remember the Whatsapps group is there.