User Tools
Writing /app/www/public/data/meta/watchdogs/informing_the_customer.meta failed
watchdogs:informing_the_customer
Differences
This shows you the differences between two versions of the page.
| watchdogs:informing_the_customer [2017/04/18 14:32] – created cokeeffe | watchdogs:informing_the_customer [2021/06/25 10:09] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | =====What alarms do customers need to be informed of?===== | ||
| + | In general we don't have to inform the customers of every little problem we deal with. What we have to inform them of is anything that's classed as an " | ||
| + | |||
| + | |||
| + | ===Watchdogs=== | ||
| + | |||
| + | Some watchdog alerts go to the customer as well. Make sure to always check the email distribution of the watchdog. If they are then we should reply to it straight away to at least say we're investigating the issue. | ||
| + | |||
| + | ===Outages=== | ||
| + | |||
| + | An outage is anything that brings down an application, | ||
| + | |||
| + | Anything that affects the core functioning of the applications is an outage - If the SNMP Manager is not processing alarms, the Ticketer is not creating tickets, the NOC Portal is not displaying alarms or the Reporting Manager is not emailing out scheduled reports then the customer should be informed immediately. | ||
| + | |||
| + | Examples of alarts that could indicate a core process has failed: | ||
| + | OutOfMemoryErrorFound, | ||
| + | |||
| + | ActiveAlarmCreation, | ||
| + | |||
| + | Any problem that requires a restart of one of the handlers is an outage and the customer should be notified of it. If possible they should be notified before restarting. | ||
| + | |||
| + | ===Issues visible to the customer=== | ||
| + | |||
| + | ClientAlarmPollerInactive - This indicates that the Client Active Alarm Poller has stopped or slowed to a crawl. This means that the alarm list that users see in the Node Monitor is not being updated. Traps may still be processing and alarms may still be being created but the users won't be able to see them. This alarm has often been the first thing to show up when the entire application is having issues and could be the canary that indicates a much larger scale problem. | ||
| + | |||
| + | |||
| + | Anything else that is visible to the customer - If it's something that the customer is going to notice they should hear it from us before they bring it to us. Examples: Tickets in a particular workflow can't be moved, a carrier not displaying any elements in the Node Monitor, NOC Portal throwing an error when selecting a cluster. | ||
| + | |||
| + | |||
| + | ===Feel free to ask=== | ||
| + | If you're not sure about things ask! If we're in office hours you can just ask someone else in person. If you're on weekend on-call remember the Whatsapps group is there. | ||