Table of Contents
Watchdogs / Prometheus
Prerequisites to being on-call / watchdog rota
General
Prometheus / Grafana
Common Watchdogs
Watchdogs / Prometheus
Prerequisites to being on-call / watchdog rota
Errigal Watchdog
Watchdog Updated Process
Watchdog Alarm Summary Report
Current watchdog alarms
Watchdog On Call Process
What alarms do customers need to be informed of?
Outage Process Backups
Tunneling
General
Installing Watchdog on a server
Configuring Watchdog Texts
Upgrade Watchdog Install on a Server
Looking at the logs for key indicators of potential issues
Linux - Ping, trace route and TCP dump diagnostic utilities
Clickatell - Watchdog Text Messages
Geb Smoke Tests
Geb Sanity Checks
ATC NOC Portal Alarm Email
TicketerEmailFailedDelivery - RabbitMQ
RemoteTicketFailedToCreate - RabbitMQ
Prometheus / Grafana
Creating a New Metric
Creating a New Alert
Common Watchdogs
mysqlSlaveReplicationFailure
Resolving Replication Timeout Issue
QuartzJobsBlocked
Watchdog Agent
Watchdog Resolution Area