User Tools
Writing /app/www/public/data/meta/onboarding/introduction/watchdog.meta failed
onboarding:introduction:watchdog
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| onboarding:introduction:watchdog [2016/08/11 12:58] – Initial Drafting ejoy | onboarding:introduction:watchdog [2021/06/25 10:09] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Errigal Watchdog Overview ====== | ||
| + | Author: Eoin Joy | ||
| + | |||
| + | |||
| + | In order to familiarise yourself with Watchdogs, please watch the following presentation on the topic: [[https:// | ||
| + | |||
| + | Also, please find the associated presentation that accompanies the video: [[https:// | ||
| + | |||
| + | |||
| + | The Errigal Watchdog application, | ||
| + | |||
| + | |||
| + | ---- | ||
| + | |||
| + | |||
| + | ===== What is the Watchdog? ===== | ||
| + | |||
| + | The Watchdog is an application, | ||
| + | |||
| + | |||
| + | ---- | ||
| + | |||
| + | |||
| + | ===== Where is it deployed? ===== | ||
| + | |||
| + | It is currently deployed to the clusters of: | ||
| + | * ExteNet Production | ||
| + | And also to individual servers in the rest of the network, such as: | ||
| + | * atlas | ||
| + | * olympus | ||
| + | * mobipcs | ||
| + | * iris | ||
| + | * melpomene | ||
| + | * scottypro | ||
| + | |||
| + | ====== How do I Check the Status of Watchdog Alarms? ====== | ||
| + | |||
| + | The status of Watchdog Alarms is found both locally on the server running the watchdog and also on the central monitoring server: Atlas | ||
| + | |||
| + | ===== Atlas ===== | ||
| + | |||
| + | Atlas is the server that accepts traps sent by watchdogs in production environments.\\ | ||
| + | These watchdog traps are processed and created as tickets in the Ticketer also running on atlas.\\ | ||
| + | The team at Errigal is notified via email (and also SMS if it is a critical alarm) when such tickets are created on atlas. | ||
| + | |||
| + | The Atlas Node Monitor can be accessed publically from https:// | ||
| + | The Atlas Ticketer can currently only be accessed from the VPN at http:// | ||
| + | |||
| + | **Note: DO NOT LOG IN WITH THE ADMIN ACCOUNT - PLEASE USE YOUR OWN ACCOUNT PROVIDED** | ||
| + | |||
| + | ==== Node Monitor ==== | ||
| + | |||
| + | In the Node Monitor on atlas, all active watchdog alarms that successfully reached atlas are shown on their servers (represented as Parent/Hub elements) and the resource (explained below) that has alarmed (represented as Child/Node elements) | ||
| + | |||
| + | ==== Ticketer ==== | ||
| + | |||
| + | The Ticketer on atlas manages the tickets created when watchdogs are received or cleared. The ticketer also sends these notification emails and SMS messages to the Errigal team when the status of a ticket changes.\\ | ||
| + | The Ticketer can also be searched for historical data on watchdogs received. | ||
| + | |||
| + | ====== Running the Watchdog ====== | ||
| + | |||
| + | The Watchdog must always be run as a user with sufficient permissions to reserve port 162, the port used for SNMP traffic.\\ | ||
| + | In any given Watchdog installation, | ||
| + | |||
| + | To manually run the Watchdog, the following is the syntax:\\ ''/ | ||
| + | ===== Cron ===== | ||
| + | |||
| + | The automatic execution and frequency of the Watchdog process is determined by the Cron process.\\ http:// | ||
| + | |||
| + | Like the manual execution, the automatic execution is performed as a user with permissions to use port 162.\\ | ||
| + | To edit the scheduling of the watchdog for a server, run the following to access the cron table for the root user:\\ '' | ||
| + | |||
| + | ===== Logging ===== | ||
| + | |||
| + | The watchdog logs to the '' | ||
| + | |||
| + | ====== Creating Watchdogs ====== | ||
| + | |||
| + | The Watchdog does not monitor every possible situation on a server, it must be told what to look for and where. | ||
| + | |||
| + | ===== The Resource Config File ===== | ||
| + | |||
| + | The data concerning what the watchdog checks for each time it is run, where to send its resulting traps, and how to identify itself is defined in the Resource Config File.\\ | ||
| + | This file is called '' | ||
| + | This file contains several structures: the remoteNms, the localSystem, | ||
| + | |||
| + | ===== Remote NMS ===== | ||
| + | |||
| + | The remoteNms portion of this file contains the ip address of the receiving SNMP Manager and the port to send the trap on. | ||
| + | |||
| + | <code groovy> | ||
| + | ipAddress = ' | ||
| + | snmpPort = 162 | ||
| + | }</ | ||
| + | |||
| + | ===== Database URL ===== | ||
| + | While not contained within the localSystem object in the file, this defines where the watchdog looks locally for the database it requires to run | ||
| + | <code groovy> | ||
| + | |||
| + | ===== Local System ===== | ||
| + | |||
| + | The localSystem contains the customer name, host name, and list of active resources.\\ | ||
| + | The customerName determines which carrier the system is placed under on the remote system.\\ | ||
| + | The hostName determines what this server will be called in the remote system, also determining what will go into the email to the Errigal team as the host name.\\ | ||
| + | The resources determine which of the Resources described in the file are actively being run when the watchdog is executed. Because this config file is a groovy script, you can use objects such as Date to determine the system time and only run certain resources at certain times or on certain days. | ||
| + | |||
| + | <code groovy> | ||
| + | localSystem { | ||
| + | customerName = " | ||
| + | hostName = " | ||
| + | resources { | ||
| + | a = '/' | ||
| + | d = '/ | ||
| + | e = '/ | ||
| + | f = '/ | ||
| + | ... | ||
| + | } | ||
| + | ... | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | ==== Resources ==== | ||
| + | |||
| + | Each resource is defined by a type, a set of thresholds, and a set of parameters\\ | ||
| + | These resources are contained within the localSystem object.\\ | ||
| + | <code groovy> | ||
| + | localSystem { | ||
| + | ... | ||
| + | ' | ||
| + | type = ' | ||
| + | thresholds { | ||
| + | a = [name: ' | ||
| + | } | ||
| + | parameters { // expected parameters: logFileLocation | ||
| + | logFileLocation = '/ | ||
| + | renameLog = false | ||
| + | withTime = true | ||
| + | dateFormat = " | ||
| + | minutesBack = 5 | ||
| + | } | ||
| + | } | ||
| + | ... | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | === Type === | ||
| + | |||
| + | The type helps to determine what kind of check the watchdog needs to perform, and also specifically what parameters will be required.\\ | ||
| + | The most common types are '' | ||
| + | '' | ||
| + | '' | ||
| + | '' | ||
| + | |||
| + | === Thresholds === | ||
| + | |||
| + | A threshold is a map of values that is used to determine what will be compared at the time that the watchdog runs.\\ | ||
| + | The name determines exactly which test to perform.\\ | ||
| + | The type can be '' | ||
| + | The value is what the test is comparing against, i.e. the string being looked for in the log.\\ | ||
| + | The level determines the severity of the trap sent, in order: '' | ||
| + | The alarmId determines the alarmIdentifier of the trap and what will become the alarm identifer of the watchdog alarm. | ||
| + | |||
| + | === Parameters === | ||
| + | |||
| + | Depending on the '' | ||
| + | Some of the optional parameters available alter how the resource performs its check.\\ | ||
| + | Of particular interest is '' | ||
| + | This has the benefit of allowing an alarm to clear if an issue has abated, and also allowing it to alarm again later that day in the event that a legitimate clear is received in the meantime. | ||
| + | |||
| + | ===== Deploying the Config File ===== | ||
| + | |||
| + | To deploy the updated ResouceConfig.groovy file, branch the [[ https:// | ||
| + | |||
| + | ===== Testing a New or Edited Watchdog Resource ===== | ||
| + | |||
| + | **Please note:** Take care when testing out new or modified watchdogs. Ensure you test this in the morning (Irish time) if possible so you are allowed enough time to revert changes if necessary. Also be sure to reply to the email chain of the triggered watchdog to denote it was a test. | ||
| + | |||
| + | There are some steps that should be taken when testing the implementation of a new or edited watchdog resource. | ||
| + | - Make a backup of the current '' | ||
| + | - Disable the Watchdog in the cron job by editing the cron table with\\ '' | ||
| + | - Run the watchdog manually with\\ '' | ||
| + | - If adding a new resource to the configuration, | ||
| + | - Create a resource with that name containing the thresholds and parameters required in the '' | ||
| + | - Or if editing an existing resource, perform the required changes to the '' | ||
| + | - Run the watchdog manually to ensure that with your changes, there are still no exceptions during the process. | ||
| + | - Let the team know that you will be sending a test watchdog. | ||
| + | - Trigger your new watchdog by altering a threshold slightly or by some other non-destructive method and run the watchdog again\\ '' | ||
| + | - Ensure the alarm is created on the Atlas Node Monitor | ||
| + | - Ensure an email is sent to < | ||
| + | - Ensure an SMS message is sent if applicable to the normal recipients. | ||
| + | - Restore your threshold or by some other means create the situation that will create a clearing trap. | ||
| + | - Ensure the alarm clears, and that the clearing email and clearing SMS are sent. | ||
| + | - Re-enable the watchdog in the cron table\\ '' | ||
| + | |||
| + | ====== Further Resources ====== | ||
| + | ===== Useful Documents ===== | ||
| + | Watchdog Resource Config How To:\\ https:// | ||
| + | |||
| + | ===== Errigal Onboarding Video ===== | ||
| + | The Errigal Onboarding Video (2016-January-25) for the Watchdog was recorded on Webex and Presented by Eoin Joy (Will request to install a Webex browser plugin to view):\\ https:// | ||
| + | |||
| + | ====== Assessment ====== | ||
| + | Please submit to Eoin Joy for correction. | ||
| + | |||
| + | - Find historical watchdog data for the ccicerrigalapps1 server | ||
| + | - Determine what string is searched for in the '' | ||
| + | - Write a watchdog threshold that sends a MINOR alarm when a particular string occurs in the Ticketer log in the past 30 minutes | ||
| + | - Find out what watchdog traps if any were sent in the last week from the qaerrigaldb1 server (not just received, but those sent) | ||
| + | - Write a watchdog threshold that sends a MINOR alarm when there is no active Trap Forwarder in the thread_config table in the snmp_manager database | ||