Author: Eoin Joy
In order to familiarise yourself with Watchdogs, please watch the following presentation on the topic: https://errigal.webex.com/errigal/ldr.php?RCID=62ba3b868fa32d08eb22dbb0b41252e9
Also, please find the associated presentation that accompanies the video: https://docs.google.com/presentation/d/1lJH0qciK2ql72h55eDma8Ugdd5gW1kjRc9w7mQX7OcY/edit#slide=id.p4
The Errigal Watchdog application, or Watchdog, is a tool used on our own and our customers' servers to monitor our applications and the systems that are relied upon by our applications, e.g. database connections.
The Watchdog is an application, written in Groovy, that runs on a regular schedule. Should it detect a problem, it notifies the central monitoring server, which in turn lets the team at Errigal, and sometimes the customer themselves, know when a problem has occurred.
It is currently deployed to the clusters of:
And also to individual servers in the rest of the network, such as:
The status of Watchdog Alarms is found both locally on the server running the watchdog and also on the central monitoring server: Atlas
Atlas is the server that accepts traps sent by watchdogs in production environments.
These watchdog traps are processed and created as tickets in the Ticketer also running on atlas.
The team at Errigal is notified via email (and also SMS if it is a critical alarm) when such tickets are created on atlas.
The Atlas Node Monitor can be accessed publically from https://www.errigalnoc.com:8080/SnmpManager/nodeMonitor or from the VPN at https://atlas.err:8080/SnmpManager/nodeMonitor
The Atlas Ticketer can currently only be accessed from the VPN at http://atlas.err:8083/Ticketer/
Note: DO NOT LOG IN WITH THE ADMIN ACCOUNT - PLEASE USE YOUR OWN ACCOUNT PROVIDED
In the Node Monitor on atlas, all active watchdog alarms that successfully reached atlas are shown on their servers (represented as Parent/Hub elements) and the resource (explained below) that has alarmed (represented as Child/Node elements)
The Ticketer on atlas manages the tickets created when watchdogs are received or cleared. The ticketer also sends these notification emails and SMS messages to the Errigal team when the status of a ticket changes.
The Ticketer can also be searched for historical data on watchdogs received.
The Watchdog must always be run as a user with sufficient permissions to reserve port 162, the port used for SNMP traffic.
In any given Watchdog installation, the script used to run the watchdog is located in the watchdog/script/ directory
To manually run the Watchdog, the following is the syntax:
/home/watchdog/script] sudo ./run.sh
The automatic execution and frequency of the Watchdog process is determined by the Cron process.
http://man7.org/linux/man-pages/man5/crontab.5.html
Like the manual execution, the automatic execution is performed as a user with permissions to use port 162.
To edit the scheduling of the watchdog for a server, run the following to access the cron table for the root user:
sudo crontab -e
The watchdog logs to the watchdog/logs/ directory to files titled like Watchdog.2016-08-15. These logs contain the same output that is produced when the watchdog is run manually
The Watchdog does not monitor every possible situation on a server, it must be told what to look for and where.
The data concerning what the watchdog checks for each time it is run, where to send its resulting traps, and how to identify itself is defined in the Resource Config File.
This file is called ResourceConfig.groovy and is located in the watchdog/resources/ directory
This file contains several structures: the remoteNms, the localSystem, and the resources.
The remoteNms portion of this file contains the ip address of the receiving SNMP Manager and the port to send the trap on.
remoteNms { ipAddress = '10.91.100.101' snmpPort = 162 }
While not contained within the localSystem object in the file, this defines where the watchdog looks locally for the database it requires to run
databaseURL = "jdbc:mysql://localhost:3306/watchdog?user=root&password=TYPEMYSQLPASSWORDHERE"
The localSystem contains the customer name, host name, and list of active resources.
The customerName determines which carrier the system is placed under on the remote system.
The hostName determines what this server will be called in the remote system, also determining what will go into the email to the Errigal team as the host name.
The resources determine which of the Resources described in the file are actively being run when the watchdog is executed. Because this config file is a groovy script, you can use objects such as Date to determine the system time and only run certain resources at certain times or on certain days.
localSystem { customerName = "Errigal" // This value is sent with certain traps to help with element management at NMS side hostName = "EXTAPPS1" resources { a = '/' d = '/home' e = '/boot' f = '/dev/shm' ... } ... }
Each resource is defined by a type, a set of thresholds, and a set of parameters
These resources are contained within the localSystem object.
localSystem { ... 'TrapParsingCheck' { type = 'LOG' thresholds { a = [name: 'checkLog', type: 'NOT_EQUAL', value: ['GeneralTrapSummaryWrapper - Created Active Alarm:'], level: ['CRITICAL'], alarmId: 'TrapParsingInactive'] } parameters { // expected parameters: logFileLocation logFileLocation = '/home/scotty/logs/grails/SnmpManager.log' renameLog = false withTime = true dateFormat = "yyyy-MM-dd HH:mm:ss" minutesBack = 5 } } ... }
The type helps to determine what kind of check the watchdog needs to perform, and also specifically what parameters will be required.
The most common types are LOG, MYSQLDB, and HDD. There are other types you see frequently.
LOG types look in a text file for a string.
MYSQLDB types perform a query and looks at the result.
HDD types look at the space used on a particular partition.
A threshold is a map of values that is used to determine what will be compared at the time that the watchdog runs.
The name determines exactly which test to perform.
The type can be EQUAL, NOT_EQUAL, MIN, or MAX. This determines if for instance finding a string of text is an alarming situation or a clearing situation.
The value is what the test is comparing against, i.e. the string being looked for in the log.
The level determines the severity of the trap sent, in order: INFORMATION, WARNING, MINOR, MAJOR, and CRITICAL.
The alarmId determines the alarmIdentifier of the trap and what will become the alarm identifer of the watchdog alarm.
Depending on the type defined of the resource, some parameters may be required (logFileLocation for LOG types; driver, host, port, database, user, and password for a MYSQLDB type) whereas some parameters are optional.
Some of the optional parameters available alter how the resource performs its check.
Of particular interest is withTime for LOG type resources. If withTime is set to true, you must also include dateFormat and minutesBack parameters. With these parameters populated, the log check now looks at all lines of the file beginning with a timestamp in the format defined in dateFormat, looking back minutesBack minutes, instead of looking through the whole log.
This has the benefit of allowing an alarm to clear if an issue has abated, and also allowing it to alarm again later that day in the event that a legitimate clear is received in the meantime.
To deploy the updated ResouceConfig.groovy file, branch the watchdog-resource-config-backup repository and edit the files in the resource_config_files directory or create new files using the format resource_config_files/{env}/{handler}/ResourceConfig.groovy. Once the changes are complete, merge the branch into master and run the watchdog_resource_deployment Jenkins pipeline, which uses Ansible to deploy to the server. The pipeline will check if the server file has been updated directly on the server and will email support if it has.
Please note: Take care when testing out new or modified watchdogs. Ensure you test this in the morning (Irish time) if possible so you are allowed enough time to revert changes if necessary. Also be sure to reply to the email chain of the triggered watchdog to denote it was a test.
There are some steps that should be taken when testing the implementation of a new or edited watchdog resource.
ResourceConfig.groovy file.sudo crontab -e# character at the start of the line.sudo ./run.shresources closure.ResourceConfig.groovy file.ResourceConfig.groovy file.sudo ./run.shsudo crontab -e.
Watchdog Resource Config How To:
https://docs.google.com/document/d/12rxXTgmQwYRlA7_AfMjHK-hWm6Z9z8frt2ePDmuCj1s
The Errigal Onboarding Video (2016-January-25) for the Watchdog was recorded on Webex and Presented by Eoin Joy (Will request to install a Webex browser plugin to view):
https://errigal.webex.com/errigal/ldr.php?RCID=62ba3b868fa32d08eb22dbb0b41252e9
Please submit to Eoin Joy for correction.
SnmpManager.log resource on cccierrigalapps1