User Tools
Table of Contents
Errigal Watchdog Overview
Author: Eoin Joy
In order to familiarise yourself with Watchdogs, please watch the following presentation on the topic: https://errigal.webex.com/errigal/ldr.php?RCID=62ba3b868fa32d08eb22dbb0b41252e9
Also, please find the associated presentation that accompanies the video: https://docs.google.com/presentation/d/1lJH0qciK2ql72h55eDma8Ugdd5gW1kjRc9w7mQX7OcY/edit#slide=id.p4
The Errigal Watchdog application, or Watchdog, is a tool used on our own and our customers' servers to monitor our applications and the systems that are relied upon by our applications, e.g. database connections.
What is the Watchdog?
The Watchdog is an application, written in Groovy, that runs on a regular schedule. Should it detect a problem, it notifies the central monitoring server, which in turn lets the team at Errigal, and sometimes the customer themselves, know when a problem has occurred.
Where is it deployed?
It is currently deployed to the clusters of:
- ExteNet Production
And also to individual servers in the rest of the network, such as:
- atlas
- olympus
- mobipcs
- iris
- melpomene
- scottypro
How do I Check the Status of Watchdog Alarms?
The status of Watchdog Alarms is found both locally on the server running the watchdog and also on the central monitoring server: Atlas
Atlas
Atlas is the server that accepts traps sent by watchdogs in production environments.
These watchdog traps are processed and created as tickets in the Ticketer also running on atlas.
The team at Errigal is notified via email (and also SMS if it is a critical alarm) when such tickets are created on atlas.
The Atlas Node Monitor can be accessed publically from https://www.errigalnoc.com:8080/SnmpManager/nodeMonitor or from the VPN at https://atlas.err:8080/SnmpManager/nodeMonitor
The Atlas Ticketer can currently only be accessed from the VPN at http://atlas.err:8083/Ticketer/
Note: DO NOT LOG IN WITH THE ADMIN ACCOUNT - PLEASE USE YOUR OWN ACCOUNT PROVIDED
Node Monitor
In the Node Monitor on atlas, all active watchdog alarms that successfully reached atlas are shown on their servers (represented as Parent/Hub elements) and the resource (explained below) that has alarmed (represented as Child/Node elements)
Ticketer
The Ticketer on atlas manages the tickets created when watchdogs are received or cleared. The ticketer also sends these notification emails and SMS messages to the Errigal team when the status of a ticket changes.
The Ticketer can also be searched for historical data on watchdogs received.
Running the Watchdog
The Watchdog must always be run as a user with sufficient permissions to reserve port 162, the port used for SNMP traffic.
In any given Watchdog installation, the script used to run the watchdog is located in the watchdog/script/ directory
To manually run the Watchdog, the following is the syntax:
/home/watchdog/script] sudo ./run.sh
Cron
The automatic execution and frequency of the Watchdog process is determined by the Cron process.
http://man7.org/linux/man-pages/man5/crontab.5.html
Like the manual execution, the automatic execution is performed as a user with permissions to use port 162.
To edit the scheduling of the watchdog for a server, run the following to access the cron table for the root user:
sudo crontab -e
Logging
The watchdog logs to the watchdog/logs/ directory to files titled like Watchdog.2016-08-15. These logs contain the same output that is produced when the watchdog is run manually
Creating Watchdogs
The Watchdog does not monitor every possible situation on a server, it must be told what to look for and where.
The Resource Config File
The data concerning what the watchdog checks for each time it is run, where to send its resulting traps, and how to identify itself is defined in the Resource Config File.
This file is called ResourceConfig.groovy and is located in the watchdog/resources/ directory
This file contains several structures: the remoteNms, the localSystem, and the resources.
Remote NMS
The remoteNms portion of this file contains the ip address of the receiving SNMP Manager and the port to send the trap on.
remoteNms { ipAddress = '10.91.100.101' snmpPort = 162 }
Database URL
While not contained within the localSystem object in the file, this defines where the watchdog looks locally for the database it requires to run
databaseURL = "jdbc:mysql://localhost:3306/watchdog?user=root&password=TYPEMYSQLPASSWORDHERE"
Local System
The localSystem contains the customer name, host name, and list of active resources.
The customerName determines which carrier the system is placed under on the remote system.
The hostName determines what this server will be called in the remote system, also determining what will go into the email to the Errigal team as the host name.
The resources determine which of the Resources described in the file are actively being run when the watchdog is executed. Because this config file is a groovy script, you can use objects such as Date to determine the system time and only run certain resources at certain times or on certain days.
localSystem { customerName = "Errigal" // This value is sent with certain traps to help with element management at NMS side hostName = "EXTAPPS1" resources { a = '/' d = '/home' e = '/boot' f = '/dev/shm' ... } ... }
Resources
Each resource is defined by a type, a set of thresholds, and a set of parameters
These resources are contained within the localSystem object.
localSystem { ... 'TrapParsingCheck' { type = 'LOG' thresholds { a = [name: 'checkLog', type: 'NOT_EQUAL', value: ['GeneralTrapSummaryWrapper - Created Active Alarm:'], level: ['CRITICAL'], alarmId: 'TrapParsingInactive'] } parameters { // expected parameters: logFileLocation logFileLocation = '/home/scotty/logs/grails/SnmpManager.log' renameLog = false withTime = true dateFormat = "yyyy-MM-dd HH:mm:ss" minutesBack = 5 } } ... }
Type
The type helps to determine what kind of check the watchdog needs to perform, and also specifically what parameters will be required.
The most common types are LOG, MYSQLDB, and HDD. There are other types you see frequently.
LOG types look in a text file for a string.
MYSQLDB types perform a query and looks at the result.
HDD types look at the space used on a particular partition.
Thresholds
A threshold is a map of values that is used to determine what will be compared at the time that the watchdog runs.
The name determines exactly which test to perform.
The type can be EQUAL, NOT_EQUAL, MIN, or MAX. This determines if for instance finding a string of text is an alarming situation or a clearing situation.
The value is what the test is comparing against, i.e. the string being looked for in the log.
The level determines the severity of the trap sent, in order: INFORMATION, WARNING, MINOR, MAJOR, and CRITICAL.
The alarmId determines the alarmIdentifier of the trap and what will become the alarm identifer of the watchdog alarm.
Parameters
Depending on the type defined of the resource, some parameters may be required (logFileLocation for LOG types; driver, host, port, database, user, and password for a MYSQLDB type) whereas some parameters are optional.
Some of the optional parameters available alter how the resource performs its check.
Of particular interest is withTime for LOG type resources. If withTime is set to true, you must also include dateFormat and minutesBack parameters. With these parameters populated, the log check now looks at all lines of the file beginning with a timestamp in the format defined in dateFormat, looking back minutesBack minutes, instead of looking through the whole log.
This has the benefit of allowing an alarm to clear if an issue has abated, and also allowing it to alarm again later that day in the event that a legitimate clear is received in the meantime.
Deploying the Config File
To deploy the updated ResouceConfig.groovy file, branch the watchdog-resource-config-backup repository and edit the files in the resource_config_files directory or create new files using the format resource_config_files/{env}/{handler}/ResourceConfig.groovy. Once the changes are complete, merge the branch into master and run the watchdog_resource_deployment Jenkins pipeline, which uses Ansible to deploy to the server. The pipeline will check if the server file has been updated directly on the server and will email support if it has.
Testing a New or Edited Watchdog Resource
Please note: Take care when testing out new or modified watchdogs. Ensure you test this in the morning (Irish time) if possible so you are allowed enough time to revert changes if necessary. Also be sure to reply to the email chain of the triggered watchdog to denote it was a test.
There are some steps that should be taken when testing the implementation of a new or edited watchdog resource.
- Make a backup of the current
ResourceConfig.groovyfile. - Disable the Watchdog in the cron job by editing the cron table with
sudo crontab -e
and commenting the line for the watchdog with a#character at the start of the line. - Run the watchdog manually with
sudo ./run.sh
and ensure the current configuration does not cause any exceptions in the execution of the watchdog. - If adding a new resource to the configuration, add an entry with the desired name to the
resourcesclosure. - Create a resource with that name containing the thresholds and parameters required in the
ResourceConfig.groovyfile. - Or if editing an existing resource, perform the required changes to the
ResourceConfig.groovyfile. - Run the watchdog manually to ensure that with your changes, there are still no exceptions during the process.
- Let the team know that you will be sending a test watchdog.
- Trigger your new watchdog by altering a threshold slightly or by some other non-destructive method and run the watchdog again
sudo ./run.sh - Ensure the alarm is created on the Atlas Node Monitor
- Ensure an email is sent to developers@errigal.com.
- Ensure an SMS message is sent if applicable to the normal recipients.
- Restore your threshold or by some other means create the situation that will create a clearing trap.
- Ensure the alarm clears, and that the clearing email and clearing SMS are sent.
- Re-enable the watchdog in the cron table
sudo crontab -e.
Further Resources
Useful Documents
Watchdog Resource Config How To:
https://docs.google.com/document/d/12rxXTgmQwYRlA7_AfMjHK-hWm6Z9z8frt2ePDmuCj1s
Errigal Onboarding Video
The Errigal Onboarding Video (2016-January-25) for the Watchdog was recorded on Webex and Presented by Eoin Joy (Will request to install a Webex browser plugin to view):
https://errigal.webex.com/errigal/ldr.php?RCID=62ba3b868fa32d08eb22dbb0b41252e9
Assessment
Please submit to Eoin Joy for correction.
- Find historical watchdog data for the ccicerrigalapps1 server
- Determine what string is searched for in the
SnmpManager.logresource on cccierrigalapps1 - Write a watchdog threshold that sends a MINOR alarm when a particular string occurs in the Ticketer log in the past 30 minutes
- Find out what watchdog traps if any were sent in the last week from the qaerrigaldb1 server (not just received, but those sent)
- Write a watchdog threshold that sends a MINOR alarm when there is no active Trap Forwarder in the thread_config table in the snmp_manager database