Errigal Watchdog Overview

Author: Eoin Joy

In order to familiarise yourself with Watchdogs, please watch the following presentation on the topic: https://errigal.webex.com/errigal/ldr.php?RCID=62ba3b868fa32d08eb22dbb0b41252e9

Also, please find the associated presentation that accompanies the video: https://docs.google.com/presentation/d/1lJH0qciK2ql72h55eDma8Ugdd5gW1kjRc9w7mQX7OcY/edit#slide=id.p4

The Errigal Watchdog application, or Watchdog, is a tool used on our own and our customers' servers to monitor our applications and the systems that are relied upon by our applications, e.g. database connections.

What is the Watchdog?

The Watchdog is an application, written in Groovy, that runs on a regular schedule. Should it detect a problem, it notifies the central monitoring server, which in turn lets the team at Errigal, and sometimes the customer themselves, know when a problem has occurred.

Where is it deployed?

It is currently deployed to the clusters of:

ExteNet Production

And also to individual servers in the rest of the network, such as:

atlas
olympus
mobipcs
iris
melpomene
scottypro

How do I Check the Status of Watchdog Alarms?

The status of Watchdog Alarms is found both locally on the server running the watchdog and also on the central monitoring server: Atlas

Atlas

Atlas is the server that accepts traps sent by watchdogs in production environments.
These watchdog traps are processed and created as tickets in the Ticketer also running on atlas.
The team at Errigal is notified via email (and also SMS if it is a critical alarm) when such tickets are created on atlas.

The Atlas Node Monitor can be accessed publically from https://www.errigalnoc.com:8080/SnmpManager/nodeMonitor or from the VPN at https://atlas.err:8080/SnmpManager/nodeMonitor
The Atlas Ticketer can currently only be accessed from the VPN at http://atlas.err:8083/Ticketer/

Note: DO NOT LOG IN WITH THE ADMIN ACCOUNT - PLEASE USE YOUR OWN ACCOUNT PROVIDED

Node Monitor

In the Node Monitor on atlas, all active watchdog alarms that successfully reached atlas are shown on their servers (represented as Parent/Hub elements) and the resource (explained below) that has alarmed (represented as Child/Node elements)

Ticketer

The Ticketer on atlas manages the tickets created when watchdogs are received or cleared. The ticketer also sends these notification emails and SMS messages to the Errigal team when the status of a ticket changes.
The Ticketer can also be searched for historical data on watchdogs received.

Running the Watchdog

The Watchdog must always be run as a user with sufficient permissions to reserve port 162, the port used for SNMP traffic.
In any given Watchdog installation, the script used to run the watchdog is located in the watchdog/script/ directory

To manually run the Watchdog, the following is the syntax:
/home/watchdog/script] sudo ./run.sh

Cron

The automatic execution and frequency of the Watchdog process is determined by the Cron process.
http://man7.org/linux/man-pages/man5/crontab.5.html

Like the manual execution, the automatic execution is performed as a user with permissions to use port 162.
To edit the scheduling of the watchdog for a server, run the following to access the cron table for the root user:
sudo crontab -e

Logging

The watchdog logs to the watchdog/logs/ directory to files titled like Watchdog.2016-08-15. These logs contain the same output that is produced when the watchdog is run manually

Creating Watchdogs

The Watchdog does not monitor every possible situation on a server, it must be told what to look for and where.

The Resource Config File

The data concerning what the watchdog checks for each time it is run, where to send its resulting traps, and how to identify itself is defined in the Resource Config File.
This file is called ResourceConfig.groovy and is located in the watchdog/resources/ directory
This file contains several structures: the remoteNms, the localSystem, and the resources.

Remote NMS

The remoteNms portion of this file contains the ip address of the receiving SNMP Manager and the port to send the trap on.

remoteNms {
  ipAddress = '10.91.100.101'
  snmpPort = 162
}

Database URL

While not contained within the localSystem object in the file, this defines where the watchdog looks locally for the database it requires to run

databaseURL = "jdbc:mysql://localhost:3306/watchdog?user=root&password=TYPEMYSQLPASSWORDHERE"

Local System

The localSystem contains the customer name, host name, and list of active resources.
The customerName determines which carrier the system is placed under on the remote system.
The hostName determines what this server will be called in the remote system, also determining what will go into the email to the Errigal team as the host name.
The resources determine which of the Resources described in the file are actively being run when the watchdog is executed. Because this config file is a groovy script, you can use objects such as Date to determine the system time and only run certain resources at certain times or on certain days.

localSystem {
  customerName = "Errigal" // This value is sent with certain traps to help with element management at NMS side
  hostName = "EXTAPPS1"
  resources {
    a = '/'
    d = '/home'
    e = '/boot'
    f = '/dev/shm'
    ...
 }
 ...
}

Resources

Each resource is defined by a type, a set of thresholds, and a set of parameters
These resources are contained within the localSystem object.

localSystem {
  ...
  'TrapParsingCheck' {
    type = 'LOG'
    thresholds {
      a = [name: 'checkLog', type: 'NOT_EQUAL', value: ['GeneralTrapSummaryWrapper  - Created Active Alarm:'], level: ['CRITICAL'], alarmId: 'TrapParsingInactive']
    }
    parameters { // expected parameters: logFileLocation
      logFileLocation = '/home/scotty/logs/grails/SnmpManager.log'
      renameLog = false
      withTime = true
      dateFormat = "yyyy-MM-dd HH:mm:ss"
      minutesBack = 5
    }
  }
  ...
}

Type

The type helps to determine what kind of check the watchdog needs to perform, and also specifically what parameters will be required.
The most common types are LOG, MYSQLDB, and HDD. There are other types you see frequently.
LOG types look in a text file for a string.
MYSQLDB types perform a query and looks at the result.
HDD types look at the space used on a particular partition.

Thresholds

A threshold is a map of values that is used to determine what will be compared at the time that the watchdog runs.
The name determines exactly which test to perform.
The type can be EQUAL, NOT_EQUAL, MIN, or MAX. This determines if for instance finding a string of text is an alarming situation or a clearing situation.
The value is what the test is comparing against, i.e. the string being looked for in the log.
The level determines the severity of the trap sent, in order: INFORMATION, WARNING, MINOR, MAJOR, and CRITICAL.
The alarmId determines the alarmIdentifier of the trap and what will become the alarm identifer of the watchdog alarm.

Parameters

Depending on the type defined of the resource, some parameters may be required (logFileLocation for LOG types; driver, host, port, database, user, and password for a MYSQLDB type) whereas some parameters are optional.
Some of the optional parameters available alter how the resource performs its check.
Of particular interest is withTime for LOG type resources. If withTime is set to true, you must also include dateFormat and minutesBack parameters. With these parameters populated, the log check now looks at all lines of the file beginning with a timestamp in the format defined in dateFormat, looking back minutesBack minutes, instead of looking through the whole log.
This has the benefit of allowing an alarm to clear if an issue has abated, and also allowing it to alarm again later that day in the event that a legitimate clear is received in the meantime.

Deploying the Config File

To deploy the updated ResouceConfig.groovy file, branch the watchdog-resource-config-backup repository and edit the files in the resource_config_files directory or create new files using the format resource_config_files/{env}/{handler}/ResourceConfig.groovy. Once the changes are complete, merge the branch into master and run the watchdog_resource_deployment Jenkins pipeline, which uses Ansible to deploy to the server. The pipeline will check if the server file has been updated directly on the server and will email support if it has.

Testing a New or Edited Watchdog Resource

Please note: Take care when testing out new or modified watchdogs. Ensure you test this in the morning (Irish time) if possible so you are allowed enough time to revert changes if necessary. Also be sure to reply to the email chain of the triggered watchdog to denote it was a test.

There are some steps that should be taken when testing the implementation of a new or edited watchdog resource.

Make a backup of the current ResourceConfig.groovy file.
Disable the Watchdog in the cron job by editing the cron table with
sudo crontab -e
and commenting the line for the watchdog with a # character at the start of the line.
Run the watchdog manually with
sudo ./run.sh
and ensure the current configuration does not cause any exceptions in the execution of the watchdog.
If adding a new resource to the configuration, add an entry with the desired name to the resources closure.
Create a resource with that name containing the thresholds and parameters required in the ResourceConfig.groovy file.
Or if editing an existing resource, perform the required changes to the ResourceConfig.groovy file.
Run the watchdog manually to ensure that with your changes, there are still no exceptions during the process.
Let the team know that you will be sending a test watchdog.
Trigger your new watchdog by altering a threshold slightly or by some other non-destructive method and run the watchdog again
sudo ./run.sh
Ensure the alarm is created on the Atlas Node Monitor
Ensure an email is sent to developers@errigal.com.
Ensure an SMS message is sent if applicable to the normal recipients.
Restore your threshold or by some other means create the situation that will create a clearing trap.
Ensure the alarm clears, and that the clearing email and clearing SMS are sent.
Re-enable the watchdog in the cron table
sudo crontab -e.

Further Resources

Useful Documents

Watchdog Resource Config How To:
https://docs.google.com/document/d/12rxXTgmQwYRlA7_AfMjHK-hWm6Z9z8frt2ePDmuCj1s

Errigal Onboarding Video

The Errigal Onboarding Video (2016-January-25) for the Watchdog was recorded on Webex and Presented by Eoin Joy (Will request to install a Webex browser plugin to view):
https://errigal.webex.com/errigal/ldr.php?RCID=62ba3b868fa32d08eb22dbb0b41252e9

Assessment

Please submit to Eoin Joy for correction.

Find historical watchdog data for the ccicerrigalapps1 server
Determine what string is searched for in the SnmpManager.log resource on cccierrigalapps1
Write a watchdog threshold that sends a MINOR alarm when a particular string occurs in the Ticketer log in the past 30 minutes
Find out what watchdog traps if any were sent in the last week from the qaerrigaldb1 server (not just received, but those sent)
Write a watchdog threshold that sends a MINOR alarm when there is no active Trap Forwarder in the thread_config table in the snmp_manager database

Table of Contents