User Tools

Site Tools


Writing /app/www/public/data/meta/onboarding/introduction/watchdog.meta failed
onboarding:introduction:watchdog

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
onboarding:introduction:watchdog [2016/08/11 17:28] – [Testing a new or edited watchdog resource] ejoyonboarding:introduction:watchdog [2021/06/25 10:09] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +====== Errigal Watchdog Overview ======
  
 +Author: Eoin Joy
 +
 +
 +In order to familiarise yourself with Watchdogs, please watch the following presentation on the topic: [[https://errigal.webex.com/errigal/ldr.php?RCID=62ba3b868fa32d08eb22dbb0b41252e9]]
 +
 +Also, please find the associated presentation that accompanies the video: [[https://docs.google.com/presentation/d/1lJH0qciK2ql72h55eDma8Ugdd5gW1kjRc9w7mQX7OcY/edit#slide=id.p4]]
 +
 +
 +The Errigal Watchdog application, or Watchdog, is a tool used on our own and our customers' servers to monitor our applications and the systems that are relied upon by our applications, e.g. database connections.
 +
 +
 +----
 +
 +
 +===== What is the Watchdog? =====
 +
 +The Watchdog is an application, written in Groovy, that runs on a regular schedule. Should it detect a problem, it notifies the central monitoring server, which in turn lets the team at Errigal, and sometimes the customer themselves, know when a problem has occurred.
 +
 +
 +----
 +
 +
 +===== Where is it deployed? =====
 +
 +It is currently deployed to the clusters of:
 +  * ExteNet Production
 +And also to individual servers in the rest of the network, such as:
 +  * atlas
 +  * olympus
 +  * mobipcs
 +  * iris
 +  * melpomene
 +  * scottypro
 +
 +====== How do I Check the Status of Watchdog Alarms? ======
 +
 +The status of Watchdog Alarms is found both locally on the server running the watchdog and also on the central monitoring server: Atlas
 +
 +===== Atlas =====
 +
 +Atlas is the server that accepts traps sent by watchdogs in production environments.\\ 
 +These watchdog traps are processed and created as tickets in the Ticketer also running on atlas.\\ 
 +The team at Errigal is notified via email (and also SMS if it is a critical alarm) when such tickets are created on atlas.
 +
 +The Atlas Node Monitor can be accessed publically from https://www.errigalnoc.com:8080/SnmpManager/nodeMonitor or from the VPN at https://atlas.err:8080/SnmpManager/nodeMonitor \\ 
 +The Atlas Ticketer can currently only be accessed from the VPN at http://atlas.err:8083/Ticketer/
 +
 +**Note: DO NOT LOG IN WITH THE ADMIN ACCOUNT - PLEASE USE YOUR OWN ACCOUNT PROVIDED**
 +
 +==== Node Monitor ====
 +
 +In the Node Monitor on atlas, all active watchdog alarms that successfully reached atlas are shown on their servers (represented as Parent/Hub elements) and the resource (explained below) that has alarmed (represented as Child/Node elements)
 +
 +==== Ticketer ====
 +
 +The Ticketer on atlas manages the tickets created when watchdogs are received or cleared. The ticketer also sends these notification emails and SMS messages to the Errigal team when the status of a ticket changes.\\ 
 +The Ticketer can also be searched for historical data on watchdogs received.
 +
 +====== Running the Watchdog ======
 +
 +The Watchdog must always be run as a user with sufficient permissions to reserve port 162, the port used for SNMP traffic.\\ 
 +In any given Watchdog installation, the script used to run the watchdog is located in the ''watchdog/script/'' directory
 +
 +To manually run the Watchdog, the following is the syntax:\\ ''/home/watchdog/script] sudo ./run.sh''
 +===== Cron =====
 +
 +The automatic execution and frequency of the Watchdog process is determined by the Cron process.\\ http://man7.org/linux/man-pages/man5/crontab.5.html
 +
 +Like the manual execution, the automatic execution is performed as a user with permissions to use port 162.\\ 
 +To edit the scheduling of the watchdog for a server, run the following to access the cron table for the root user:\\ ''sudo crontab -e''
 +
 +===== Logging =====
 +
 +The watchdog logs to the ''watchdog/logs/'' directory to files titled like ''Watchdog.2016-08-15''. These logs contain the same output that is produced when the watchdog is run manually
 +
 +====== Creating Watchdogs ======
 +
 +The Watchdog does not monitor every possible situation on a server, it must be told what to look for and where.
 +
 +===== The Resource Config File =====
 +
 +The data concerning what the watchdog checks for each time it is run, where to send its resulting traps, and how to identify itself is defined in the Resource Config File.\\ 
 +This file is called ''ResourceConfig.groovy'' and is located in the ''watchdog/resources/'' directory\\ 
 +This file contains several structures: the remoteNms, the localSystem, and the resources.
 +
 +===== Remote NMS =====
 +
 +The remoteNms portion of this file contains the ip address of the receiving SNMP Manager and the port to send the trap on.
 +
 +<code groovy>remoteNms {
 +  ipAddress = '10.91.100.101'
 +  snmpPort = 162
 +}</code>
 +
 +===== Database URL =====
 +While not contained within the localSystem object in the file, this defines where the watchdog looks locally for the database it requires to run
 +<code groovy>databaseURL = "jdbc:mysql://localhost:3306/watchdog?user=root&password=TYPEMYSQLPASSWORDHERE"</code>
 +
 +===== Local System =====
 +
 +The localSystem contains the customer name, host name, and list of active resources.\\ 
 +The customerName determines which carrier the system is placed under on the remote system.\\ 
 +The hostName determines what this server will be called in the remote system, also determining what will go into the email to the Errigal team as the host name.\\ 
 +The resources determine which of the Resources described in the file are actively being run when the watchdog is executed. Because this config file is a groovy script, you can use objects such as Date to determine the system time and only run certain resources at certain times or on certain days.
 +
 +<code groovy>
 +localSystem {
 +  customerName = "Errigal" // This value is sent with certain traps to help with element management at NMS side
 +  hostName = "EXTAPPS1"
 +  resources {
 +    a = '/'
 +    d = '/home'
 +    e = '/boot'
 +    f = '/dev/shm'
 +    ...
 + }
 + ...
 +}
 +</code>
 +
 +==== Resources ====
 +
 +Each resource is defined by a type, a set of thresholds, and a set of parameters\\ 
 +These resources are contained within the localSystem object.\\ 
 +<code groovy>
 +localSystem {
 +  ...
 +  'TrapParsingCheck' {
 +    type = 'LOG'
 +    thresholds {
 +      a = [name: 'checkLog', type: 'NOT_EQUAL', value: ['GeneralTrapSummaryWrapper  - Created Active Alarm:'], level: ['CRITICAL'], alarmId: 'TrapParsingInactive']
 +    }
 +    parameters { // expected parameters: logFileLocation
 +      logFileLocation = '/home/scotty/logs/grails/SnmpManager.log'
 +      renameLog = false
 +      withTime = true
 +      dateFormat = "yyyy-MM-dd HH:mm:ss"
 +      minutesBack = 5
 +    }
 +  }
 +  ...
 +}
 +</code>
 +
 +=== Type ===
 +
 +The type helps to determine what kind of check the watchdog needs to perform, and also specifically what parameters will be required.\\ 
 +The most common types are ''LOG'', ''MYSQLDB'', and ''HDD''. There are other types you see frequently.\\ 
 +''LOG'' types look in a text file for a string.\\ 
 +''MYSQLDB'' types perform a query and looks at the result.\\ 
 +''HDD'' types look at the space used on a particular partition.
 +
 +=== Thresholds ===
 +
 +A threshold is a map of values that is used to determine what will be compared at the time that the watchdog runs.\\ 
 +The name determines exactly which test to perform.\\ 
 +The type can be ''EQUAL'', ''NOT_EQUAL'', ''MIN'', or ''MAX''. This determines if for instance finding a string of text is an alarming situation or a clearing situation.\\ 
 +The value is what the test is comparing against, i.e. the string being looked for in the log.\\ 
 +The level determines the severity of the trap sent, in order: ''INFORMATION'', ''WARNING'', ''MINOR'', ''MAJOR'', and ''CRITICAL''.\\ 
 +The alarmId determines the alarmIdentifier of the trap and what will become the alarm identifer of the watchdog alarm.
 +
 +=== Parameters ===
 +
 +Depending on the ''type'' defined of the resource, some parameters may be required (''logFileLocation'' for ''LOG'' types; ''driver'', ''host'', ''port'', ''database'', ''user'', and ''password'' for a ''MYSQLDB'' type) whereas some parameters are optional.\\ 
 +Some of the optional parameters available alter how the resource performs its check.\\ 
 +Of particular interest is ''withTime'' for ''LOG'' type resources. If ''withTime'' is set to ''true'', you must also include ''dateFormat'' and ''minutesBack'' parameters. With these parameters populated, the log check now looks at all lines of the file beginning with a timestamp in the format defined in ''dateFormat'', looking back ''minutesBack'' minutes, instead of looking through the whole log.\\ 
 +This has the benefit of allowing an alarm to clear if an issue has abated, and also allowing it to alarm again later that day in the event that a legitimate clear is received in the meantime.
 +
 +===== Deploying the Config File =====
 +
 +To deploy the updated ResouceConfig.groovy file, branch the [[ https://bitbucket.org/errigal/watchdog-resource-config-backup/src/master/ | watchdog-resource-config-backup ]] repository and edit the files in the resource_config_files directory or create new files using the format resource_config_files/{env}/{handler}/ResourceConfig.groovy. Once the changes are complete, merge the branch into master and run the [[ http://moros.err:8080/job/application_component-deployment/job/watchdog_resource_deployment/ | watchdog_resource_deployment ]] Jenkins pipeline, which uses Ansible to deploy to the server. The pipeline will check if the server file has been updated directly on the server and will email support if it has.
 +
 +===== Testing a New or Edited Watchdog Resource =====
 +
 +**Please note:** Take care when testing out new or modified watchdogs. Ensure you test this in the morning (Irish time) if possible so you are allowed enough time to revert changes if necessary. Also be sure to reply to the email chain of the triggered watchdog to denote it was a test.
 +
 +There are some steps that should be taken when testing the implementation of a new or edited watchdog resource.
 +  - Make a backup of the current ''ResourceConfig.groovy'' file.
 +  - Disable the Watchdog in the cron job by editing the cron table with\\ ''sudo crontab -e''\\ and commenting the line for the watchdog with a ''#'' character at the start of the line.
 +  - Run the watchdog manually with\\ ''sudo ./run.sh''\\ and ensure the current configuration does not cause any exceptions in the execution of the watchdog.
 +  - If adding a new resource to the configuration, add an entry with the desired name to the ''resources'' closure.
 +  - Create a resource with that name containing the thresholds and parameters required in the ''ResourceConfig.groovy'' file.
 +  - Or if editing an existing resource, perform the required changes to the ''ResourceConfig.groovy'' file.
 +  - Run the watchdog manually to ensure that with your changes, there are still no exceptions during the process.
 +  - Let the team know that you will be sending a test watchdog.
 +  - Trigger your new watchdog by altering a threshold slightly or by some other non-destructive method and run the watchdog again\\ ''sudo ./run.sh''
 +  - Ensure the alarm is created on the Atlas Node Monitor
 +  - Ensure an email is sent to <nowiki>developers@errigal.com</nowiki>.
 +  - Ensure an SMS message is sent if applicable to the normal recipients.
 +  - Restore your threshold or by some other means create the situation that will create a clearing trap.
 +  - Ensure the alarm clears, and that the clearing email and clearing SMS are sent.
 +  - Re-enable the watchdog in the cron table\\ ''sudo crontab -e''.
 +
 +====== Further Resources ======
 +===== Useful Documents =====
 +Watchdog Resource Config How To:\\ https://docs.google.com/document/d/12rxXTgmQwYRlA7_AfMjHK-hWm6Z9z8frt2ePDmuCj1s
 +
 +===== Errigal Onboarding Video =====
 +The Errigal Onboarding Video (2016-January-25) for the Watchdog was recorded on Webex and Presented by Eoin Joy (Will request to install a Webex browser plugin to view):\\ https://errigal.webex.com/errigal/ldr.php?RCID=62ba3b868fa32d08eb22dbb0b41252e9
 +
 +====== Assessment ======
 +Please submit to Eoin Joy for correction.
 +
 +  - Find historical watchdog data for the ccicerrigalapps1 server
 +  - Determine what string is searched for in the ''SnmpManager.log'' resource on cccierrigalapps1
 +  - Write a watchdog threshold that sends a MINOR alarm when a particular string occurs in the Ticketer log in the past 30 minutes
 +  - Find out what watchdog traps if any were sent in the last week from the qaerrigaldb1 server (not just received, but those sent)
 +  - Write a watchdog threshold that sends a MINOR alarm when there is no active Trap Forwarder in the thread_config table in the snmp_manager database