User Tools
Writing /app/www/public/data/meta/watchdogs/watchdog_alarms.meta failed
watchdogs:watchdog_alarms
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| watchdogs:watchdog_alarms [2019/12/16 16:00] – [PotentialIDMSFailureImminent / BothHandlerThreadsInactive] aryan | watchdogs:watchdog_alarms [2021/06/25 10:09] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Watchdog Alarm Identifiers ====== | ||
| + | Here is an in-progress list of all the current watchdog alarms, what they mean and what you should do if we get them. | ||
| + | |||
| + | ===== UnableToRunCommand ===== | ||
| + | |||
| + | The UnableToRunCommand alarms are a special case that can show up with any alarm ID. It means rather than the alarm being triggered because the threshold was breached the threshold check failed to run correctly at all. This is serious as it means watchdog is currently broken on that server. Possible causes: A database can't be reached, a log file or directory is missing, or the watchdog process doesn' | ||
| + | |||
| + | ==== Potential Exceptions ==== | ||
| + | |||
| + | === No space left on device === | ||
| + | |||
| + | == Exception == | ||
| + | < | ||
| + | java.io.IOException: | ||
| + | |||
| + | == Resolution == | ||
| + | Remove some unused files to gain free space: | ||
| + | - Run **df -h** | ||
| + | - Check if any file system usage is around 100% | ||
| + | - Go the **/tmp** folder. If it's a root partition issue (e.g. *lv_root*), *sudo su* first. | ||
| + | - Delete the *.tmp* files. That will prove sufficient in the vast majority of the cases. If the problem still persists, proceed at your own discretion deleting files. | ||
| + | - | ||
| + | === Log file missing === | ||
| + | |||
| + | == Exception == | ||
| + | < | ||
| + | java.io.FileNotFoundException: | ||
| + | |||
| + | == Resolution == | ||
| + | This is a known issue with the ebonding application (ExteNet only). Occasionally a log file is not created on log rollover. | ||
| + | Solution is to restart the ebonding application on the affected handler. | ||
| + | |||
| + | ===== Low Disc Space Alarms ===== | ||
| + | |||
| + | These are pretty self explanatory - a partition on the server is full or nearly full. | ||
| + | |||
| + | The alarm ID in this case will be the name of the partition being affected. Examples might be: "/ | ||
| + | |||
| + | **What to do:** Check in on server. Run "df -h" to see current free space on all partition. Check for any obvious large/ | ||
| + | |||
| + | For a database partition this is more serious may mean we need a master swap. Binlogs can be (carefully!) cleared out to free up some space to buy a bit more time. | ||
| + | |||
| + | ===== General Application Alarms ===== | ||
| + | |||
| + | These are checks that are carried out against each application to ensure they are still running ok. | ||
| + | |||
| + | ==== NocPortal-catalina.log / ReportingManager-catalina.out / SnmpManager-catalina.out / Ticketer-catalina.out / Cas-catalina.out - " | ||
| + | |||
| + | A PermGen error has been spotted in the applications catalina.out log. | ||
| + | |||
| + | ==== NocPortal.log / ReportingManager.log / SnmpManager.log / Ticketer.log / Cas.log " | ||
| + | |||
| + | An out of memory error error has been spotted in the application' | ||
| + | |||
| + | ==== NocPortalApplicationTomcatServerProcess / ReportingManagerApplicationTomcatServerProcess / SnmpManagerApplicationTomcatServerProcess / TicketerApplicationTomcatServerProcess / CasApplicationTomcatServerProcess - " | ||
| + | |||
| + | The application wasn't found on the server' | ||
| + | |||
| + | ==== NocPortalJvm / ReportingManagerJvm / SnmpManagerJvm / TicketerJvm / CasJvm - " | ||
| + | |||
| + | The application heap size is too high! | ||
| + | |||
| + | |||
| + | ==== NocPortalBackups / ReportingManagerBackups / SnmpManagerBackups / TicketerBackups - " | ||
| + | |||
| + | The automated backup process for this application hasn't completed. | ||
| + | |||
| + | ==== NocPortalURLCheck / ReportingManagerURLCheck / SnmpManagerURLCheck / TicketerURLCheck - " | ||
| + | |||
| + | The watchdog server can't reach the application' | ||
| + | |||
| + | ==== NocPortal_Database/ | ||
| + | |||
| + | If it's a " | ||
| + | |||
| + | |||
| + | ===== SNMP Manager Alarms ===== | ||
| + | |||
| + | ==== ActiveAlarmCreation ==== | ||
| + | |||
| + | No new active_alarm entries have ben created in the database in the last ten minutes. | ||
| + | |||
| + | ==== apps1Failover / apps2Failover / handlerFailover ==== | ||
| + | |||
| + | The SNMP Manager distributer has logged that one of the handlers is not responding in a timely manner and has taken it off the trap distribution list. | ||
| + | |||
| + | ==== autoDiscoveryPollCheck / alarmAuditStuck ==== | ||
| + | |||
| + | The AutodiscoveryAlarmJob has been running for more than 16 hours and is probably stuck. | ||
| + | |||
| + | |||
| + | ==== ClientAlarmPollerCheck / ClientAlarmPollerInactive ==== | ||
| + | |||
| + | The client active alarm poller im the SNMP Manager has stopped working. This means that the alarm list the users see in the Node Monitor isn't updating. Trap processing and alarm creation may still be happening. We have also seen this as the first canary to die when the entire SNMP Manager is slowing to a crawl and stopping. | ||
| + | |||
| + | Troubleshooting: | ||
| + | Be sure that you are in the handler where the issue occurred, go to the remote discovery script' | ||
| + | |||
| + | |||
| + | import com.errigal.snmpmanager.threads.pollers.* | ||
| + | ClientActiveAlarmPoller.CLIENT_ACTIVE_ALARM_POLLER.run() | ||
| + | |||
| + | If the above doesn' | ||
| + | ==== ConnectionCountCheck / ConnectionCountHigh ==== | ||
| + | |||
| + | A temporary watchdog set up for Crown for CCSUPPORT-2282. A cronjob checks the current DB connection count logged by the SNMP Manager and if it gets too high puts the word ALARM into a text file that is checked by this job. The connections were being eatten up by the searchable plugin and take a long time to release. | ||
| + | |||
| + | Start monitoring the number of connections "tail -f ~/ | ||
| + | |||
| + | |||
| + | ==== databaseClipLog / ClipProcessNotEnded ==== | ||
| + | |||
| + | The database clip script that runs on CCICERRIGALDB2 did not finish today. | ||
| + | |||
| + | ==== GeneralTrapSummaryCreation / generalTrapSummaryCreationInactive ==== | ||
| + | |||
| + | No new general_trap_summary entries have been created in the database in the last five minutes. | ||
| + | |||
| + | ==== HubAuditPoll / HubAuditNoEndDate ==== | ||
| + | |||
| + | The Hub Audit has been running for more than twelve hours and is probably stuck. | ||
| + | |||
| + | ==== LinkPollerLogCheck / LinkPollerInactive ==== | ||
| + | |||
| + | The link poller has either stopped or is not able to reach any controllers. Restart the IP connectivity to see if the link poller process activates. Tail the logs, click on ' | ||
| + | |||
| + | < | ||
| + | https:// | ||
| + | |||
| + | ==== PotentialIDMSFailureImminent / BothHandlerThreadsInactive ==== | ||
| + | |||
| + | Checks back 15 minutes in the database to see if the tab_modification_log table has entries for __both__ app handler threads showing inactive. Internal thread for both apps handlers being inactive would show a potential outage for the application. | ||
| + | |||
| + | **NOTE** the check relies on failure in the same minute period in the thread table (ie it groups by Day, Hour, Minute). A failure at 10:58 and 11:02 will be missed by this query at present. | ||
| + | |||
| + | **Temporary Solution** for Both App Handlers Dead, which is indicated by PotentialIDMSFailureImminent or BothHandlerThreadsInactive(apps1Failover and apps2Failover from LB) | ||
| + | Only do this if the alarm doesn' | ||
| + | restartScheduler by hitting " | ||
| + | |||
| + | http:// | ||
| + | http:// | ||
| + | http:// | ||
| + | http:// | ||
| + | grep `HANDLER is` in LB's SnmpManager.log to make sure both of the Handlers show `alive` | ||
| + | | ||
| + | Use port 8442 for ATC | ||
| + | | ||
| + | If the handlers are not up, then need to restart SnmpManager on both handlers. | ||
| + | ==== SNMPManager_Alarm_Audit_Job ==== | ||
| + | |||
| + | The alarm audit job has been running for more than 16 hours. | ||
| + | |||
| + | ==== SNMPManager_Trap_Count_Threshold_Job ==== | ||
| + | |||
| + | The trap count threshold job has been running for more than 30 minutes | ||
| + | |||
| + | ==== SNMPManager_Zinwave_Job ==== | ||
| + | |||
| + | The Zinwave job has been running for more than two and a half hours. It's probably stuck. | ||
| + | |||
| + | ==== TrapCountCheck / TrapCountBreached ==== | ||
| + | |||
| + | No new trap entries in the database in the last five minutes. | ||
| + | |||
| + | Check the SNMP Manager load balancer and handler logs for errors. Something is probably wrong with trap processing. A restart may be required. | ||
| + | |||
| + | ==== TrapForwarderLogCheck / TrapForwarderInactive ==== | ||
| + | |||
| + | Trap forwarder hasn't run in half an hour. | ||
| + | |||
| + | ==== TrapParsingCheck / TrapParsingInactive ==== | ||
| + | |||
| + | The string " | ||
| + | |||
| + | =====Ticketer Alarms===== | ||
| + | |||
| + | ==== RemoteTicketCreation / remoteTicketCreationInactive ==== | ||
| + | |||
| + | Looks like new tickets are not being opened. Could be a serious issue. Verify that alarms are still opening and verify that they should have been ticketing. | ||
| + | |||
| + | =====Alarm Sync===== | ||
| + | |||
| + | ====AutoSyncProcess==== | ||
| + | |||
| + | AlarmSyncTasksNotRunning, | ||
| + | |||
| + | ====SlowAutoSyncSchedule==== | ||
| + | |||
| + | This checks to see if there has been more than 10 auto sync entries in the general_discovery_sync_history table within the last day that have taken more than 12 hours to complete. This could indicate that the sync scheduler is being slowed down by too many tasks or very slow ones. | ||
| + | |||
| + | =====Orchestrator===== | ||
| + | |||
| + | ====MultiTechModemNotSyncing==== | ||
| + | |||
| + | Multitech modems on KLA are polled every 2 minutes to get performance parameters. If the tracker log line is missing stating discoveredName = multitechModem, | ||
| + | |||
| + | ====ElasticSearchFieldLimitReached==== | ||
| + | |||
| + | The logs show an error stating that the total fields have reached their limit in elastic search. This error will display the current fields and elastic search can be checked to see the value " | ||
| + | |||
| + | =====Right of Way Alarms===== | ||
| + | |||
| + | ====ROWRabbitMQCheck==== | ||
| + | |||
| + | ROWCannotReachRabbitMQ/ | ||
| + | |||
| + | ====ROWSubmissionOrphaned==== | ||
| + | |||
| + | Checks the database to see if there are any Right of Way submissions that have become orphaned in the last 10 minutes. | ||
| + | |||
| + | =====Other===== | ||
| + | |||
| + | ==== ApacheLoadBalancerServerProcess / ApacheLoadBalancerServerProcessInactive==== | ||
| + | |||
| + | The httpd process on the load balanced has stopped. If this wasn't intentional please investigate and restart httpd. | ||
| + | |||
| + | ==== zinwaveRelay / Relay Reset Failed ==== | ||
| + | |||
| + | A check of the URL https://[ip address]/ | ||
| + | |||
| + | ==== HighCPU ==== | ||
| + | |||
| + | The following is a good link describing the difference between load average CPU checks and % CPU checks (both of which are implemented on ExteNet currently) | ||
| + | http:// | ||
| + | |||
| + | ==== NewerHighCPUCheck ==== | ||
| + | |||
| + | Checks the average CPU workload (uses **mpstat**). Every ten seconds the current average CPU workload is written to the CPU usage history file and then every minute (crontab) it is checked by the acceptable CPU usage script and the result (**CPU HIGH** or **CPU NORMAL**) is recorded to the acceptable CPU usage log file which is later checked by the Watchdog **run.sh** script. | ||
| + | |||
| + | === Setup Instructions === | ||
| + | |||
| + | Comment the watchdog crontab entry | ||
| + | |||
| + | Retrieve the latest version of the CPU usage tracking scripts from the watchdog repository (scripts folder). | ||
| + | |||
| + | Put the CPU usage tracking scripts in the ~/ | ||
| + | |||
| + | sudo chmod 776 on script files. | ||
| + | |||
| + | Check that the files and directories paths are in the script reflect what is on the server. | ||
| + | |||
| + | Set the required threshold (default - 50% of CPU power) in the acceptable_cpu_usage.sh file. | ||
| + | |||
| + | touch cpu_check_pass.log | ||
| + | |||
| + | sudo chmod 776 cpu_check_pass.log | ||
| + | |||
| + | touch cpu_usage_history.log | ||
| + | |||
| + | sudo chmod 776 cpu_usage_history.log | ||
| + | |||
| + | touch cpu_usage.log | ||
| + | |||
| + | sudo chmod 776 cpu_usage.log | ||
| + | |||
| + | Add the following watchdog (~/ | ||
| + | |||
| + | < | ||
| + | ' | ||
| + | type = ' | ||
| + | thresholds { | ||
| + | a = [name: ' | ||
| + | } | ||
| + | parameters { // expected parameters: logFileLocation | ||
| + | logFileLocation = '/ | ||
| + | renameLog = false | ||
| + | } | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | Start the CPU usage tracker: nohup ./ | ||
| + | |||
| + | Add the following crontab - */1 * * * * ~/ | ||
| + | |||
| + | Verify that the cpu_usage.log file is being updated every 10 seconds | ||
| + | |||
| + | Run the acceptable_cpu_usage.sh script: bash acceptable_cpu_usage.sh | ||
| + | |||
| + | Verify that the cpu_check_pass.log file is not empty anymore | ||
| + | |||
| + | Comment out the watchdog crontab entry | ||
| + | |||
| + | Run the watchdog script manually (sudo bash ~/ | ||
| + | |||
| + | |||
| + | ==== ThreadCountCheck ==== | ||
| + | |||
| + | Checks to see how many threads are being used by the user ' | ||
| + | |||
| + | Command to check the max thread limit on a server: ulimit -u | ||
| + | |||
| + | Command to check the current amount of threads in use: top -b -H -u scotty -n 1 | wc -l | ||
| + | |||
| + | ==== mNetOpenVPNClientService ==== | ||
| + | |||
| + | ExteNet only. | ||
| + | |||
| + | Checks if OpenVPN client is running on the loadbalancer. OpenVPN client is used to receive mNET traps. | ||
| + | |||
| + | Starting OpenVPN client will clear this watchdog. | ||
| + | |||
| + | sudo service openvpn start | ||
| + | |||
| + | ==== SLA_Groovlet_Changed_Master ==== | ||
| + | |||
| + | ATC only. | ||
| + | |||
| + | This watchdog came about due to a weekend where an unknown process swapped groovlets from a QA env to a Prod env and the Prod even to a QA env. When the testing changes were on production the resolution was to get the script from QA and return it to production. This happened twice, both times outside of work hours. Only one script had modifications on QA at the time. | ||
| + | |||
| + | Its unlikely this will happen again but it might be related to a reinit. The watchdog just ensured that the details are hashed and the hash matches a previously taken hash. If there are any updates done to the SLA groovlet, this watchdog will need to be updated with a new hash. | ||
| + | |||
| + | |||
| + | ==== NetworkElementsWithNullSystemTypes ==== | ||
| + | |||
| + | This is for ATC billing. | ||
| + | It was noted that if a Network Element has a null system type then it would not be counted in the billing process. In order to avoid these mistakes, there was a watchdog added that would fire off until these were all cleaned up. If anything new comes up then the watchdog will fore again. There is a report for this too. Not a concern for Ops, Patrick or someone else would handle these. | ||
| + | ==== MibDeactivation ==== | ||
| + | |||
| + | ExteNet only. | ||
| + | This checks for the mib ma_events_2_26.mib and its ' | ||
| + | |||
| + | To fix find the mib in the snmp_manager database and find the mib with the name ma_events_2_26. Set its active field from NO to YES. | ||
| + | |||
| + | SELECT * FROM snmp_manager.mib m WHERE m.name LIKE " | ||
| + | ==== MibDeletion ==== | ||
| + | |||
| + | ExteNet only. | ||
| + | After an investigation into the cause of MibDeactivations. It was found that everything is behaving normally. However, if the mib existed on the server and didn't exist on the database, the applications would add the mib and set its active status to NO. This watchdog will hopefully fire ahead of time to catch a MibDeactivation before it happens. | ||
| + | |||
| + | It is unlikely to happen again but this is a precaution. | ||
| + | ==== KeyStatsGenerationFailed ==== | ||
| + | |||
| + | Checks that Key Stats application has run correctly; application is run via a cron job | ||
| + | |||
| + | Watchdog checks for an " | ||
| + | | ||
| + | Log location: / | ||
| + | |||
| + | If changes are made to resolve watchdog run Key Stats manually to clear; find command in crontab | ||
| + | |||
| + | sudo crontab -e | ||
| + | |||
| + | ==== BubbleAppConnectionError ==== | ||
| + | |||
| + | This watchdog checks the Bubble App for connection timeouts. | ||
| + | |||
| + | |||
| + | Watchdog checks for an " | ||
| + | |||
| + | Bubble App URL is ...com/ | ||
| + | |||
| + | A restart of Bubble App will be required if a spinning gif is displayed and application does not load. | ||
| + | |||
| + | After restart verify that application loads correctly and watchdog clears. | ||