Differences

This shows you the differences between two versions of the page.

--- watchdogs:watchdog_alarms [2019/12/16 16:00] – [PotentialIDMSFailureImminent / BothHandlerThreadsInactive] aryan
+++ watchdogs:watchdog_alarms [2021/06/25 10:09] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+====== Watchdog Alarm Identifiers ======
+Here is an in-progress list of all the current watchdog alarms, what they mean and what you should do if we get them.
+===== UnableToRunCommand =====
+The UnableToRunCommand alarms are a special case that can show up with any alarm ID. It means rather than the alarm being triggered because the threshold was breached the threshold check failed to run correctly at all. This is serious as it means watchdog is currently broken on that server. Possible causes: A database can't be reached, a log file or directory is missing, or the watchdog process doesn't have sufficient privileges to run the command it's trying to.
+==== Potential Exceptions ====
+=== No space left on device ===
+== Exception ==
+<code>2018-02-23 15:01:20,811 [main] ERROR c.o.watchdog.HardwareQuerier - Error while running command :
+java.io.IOException: No space left on device</code>
+== Resolution ==
+Remove some unused files to gain free space:
+  - Run **df -h**
+  - Check if any file system usage is around 100%
+  - Go the **/tmp** folder. If it's a root partition issue (e.g. *lv_root*), *sudo su* first.
+  - Delete the *.tmp* files. That will prove sufficient in the vast majority of the cases. If the problem still persists, proceed at your own discretion deleting files.
+  -
+=== Log file missing ===
+== Exception ==
+<code>2020-12-21 09:00:25,436 [main]  ERROR c.o.watchdog.HardwareQuerier - Error while running command :
+java.io.FileNotFoundException: /home/scotty/logs/grails/ebonding.log (No such file or directory)</code>
+== Resolution ==
+This is a known issue with the ebonding application (ExteNet only). Occasionally a log file is not created on log rollover.
+Solution is to restart the ebonding application on the affected handler.
+===== Low Disc Space Alarms =====
+These are pretty self explanatory - a partition on the server is full or nearly full.
+The alarm ID in this case will be the name of the partition being affected. Examples might be: "/backup", "/export/home", "/var/lib/mysql". They will also have "errigalServerHddLowSpaceAlarm" and "freeSpacePercentLow" in the email title.
+**What to do:** Check in on server. Run "df -h" to see current free space on all partition. Check for any obvious large/temporary files that shouldn't be there. DB dumps, heap dumps, large tars or a bloated temp or backup directory may be to blame.
+For a database partition this is more serious may mean we need a master swap. Binlogs can be (carefully!) cleared out to free up some space to buy a bit more time.
+===== General Application Alarms =====
+These are checks that are carried out against each application to ensure they are still running ok.
+==== NocPortal-catalina.log / ReportingManager-catalina.out / SnmpManager-catalina.out / Ticketer-catalina.out / Cas-catalina.out - "PermGenErrorFound" ====
+A PermGen error has been spotted in the applications catalina.out log.
+==== NocPortal.log / ReportingManager.log / SnmpManager.log / Ticketer.log / Cas.log "OutOfMemoryErrorFound" ====
+An out of memory error error has been spotted in the application's log file.
+==== NocPortalApplicationTomcatServerProcess / ReportingManagerApplicationTomcatServerProcess / SnmpManagerApplicationTomcatServerProcess / TicketerApplicationTomcatServerProcess / CasApplicationTomcatServerProcess - "errigalApplicationProcessInactiveAlarm" ====
+The application wasn't found on the server's process list - "ps -edf". If it wasn't shut down intentionally try to figure out why it stopped immediately and restart it.
+==== NocPortalJvm / ReportingManagerJvm / SnmpManagerJvm / TicketerJvm / CasJvm - "JVMHeapSizeHigh" ====
+The application heap size is too high!
+==== NocPortalBackups / ReportingManagerBackups / SnmpManagerBackups / TicketerBackups - "ExteNetBackupsNotRunning" ====
+The automated backup process for this application hasn't completed.
+==== NocPortalURLCheck / ReportingManagerURLCheck / SnmpManagerURLCheck / TicketerURLCheck - "TheURLisUnavailable" ====
+The watchdog server can't reach the application's public URL. Log into the application yourself to make sure that it's still running and accessible. If so check the server the watchdog came from for connection issues.
+==== NocPortal_Database/ ReportingManager_Database/ SnmpManager_Database/ Ticketer_Database - "QuartzJobsBlocked" or "ConnectAttemptFailed" ====
+If it's a "QuartzJobsBlocked" then one of the quartz jobs on the application has been running for too long and may be stuck. If it's "ConnectAttemptFailed" then the application database couldn't be reached from the server. If the database is inaccessible from the application servers the apps will not run properly!
+===== SNMP Manager Alarms =====
+==== ActiveAlarmCreation ====
+No new active_alarm entries have ben created in the database in the last ten minutes.
+==== apps1Failover / apps2Failover / handlerFailover ====
+The SNMP Manager distributer has logged that one of the handlers is not responding in a timely manner and has taken it off the trap distribution list.
+==== autoDiscoveryPollCheck / alarmAuditStuck ====
+The AutodiscoveryAlarmJob has been running for more than 16 hours and is probably stuck.
+==== ClientAlarmPollerCheck / ClientAlarmPollerInactive ====
+The client active alarm poller im the SNMP Manager has stopped working. This means that the alarm list the users see in the Node Monitor isn't updating. Trap processing and alarm creation may still be happening. We have also seen this as the first canary to die when the entire SNMP Manager is slowing to a crawl and stopping.
+Troubleshooting:
+   Be sure that you are in the handler where the issue occurred, go to the remote discovery script's controller, replace the content of the script with the code below and execute it
+import com.errigal.snmpmanager.threads.pollers.*
+ClientActiveAlarmPoller.CLIENT_ACTIVE_ALARM_POLLER.run()
+If the above doesn't work, SNMP manager restart is problably required
+==== ConnectionCountCheck / ConnectionCountHigh ====
+A temporary watchdog set up for Crown for CCSUPPORT-2282. A cronjob checks the current DB connection count logged by the SNMP Manager and if it gets too high puts the word ALARM into a text file that is checked by this job. The connections were being eatten up by the searchable plugin and take a long time to release.
+Start monitoring the number of connections "tail -f ~/logs/grails/SnmpManager.log | grep 'Active Connection : '". If it grows to hit the maxActive connections defined in SnmpManagerConfig.groovy the entire application will lock up. If this happens the SNMP Manager instance will need to be restarted. If it starts approaching this CALL THE NOC and try to get them to stop doing what's driving it up for a while. Could be caused by loading lots and lots of excel sheets. This should no longer be needed after the 2.18 release.
+==== databaseClipLog / ClipProcessNotEnded ====
+The database clip script that runs on CCICERRIGALDB2 did not finish today.
+==== GeneralTrapSummaryCreation / generalTrapSummaryCreationInactive ====
+No new general_trap_summary entries have been created in the database in the last five minutes.
+==== HubAuditPoll / HubAuditNoEndDate ====
+The Hub Audit has been running for more than twelve hours and is probably stuck.
+==== LinkPollerLogCheck / LinkPollerInactive ====
+The link poller has either stopped or is not able to reach any controllers. Restart the IP connectivity to see if the link poller process activates. Tail the logs, click on 'Start/Stop IP Connectivity Poller' found in the SNMP managers link poller configuration page and check the logs to see if the process stops. Click it again to start the process. If the process doesn't stop, log into the handler that is showing the error and repeat the process.
+<code>https://atclb1.atc:8442/SnmpManager/auth/login?fallback=true
+https://atclb1.atc:8442/SnmpManager/linkPollerConfiguration/list</code>
+==== PotentialIDMSFailureImminent / BothHandlerThreadsInactive ====
+Checks back 15 minutes in the database to see if the tab_modification_log table has entries for __both__ app handler threads showing inactive. Internal thread for both apps handlers being inactive would show a potential outage for the application.
+**NOTE** the check relies on failure in the same minute period in the thread table (ie it groups by Day, Hour, Minute). A failure at 10:58 and 11:02 will be missed by this query at present.
+**Temporary Solution** for Both App Handlers Dead, which is indicated by PotentialIDMSFailureImminent or BothHandlerThreadsInactive(apps1Failover and apps2Failover from LB)
+Only do this if the alarm doesn't clear within 5mins, because it will clear all the scheduled task.
+restartScheduler by hitting "SnmpManager/configuration/restartScheduler" on *each of handlers*, examples for EXTQA:
+  http://qaextapps1.ext:8082/SnmpManager/auth/login?fallback=true
+  http://qaextapps1.ext:8082/SnmpManager/configuration/restartScheduler
+  http://qaextapps2.ext:8082/SnmpManager/auth/login?fallback=true
+  http://qaextapps2.ext:8082/SnmpManager/configuration/restartScheduler
+  grep `HANDLER is` in LB's SnmpManager.log to make sure both of the Handlers show `alive`
+Use port 8442 for ATC
+If the handlers are not up, then need to restart SnmpManager on both handlers.
+==== SNMPManager_Alarm_Audit_Job ====
+The alarm audit job has been running for more than 16 hours.
+==== SNMPManager_Trap_Count_Threshold_Job ====
+The trap count threshold job has been running for more than 30 minutes
+==== SNMPManager_Zinwave_Job ====
+The Zinwave job has been running for more than two and a half hours. It's probably stuck.
+==== TrapCountCheck / TrapCountBreached ====
+No new trap entries in the database in the last five minutes.
+Check the SNMP Manager load balancer and handler logs for errors. Something is probably wrong with trap processing. A restart may be required.
+==== TrapForwarderLogCheck / TrapForwarderInactive ====
+Trap forwarder hasn't run in half an hour.
+==== TrapParsingCheck / TrapParsingInactive ====
+The string "GeneralTrapSummaryWrapper  - Created Active Alarm:" hasn't appeared in the SNMP Manager log on this server in the last five minutes.
+=====Ticketer Alarms=====
+==== RemoteTicketCreation / remoteTicketCreationInactive ====
+Looks like new tickets are not being opened. Could be a serious issue. Verify that alarms are still opening and verify that they should have been ticketing.
+=====Alarm Sync=====
+====AutoSyncProcess====
+AlarmSyncTasksNotRunning, AlarmSeveritySyncTasksNotRunning, PerformanceDiscoveryTasksNotRunning, and ConfigSyncTasksNotRunning only runs at 02:00 everyday and checks the previous days logs to see if the relevant task has ran e.g. inbound.AlarmSyncTask. If the logs do not contain this log line, the watchdog will trigger. This has been caused by auto sync tasks taking too long to complete and the network_element_sync_setting table can be checked to see if any syncs are taking longer than expected.
+====SlowAutoSyncSchedule====
+This checks to see if there has been more than 10 auto sync entries in the general_discovery_sync_history table within the last day that have taken more than 12 hours to complete. This could indicate that the sync scheduler is being slowed down by too many tasks or very slow ones.
+=====Orchestrator=====
+====MultiTechModemNotSyncing====
+Multitech modems on KLA are polled every 2 minutes to get performance parameters. If the tracker log line is missing stating discoveredName = multitechModem, this watchdog will trigger, signifying that the process is not running and will require investigation.
+====ElasticSearchFieldLimitReached====
+The logs show an error stating that the total fields have reached their limit in elastic search. This error will display the current fields and elastic search can be checked to see the value "index.mapping.total_fields.limit" is too low and requires increasing.
+=====Right of Way Alarms=====
+====ROWRabbitMQCheck====
+ROWCannotReachRabbitMQ/ROWTimeout - Checks to see if the Right of Way application can reach RabbitMQ.
+====ROWSubmissionOrphaned====
+Checks the database to see if there are any Right of Way submissions that have become orphaned in the last 10 minutes.
+=====Other=====
+==== ApacheLoadBalancerServerProcess / ApacheLoadBalancerServerProcessInactive====
+The httpd process on the load balanced has stopped. If this wasn't intentional please investigate and restart httpd.
+==== zinwaveRelay / Relay Reset Failed ====
+A check of the URL https://[ip address]/V4_20_6/cgi-bin/ClearFaultRelay.cgi for one of the Zinwave hubs did not return a response code of 302
+==== HighCPU ====
+The following is a good link describing the difference between load average CPU checks and % CPU checks (both of which are implemented on ExteNet currently)
+http://www.linuxjournal.com/article/9001?page=0,0
+==== NewerHighCPUCheck ====
+Checks the average CPU workload (uses **mpstat**). Every ten seconds the current average CPU workload is written to the CPU usage history file and then every minute (crontab) it is checked by the acceptable CPU usage script and the result (**CPU HIGH** or **CPU NORMAL**) is recorded to the acceptable CPU usage log file which is later checked by the Watchdog **run.sh** script.
+=== Setup Instructions ===
+Comment the watchdog crontab entry
+Retrieve the latest version of the CPU usage tracking scripts from the watchdog repository (scripts folder).
+Put the CPU usage tracking scripts in the ~/script/cpu_usage_tracking/ folder.
+sudo chmod 776 on script files.
+Check that the files and directories paths are in the script reflect what is on the server.
+Set the required threshold (default - 50% of CPU power) in the acceptable_cpu_usage.sh file.
+touch cpu_check_pass.log
+sudo chmod 776 cpu_check_pass.log
+touch cpu_usage_history.log
+sudo chmod 776 cpu_usage_history.log
+touch cpu_usage.log
+sudo chmod 776 cpu_usage.log
+Add the following watchdog (~/watchdog/resources/ResourceConfig.groovy) (do not forget to list the watchdog in the localSystem { at the start of the file):
+<code>
+'NewerHighCPUCheck' {
+                type = 'LOG'
+                thresholds {
+                        a = [name: 'checkLog', type: 'EQUAL', value: ['CPU HIGH'], level: ['CRITICAL'], alarmId: 'HighCPU']
+                }
+                parameters { // expected parameters: logFileLocation
+                        logFileLocation = '/export/home/scotty/script/cpu_usage_tracking/cpu_check_pass.log'
+                        renameLog = false
+                }
+}
+</code>
+Start the CPU usage tracker: nohup ./cpu_usage_tracker.sh & (nohup: ignoring input and appending output to `nohup.out' message is acceptable, it is not an indication of an error)
+Add the following crontab - */1 * * * * ~/script/cpu_usage_tracking/acceptable_cpu_usage.sh (non-sudo crontab)
+Verify that the cpu_usage.log file is being updated every 10 seconds
+Run the acceptable_cpu_usage.sh script: bash acceptable_cpu_usage.sh
+Verify that the cpu_check_pass.log file is not empty anymore
+Comment out the watchdog crontab entry
+Run the watchdog script manually (sudo bash ~/watchdog/script/run.sh) and verify that no exceptions are thrown
+==== ThreadCountCheck ====
+Checks to see how many threads are being used by the user 'Scotty'. A 'Minor' alarm is triggered if the alarms reach 80% of the maximum limit and a 'Critical' is triggered if the alarms reach 90% of the allowed maximum limit.
+Command to check the max thread limit on a server: ulimit -u
+Command to check the current amount of threads in use: top -b -H -u scotty -n 1 | wc -l
+==== mNetOpenVPNClientService ====
+ExteNet only.
+Checks if OpenVPN client is running on the loadbalancer. OpenVPN client is used to receive mNET traps.
+Starting OpenVPN client will clear this watchdog.
+  sudo service openvpn start
+==== SLA_Groovlet_Changed_Master ====
+ATC only.
+This watchdog came about due to a weekend where an unknown process swapped groovlets from a QA env to a Prod env and the Prod even to a QA env. When the testing changes were on production the resolution was to get the script from QA and return it to production. This happened twice, both times outside of work hours. Only one script had modifications on QA at the time.
+Its unlikely this will happen again but it might be related to a reinit. The watchdog just ensured that the details are hashed and the hash matches a previously taken hash. If there are any updates done to the SLA groovlet, this watchdog will need to be updated with a new hash.
+==== NetworkElementsWithNullSystemTypes ====
+This is for ATC billing.
+It was noted that if a Network Element has a null system type then it would not be counted in the billing process. In order to avoid these mistakes, there was a watchdog added that would fire off until these were all cleaned up. If anything new comes up then the watchdog will fore again. There is a report for this too. Not a concern for Ops, Patrick or someone else would handle these.
+==== MibDeactivation ====
+ExteNet only.
+This checks for the mib ma_events_2_26.mib and its 'active' status to be 'NO'. If there is a match we know that the mib has been deactivated by the application. This seems mostly happens after a deploy but not always. Possibly after a reinit but nothing seems to cause it to happen there either.
+To fix find the mib in the snmp_manager database and find the mib with the name ma_events_2_26. Set its active field from NO to YES.
+  SELECT * FROM snmp_manager.mib m WHERE m.name LIKE "ma_events_2_26%";
+==== MibDeletion ====
+ExteNet only.
+After an investigation into the cause of MibDeactivations. It was found that everything is behaving normally. However, if the mib existed on the server and didn't exist on the database, the applications would add the mib and set its active status to NO. This watchdog will hopefully fire ahead of time to catch a MibDeactivation before it happens.
+It is unlikely to happen again but this is a precaution.
+==== KeyStatsGenerationFailed ====
+Checks that Key Stats application has run correctly; application is run via a cron job
+Watchdog checks for an "ERROR" in the Key Stats logs.
+Log location: /export/home/scotty/script/key_stats/key_stat_extractor.out
+If changes are made to resolve watchdog run Key Stats manually to clear; find command in crontab
+  sudo crontab -e
+==== BubbleAppConnectionError ====
+This watchdog checks the Bubble App for connection timeouts.
+Watchdog checks for an "Timeout: Pool empty. Unable to fetch a connection" in the Bubble App logs.
+Bubble App URL is ...com/bubble-app
+A restart of  Bubble App will be required if a spinning gif is displayed and application does not load.
+After restart verify that application loads correctly and watchdog clears.

Sidebar

Internal Errigal Collaboration Wiki

Differences

Sidebar

Internal Errigal Collaboration Wiki

User Tools

Site Tools

Differences

Page Tools