Level: Critical
Purpose: Notify Errigal Operations staff that the elastalert application on prometheus prod has not completed a scrape on the DF-80 (80% disk full) or DF-60 (60% disk full) Elast Alert rules in the last 2 minutes.
Resolution: Elastalert runs in a docker container on prometheus prod. It is monitoring NUCS Disk space in KLAs network by checking elastic search data that has pulled the Operating system stats. If these dont catch log files filling up the HDD the NUC will go offline.
Manual Action Steps:
sudo docker logs --tail 100 elastalert --follow
sudo docker restart elastalert
A restart should do the trick if there was an error and this caused the application to disable the rules.
Observe logs and be sure that the logs show elastalert running the query for both DF-80 and DF-60 and that none of them are disabled. These two lines should be present.
INFO:elastalert:Queried rule df-80 from 2024-08-06 19:41 UTC to 2024-08-06 19:52 UTC: 0 / 0 hits
INFO:elastalert:Queried rule df-60 from 2024-08-06 19:41 UTC to 2024-08-06 19:53 UTC: 0 / 0 hits
This line below is showing that df-80 is disabled. Thats no good, restart the application to get it back.
INFO:elastalert:Disabled rules are: ['df-80']
Then you should see
INFO:elastalert:Disabled rules are: []
Another thing to check is the rule yaml files
cd /home/scotty/elastalert_config/rules
sudo vi df-60.yaml or sudo vi df-80.yaml
Ensure the setting is_enabled is set to true.
Auto Clear: Will auto clear if the elastalert_scrapes_total counter increases again