User Tools

Site Tools


Writing /app/www/public/data/meta/resolution_area/prometheus_resolutions/res-p1109.meta failed
resolution_area:prometheus_resolutions:res-p1109

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
resolution_area:prometheus_resolutions:res-p1109 [2021/06/24 14:25] btobinresolution_area:prometheus_resolutions:res-p1109 [2021/12/17 13:09] (current) wflaherty
Line 1: Line 1:
 +===== SystemdServiceCrashed =====
  
 +**Level:** Warning
 +
 +**Purpose:**
 +To ensure the services that are supposed to be running stay running
 +
 +**Scenario:** SystemD service has crashed for 5m.
 +
 +**Resolution:**
 +The service is running in a stable state
 +
 +**Manual Action Steps:**
 +  - ssh onto the affected server.
 +  - Use `ps aux | grep <app>` to see if the application is still running.
 +  - use `systemctl status <app>` to check the status of the application.
 +    - If the application is trying to restart over and over open /etc/systemd/system/<app>.service
 +    - Edit the `Restart=` line to be off rather than on-failure or always.
 +  - Use `sudo journalctl -ex` to see the logs of the server after attempting to restart the application.
 +  - A problem for some things in the past that weren't written by Errigal was users required for applications.
 +    - ELK stack and MySQL all require an elasticsearch, logstash, kibana and mysql user respectively.
 +  - Sometimes just fully shutting down the service with `sudo systemctl stop <app>` for a minute before trying to start it again with `sudo systemctl start <app>` can help the application recover.
 +  - If the bash prompt is behaving strangely, the server is likely running out of RAM for some reason.
 +  - Another thing worth checking is the disk space. `df -h` 
 +  - If this is an Errigal app, you can check the logs at moros.err:5601/app/kibana to see the application logs before the service died.
 +
 +**Auto Clear:**
 +Its entirely possible the service will automatically recover