User Tools

Site Tools


Writing /app/www/public/data/meta/resolution_area/prometheus_resolutions/res-p1809.meta failed
resolution_area:prometheus_resolutions:res-p1809

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
resolution_area:prometheus_resolutions:res-p1809 [2021/06/22 12:22] – external edit 127.0.0.1resolution_area:prometheus_resolutions:res-p1809 [2022/05/03 15:28] (current) 10.5.5.52
Line 1: Line 1:
 +===== RabbitmqTooManyMessagesInQueue =====
  
 +This could still use more additions
 +
 +**Level:** __Critical__ 
 +
 +
 +**Purpose:**
 +This alert was made so that RabbitMQ doesn't fill up. It is usually indicative of something going wrong with the message processing from one of the applications which consumed the queue.
 +
 +**Scenario:** There have been >1000 messages in the queue for 2m.
 +
 +**Resolution:**
 +There isn't a direct solution but it is important to ensure everything is operating smoothly. Sometimes the queue may suddenly increase due to a burst incomming messages that the applications can't process fast enough. In this case, just ensure everything is still functional and that the queue isn't increasing. 
 +
 +
 +
 +**Manual Action Steps:**
 +Connect to RabbitMQ, typically hosted on port 15672 on a handler or loadbalancer
 +
 +For example to connect to RabbitMQ on ExteNet: \\
 +https://extlb.ext:15672/#/queues
 +or
 +http://extapps2.ext:15672/#/queues
 +
 +You can check specifically for what server RabbitMQ is on for an environment with this. 
 +https://docs.google.com/spreadsheets/d/1Ebj0kWPl63Q4L3f_oo_kvWoh5ZRq2vFI8FmrWbjoPlo/edit?usp=sharing
 +Scroll across to see what RabbitMQ is running on. However, this information should be in the alert received.
 +
 +
 +The authentication credentials are in pwsafe for the various environments.
 +
 +Once you have signed in, check the queue in question and ensure if it isn't increasing. If it is still growing then it is critical to alert the rest of the team, as a restart may be required. 
 +
 +-----
 +
 +On of the first things to check is if the application is actually running or not.
 +There are a few ways to do this. 
 +
 +Using the Application Version Monitor you can see the running status. Check how recently the monitor was run, as you may have to run it again and wait for it to finish
 +http://opsjenkins.errigal.com:8080/job/application-version-monitor/
 +https://docs.google.com/spreadsheets/d/1Ebj0kWPl63Q4L3f_oo_kvWoh5ZRq2vFI8FmrWbjoPlo/edit#gid=1762968851
 +
 +If you want something more immediate then you may have to SSH into a server.
 +To check a server quickly you can use the following or similar:
 +
 +For non-service application
 +<code>
 +ps aux | grep 'java' | grep -i 'snmp' 
 +ps aux | grep 'java' | grep -i 'reporting' 
 +ps aux | grep 'elastic'
 +</code>
 +
 +With Ironman as an example
 +<code>
 +ssh scotty@ironmanlb1.err "ps aux | grep 'java' | grep -i 'snmp' "
 +ssh scotty@ironmanapps1.err "ps aux | grep 'java' | grep -i 'snmp' "
 +</code>
 +
 +For services
 +<code>
 +service mysql_exporter status
 +systemctl status elasticsearch
 +rabbitmqctl status
 +</code>
 +
 +<code>
 +ssh scotty@ironmanlb1.err "rabbitmqctl status"
 +ssh scotty@ironmanapps1.err "service mysql_exporter status"
 +</code>
 +
 +For checking many servers on an env at once you can use the Script Runner
 +http://opsjenkins.errigal.com:8080/job/universal_script_runner/build
 +You can run many commands at once with that too. Just remember that the output is going to cells that may split on newlines and spaces, so use grep and awk for assistance with formatting. 
 +
 +
 +-----
 +
 +In the case of the Trap Queue increasing, it is likely that the Trap Parsing isn't working correctly for the SNMP Manager.
 +One of the problems may be the connection between the loadbalancer and the handlers. If the loadbalancer doesn't get a timely response, it may temporarily think that the handler is dead and spend traffic to the other. 
 +
 +
 +A quick check for this can be done with the following:
 +<code>
 +ssh scotty@server "cat logs/grails/SnmpManager.log | grep -i 'handler is' "
 +</code>
 +
 +With Ironman as an example
 +<code>
 +ssh scotty@ironmanapps1.err "cat logs/grails/SnmpManager.log | grep -i 'handler is' "
 +ssh scotty@ironmanlb1.err "cat logs/grails/SnmpManager.log | grep -i 'handler is' "
 +</code>
 +
 +If there are many 'HANDLER is DEAD' messages then there is trouble connecting the loadbalancer with the handlers. It should be mostly 'HANDLER is Alive'
 +
 +
 +Check the database with
 +<code>
 +select * from thread_config tc
 +</code>
 +
 +-----
 +
 +Another possibility is the scheduler. The scheduler for trap processing might be slower than usual. This can be for a number of reasons.
 +If it is directly after a release then it is best to let the dev and ops teams know right away. 
 +
 +-----
 +
 +If this is all of a sudden, then something might be slowing the server or the database.
 +
 +It is worth checking the database with 
 +<code> 
 +SHOW PROCESSLIST;
 +</code>
 +
 +-----
 +
 +In some cases the queue is fed and consumed by an application itself, as in, the application pushes the message to the queue and takes them off the queue. For this, it is important to ensure that the queue being fed is the same as the queu being consumed. In the rare occasion that a queue consumer is expecting a difference name to the queue getting filled. 
 +
 +There is also a possibility that a queue was refactored in the codebase or script but not updated in RabbitMQ. Check with the devs for this. 
 +
 +**Ebonding**
 +
 +If the alert is related to an eBonding queue:
 +  * eBonding_outbound_queue
 +  * eBonding_inbound_queue
 +
 +Check if soap and ebonding applications are running
 +
 +<code>
 +ps -edf | grep -i soap
 +ps -edf | grep -i ebonding 
 +</code>
 +Verify that the logs are not frozen: log location is: logs/grails
 +
 +An application restart will be required if ebonding & soap logs are frozen.
 +
 +**Auto Clear:**
 +
 +The alert will auto clear once there are fewer than 1000 messages in the queue.