User Tools
This is an old revision of the document!
RabbitmqTooManyMessagesInQueue
This could still use more additions
Level: Critical
Purpose: This alert was made so that RabbitMQ doesn't fill up. It is usually indicative of something going wrong with the message processing from one of the applications which consumed the queue.
Scenario: There have been >1000 messages in the queue for 2m.
Resolution: There isn't a direct solution but it is important to ensure everything is operating smoothly. Sometimes the queue may suddenly increase due to a burst incomming messages that the applications can't process fast enough. In this case, just ensure everything is still functional and that the queue isn't increasing.
Manual Action Steps: Connect to RabbitMQ, typically hosted on port 15672 on a handler or loadbalancer
For example to connect to RabbitMQ on ExteNet:
https://extlb.ext:15672/#/queues
or
http://extapps2.ext:15672/#/queues
You can check specifically for what server RabbitMQ is on for an environment with this. https://docs.google.com/spreadsheets/d/1Ebj0kWPl63Q4L3f_oo_kvWoh5ZRq2vFI8FmrWbjoPlo/edit?usp=sharing Scroll across to see what RabbitMQ is running on.
The authentication credentials are in pwsafe for the various environments.
Once you have signed in, check the queue in question and ensure if it isn't increasing. If it is still growing then it is critical to alert the rest of the team, as a restart may be required.
On of the first things to check is if the application is actually running or not. There are a few ways to do this.
Using the Application Version Monitor you can see the running status. Check how recently the monitor was run, as you may have to run it again and wait for it to finish http://opsjenkins.errigal.com:8080/job/application-version-monitor/ https://docs.google.com/spreadsheets/d/1Ebj0kWPl63Q4L3f_oo_kvWoh5ZRq2vFI8FmrWbjoPlo/edit#gid=1762968851
If you want something more immediate then you may have to SSH into a server. To check a server quickly you can use the following or similar:
For non-service application
ps aux | grep 'java' | grep -i 'snmp' ps aux | grep 'java' | grep -i 'reporting' ps aux | grep 'elastic'
With Ironman as an example
ssh scotty@ironmanlb1.err "ps aux | grep 'java' | grep -i 'snmp' " ssh scotty@ironmanapps1.err "ps aux | grep 'java' | grep -i 'snmp' "
For services
service mysql_exporter status systemctl status elasticsearch rabbitmqctl status
ssh scotty@ironmanlb1.err "rabbitmqctl status" ssh scotty@ironmanapps1.err "service mysql_exporter status"
For checking many servers on an env at once you can use the Script Runner http://opsjenkins.errigal.com:8080/job/universal_script_runner/build You can run many commands at once with that too. Just remember that the output is going to cells that may split on newlines and spaces, so use grep and awk for assistance with formatting.
In the case of the Trap Queue increasing, it is likely that the Trap Parsing isn't working correctly for the SNMP Manager. One of the problems may be the connection between the loadbalancer and the handlers. If the loadbalancer doesn't get a timely response, it may temporarily think that the handler is dead and spend traffic to the other.
A quick check for this can be done with the following:
ssh scotty@server "cat logs/grails/SnmpManager.log | grep -i 'handler is' "
With Ironman as an example
ssh scotty@ironmanapps1.err "cat logs/grails/SnmpManager.log | grep -i 'handler is' " ssh scotty@ironmanlb1.err "cat logs/grails/SnmpManager.log | grep -i 'handler is' "
If there are many 'HANDLER is DEAD' messages then there is trouble connecting the loadbalancer with the handlers. It should be mostly 'HANDLER is Alive'
Another possibility is the scheduler. The scheduler for trap processing might be slower than usual. This can be for a number of reasons. If it is directly after a release then it is best to let the dev and ops teams know right away.
If this is all of a sudden, then something might be slowing the server or the database.
It is worth checking the database with
SHOW PROCESSLIST;
In some cases the queue is fed and consumed by an application itself, as in, the application pushes the message to the queue and takes them off the queue. For this, it is important to ensure that the queue being fed is the same as the queu being consumed. In the rare occasion that a queue consumer is expecting a difference name to the queue getting filled.
There is also a possibility that a queue was refactored in the codebase or script but not updated in RabbitMQ. Check with the devs for this.
Ebonding
If the alert is related to an eBonding queue:
- eBonding_outbound_queue
- eBonding_inbound_queue
Check if soap and ebonding applications are running
ps -edf | grep -i soap ps -edf | grep -i ebonding
Verify that the logs are not frozen: log location is: logs/grails
An application restart will be required if ebonding & soap logs are frozen.
Auto Clear:
The alert will auto clear once there are fewer than 1000 messages in the queue.