Table of Contents

Internal Environment issue

The purpose of this page is to gather a list of resolutions which can be used by anyone to recover an OpenStack environment to keep the system up.

As the environment is not monitored as a production environment, there can be situations like disk space usage which are alerted in the slack channels but not acted upon in a timely manner.

Watchdog Internal Slack Channel Prometheus Internal Slack Channel

Troubleshooting

Check space, typically start with the IDMS Loadbalancer host and work our way through Apps1, Apps2, DB1, DB2

ssh scotty@hostlb1.err
sudo su -
cd /
du -hs | sort -h

Example output 
1.2G	run
3.2G	root
3.3G	home
4.0G	usr
5.0G	swapfile
14G	var

RabbitMQ Space resolution - Internal Env only

NOTE This will wipe all data so apply with care and only on Internal environment.

The RMQ data is stored in the /var/lib/rabbitmq so above we can see 14G locked in the var folder.

As this is an internal environment, we can clean out space by removing the persistent store

/var/lib/rabbitmq/mnesia/HOSTHERE/msg_stores/vhosts/UUIDFOLDER/msg_store_persistent'

Find the largest folder store, and delete all files present

CAS / You do not have permission to access this.

When all normal user profile issues are checked (username, password, account active) checking the CAS log can be a useful start logs/grails/cas.log

If the following is present [org.jasig.cas.CentralAuthenticationServiceImpl] - ServiceManagement: Unauthorized Service Access. Service [http://qascoapps1.err:8081/ReportingManager/shiro-cas] is not found in service registry.

Verify the URL is resolving by a simple ping qascoapps1.err

if this fails to render, then the CAS authentication cannot succeed, and points to a DNS issue.

Configurations Check CAS services and make sure they contain the correct urls. You'll find these on the handlers at /usr/local/conf/cas/services

If you're seeing in the CAS logs, that the service url isn't matching the supplied service urls, like in the example below, it might be a configuration issue in HTTPD on the loadbalancer

ERROR [org.jasig.cas.CentralAuthenticationServiceImpl] - Service ticket [ST-62-LcMoIN7pmxTjPxI9eNkb-qascoapps1] with service [https://sco.errigal.com/ReportingManager/shiro-cas] does not match supplied service [http://qascoapps1.err:8081/ReportingManager/shiro-cas]

As you can see, the supplied service is using qascoapps1.err but the existing service is for sco.errigal.com.

So, ssh into the loadbalancer and navigate to /etc/httpd/conf/

You'll need to check and potentially edit mod-jk.conf and workers.properties

mod-js.conf For the Grails applications, we don't use ProxyPass and ProxyPassReverse (those are for the springboot applications).

Add the JkMount lines at the bottom for the relevant applications:

JkMount /ReportingManager/* ReportingManagerLoadBalancer
JkMount /ReportingManager ReportingManagerLoadBalancer

workers.properties At the very top, make sure your application LoadBalancer entry is in the list

# Create virtual workers
worker.list=jkstatus,SnmpManagerLoadBalancer,NocPortalLoadBalancer,ReportingManagerLoadBalancer,SupportPageLoadBalancer,casLoadBalancer,rdfLoadBalancer,SnmpManagerEMSLoadBalancer,TicketerLoadBalancer

Next, add the lines to configure the loadbalancer instances

#Configure ReportingManager load balanced instances
worker.ReportingManagerLoadBalancer.type=lb
worker.ReportingManagerLoadBalancer.sticky_session=1

# Declare Tomcat server workers 1 through n

worker.ReportingManagerWorker1.reference=worker.ajptemplate
worker.ReportingManagerWorker1.host=qascoapps1.err
worker.ReportingManagerWorker1.port=8011
worker.ReportingManagerWorker1.reply_timeout=600000

Finally, at the end of the file add those instances to the application loadbalancer worker

worker.ReportingManagerLoadBalancer.balance_workers=ReportingManagerWorker1

Save those files, and restart the httpd service

sudo service httpd restart