mTail is a simple tool that allows us to read over log files and look for certain patterns. The count of these patterns or lines can be put into metrics for Prometheus to use.
The servers that prometheus checks for mtail are defined in the mtail-sd.yml in the files/file-sd/ directory of the prometheus-monitoring-config project. This yaml file directs prometheus on where to look for the mtail metrics. It doesn't control where metrics are installed to. As long as mtail is installed and running on the server, it will report metrics. If prometheus has this servers details in the mtail-sd.yml file it will read it.
In general, you shouldn't have to worry about installing mtail to a server for the first time, as that would have been dealt with already. However, if mtail is missing from a server and you want to installed it then follow the instructions below.
Use the ops-playbooks repository. This has a number of scripts and ansible playbooks in it and one of them is the mtail.yml playbook used for installing mtail on a server.
ansible-playbook -i ../env-configuration/<HOST>/hosts.ini mtail.yml –vault-id @prompt –diff
This should run without too much trouble and then you will have mtail on the new server. By default, this comes with a basic linecount.mtail in the files folder. For a fuller installation, you will have to use the update-mtail-progs.yml playbook back in the prometheus-monitoring-config repository.
The update-mtail-progs.yml playbook in the prometheus-monitoring-config repository will install the current configuration for mtail onto a server with different rules by prometheus role. This is because different servers do different things and we don't always want a line check that would be used for a handler on a loadbalancer. However, there is a shared role too for such metrics.
The playbook will first check if mtail is on the server in the first place. Then it will copy over the progs fr that role's folder from the templates to the server. After the files have been copied with will then stop the mtail service and start it up again.
In the templates folder, under mtail will find a few folders. One is named all and another may be named something like handlers or loadbalancer. These are the names of the server roles defined in the env-configuration under the hosts file. These names are used for while the progs will be installed to.
If you want to add another role to the list, say rdf you can modify the update-mtail-progs.yml and copy on of the roles and rename things to be for your new role. Say you copied loadbalancer and changed everything for it to rdf. You now have an install for the rdf role. Remember to also create and use an rdf template folder with all the progs you want for it.
To add a new prog file, simply create a file or clone another .mtail file. Then modify it.
counter out_of_memory_lines
/java.lang.OutOfMemoryError: unable to create new native thread/ {
out_of_memory_lines++
}
In this example we have the CASOutOfMemory.mtail file. This defines a counter at the top. A out_of_memory_lines metric for prometheus to look for.
From there we have a java.lang.OutOfMemoryError line that we want to search for in our log file. If mtail sees this line we want to increase the count. This may mean that the alert will stay until the log file rotates.
Once you have saved your changes, feel free to test them out.
Run the update-mtail-progs.yml playbook for a testing env.
ansible-playbook -i ../env-configuration/<ENV>/hosts.ini update-mtail-progs.yml –vault-id @prompt
If you want you can use
http://opsjenkins.errigal.com:8080/job/universal_script_runner/
to run find ~/mtail/progs/ on as many servers as you want to check to see what progs are where.
Once you have successfully created and deployed your new prog, you can now make an alert. For this you will want to add an entry into the prometheus.rules sections of the prometheus-monitoring-config.
Here is an example for the CASOutOfMemory alert.
- alert: CASOutOfMemory
expr: out_of_memory_lines > 0
for: 5m
labels:
severity: critical
annotations:
summary: "CAS has reported it is out of memory on {{ $labels.instance }})"
description: "CAS claims it is out of memory in the logs {{ $value }} times. On {{ $labels.instance }}."
resolution: "http://wiki.err/doku.php?id=resolution_area:prometheus_resolutions:res-p1116"