Differences

This shows you the differences between two versions of the page.

--- toolsandtechnologies:mtail [2022/03/11 12:43] – created 10.91.120.28
+++ toolsandtechnologies:mtail [2022/06/15 11:38] (current) – 10.91.110.100
@@ Line 14: / Line 14: @@
 Use the ops-playbooks repository. This has a number of scripts and ansible playbooks in it and one of them is the mtail.yml playbook used for installing mtail on a server.
-''ansible-playbook ../env-configuration/host/hosts.ini mtail.yml --vault-id @prompt''
+''ansible-playbook -i ../env-configuration/<HOST>/hosts.ini mtail.yml --vault-id @prompt --diff''
 This should run without too much trouble and then you will have mtail on the new server.
-By default, this comes with a basic linecount.mtail in the files folder. For a fuller installation, you will have to use the update-mtail playbook back in the prometheus-monitoring-config repository.
+By default, this comes with a basic linecount.mtail in the files folder. For a fuller installation, you will have to use the update-mtail-progs.yml playbook back in the prometheus-monitoring-config repository.
+===== Updating mTail's configuration =====
+The ''update-mtail-progs.yml'' playbook in the prometheus-monitoring-config repository will install the current  configuration for mtail onto a server with different rules by prometheus role. This is because different servers do different things and we don't always want a line check that would be used for a handler on a loadbalancer. However, there is a shared role too for such metrics.
+==== How updating works ====
+The playbook will first check if mtail is on the server in the first place. Then it will copy over the progs fr that role's folder from the templates to the server. After the files have been copied with will then stop the mtail service and start it up again.
+In the templates folder, under mtail will find a few folders. One is named ''all'' and another may be named something like ''handlers'' or ''loadbalancer''. These are the names of the server roles defined in the env-configuration under the hosts file. These names are used for while the progs will be installed to.
+If you want to add another role to the list, say ''rdf'' you can modify the update-mtail-progs.yml and copy on of the roles and rename things to be for your new role. Say you copied loadbalancer and changed everything for it to rdf. You now have an install for the rdf role. Remember to also create and use an rdf template folder with all the progs you want for it.
+To add a new prog file, simply create a file or clone another .mtail file.
+Then modify it.
+===== Modifying files =====
+<code>counter out_of_memory_lines
+/java.lang.OutOfMemoryError: unable to create new native thread/ {
+  out_of_memory_lines++
+}</code>
+In this example we have the CASOutOfMemory.mtail file.
+This defines a counter at the top. A out_of_memory_lines metric for prometheus to look for.
+From there we have a java.lang.OutOfMemoryError line that we want to search for in our log file.
+If mtail sees this line we want to increase the count. This may mean that the alert will stay until the log file rotates.
+Once you have saved your changes, feel free to test them out.
+Run the update-mtail-progs.yml playbook for a testing env.
+''ansible-playbook -i ../env-configuration/<ENV>/hosts.ini update-mtail-progs.yml --vault-id @prompt''
+If you want you can use
+http://opsjenkins.errigal.com:8080/job/universal_script_runner/
+to run ''find ~/mtail/progs/'' on as many servers as you want to check to see what progs are where.
+===== Adding the Alert =====
+Once you have successfully created and deployed your new prog, you can now make an alert.
+For this you will want to add an entry into the prometheus.rules sections of the prometheus-monitoring-config.
+Here is an example for the CASOutOfMemory alert.
+<code>
+      - alert: CASOutOfMemory
+        expr: out_of_memory_lines > 0
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "CAS has reported it is out of memory on {{ $labels.instance }})"
+          description: "CAS claims it is out of memory in the logs {{ $value }} times. On {{ $labels.instance }}."
+          resolution: "http://wiki.err/doku.php?id=resolution_area:prometheus_resolutions:res-p1116"
+</code>

Sidebar

Internal Errigal Collaboration Wiki

Differences

Sidebar

Internal Errigal Collaboration Wiki

User Tools

Site Tools

Differences

Page Tools