User Tools

Site Tools


resolution_area:prometheus_resolutions:res-p9130

MdcTasksFailing

Level: Major

Purpose: Notify operations that the percentage of MDC failed tasks has reached greater than 60% for more than 1 hour. Percentage is the amount of failed tasks in orchestrator database vs total processed tasks (failed + completed) for the last hour

Scenario: The RDF agent has become overloaded with tasks and can't process them, causing a greater number of them to fail

Resolution: Clear out the active_task table and restart the agent. If the issue persists, check the schedule_config table to see if there are schedules that have a long timeout (> 200 seconds) that might be bottlenecking it. Check if there are too many schedules starting at the same time. Check the metrics in Prometheus to detect a pattern.

Manual Action Steps:

delete from orchestrator.active_task;

On the oat servers:

sudo docker restart rdfagent

http://prometheusprod.err:9090/graph?g0.expr=mdc_agent_task_failed_percentage&g0.tab=0&g0.stacked=0&g0.range_input=1d

Auto Clear: When the failed task percentage drops below 60%

resolution_area/prometheus_resolutions/res-p9130.txt · Last modified: 2024/09/19 09:40 by 10.91.120.100