User Tools
MdcTasksFailing
Level: Major
Purpose: Notify operations that the percentage of MDC failed tasks has reached greater than 60% for more than 1 hour. Percentage is the amount of failed tasks in orchestrator database vs total processed tasks (failed + completed) for the last hour
Scenario: The RDF agent has become overloaded with tasks and can't process them, causing a greater number of them to fail
Resolution: Clear out the active_task table and restart the agent. If the issue persists, check the schedule_config table to see if there are schedules that have a long timeout (> 200 seconds) that might be bottlenecking it. Check if there are too many schedules starting at the same time. Check the metrics in Prometheus to detect a pattern.
Manual Action Steps:
delete from orchestrator.active_task;
On the oat servers:
sudo docker restart rdfagent
Auto Clear: When the failed task percentage drops below 60%