=====MdcTasksFailing===== **Level:** Major **Purpose:** Notify operations that the percentage of MDC failed tasks has reached greater than 60% for more than 1 hour. Percentage is the amount of failed tasks in orchestrator database vs total processed tasks (failed + completed) for the last hour **Scenario:** The RDF agent has become overloaded with tasks and can't process them, causing a greater number of them to fail **Resolution:** Clear out the active_task table and restart the agent. If the issue persists, check the schedule_config table to see if there are schedules that have a long timeout (> 200 seconds) that might be bottlenecking it. Check if there are too many schedules starting at the same time. Check the metrics in Prometheus to detect a pattern. **Manual Action Steps:** delete from orchestrator.active_task; On the oat servers: sudo docker restart rdfagent [[http://prometheusprod.err:9090/graph?g0.expr=mdc_agent_task_failed_percentage&g0.tab=0&g0.stacked=0&g0.range_input=1d]] **Auto Clear:** When the failed task percentage drops below 60%