Dynatrace Training Platform

New Capability: Auto-Remediation

With Workflows in place (Module 13), auto-remediation becomes possible — detect a problem, analyze it, and fix it automatically. This wasn't feasible with Gen2 alerting profiles.

The Pattern

How It Works

Trigger: Davis detects a problem (CPU spike, service failure, etc.)
Context: DQL queries gather data (entity health, recent changes, history)
Decision: JavaScript analyzes the context and decides: monitor, investigate, remediate, or escalate
Action: HTTP request to restart service, scale up, or create ticket
Notify: Email/Slack with what happened and what was done

⚠️ Critical: NEVER set owner to a service user on a workflow — it permanently locks you out. Only set actor (who the tasks run as).

🛠 Try it: Create a workflow with a Davis problem trigger → Add a DQL task to query the problem details → Add a "Send email" task with the results → Deploy and wait for a problem to fire.

Decision Engine Logic

The JavaScript decision task analyzes context and chooses an action:

Condition                                Action              Why
──────────────────────────────────────   ──────────────────   ──────────────────────────
CPU > 90% for < 5 min                    Monitor              Might be a spike, wait
CPU > 90% for > 15 min                   Remediate            Sustained — restart service
Error rate > 10% + recent deployment     Rollback             Deployment caused it
Disk > 95%                               Remediate            Clean logs, expand volume
Problem seen 3+ times this week          Escalate             Recurring — needs root cause
Unknown problem type                     Escalate             Don't auto-fix what you don't understand

⚠️ Golden rule: Only auto-remediate problems you fully understand. For everything else, escalate to humans with rich context (DQL data, entity health, recent changes).

Service User Setup (Required for Production)

Workflows need a service user as the actor — this controls whose permissions the tasks execute under:

Create a service user in Account Management → Identity & access management
Create a policy with required scopes (storage:read, automation:workflows:run, email:send)
Bind the policy to the service user's auto-created group
Set the service user as the workflow's actor

⚠️ NEVER set owner to a service user — only set actor. Setting owner permanently locks you out of the workflow because you can't authenticate as a service user.

Real-World Remediation Actions

Action                  How (Workflow Task)                    Risk Level
──────────────────────  ──────────────────────────────────────  ──────────
Restart service         HTTP request to orchestrator API        Medium
Scale up (K8s)          HTTP request to K8s API (HPA patch)     Low
Clear disk space        HTTP request to run cleanup script      Low
Rollback deployment     HTTP request to CI/CD API               High
Create Jira ticket      Jira connector task                     None
Page on-call            Email/Slack/PagerDuty connector         None

💡 Start with low-risk actions (notify, create ticket). Once you trust the decision logic, gradually add higher-risk actions (restart, scale). Never start with rollback.

Build Auto-Remediation Workflows

New Capability: Auto-Remediation

The Pattern

How It Works

Decision Engine Logic

Service User Setup (Required for Production)

Real-World Remediation Actions