Build Auto-Remediation Workflows
New Capability: Auto-Remediation
With Workflows in place (Module 13), auto-remediation becomes possible โ detect a problem, analyze it, and fix it automatically. This wasn't feasible with Gen2 alerting profiles.
The Pattern
How It Works
- Trigger: Davis detects a problem (CPU spike, service failure, etc.)
- Context: DQL queries gather data (entity health, recent changes, history)
- Decision: JavaScript analyzes the context and decides: monitor, investigate, remediate, or escalate
- Action: HTTP request to restart service, scale up, or create ticket
- Notify: Email/Slack with what happened and what was done
โ ๏ธ Critical: NEVER set owner to a service user on a workflow โ it permanently locks you out. Only set actor (who the tasks run as).
๐ Try it: Create a workflow with a Davis problem trigger โ Add a DQL task to query the problem details โ Add a "Send email" task with the results โ Deploy and wait for a problem to fire.
Decision Engine Logic
The JavaScript decision task analyzes context and chooses an action:
Condition Action Why
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
CPU > 90% for < 5 min Monitor Might be a spike, wait
CPU > 90% for > 15 min Remediate Sustained โ restart service
Error rate > 10% + recent deployment Rollback Deployment caused it
Disk > 95% Remediate Clean logs, expand volume
Problem seen 3+ times this week Escalate Recurring โ needs root cause
Unknown problem type Escalate Don't auto-fix what you don't understand
โ ๏ธ Golden rule: Only auto-remediate problems you fully understand. For everything else, escalate to humans with rich context (DQL data, entity health, recent changes).
Service User Setup (Required for Production)
Workflows need a service user as the actor โ this controls whose permissions the tasks execute under:
- Create a service user in Account Management โ Identity & access management
- Create a policy with required scopes (storage:read, automation:workflows:run, email:send)
- Bind the policy to the service user's auto-created group
- Set the service user as the workflow's
actor
โ ๏ธ NEVER set owner to a service user โ only set actor. Setting owner permanently locks you out of the workflow because you can't authenticate as a service user.
Real-World Remediation Actions
Action How (Workflow Task) Risk Level
โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโ
Restart service HTTP request to orchestrator API Medium
Scale up (K8s) HTTP request to K8s API (HPA patch) Low
Clear disk space HTTP request to run cleanup script Low
Rollback deployment HTTP request to CI/CD API High
Create Jira ticket Jira connector task None
Page on-call Email/Slack/PagerDuty connector None
๐ก Start with low-risk actions (notify, create ticket). Once you trust the decision logic, gradually add higher-risk actions (restart, scale). Never start with rollback.