Homeโ€บ๐Ÿ”” Phase 3: Migrate Alerting & SLOsโ€บModule 152 min read ยท 16/21

Build Auto-Remediation Workflows

Hands-on

New Capability: Auto-Remediation

With Workflows in place (Module 13), auto-remediation becomes possible โ€” detect a problem, analyze it, and fix it automatically. This wasn't feasible with Gen2 alerting profiles.

The Pattern

Davis Problem โ†’ DQL Context โ†’ JS Decision โ†’ Act / Escalate โ†’ Notify

How It Works

  1. Trigger: Davis detects a problem (CPU spike, service failure, etc.)
  2. Context: DQL queries gather data (entity health, recent changes, history)
  3. Decision: JavaScript analyzes the context and decides: monitor, investigate, remediate, or escalate
  4. Action: HTTP request to restart service, scale up, or create ticket
  5. Notify: Email/Slack with what happened and what was done

โš ๏ธ Critical: NEVER set owner to a service user on a workflow โ€” it permanently locks you out. Only set actor (who the tasks run as).

๐Ÿ›  Try it: Create a workflow with a Davis problem trigger โ†’ Add a DQL task to query the problem details โ†’ Add a "Send email" task with the results โ†’ Deploy and wait for a problem to fire.

Decision Engine Logic

The JavaScript decision task analyzes context and chooses an action:

Condition                                Action              Why
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
CPU > 90% for < 5 min                    Monitor              Might be a spike, wait
CPU > 90% for > 15 min                   Remediate            Sustained โ€” restart service
Error rate > 10% + recent deployment     Rollback             Deployment caused it
Disk > 95%                               Remediate            Clean logs, expand volume
Problem seen 3+ times this week          Escalate             Recurring โ€” needs root cause
Unknown problem type                     Escalate             Don't auto-fix what you don't understand

โš ๏ธ Golden rule: Only auto-remediate problems you fully understand. For everything else, escalate to humans with rich context (DQL data, entity health, recent changes).

Service User Setup (Required for Production)

Workflows need a service user as the actor โ€” this controls whose permissions the tasks execute under:

  1. Create a service user in Account Management โ†’ Identity & access management
  2. Create a policy with required scopes (storage:read, automation:workflows:run, email:send)
  3. Bind the policy to the service user's auto-created group
  4. Set the service user as the workflow's actor

โš ๏ธ NEVER set owner to a service user โ€” only set actor. Setting owner permanently locks you out of the workflow because you can't authenticate as a service user.

Real-World Remediation Actions

Action                  How (Workflow Task)                    Risk Level
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Restart service         HTTP request to orchestrator API        Medium
Scale up (K8s)          HTTP request to K8s API (HPA patch)     Low
Clear disk space        HTTP request to run cleanup script      Low
Rollback deployment     HTTP request to CI/CD API               High
Create Jira ticket      Jira connector task                     None
Page on-call            Email/Slack/PagerDuty connector         None

๐Ÿ’ก Start with low-risk actions (notify, create ticket). Once you trust the decision logic, gradually add higher-risk actions (restart, scale). Never start with rollback.