Homeโ€บ๐Ÿ’ฐ SRE & FinOpsโ€บModule 02 min read ยท 1/6

Daily Health Report

Hands-on

Daily Health Report Workflow

A workflow that runs every morning, queries key metrics, formats a report, and emails it to the team.

Architecture

Schedule (08:00 daily)
  โ†’ DQL: Host health
  โ†’ DQL: Service errors
  โ†’ DQL: Active problems
  โ†’ DQL: SLO status
  โ†’ JavaScript: Format report
  โ†’ Email: Send to team

DQL Queries for the Report

Host Health Summary

timeseries cpu = avg(dt.host.cpu.usage, scalar:true), by:{dt.entity.host}
| fieldsAdd current = arrayAvg(cpu)
| fields dt.entity.host, current

Service Error Summary (24h)

timeseries {
  total = sum(dt.service.request.count),
  failures = sum(dt.service.request.failure_count)
}, by:{dt.entity.service}, from:now()-24h
| fieldsAdd total_sum = arraySum(total), fail_sum = arraySum(failures)
| fieldsAdd error_pct = 100.0 * fail_sum / total_sum
| fields dt.entity.service, total_sum, fail_sum, error_pct

Active Problems

fetch events, from:now()-24h
| filter event.kind == "DAVIS_PROBLEM"
| filter event.status == "ACTIVE"
| fields display_id, event.name

Workflow Setup

  1. Create workflow with Schedule trigger (08:00 daily)
  2. Add DQL tasks for each query above
  3. Add JavaScript task to format results into a readable report
  4. Add Email task: {{ result("format_report").output }}
  5. Set a service user as actor

๐Ÿ’ก We have a production-tested 11-task executive daily report workflow template with sparklines, health scores, and trend arrows. See the dt-automation skill for the full JSON template.

๐Ÿ›  Try it: Open Workflows โ†’ "+ Workflow" โ†’ add a Schedule trigger (daily 09:00) โ†’ add a "Execute DQL query" task with fetch dt.entity.host | summarize total=count(), healthy=countIf(state == "RUNNING") โ†’ add a "Send email" task. You just automated your morning health check.

SLO Error Budget & Burn Rate

SLOs track error budget โ€” the amount of "allowed" downtime before breaching your target:

  • Burn rate 1.0 = consuming budget exactly at target pace
  • Burn rate > 1.0 = will breach SLO before evaluation period ends
  • Auto-alerting: the SLO app raises burn-rate events via Anomaly Detection

Official SLI DQL pattern (service availability):

timeseries {
    total = sum(dt.service.request.count),
    failures = sum(dt.service.request.failure_count)
  }, by: { dt.smartscape.service }
| fieldsAdd sli = (((total[] - failures[]) / total[]) * 100)

Key: the sli field must return an array of double values. Use timeseries for metrics, makeTimeseries for logs/spans.