Homeโ€บ๐Ÿ” Playbooksโ€บModule 11 min read ยท 2/8

High Error Rate

Hands-on

Playbook: High Error Rate

Alert fires: "[P1] Error Rate Spike." Here's how to investigate.

Step 1: Which Service?

timeseries avg(dt.service.request.failure_rate), by:{dt.entity.service}

Step 2: Error Count Over Time

timeseries sum(dt.service.request.failure_count), by:{dt.entity.service}

When did it start? Sudden spike = deployment or dependency. Gradual increase = resource exhaustion.

Step 3: What Errors?

fetch logs, from:now()-1h
| filter loglevel == "ERROR"
| summarize cnt=count(), by:{content}
| sort cnt desc
| limit 10

Step 4: Error Traces

fetch spans, from:now()-1h
| filter span.kind == "SERVER"
| filter status_code >= 400
| summarize cnt=count(), by:{service.name, status_code}
| sort cnt desc

Step 5: Recent Deployments?

fetch events, from:now()-24h
| filter event.type == "CUSTOM_DEPLOYMENT"
| fields timestamp, event.name
| sort timestamp desc

Step 6: Dependency Health

// Check if a database or external service is failing
fetch spans, from:now()-1h
| filter span.kind == "CLIENT"
| filter status_code >= 400
| summarize cnt=count(), by:{service.name, span.name}
| sort cnt desc

Decision Tree

Recent deployment?         โ†’ Rollback or fix the deployment
  โ†“ No
Database errors?           โ†’ Check DB health, connection pool, queries
  โ†“ No
External API failing?      โ†’ Check third-party status, timeouts
  โ†“ No
Resource exhaustion?       โ†’ Check CPU, memory, disk, connections
  โ†“ No
Code-level exception?      โ†’ Check stack traces in error logs

๐Ÿ›  Try it: Open a Notebook โ†’ run timeseries err=avg(dt.service.request.failure_rate), by:{dt.entity.service} โ†’ look for any service above 1%. Click through to the service detail page โ†’ "Failure rate" tab to see individual failed requests.

โš ๏ธ HTTP error rate gotcha: status == "ERROR" on spans is unreliable for HTTP services โ€” it can return 0 errors even with thousands of 5xx responses. Always use:

fetch spans
| summarize errors = countIf(http.response.status_code >= 500), total = count()
| fieldsAdd error_rate = 100.0 * errors / total