Playbook: High Error Rate
Alert fires: "[P1] Error Rate Spike." Here's how to investigate.
Step 1: Which Service?
timeseries avg(dt.service.request.failure_rate), by:{dt.entity.service}
Step 2: Error Count Over Time
timeseries sum(dt.service.request.failure_count), by:{dt.entity.service}
When did it start? Sudden spike = deployment or dependency. Gradual increase = resource exhaustion.
Step 3: What Errors?
fetch logs, from:now()-1h
| filter loglevel == "ERROR"
| summarize cnt=count(), by:{content}
| sort cnt desc
| limit 10
Step 4: Error Traces
fetch spans, from:now()-1h
| filter span.kind == "SERVER"
| filter status_code >= 400
| summarize cnt=count(), by:{service.name, status_code}
| sort cnt desc
Step 5: Recent Deployments?
fetch events, from:now()-24h
| filter event.type == "CUSTOM_DEPLOYMENT"
| fields timestamp, event.name
| sort timestamp desc
Step 6: Dependency Health
// Check if a database or external service is failing
fetch spans, from:now()-1h
| filter span.kind == "CLIENT"
| filter status_code >= 400
| summarize cnt=count(), by:{service.name, span.name}
| sort cnt desc
Decision Tree
Recent deployment? โ Rollback or fix the deployment
โ No
Database errors? โ Check DB health, connection pool, queries
โ No
External API failing? โ Check third-party status, timeouts
โ No
Resource exhaustion? โ Check CPU, memory, disk, connections
โ No
Code-level exception? โ Check stack traces in error logs
๐ Try it: Open a Notebook โ run timeseries err=avg(dt.service.request.failure_rate), by:{dt.entity.service} โ look for any service above 1%. Click through to the service detail page โ "Failure rate" tab to see individual failed requests.
โ ๏ธ HTTP error rate gotcha: status == "ERROR" on spans is unreliable for HTTP services โ it can return 0 errors even with thousands of 5xx responses. Always use:
fetch spans
| summarize errors = countIf(http.response.status_code >= 500), total = count()
| fieldsAdd error_rate = 100.0 * errors / total