Playbook: Deployment Gone Wrong
A new version was deployed and things broke. Here's how to confirm and assess impact.
Step 1: Find the Deployment Event
fetch events, from:now()-24h
| filter event.type == "CUSTOM_DEPLOYMENT" OR matchesPhrase(event.type, "DEPLOYMENT")
| fields timestamp, event.name, event.type
| sort timestamp desc
Step 2: Before vs After โ Response Time
// Last 6 hours โ look for the inflection point
timeseries avg(dt.service.request.response_time), by:{dt.entity.service}, from:now()-6h
Step 3: Before vs After โ Error Rate
timeseries avg(dt.service.request.failure_rate), by:{dt.entity.service}, from:now()-6h
Step 4: New Error Types
// Errors that appeared AFTER the deployment
fetch logs, from:now()-2h
| filter loglevel == "ERROR"
| summarize cnt=count(), by:{content}
| sort cnt desc
| limit 10
Step 5: Davis Problems Since Deployment
fetch events, from:now()-6h
| filter event.kind == "DAVIS_PROBLEM"
| fields display_id, event.name, event.status, timestamp
| sort timestamp desc
Step 6: User Impact
// Did user sessions drop?
fetch user.sessions, from:now()-6h
| makeTimeseries sessions=count()
Decision Tree
Error rate spiked after deploy? โ Rollback immediately
โ No spike
Response time degraded? โ Check new code paths, DB queries
โ No degradation
New error types in logs? โ Fix the specific errors
โ No new errors
User sessions dropped? โ Check frontend, CDN, DNS
โ All normal
Deployment is fine โ Monitor for 24h before closing
๐ Try it: Open Ctrl+K โ "Releases" to see recent deployments across your environment. Click any release to see which services were affected and whether error rates changed after deployment.
Regression Thresholds (Official)
A regression is confirmed when any threshold is exceeded:
Signal Regression Threshold
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
P95 response time Increased >20% OR absolute >2s
Error rate Increased >1 percentage point
Throughput Dropped >20% (without corresponding traffic drop)
Compare 30-minute windows: before = [deploymentTime - 35min, deploymentTime - 5min], after = [deploymentTime + 5min, deploymentTime + 35min].