Homeโ€บ๐Ÿ” Playbooksโ€บModule 31 min read ยท 4/8

Deployment Gone Wrong

Hands-on

Playbook: Deployment Gone Wrong

A new version was deployed and things broke. Here's how to confirm and assess impact.

Step 1: Find the Deployment Event

fetch events, from:now()-24h
| filter event.type == "CUSTOM_DEPLOYMENT" OR matchesPhrase(event.type, "DEPLOYMENT")
| fields timestamp, event.name, event.type
| sort timestamp desc

Step 2: Before vs After โ€” Response Time

// Last 6 hours โ€” look for the inflection point
timeseries avg(dt.service.request.response_time), by:{dt.entity.service}, from:now()-6h

Step 3: Before vs After โ€” Error Rate

timeseries avg(dt.service.request.failure_rate), by:{dt.entity.service}, from:now()-6h

Step 4: New Error Types

// Errors that appeared AFTER the deployment
fetch logs, from:now()-2h
| filter loglevel == "ERROR"
| summarize cnt=count(), by:{content}
| sort cnt desc
| limit 10

Step 5: Davis Problems Since Deployment

fetch events, from:now()-6h
| filter event.kind == "DAVIS_PROBLEM"
| fields display_id, event.name, event.status, timestamp
| sort timestamp desc

Step 6: User Impact

// Did user sessions drop?
fetch user.sessions, from:now()-6h
| makeTimeseries sessions=count()

Decision Tree

Error rate spiked after deploy?    โ†’ Rollback immediately
  โ†“ No spike
Response time degraded?            โ†’ Check new code paths, DB queries
  โ†“ No degradation
New error types in logs?           โ†’ Fix the specific errors
  โ†“ No new errors
User sessions dropped?             โ†’ Check frontend, CDN, DNS
  โ†“ All normal
Deployment is fine                 โ†’ Monitor for 24h before closing

๐Ÿ›  Try it: Open Ctrl+K โ†’ "Releases" to see recent deployments across your environment. Click any release to see which services were affected and whether error rates changed after deployment.

Regression Thresholds (Official)

A regression is confirmed when any threshold is exceeded:

Signal              Regression Threshold
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
P95 response time   Increased >20% OR absolute >2s
Error rate          Increased >1 percentage point
Throughput          Dropped >20% (without corresponding traffic drop)

Compare 30-minute windows: before = [deploymentTime - 35min, deploymentTime - 5min], after = [deploymentTime + 5min, deploymentTime + 35min].