Homeโ€บ๐Ÿ” Playbooksโ€บModule 02 min read ยท 1/8

Slow Application

Hands-on

Playbook: Slow Application

User reports: "The app is slow." Here's the step-by-step investigation.

๐Ÿ’ก Start with the Problems app. Open Ctrl+K โ†’ "Problems" โ€” Davis may have already detected the root cause. If your team has created troubleshooting guides (notebooks/dashboards prefixed with [TSG]), Dynatrace Intelligence will auto-suggest relevant ones for the active problem.

Step 1: Check if Davis Already Found It

fetch events, from:now()-24h
| filter event.kind == "DAVIS_PROBLEM"
| filter event.status == "ACTIVE"
| fields display_id, event.name, affected_entity_ids

Step 2: Service Response Time

timeseries avg(dt.service.request.response_time), by:{dt.entity.service}

Look for spikes. Which service is slow? Note the entity ID.

Step 3: Is It Throughput or Latency?

timeseries {
  rt = avg(dt.service.request.response_time),
  count = sum(dt.service.request.count)
}, by:{dt.entity.service}

If throughput dropped AND latency spiked โ†’ likely a backend issue. If throughput is normal but latency spiked โ†’ slow dependency.

Step 4: Check the Host

timeseries {
  cpu = avg(dt.host.cpu.usage),
  mem = avg(dt.host.memory.usage),
  disk = avg(dt.host.disk.used.percent)
}, by:{dt.entity.host}

CPU > 90%? Memory exhausted? Disk full? These cause application slowness.

Step 5: Find Slow Traces

fetch spans
| filter span.kind == "SERVER"
| filter duration > 5000000000
| fields trace.id, service.name, span.name, duration
| sort duration desc
| limit 10

Step 6: Check Logs for Errors

fetch logs, from:now()-1h
| filter loglevel == "ERROR"
| fields timestamp, content, dt.entity.service
| sort timestamp desc
| limit 20

Decision Tree

Davis found a problem?     โ†’ Follow Davis root cause
  โ†“ No
Service RT spiked?         โ†’ Check host resources (Step 4)
  โ†“ Host OK
Slow traces found?         โ†’ Drill into trace waterfall
  โ†“ No traces
Error logs present?        โ†’ Fix the errors first
  โ†“ No errors
Check external dependencies โ†’ DNS, network, third-party APIs

๐Ÿ›  Try it: Open the Problems app (Ctrl+K โ†’ "Problems") and check for active problems. Click any problem to see Davis's root cause analysis โ€” it automatically correlates across hosts, services, and traces.

Official Regression Thresholds

From Dynatrace's own troubleshooting playbooks โ€” a regression is confirmed when:

Signal              Regression Threshold
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
P95 response time   Increased >20% OR absolute >2s (>2,000,000,000 ns)
Error rate          Increased >1 percentage point
Throughput          Dropped >20% (without corresponding traffic drop)

Key Rules

โš ๏ธ ALWAYS start with Davis problems โ€” never do broad log searches. Use the Problems app first, then scope all queries to the problem's timeframe and affected entities.

  • Never query logs without context โ€” broad log searches hit scan limits and return 0 results
  • Scope queries: from: problemStart - 5min, to: problemEnd + 5min
  • Look for trace_id in logs โ€” correlate logs โ†’ traces for root cause
  • HTTP error rate: use countIf(http.response.status_code >= 500) โ€” span status == "ERROR" is unreliable for HTTP services