Playbook: Infrastructure Alert
Alert fires: "[P1] CPU Usage Critical" or "[P1] Disk Usage Critical." Here's the investigation.
Step 1: Which Host?
timeseries avg(dt.host.cpu.usage), by:{dt.entity.host}
Step 2: What's Consuming Resources?
// Top process groups by CPU (check in Dynatrace UI: Host โ Processes)
fetch dt.entity.process_group
| expand runs_on
| fields entity.name, softwareTechnologies
| limit 20
Step 3: Is It a Spike or Trend?
// Look at 24h to see if it's a spike or gradual increase
timeseries avg(dt.host.cpu.usage), by:{dt.entity.host}, from:now()-24h
Spike = process gone wild. Trend = capacity issue.
Step 4: Memory Leak Check
timeseries avg(dt.host.memory.usage), by:{dt.entity.host}, from:now()-7d
Steadily increasing memory over days = memory leak in an application.
Step 5: Disk Full
timeseries avg(dt.host.disk.used.percent), by:{dt.entity.host}
Common causes: log files not rotated, temp files, database growth.
Step 6: Impact on Services
// Are services on this host affected?
timeseries avg(dt.service.request.response_time), by:{dt.entity.service}
Quick Fixes
Problem Quick Fix Long-term Fix
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
CPU spike Restart runaway process Profile and optimize code
Memory leak Restart application Fix the leak, add limits
Disk full Clean logs/temp files Log rotation, retention policy
Network saturation Identify top talker Load balancing, CDN
๐ Try it: Open Ctrl+K โ "Infrastructure & Operations" โ click your host โ check the CPU, memory, and disk charts. Hover over any spike to see the exact timestamp, then correlate with fetch events | filter dt.entity.host == "HOST-ID" | sort timestamp desc.