Dynatrace Training Platform

Playbook: Infrastructure Alert

Alert fires: "[P1] CPU Usage Critical" or "[P1] Disk Usage Critical." Here's the investigation.

Step 1: Which Host?

timeseries avg(dt.host.cpu.usage), by:{dt.entity.host}

Step 2: What's Consuming Resources?

// Top process groups by CPU (check in Dynatrace UI: Host → Processes)
fetch dt.entity.process_group
| expand runs_on
| fields entity.name, softwareTechnologies
| limit 20

Step 3: Is It a Spike or Trend?

// Look at 24h to see if it's a spike or gradual increase
timeseries avg(dt.host.cpu.usage), by:{dt.entity.host}, from:now()-24h

Spike = process gone wild. Trend = capacity issue.

Step 4: Memory Leak Check

timeseries avg(dt.host.memory.usage), by:{dt.entity.host}, from:now()-7d

Steadily increasing memory over days = memory leak in an application.

Step 5: Disk Full

timeseries avg(dt.host.disk.used.percent), by:{dt.entity.host}

Common causes: log files not rotated, temp files, database growth.

Step 6: Impact on Services

// Are services on this host affected?
timeseries avg(dt.service.request.response_time), by:{dt.entity.service}

Quick Fixes

Problem              Quick Fix                           Long-term Fix
───────────────────  ──────────────────────────────────  ──────────────────────────
CPU spike            Restart runaway process              Profile and optimize code
Memory leak          Restart application                  Fix the leak, add limits
Disk full            Clean logs/temp files                Log rotation, retention policy
Network saturation   Identify top talker                  Load balancing, CDN

🛠 Try it: Open Ctrl+K → "Infrastructure & Operations" → click your host → check the CPU, memory, and disk charts. Hover over any spike to see the exact timestamp, then correlate with fetch events | filter dt.entity.host == "HOST-ID" | sort timestamp desc.