Alerting
Alerting is the bridge between monitoring and action. Dynatrace uses anomaly detectors โ DQL-based rules that trigger events when conditions are met.
Three Analyzer Models
Model How It Works Best For
โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
Static threshold Fixed value (e.g., CPU > 90%) Known limits
Auto-adaptive Learns baseline, adapts over time Growing workloads
Seasonal baseline Detects patterns (daily/weekly cycles) Traffic with patterns
Creating an Anomaly Detector
- Ctrl+K โ "Anomaly detectors" โ Create new
- Choose analyzer: Static threshold (simplest to start)
- Write a DQL timeseries query:
timeseries avg(dt.host.cpu.usage), interval:1m - Set threshold (e.g., 90%) and condition (ABOVE)
- Configure sliding window (default: 3 of 5 samples)
- Save and enable
โ ๏ธ Anomaly detector queries MUST use interval:1m. Other intervals will fail.
Priority Pattern: P1 + P3
Best practice: create TWO alerts for every key metric โ a warning (P3) and a critical (P1):
Alert Priority Threshold
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโ โโโโโโโโโ
[P3] CPU Usage Warning P3 > 70%
[P1] CPU Usage Critical P1 > 90%
[P3] Response Time Warning P3 > 500ms
[P1] Response Time Critical P1 > 2000ms
[P3] Error Rate Warning P3 > 1%
[P1] Error Rate Spike P1 > 5%
๐ก P3 warnings give early notice before outages. P1 critical means immediate action. Without P3, every alert is critical โ teams ignore them (alert fatigue).
Sliding Window
Parameter Default What It Means
โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Violating samples 3 How many 1-min samples must breach
Sliding window 5 Over how many minutes
De-alerting samples 5 How many normal samples to close
Default: 3 out of 5 one-minute samples must violate the threshold to trigger. 5 normal samples to close.
Settings API
Anomaly detectors use schema builtin:davis.anomaly-detectors. They can be managed via Settings API, Terraform (dynatrace_davis_anomaly_detectors), or Monaco.
๐ Try it: Open Ctrl+K โ "Anomaly Detection" โ click "+ Anomaly detector" โ paste timeseries avg(dt.host.cpu.usage), interval:1m โ choose Static threshold โ set 90% โ preview. You just built a production-grade CPU alert.
Log Alerting (3 strategies)
Method Speed Best For
โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Log-based events Fastest Sparse patterns (1/week), instant response
Log-based metrics (rec.) 1 min Threshold alerting, anomaly detection
DQL queries in alerts 1 min Fallback when metrics/events won't work
Recommended: extract metrics from logs via OpenPipeline, then alert on the metric with an anomaly detector.