Dynatrace Training Platform

Alerting

Alerting is the bridge between monitoring and action. Dynatrace uses anomaly detectors — DQL-based rules that trigger events when conditions are met.

Three Analyzer Models

Model                   How It Works                              Best For
──────────────────────  ──────────────────────────────────────    ──────────────────
Static threshold        Fixed value (e.g., CPU > 90%)             Known limits
Auto-adaptive           Learns baseline, adapts over time         Growing workloads
Seasonal baseline       Detects patterns (daily/weekly cycles)    Traffic with patterns

Creating an Anomaly Detector

Ctrl+K → "Anomaly detectors" → Create new
Choose analyzer: Static threshold (simplest to start)
Write a DQL timeseries query: timeseries avg(dt.host.cpu.usage), interval:1m
Set threshold (e.g., 90%) and condition (ABOVE)
Configure sliding window (default: 3 of 5 samples)
Save and enable

⚠️ Anomaly detector queries MUST use interval:1m. Other intervals will fail.

Priority Pattern: P1 + P3

Best practice: create TWO alerts for every key metric — a warning (P3) and a critical (P1):

Alert                                   Priority  Threshold
──────────────────────────────────────  ────────  ─────────
[P3] CPU Usage Warning                  P3        > 70%
[P1] CPU Usage Critical                 P1        > 90%

[P3] Response Time Warning              P3        > 500ms
[P1] Response Time Critical             P1        > 2000ms

[P3] Error Rate Warning                 P3        > 1%
[P1] Error Rate Spike                   P1        > 5%

💡 P3 warnings give early notice before outages. P1 critical means immediate action. Without P3, every alert is critical → teams ignore them (alert fatigue).

Sliding Window

Parameter              Default   What It Means
─────────────────────  ────────  ──────────────────────────────────
Violating samples      3         How many 1-min samples must breach
Sliding window         5         Over how many minutes
De-alerting samples    5         How many normal samples to close

Default: 3 out of 5 one-minute samples must violate the threshold to trigger. 5 normal samples to close.

Settings API

Anomaly detectors use schema builtin:davis.anomaly-detectors. They can be managed via Settings API, Terraform (dynatrace_davis_anomaly_detectors), or Monaco.

🛠 Try it: Open Ctrl+K → "Anomaly Detection" → click "+ Anomaly detector" → paste timeseries avg(dt.host.cpu.usage), interval:1m → choose Static threshold → set 90% → preview. You just built a production-grade CPU alert.

Log Alerting (3 strategies)

Method                    Speed     Best For
────────────────────────  ────────  ──────────────────────────────
Log-based events          Fastest   Sparse patterns (1/week), instant response
Log-based metrics (rec.)  1 min     Threshold alerting, anomaly detection
DQL queries in alerts     1 min     Fallback when metrics/events won't work

Recommended: extract metrics from logs via OpenPipeline, then alert on the metric with an anomaly detector.