Homeโ€บ๐Ÿ“Š Track 3: Monitor & Alertโ€บModule 102 min read ยท 11/21

Alerting

Tutorial

Alerting

Alerting is the bridge between monitoring and action. Dynatrace uses anomaly detectors โ€” DQL-based rules that trigger events when conditions are met.

Three Analyzer Models

Model                   How It Works                              Best For
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Static threshold        Fixed value (e.g., CPU > 90%)             Known limits
Auto-adaptive           Learns baseline, adapts over time         Growing workloads
Seasonal baseline       Detects patterns (daily/weekly cycles)    Traffic with patterns

Creating an Anomaly Detector

  1. Ctrl+K โ†’ "Anomaly detectors" โ†’ Create new
  2. Choose analyzer: Static threshold (simplest to start)
  3. Write a DQL timeseries query: timeseries avg(dt.host.cpu.usage), interval:1m
  4. Set threshold (e.g., 90%) and condition (ABOVE)
  5. Configure sliding window (default: 3 of 5 samples)
  6. Save and enable

โš ๏ธ Anomaly detector queries MUST use interval:1m. Other intervals will fail.

Priority Pattern: P1 + P3

Best practice: create TWO alerts for every key metric โ€” a warning (P3) and a critical (P1):

Alert                                   Priority  Threshold
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
[P3] CPU Usage Warning                  P3        > 70%
[P1] CPU Usage Critical                 P1        > 90%

[P3] Response Time Warning              P3        > 500ms
[P1] Response Time Critical             P1        > 2000ms

[P3] Error Rate Warning                 P3        > 1%
[P1] Error Rate Spike                   P1        > 5%

๐Ÿ’ก P3 warnings give early notice before outages. P1 critical means immediate action. Without P3, every alert is critical โ†’ teams ignore them (alert fatigue).

Sliding Window

Parameter              Default   What It Means
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Violating samples      3         How many 1-min samples must breach
Sliding window         5         Over how many minutes
De-alerting samples    5         How many normal samples to close

Default: 3 out of 5 one-minute samples must violate the threshold to trigger. 5 normal samples to close.

Settings API

Anomaly detectors use schema builtin:davis.anomaly-detectors. They can be managed via Settings API, Terraform (dynatrace_davis_anomaly_detectors), or Monaco.

๐Ÿ›  Try it: Open Ctrl+K โ†’ "Anomaly Detection" โ†’ click "+ Anomaly detector" โ†’ paste timeseries avg(dt.host.cpu.usage), interval:1m โ†’ choose Static threshold โ†’ set 90% โ†’ preview. You just built a production-grade CPU alert.

Log Alerting (3 strategies)

Method                    Speed     Best For
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Log-based events          Fastest   Sparse patterns (1/week), instant response
Log-based metrics (rec.)  1 min     Threshold alerting, anomaly detection
DQL queries in alerts     1 min     Fallback when metrics/events won't work

Recommended: extract metrics from logs via OpenPipeline, then alert on the metric with an anomaly detector.