🚀 Production Skills — Module 13

Alert Creation

Hands-on

Creating Alerts for Extensions

An extension without alerts is just data collection. Alerts turn metrics into actionable problems that generate tickets and wake people up at 3 AM. Getting them right matters.

Alert Architecture

Dynatrace metric events (alerts) use the Settings API v2 with schema builtin:anomaly-detection.metric-events. Each alert defines:

  • What to watch: A metric key or selector
  • When to fire: Threshold + sliding window
  • What to create: Event type, title, description
  • Where to attach: Entity dimension key (for problem grouping)

The Sliding Window

Never alert on a single datapoint. Use a sliding window to avoid false positives:

model_properties:
  type: STATIC_THRESHOLD
  threshold: 90              # Fire when value exceeds 90
  alert_condition: ABOVE
  samples: 35                # Look at last 35 samples
  violating_samples: 3       # Fire if 3 of 35 exceed threshold
  dealerting_samples: 5      # Clear after 5 consecutive samples below threshold
  alert_on_no_data: false    # Don't alert when device is unreachable

With 1-minute polling, samples: 35 covers ~35 minutes. Requiring 3 violations means the condition must persist, not just spike once.

Priority Levels

PriorityEvent TypeTypical Use
P1 (Severe)CUSTOM_ALERTService down, critical threshold (≥90%)
P2 (Critical)CUSTOM_ALERTHigh threshold (≥80%), degraded state
P3 (Warning)CUSTOM_ALERTWarning threshold (≥70%), informational

Entity Dimension Key (CRITICAL)

The eventEntityDimensionKey determines which entity the problem is raised on. Always use the parent entity — this is where tickets get generated:

# CORRECT: Problem raised on the device (parent)
event_entity_dimension_key = "dt.entity.myext:device"

# WRONG: Problem raised on the interface (child) — tickets scattered
event_entity_dimension_key = "dt.entity.myext:interface"

Title Placeholders

Include the affected entity name in the alert title so operators know what's broken without opening the problem:

# For child entity alerts (interfaces, ports, etc.)
title = "[P2] {dims:dt.entity.myext:interface.name} - High Bandwidth Utilization"

# For parent entity alerts
title = "[P1] {dims:dt.entity.myext:device.name} - CPU Critical"

Alert Configuration Template

{
  "enabled": true,
  "summary": "[P2] Extension - High CPU Alert",
  "eventEntityDimensionKey": "dt.entity.myext:device",
  "eventTemplate": {
    "title": "[P2] {dims:dt.entity.myext:device.name} - High CPU",
    "description": "CPU usage exceeded 80% threshold",
    "eventType": "CUSTOM_ALERT",
    "davisMerge": false
  },
  "modelProperties": {
    "type": "STATIC_THRESHOLD",
    "threshold": 80,
    "alertCondition": "ABOVE",
    "alertOnNoData": false,
    "dealertingSamples": 5,
    "samples": 35,
    "violatingSamples": 3
  },
  "queryDefinition": {
    "type": "METRIC_KEY",
    "metricKey": "com.dynatrace.extension.myext.cpu",
    "aggregation": "AVG"
  }
}

func: Metrics in Alerts

For func: calculated metrics, you cannot use METRIC_KEY query type. Use METRIC_SELECTOR instead:

"queryDefinition": {
  "type": "METRIC_SELECTOR",
  "metricSelector": "func:com.dynatrace.extension.myext.bandwidth_pct:splitBy(\"dt.entity.myext:interface\")"
}

Deployment via Settings API v2

curl -X POST "$BASE/api/v2/settings/objects" \
  -H "Authorization: Api-Token $TOKEN" \
  -H "Content-Type: application/json" \
  -d '[{
    "schemaId": "builtin:anomaly-detection.metric-events",
    "scope": "environment",
    "value": { ... alert config ... }
  }]'

Common Mistakes

  • alertOnNoData: true — fires when device is unreachable for maintenance. Always set to false.
  • davisMerge: true — merges your alert with Davis AI problems. Set to false to keep extension alerts separate.
  • Wrong entity dimension — attaching to child entity scatters problems across hundreds of interfaces instead of grouping on the device.
  • Missing {dims:} in title — operators see "High CPU" but don't know which device without opening the problem.
  • Threshold unit mismatch — querying a metric that returns bytes but setting threshold as if it's megabytes.

Real Example: Customer Feedback

After deploying ASR alerts, the customer reported "too many CPU alerts." Root cause: 3 duplicate pre-existing alerts from a previous team. We deleted the duplicates and kept our properly-configured ones with samples: 35 sliding window.

Another customer found the Catalyst uptime threshold was 360000 instead of 3600000 (6 minutes vs 60 minutes). A single zero cost them false alerts for weeks.

🛠 Hands-On Exercise

Edit the YAML in the editor, then click "Check My Work" to validate.

Alert Configuration

This extension monitors a network switch. Create the YAML metrics and topology needed to support these alerts:

  • CPU alert at P1 (≥90%), P2 (≥80%), P3 (≥70%)
  • Memory alert at the same thresholds
  • Uptime alert at P3 (< 1 hour = 3600000 timeticks)

Make sure:

  • The parent entity type has role: default (alerts attach here)
  • Metric keys are correct for the alert query type
  • All metrics have proper metadata
extension.yamlYAML
Loading...