Homeโ€บ๐Ÿ“Š Track 3: Monitor & Alertโ€บModule 112 min read ยท 12/21

Service Level Objectives

Tutorial

Service Level Objectives

SLOs define measurable reliability targets. Instead of vague "99.9% availability" goals, SLOs track actual performance against a target using DQL-based indicators.

Key Concepts

Term            What It Means                              Example
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
SLO             The target you commit to                   99.5% availability over 7 days
SLI             The metric that measures it                 Successful requests / total requests
Error budget    How much failure is allowed                 0.5% = ~50 min downtime per week
Burn rate       How fast you're consuming the budget        2x = will exhaust in half the time

Creating an SLO

  1. Ctrl+K โ†’ "Service-Level Objectives" โ†’ Create new
  2. Choose a template (recommended) or write custom DQL
  3. Select entity (host, service) and set target
  4. Save โ€” the SLO starts tracking immediately

Built-in Templates

Template                              SLI (auto-generated DQL)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Host CPU utilization                  timeseries sli=avg(dt.host.cpu.usage)
Service availability                  timeseries {total=sum(dt.service.request.count), failures=sum(dt.service.request.failure_count)}
Service performance                   timeseries total=avg(dt.service.request.response_time)
K8s cluster CPU/memory efficiency     timeseries sli=avg(dt.kubernetes.cluster.cpu_usage_percent)

๐Ÿ’ก After creating from a template, click "Edit SLI" to see the generated DQL โ€” great way to learn SLO query patterns.

SLO Tiers (ACE Best Practice)

Tier      Target   Warning   Use Case
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
High      99.0%    99.5%     Revenue-critical services
Medium    98.0%    99.0%     Internal business apps
Low       95.0%    98.0%     Dev/staging environments

Burn Rate Alerting

Burn rate measures how fast you're consuming your error budget. Create an anomaly detector on the SLO burn rate metric to get alerted before the SLO target is breached.

Burn Rate    Meaning                              Action
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
< 1          Under budget โ€” healthy                None
1-4          Slow burn โ€” will miss SLO eventually  Investigate
4-10         Fast burn โ€” urgent                    Page on-call
> 10         Critical โ€” SLO will fail soon         Immediate action

๐Ÿ›  Try it: Create an SLO from the "Service availability" template for your main service. Set target to 99.5%. Watch it track over the next few hours.

Error Budget & Burn Rate

  • Burn rate 1.0 = consuming budget at target pace
  • Burn rate > 1.0 = will breach SLO before period ends
  • Dynatrace auto-calculates burn rate and raises events via Anomaly Detection

Custom SLI DQL must produce an sli field returning array of doubles. Use timeseries for metrics, makeTimeseries for logs/spans.