Homeโ€บ๐Ÿ” Playbooksโ€บModule 61 min read ยท 7/8

Kubernetes Issues

Hands-on

Kubernetes Issues

Kubernetes problems show up as pod crashes, OOM kills, scheduling failures, and resource exhaustion. Here's how to investigate each with DQL.

Pod Status Overview

// Current pod status across clusters
fetch dt.entity.cloud_application
| fields entity.name, id, k8s.cluster.name, k8s.namespace.name
| sort k8s.cluster.name asc

K8s Events (Warnings)

// Recent K8s warning events
fetch events, from:now()-2h
| filter event.kind == "K8S_EVENT" AND event.type == "Warning"
| fields timestamp, event.reason, event.message, k8s.namespace.name, k8s.pod.name
| sort timestamp desc
| limit 20

Container Resource Usage

// CPU usage vs limits
timeseries cpu=avg(dt.kubernetes.container.cpu_usage),
           limits=avg(dt.kubernetes.container.limits_cpu),
    by:{k8s.container.name, k8s.namespace.name}, from:now()-1h
// Memory usage vs limits (OOM detection)
timeseries mem=avg(dt.kubernetes.container.memory_working_set),
           limits=avg(dt.kubernetes.container.limits_memory),
    by:{k8s.container.name, k8s.namespace.name}, from:now()-1h

Key K8s Metrics

Metric                                        What It Measures
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
dt.kubernetes.container.cpu_usage             Container CPU usage
dt.kubernetes.container.memory_working_set    Container memory usage
dt.kubernetes.container.requests_cpu          CPU requests
dt.kubernetes.container.limits_cpu            CPU limits
dt.kubernetes.container.requests_memory       Memory requests
dt.kubernetes.container.limits_memory         Memory limits
dt.kubernetes.workload.conditions             Pod conditions
dt.kubernetes.pods                            Pod status

Decision Tree

Pod CrashLoopBackOff?      โ†’ Check container logs for startup errors
  โ†“ No
OOMKilled?                 โ†’ Memory limit too low, increase limits_memory
  โ†“ No
ImagePullBackOff?          โ†’ Check image name, registry auth, network
  โ†“ No
Pending (unschedulable)?   โ†’ Check node resources, taints, affinity rules
  โ†“ No
Running but unhealthy?     โ†’ Check readiness/liveness probes, service mesh

Metadata Enrichment

Dynatrace Operator enriches ALL telemetry with K8s metadata. Use these fields for filtering:

  • k8s.cluster.name, k8s.namespace.name, k8s.pod.name
  • k8s.container.name, k8s.workload.name, k8s.workload.kind
  • dt.security_context โ€” for ABAC boundaries based on K8s labels
  • dt.cost.costcenter โ€” for cost allocation from K8s annotations
โ–ถ Knowledge Check

Q: A container is OOMKilled. Which metric should you check?

  • โŒ dt.kubernetes.container.cpu_usage
  • โœ… dt.kubernetes.container.memory_working_set vs limits_memory
  • โŒ dt.kubernetes.pods