Real-World Case Studies
ReferenceReal-World Case Studies
These are real bugs from real production extensions. Every one of them made it past code review and into a customer environment before being caught.
Case Study 1: The ACI Cross-Table Bug (DED018)
The Problem
A Cisco ACI spine switch (10.250.11.51) reported DEVICE_CONNECTION_ERROR every 3 minutes. All table metrics (CPU, memory, interfaces) returned zero. Only the scalar sysUpTime worked.
Root Cause
The device had an SNMP agent bug on ifIndex 402718780. When the extension tried to walk ifDescr via GETBULK, the device hung for 180 seconds (5 retries × 30s timeout). Because ALL table subgroups were in a single SNMP group, the interface hang blocked CPU and memory collection too.
The Fix (v0.0.5)
Split into two independent groups:
snmp:
- group: Device Default # CPU, Memory, PSU, Temp
interval:
minutes: 1
subgroups:
- subgroup: CPU and Memory
featureSet: CPU and Memory
...
- group: Interfaces # Separate group — polls independently
interval:
minutes: 1
subgroups:
- subgroup: Interface Traffic
featureSet: Interfaces
...
Now when interfaces hang, CPU/memory data still flows. This is fault isolation — the #1 architecture pattern for production SNMP extensions.
Bonus Bug
The same extension had ipAdEntTable OIDs (1.3.6.1.2.1.4.20.1.*) mixed into the Interfaces subgroup alongside ifTable OIDs (1.3.6.1.2.1.2.2.1.*). These tables have different index spaces — ifTable uses ifIndex, ipAdEntTable uses IP address. Our lint tool catches this as DED018.
Case Study 2: The Nexus SNMPv2c Bug
The Problem
A Cisco Nexus Python extension had 50 SNMP metrics returning NO DATA. The 16 NX-API metrics worked fine.
Root Cause
Lines 106-115 of the Python code:
if version == "v1":
self.auth = CommunityData(community, mpModel=0)
# v2 case MISSING — falls through to v3!
else:
self.auth = UsmUserData(user, auth_key, priv_key, ...)
The customer used SNMPv2c. The code only handled v1 and v3. SNMPv2c fell through to the SNMPv3 handler, which tried USM authentication against a device expecting a community string. Every SNMP poll failed silently.
The Fix
if version == "v1":
self.auth = CommunityData(community, mpModel=0)
elif version == "v2":
self.auth = CommunityData(community, mpModel=1) # Added!
else:
self.auth = UsmUserData(user, auth_key, priv_key, ...)
Case Study 3: The APIC Triple Typo
The Problem
Three APIC metrics returned NO DATA despite correct API endpoints.
Root Cause
Three typos in the Python code:
- Line 87:
identInt16→ should beidentInst16(wrong JSON field name) - Line 133:
totalEplast→ should betotalEpLast(case sensitivity) - Line 137:
qptcapacity→ should beeqptcapacity(wrong API class name)
All three were single-character or case errors in API response field names. The code ran without errors — it just parsed None from the JSON and silently reported nothing.
Case Study 4: The Palo Alto Feature Set Trap
The Problem
Customer updated from Palo Alto extension v2.9.7 to v3.2.3 (Dynatrace-built commercial extension). After the update, CPU, power supply, and sensor metrics disappeared.
Root Cause
v3.2.3 reorganized metrics into new feature sets: control-plane and hardware. The existing monitoring configurations didn't have these feature sets enabled — they were new in v3.2.3. The extension was collecting data, but only for the default feature set.
Additionally, the metric key prefix changed from palo-alto-generic to palo-alto.generic (hyphen to dot) — a breaking change that silently broke all existing dashboards and alerts referencing the old keys.
The Fix
Add control-plane and hardware to the featureSets array in each monitoring configuration. Update all dashboard and alert metric references to use the new prefix.
Case Study 5: The MSSQL Rounding Bug
The Problem
SQL Server extension returned incorrect percentage values — sometimes showing 33.333333333 instead of 33.33.
Root Cause
Line 4740: ROUND(value, ) — missing the precision parameter. Should be ROUND(value, 2). The SQL query executed without error but returned unrounded floats.
Lessons Learned
- Always use fault isolation — separate SNMP groups for independent data sources
- Test all SNMP versions — v1, v2c, and v3 code paths must all work
- API field names are case-sensitive — one wrong character = silent failure
- Feature set changes are breaking — document which feature sets are required
- Metric key changes are breaking — never rename metric keys in a minor version
- Validate with real data — code review alone misses runtime bugs
- Run the lint tool — catches OID errors that would take hours to debug in production