Capacity Planning
Use historical trends to predict when you'll run out of resources โ before it becomes an incident.
CPU Trend (7-day)
timeseries avg(dt.host.cpu.usage), by:{dt.entity.host}, from:now()-7d
Is the average increasing day over day? If yes, you're approaching capacity.
Memory Trend (7-day)
timeseries avg(dt.host.memory.usage), by:{dt.entity.host}, from:now()-7d
Disk Growth Rate
timeseries avg(dt.host.disk.used.percent), by:{dt.entity.host}, from:now()-7d
If disk grows 1% per day and you're at 80%, you have ~20 days before it's full.
Service Throughput Growth
timeseries sum(dt.service.request.count), by:{dt.entity.service}, from:now()-7d
Growing throughput with flat infrastructure = future performance degradation.
Week-over-Week Comparison
// This week's average CPU
timeseries this_week = avg(dt.host.cpu.usage, scalar:true), from:now()-7d, to:now()
// Compare with previous week manually by running:
timeseries last_week = avg(dt.host.cpu.usage, scalar:true), from:now()-14d, to:now()-7d
Capacity Planning Checklist
Metric Warning Zone Action
โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
CPU avg > 70% Approaching Plan scaling or optimization
Memory avg > 80% Approaching Check for leaks, plan upgrade
Disk > 80% Critical Clean up or expand storage
Throughput growing 10%+ Weekly Plan horizontal scaling
Error rate trending up Degrading Investigate before it breaks
๐ก Set up a weekly trend report workflow that compares this week vs last week. If any metric is consistently growing, it's a capacity signal โ act before it becomes an incident.
๐ Try it: Open a Notebook โ run timeseries cpu=avg(dt.host.cpu.usage), by:{dt.entity.host}, from:now()-30d โ switch to "Line chart" visualization. Look for upward trends โ those hosts need capacity attention before they alert.