← Week 2: Metrics & Alerting

Day 10: Grafana Dashboards

Phase 6 · Sep 11, 2026

← Week 2: Metrics & Alerting

Agenda (2–3 hours)

  • Read (45 min): Grafana documentation — panels, variables, time range, annotations; Grafana dashboard-as-code (JSON model or Grafonnet)
  • Study (45 min): Design a service dashboard with 4 panels: traffic, error rate, P95 latency, and queue depth. What time range and refresh interval is appropriate for each?
  • Practice (45 min): Build the 4-panel dashboard in Grafana; add a template variable for $service to filter all panels simultaneously
  • Challenge (30 min): Export the dashboard as JSON; write a script to parameterize the service name so the same dashboard can be reused for 10 services
← Week 2: Metrics & Alerting

Dashboard Structure

Row: Service Health
├── Panel: Request Rate    (time series, req/s, 1h range)
├── Panel: Error Rate (%)  (time series, threshold red > 1%)
├── Panel: P95 Latency     (time series, ms, threshold > 200ms)
└── Panel: Queue Depth     (stat panel, current value)

Row: Downstream
├── Panel: DynamoDB latency by operation
└── Panel: SQS consumer lag
← Week 2: Metrics & Alerting

Template Variables

Variable: $service
Type: Query
Query: label_values(http_requests_total, service)
Refresh: On dashboard load

Panel query with variable:

sum(rate(http_requests_total{service="$service",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="$service"}[5m])) * 100

One dashboard serves all services — filter with $service dropdown.

← Week 2: Metrics & Alerting

Annotations

Link Grafana annotations to deployments:

# Post-deploy annotation to Grafana
curl -X POST http://grafana:3000/api/annotations \
  -H 'Content-Type: application/json' \
  -d "{\"text\": \"Deployed $VERSION\", \"time\": $(date +%s)000}"

Annotations appear as vertical lines on time-series panels — correlate latency spikes with deploys.

Grafana can also query annotations from Prometheus:

changes(kube_deployment_status_observed_generation{deployment="task-svc"}[1m]) > 0
← Week 2: Metrics & Alerting

Key Takeaways

  • Template variables ($service) make dashboards reusable across all instances of a service
  • Annotations link deploy events to metric charts — fastest way to spot a latency regression
  • Export dashboards as JSON and store in git; use Grafonnet or jsonnet for parameterization
  • Stat panels for current values; time-series panels for trends; heatmaps for distributions

Tomorrow: Alertmanager — alert rules, routing, silencing, and PagerDuty integration.