Grafana Dashboards: A Complete Guide¶

Source Reference

Based on: Grafana dashboards: A complete guide

Original Author: Alexandre de Verteuil

Last Updated: Jan 9, 2023

Introduction¶

Dashboards are easy to create but difficult to maintain at scale. Without a clear strategy, organizations often face "dashboard sprawl." The following classification scheme helps organize dashboards based on business processes and engineering needs.

1. Methodologies: USE & REDS¶

Foundational dashboards for SREs. These should be visually simple, uniform, and composed primarily of time-series panels.

USE Method REDS Method

Hardware / Resource Oriented

The USE Method (Utilization, Saturation, Errors) is designed for physical or virtual resources.

Target: Analyzing machine health and identifying root causes of infrastructure performance issues.
Key Metrics: CPU, Memory, Disk I/O, Network bandwidth.

Service / User Oriented

The REDS metrics (Requests, Errors, Duration, Saturation) derive from the Four Golden Signals.

Target: Microservices and API endpoints.
Usage: Primary candidates for alerting; these metrics serve as a proxy for the actual user experience.

Overview and Drill Down¶

Implemented using Dashboard Links or Data Links to create a hierarchy.

Overview: High-level aggregated metrics for the entire infrastructure or fleet.
Drill Down: granular metrics focusing on a single component or instance.

Business Journey¶

Visualizations that track high-level business logic rather than technical metrics.

Scope: Customer acquisition funnels, supply chain logistics, or physical operations (e.g., IoT manufacturing lines).

The Home Dashboards¶

Custom entry points configured at the Organization, Team, or User level.

Strategy: Can range from simple row additions to "Enterprise" setups with team-specific layouts.
Components: Useful for displaying admin contact info and dynamic Dashboard Lists.

3. Development Workflows¶

Research & Development (R&D)¶

Sandboxes for iterative dashboard design and "work-in-progress" (WIP).

Organization: Isolate in folders like SRE R&D or AIOps Drafts.
Best Practice: Avoid using production tags to prevent drafts from polluting global search results.

Metrics Exploration¶

Abstract, reusable dashboards for browsing data when specific metric names are unknown.

Design Pattern: Use Variables to select metric prefixes, with panels repeating automatically.

PromQL Aggregation Templates:

PromQL

# 1. Average Value
avg without (instance) ($metric)

# 2. Total Sum
sum without (instance) ($metric)

# 3. Average Per-Second Rates
avg without (instance) (rate($metric[$__rate_interval]))

# 4. Total Rate Sum
sum without (instance) (rate($metric[$__rate_interval]))

4. Operational Views¶

Alerts Analysis¶

Visualizes the history and state of alerts, typically using the State Timeline panel.

Visualizing Alert States (State Timeline):

Map states to integers for visualization (e.g., 3=Meta, 2=Firing, 1=Pending).

PromQL

max by (alertname,alertstate) (
  3 * max_over_time(ALERTS{alertname="AlwaysFiring"}[$__interval])
  or
  2 * max_over_time(ALERTS{alertstate="firing"}[$__interval])
  or
  max_over_time(ALERTS{alertstate="pending"}[$__interval])
)

Counting Firing Alerts (Status History):

PromQL

count by (alertname) (max_over_time(ALERTS{alertstate="firing"}[$__interval]))

Issue Dashboards¶

Ephemeral dashboards created for specific incident investigations.

Lifecycle: Temporary; should be archived or deleted after resolution.
Naming Convention: Include timestamp or Incident ID (e.g., 2023-10-Incident-Database-Lock).

Meta-Monitoring¶

Observability for the observability stack itself ( Prometheus, Loki, Alertmanager).

Access: Restricted to Platform Admins.
Purpose: Ensure the monitoring pipeline is healthy (e.g., scrape failures, rule evaluation times).

5. Visualization Formats¶

Big Screen (Wall TV)¶

Optimized for open office displays or NOCs.

UX Design: High contrast, large fonts, instant readability (Stat panels, Gauges).
Focus: Identifying what is broken immediately, rather than explaining why.

Reports¶

Requires Grafana Enterprise.

Layouts designed for PDF export and email distribution to stakeholders.

Constraints: Must be tuned for static rendering; interactive elements do not translate well to print/PDF.

6. Automation & Sourcing¶

Prebuilt Dashboards¶

Sources: Community Plugins, Cloud Integrations, and Mixins (Jsonnet bundles).
Repository: Grafana Dashboards Public Repo.

Dashboards as Code (IaC)¶

Managing dashboards via API or Terraform to ensure version control and reproducibility.

Automation Methods:

File Provisioning: Placing JSON files in the server's provisioning directory (CI/CD friendly).
HTTP API: Scripted interactions.

API Reference

Key endpoints for automation:

Tooling:

Grafonnet (Jsonnet library)
Grizzly (Dashboards as Code tool)
Terraform Provider

7. Organization Strategy¶

"The Watchtower" Structure¶

Recommended folder taxonomy for self-hosted instances to maintain order:

Folder	Purpose
Archive	Deprecated dashboards (kept for query recovery).
Issues	Investigation dashboards (Prefix: `yyyy-mm-dd`).
User Prod	Personal dashboards promoted to production use.
User R&D	Personal drafts and experiments.
Meta-monitoring	Stack health and Metrics Exploration.
General	Shared, stable production dashboards.

Grafana Dashboards: A Complete Guide¶

Introduction¶

1. Methodologies: USE & REDS¶

2. Navigation & Flows¶

Overview and Drill Down¶

Business Journey¶

The Home Dashboards¶

3. Development Workflows¶

Research & Development (R&D)¶

Metrics Exploration¶

4. Operational Views¶

Alerts Analysis¶

Issue Dashboards¶

Meta-Monitoring¶

5. Visualization Formats¶

Big Screen (Wall TV)¶

Reports¶

6. Automation & Sourcing¶

Prebuilt Dashboards¶

Dashboards as Code (IaC)¶

7. Organization Strategy¶

"The Watchtower" Structure¶

References¶