Saltar a contenido

Grafana Dashboards: A Complete Guide

Source Reference

Based on: Grafana dashboards: A complete guide

Original Author: Alexandre de Verteuil

Last Updated: Jan 9, 2023

Introduction

Dashboards are easy to create but difficult to maintain at scale. Without a clear strategy, organizations often face "dashboard sprawl." The following classification scheme helps organize dashboards based on business processes and engineering needs.


1. Methodologies: USE & REDS

Foundational dashboards for SREs. These should be visually simple, uniform, and composed primarily of time-series panels.

Hardware / Resource Oriented

The USE Method (Utilization, Saturation, Errors) is designed for physical or virtual resources.

  • Target: Analyzing machine health and identifying root causes of infrastructure performance issues.
  • Key Metrics: CPU, Memory, Disk I/O, Network bandwidth.

Service / User Oriented

The REDS metrics (Requests, Errors, Duration, Saturation) derive from the Four Golden Signals.

  • Target: Microservices and API endpoints.
  • Usage: Primary candidates for alerting; these metrics serve as a proxy for the actual user experience.

2. Navigation & Flows

Overview and Drill Down

Implemented using Dashboard Links or Data Links to create a hierarchy.

  • Overview: High-level aggregated metrics for the entire infrastructure or fleet.
  • Drill Down: granular metrics focusing on a single component or instance.

Business Journey

Visualizations that track high-level business logic rather than technical metrics.

  • Scope: Customer acquisition funnels, supply chain logistics, or physical operations (e.g., IoT manufacturing lines).

The Home Dashboards

Custom entry points configured at the Organization, Team, or User level.

  • Strategy: Can range from simple row additions to "Enterprise" setups with team-specific layouts.
  • Components: Useful for displaying admin contact info and dynamic Dashboard Lists.

3. Development Workflows

Research & Development (R&D)

Sandboxes for iterative dashboard design and "work-in-progress" (WIP).

  • Organization: Isolate in folders like SRE R&D or AIOps Drafts.
  • Best Practice: Avoid using production tags to prevent drafts from polluting global search results.

Metrics Exploration

Abstract, reusable dashboards for browsing data when specific metric names are unknown.

Design Pattern: Use Variables to select metric prefixes, with panels repeating automatically.

PromQL Aggregation Templates:

PromQL
# 1. Average Value
avg without (instance) ($metric)

# 2. Total Sum
sum without (instance) ($metric)

# 3. Average Per-Second Rates
avg without (instance) (rate($metric[$__rate_interval]))

# 4. Total Rate Sum
sum without (instance) (rate($metric[$__rate_interval]))

4. Operational Views

Alerts Analysis

Visualizes the history and state of alerts, typically using the State Timeline panel.

Visualizing Alert States (State Timeline):

Map states to integers for visualization (e.g., 3=Meta, 2=Firing, 1=Pending).

PromQL
max by (alertname,alertstate) (
  3 * max_over_time(ALERTS{alertname="AlwaysFiring"}[$__interval])
  or
  2 * max_over_time(ALERTS{alertstate="firing"}[$__interval])
  or
  max_over_time(ALERTS{alertstate="pending"}[$__interval])
)

Counting Firing Alerts (Status History):

PromQL
count by (alertname) (max_over_time(ALERTS{alertstate="firing"}[$__interval]))

Issue Dashboards

Ephemeral dashboards created for specific incident investigations.

  • Lifecycle: Temporary; should be archived or deleted after resolution.
  • Naming Convention: Include timestamp or Incident ID (e.g., 2023-10-Incident-Database-Lock).

Meta-Monitoring

Observability for the observability stack itself ( Prometheus, Loki, Alertmanager).

  • Access: Restricted to Platform Admins.
  • Purpose: Ensure the monitoring pipeline is healthy (e.g., scrape failures, rule evaluation times).

5. Visualization Formats

Big Screen (Wall TV)

Optimized for open office displays or NOCs.

  • UX Design: High contrast, large fonts, instant readability (Stat panels, Gauges).
  • Focus: Identifying what is broken immediately, rather than explaining why.

Reports

Requires Grafana Enterprise.

Layouts designed for PDF export and email distribution to stakeholders.

  • Constraints: Must be tuned for static rendering; interactive elements do not translate well to print/PDF.

6. Automation & Sourcing

Prebuilt Dashboards

Dashboards as Code (IaC)

Managing dashboards via API or Terraform to ensure version control and reproducibility.

Automation Methods:

  • File Provisioning: Placing JSON files in the server's provisioning directory (CI/CD friendly).
  • HTTP API: Scripted interactions.

API Reference

Key endpoints for automation:

Tooling:


7. Organization Strategy

"The Watchtower" Structure

Recommended folder taxonomy for self-hosted instances to maintain order:

Folder Purpose
Archive Deprecated dashboards (kept for query recovery).
Issues Investigation dashboards (Prefix: yyyy-mm-dd).
User Prod Personal dashboards promoted to production use.
User R&D Personal drafts and experiments.
Meta-monitoring Stack health and Metrics Exploration.
General Shared, stable production dashboards.

References