SLIs, SLOs, and SLAs¶
Source Attribution
The core concepts and examples in this document are extracted from Dark Mode Club's video. It is highly recommended to watch the full explanation:
1. Planning¶
Service Level Objectives (SLOs) distinguish themselves from standard alerting and monitoring by focusing strictly on operations that matter to the user.
The Planning Workflow¶
| Step | Description | Example |
|---|---|---|
| 1. Operation | A customer-facing operation important enough to warrant an SLO. | A new order is generated, submitted, and acknowledged. |
| 2. SLI (Indicator) | The measurable data needed from the system to compute the SLOs. | Latency in seconds from "order submitted" to "order acknowledged" (derived from logs/telemetry). |
| 3. Aggregation Window | A timeframe meaningful to the business to spot trends/hotspots. | 8 hours. |
| 4. Target Reliability | The percentage of success required. (Prod usually > 99%). | 90% (for this example). |
| 5. SLO (Objective) | The combination of the Indicator and the Target. | New orders are acknowledged within 1 second and achieve this level of service 90% of the time. |
Categories of SLIs¶
Based on "Database Reliability Engineering" by Laine Campbell & Charity Majors.
Measurement of delay in a communication flow.
Usually measured in units of time. Best practice is to measure boundaries:
- Upper bound: "Created within 2 seconds."
- Range: "Created between 1 and 2 seconds."
Is the service functional?
A simple measurement of whether or not a service is generally available given a valid request.
- Example: HTTP 200 OK vs HTTP 500 Error.
Quality of Data.
"Do users receive the data they expected?"
The Caching Trap
Techniques like caching improve Latency, but overdoing it might negatively impact Consistency (data freshness).
Hybrid Example (Consistency + Latency):
"Users see an online check deposit reflected in their transaction history within 2 seconds."
Quantity of data sent/received within a timeframe.
- Example: "Transactions processed per second:
3".
Context matters: An overnight batch process measures throughput very differently from a real-time streaming service.
2. Implementing¶
Implementation is the delivery phase where SLOs impact development efforts.
- Basics: Ensure SLIs identified in planning are present in logs or time-series DBs.
- Architecture Impact:
- Targeting 99%? Likely no major architectural changes.
- Targeting 99.999%? Profound impact on infra and design.
Start Small
Don't refactor your entire architecture on day one.
Goal 1: Get actual reliability under measurement.
Goal 2: Iteratively add SLIs to code and hook them into tools like Prometheus.
Code Instrumentation Example¶
Scenario: User clicks "checkout".
SLI: Seconds to process a valid order (derived from logs).
We need 3 distinct log events:
- Request initiation.
- Success with latency calculation.
- Failure events.
@throws(classOf[ValidationException])
def checkout(cartItems: ShoppingCart,
creditCardFrom: CreditCardFrom,
shippingAdressFrom: AddressFrom): Order = {
logger.info("checkout requested") // (1)
try {
startTime = timer.start() // (2)
validCard = validateCreditCard(creditCardForm)
validAddress = validateShippingAdress(shippingAddressFrom)
order = processOrder(validCard, validAddress, cartItems))
endTime = timer.end()
elapsedTime = endTime - startTime
// Log success with specific latency
logger.info(s"order ${order.id} successfully took ${elapsedTime}ms") // (3)
return order
}
catch (e: ValidationException) {
// Log known validation errors (client side errors)
logger.info("checkout request failed due to a validation error") // (4)
throw e
}
catch (e: ServiceException) {
// Log server side errors
logger.error("checkout request failed") // (5)
throw e
}
}
- Total Requests Marker: Essential for the denominator in our formula.
- Timer: Capture high-precision timestamps.
- Success Marker: Capture the successful ID and the
elapsedTime. - Client Error: These often shouldn't count against your reliability score (depending on policy).
- Server Error: These definitely count against reliability.
3. Operations¶
Once SLIs are implemented, we need visualization (Dashboards) and Reporting.
The Tooling Stack¶
- Sloth: Generates Prometheus rules from simple SLO spec files. Saves time vs. hand-coding rules.
- Prometheus: Standard for metrics-based SLIs.
- Grafana: Visualizing the data (Supports Prometheus & Loki).
- Loki: Like Prometheus, but for logs. Great for legacy apps where you can't change code easily but have logs.
- Thanos: Solves Prometheus retention limits (Unlimited history).
- NOBL9: Turn-key platform for SLOs (if you want to buy vs build).
Workflow: Logging to Dashboard (Loki path)¶
- Promtail: Configure to scrape specific log files containing SLI data.
- Extraction: Define parser configs (Regex/JSON) in Promtail to isolate fields (latency, status).
- Loki: Query language extracts the metrics.
- Grafana: Visualize the query over the Aggregation Window.
4. The Maths¶
How to calculate the final percentage.
Scenario A: Standard Calculation¶
Objective: New orders acknowledged within 1s.
Total Requests: 110 | Validation Errors: 10 | Service Errors: 10 | Slow Requests: 2
Breakdown:
- Valid Requests = (Total - Validation Errors) = \(110 - 10 = 100\)
- Failures = (Service Errors + Slow Requests) = \(10 + 2 = 12\)
- Successes = (Valid Requests - Failures) = \(100 - 12 = 88\)
Result
88% < 90% Target. We missed the SLO.
Scenario B: Filtering "Service Errors"¶
Sometimes, we only want to measure Latency performance, excluding crashes (which might be covered by an Availability SLO).
Formula adjustment: Remove Service Errors from the denominator.
Result
97% > 90% Target. We passed the Latency SLO.
5. Understanding & Error Budgets¶
This phase involves periodically reviewing reliability (monthly/quarterly). Data-driven decisions beat gut feelings.
The Burn Rate¶
If you target 90% but achieve 99.9%, you have a surplus of reliability. This is your Error Budget.
- Under Budget (Failing): Freeze features, focus on reliability.
- Over Budget (Success): Spend the budget on risky changes, refactoring, or speed.
Advanced: Calculating Error Budget
Example for a 99.9% Availability Target:
Error Budget = 1 - Availability SLO
Error Budget = 1 - 99.9% = 0.1%
Time Calculation (Monthly):
0.1% of 30 days = 43.2 minutes
Note: Don't worry about this when starting out. Get the measurements working first.
Resources¶
Books
- Implementing Service Level Objectives - Alex Hidalgo
- Database Reliability Engineering - Laine Campbell & Charity Majors
- Google's SRE Book List
Tools
Articles & Lectures