Nerd-World Problems: Capacitation and Comiseration

A couple years ago I worte a post which aimed to be a comprehensive reference to all the concepts I have gathered over my career about how SLI/O/As work. In many ways the goal was somewhat meant to help myself distil all of the various related concepts surrounding this observability concept into a cohesive framework. Because part of the problem one encounters when trying to learn about SLI/O/As is that there doesn’t currently exist a common approach to them across the industry. So I attempted to present all of the related concepts, as I understand them, in a way that left room for all of the ways people have currently implemented them. Which was a helpful excersise for myself, at least, if no one else.

However few people have time to read a frankly dry treatise on all the ways the various concepts of Service Level Obersability work together. Most just want to know how to make them happen in their organization. And if that describes you, then hopefully this will not only help you get started, but will guide from zero to full implementation in the most straightforward way possible.

Goals
Clarifications
Steps

Goals

To start let’s quickly go over the outcomes we will hopefully get from implementing Service Level Obersability.

Use SLOs to Prioritize Work

Service Level Objectives will provide an excellent means for us to know where we need to direct our engineering resources both today, and in the near future. They will tell us where the stability of our platform is at risk or outright failing.

We can and should use these on a regular or even daily cadence to know where an engineer should be tasked with investigating failures and possibly making adjustments to either the code or infrastructure to address or prevent incidents.

Use Burn Rate Alerts for Proactive Incident Response

Burn Rate Alerts are an invaluable tool to help distinguish ocassional hiccoughs in a service from a developing incident.

In short, they measure the trajectory of the rollup of an SLI over time, to determine when a trend is emerging over either a short or a long time span. These should then be used to send alerts to on-call engineers so that they can begin to address the issue before the first customer report is made.

Use Internal SLAs as Inter-Team Contracts

At the end of the initiative to implement Service Level Obersability across the organization, you should have internal SLAs for all of your critical systems which then should be used by other teams developing a dependency against that system to inform what sort of realistic Objectives they should set for their system, or if they need to architect their system differently to compensate or avoid a critical dependency on a system that does not provide the reliability their consumers will require.

Reconcile Internal SLOs/SLAs and Customer SLAs

Often organizations will write contracts or publish SLAs to their customers prior to having engaged in this excercise or having adopted Service Level Obersability. After having gone through this excercise and defined your internal SLOs/SLAs, you may find that the published SLAs are either more or less aggressive than what your internal systems can support. You can then determine whether it is your SLA that needs refinement or if you need to plan for improvements to your internal systems so that they can properly support the customer SLAs.

Use SLIs as Deployment Signals

The excersise of defining SLIs will help you to better understand what your system looks like when it’s running properly. This then will give you an understanding of what sort of behaviors you should evaluate as part of your Continuous Deployment to ensure your changes are being properly tested for potential negative consumer impact.

Ideally you’ll want these SLIs converted into some sort of automated test that will indicate to you whether a change is ready to be promoted via your deployment pipeline to a higher environment or if it needs to be rolled back.

Define SLI/O/As Early in the Service Lifecycle

The excersise of defining SLIs and setting SLOs and SLAs helps bring into focus exactly what success looks like for the launch of a new service/system. As such you really want to be going through this excersise as early as possible in the development of a new service. As soon as you know what the use-cases and basic architecture will be, you’re ready to start defining SLIs and using those to ensure your service will be properly instrumented and monitored at launch.

Clarifications

Before we get into the actual steps for implementing SLOs, we should ensure that we’re clear on a few points to avoid confusion down the road.

SLO vs. SLOs

Service Level Obersability is a term I use to refer to this whole model for evaluating how well a system or service is meeting the needs of its consumers.

Service Level Objectives are the targets for evaluation that are the killer feature of this model for monitoring our systems.

It’s quite common for people to simply refer to the whole model as SLOs, and so Service Level Obersability is a bit of a bacronym for SLO that can better capture the full range of things this model has to offer beyond just Obectives.

SLIs vs. SLOs

The fundamental building blocks of Service Level Obersability are questions that are answered by metrics & montioring about the ability of your system/service to meet the needs of its consumers. Some people call these questions SLOs while referring to the metrics that are used to answer the question as the SLIs. Others refer to the question as the SLI and either the desired range of answers or the percentage of time the answer to the question was positive as the SLO. This latter definition is the one I’m using here, for clarity, though in practice the distinction doesn’t have much effect on the outcome.

Golden Signals vs. SLIs

SLIs are a form of Signal. And thus are closely related to the Four Golden Singals: Latency, Traffic, Errors, & Saturation. The Golden Signals for an important industry standard set of metrics that you should be monitoring about any system, and even after implementing Service Level Obersability your observability story will be incomplete without propper Golden Signals montioring. However, with the exception of Latency, the Golden Signals are more focused on the vital signs of a system from the perspective of its operator more than the perspective of its consumers. Thus they sit as important compliment to Service Level Obersability but are slightly different from what we will focus on here.

For Service Level Obersability, focus on metrics that directly capture consumer facing functionality, and use the Golden Signals to track changes in how the overal system is behaving in the eyes of the operator.

Internal SLI/O/As vs. Customer Facing SLI/O/As

The process I will recommend here begins by defining SLIs for all critical internal systems/services and then setting internal SLOs and SLAs for each of them. SLI/O/As can also be defined at the customer-facing boundary as well, where all internal systems are abstracted away and we are only visualizing the system as one (or sometimes a few) customer-facing applications whose discrete functionality is supplied by any number of internal systems of which the customer is completely unaware.

Usually it’s the case that at this level, SLAs have already been published and written into contracts. Even if there does not yet exist any monitoring for our ability to meet those SLAs. And that’s something that we want to fix.

It can be tempting to start with already defined customer SLAs and work backwards through the system to figure out how to monitor the overall ecosystem and its ability to meet those SLAs. While this is a valid approach, and might be the right solution for organizations whose primary goal in adopting Service Level Obersability is to provide an exective level summary of the health trends of the customer facing product; it will not yield the sort of fine-grained understanding of consumer experience across the organization that this model makes possible. By starting with SLI definitions for all your internal systems (or at the very least the most citical systems) and working from there will result in far better observability coverage and faster incident response/root cause analysis.

And since these are the systems that are ultimately responisble for performing these actions on behalf of the end customers, it will normally be the case that your customer-facing SLAs will already be covered by the monitoring put in place for your internal SLI/O/As. If that’s not the case, you probably missed something when defining your internal SLIs.

Steps

Now that we’ve laid out the expectations of what we want to accomplish by implementing SLOs let’s go over the steps of this framework for how to to actually implement them.

The steps are presented here in order, but you can parallelize doing this for each individual SLI or do them in batches per system. I’d recommend taking at least one SLI or system all or most of the way through this process to gain a better understanding of what lies ahead when doing the work for subsequent systems.

All steps will likely require iteration at some point, and don’t exepct to get everything right the first time. Part of the value of this exercise is that it will help you think about your systems in new ways and you may eventually find that you want to go back and tweak things as you learn more.

Define System/Service Boundaries

Since SLIs are defined at the System Boundary it’s important to start by ensuring you know where one systems ends and another one starts.

Your system can be made up of many kinds of components. Kubernetes Workloads, Database Clusters, Serverless Functions, Gateways & Load Balancers, Queues, API Servers, PaaS Applications, & etc. But identifying which components are part of a single system and which are part of another that perhaps this system relies upon, is important because all your SLIs will be written as if everything sitting within the the boundary is opaque to the consumer. Even if that consumer is simply another internal system.

Where the boundary is ultimately drawn is somewhat arbitrary and should be wherever makes the most sense to your team/organization. Yet the choice is consequential to how you will write your SLI definitions and how you will report on overal system health. The following guidelines might help you know where you want to define the boundary/group components into systems.

Signs two components are likely part of the same system:

They are developed/configured by the same team.
They are developed in tandem, and deployed together.
If one provides some functionality to the other, no other systems (or in other words: components out side this system) rely on it to provide the same functionality.
The organization already thinks of them as one unit.
You don’t want to report on their ability to meet consumer needs separately.

Enumerate Features and Functions

Having determined what components make up the system, you need to list what the system does on behalf of consumers. This should be written from the perspective of someone sitting outside the system boundary and listing what is expected of the system as a whole. Which components are actually responsible for providing that functionality (even if it’s just one) are opaque at this level, so all functionality listed here now belongs to the system as a whole.

Examples:

Stores new user audit records received
Serves existing user audit records when queried
Keeps records as long as requested when written (TTL)
Handles requests to forward all audit records for a single user for processing
Processess GDPR request queue to purge user records

Identify Potential SLI Types

Take a look at each of the pieces of functionality provided by the system. And compare them against well known SLI types to see what if any of them seem applicable to a given piece of functionality.

Well known SLI types include:

Availability
- For any system that handles external requests.
Latency
- For any system that handles external requests.
Quality
- Useful when a system responses or output can degrade gracefully. Such as when the reponse data might be cached, or when encoding quality can degrade.
Freshness
- For measuring processing output either by ensuring data has recently been emitted or that the pipeline processing is keeping up.
Correctness
- For evaluating any thing that emits output.
Coverage
- For evaluating systems that process available data or request saturation, to ensure they are keeping up with available data for consumption.
Durability
- For measuring storage reliability.

Multiple might be applicable to just one piece of functionality, such as with any functionality that handles external requests which will usually have at least an Availability and a Latency SLI.

It’s also possible that none of them fit, or you’re not sure which kind matches the sort of thing you want to measure. Don’t worry about this too much, these types are there as guides, not a limited set that you must choose from.

Write SLI Definitions

Now it’s time to actually write an SLI definition. These are just questions or assertions about the health of your system at a given point in time.

Examples:

The percentage of all requests returing a non-500 status.
The percentage of clients receiving a specific encoding quality for streaming video
The percentage of responses returned in <500ms.
The percentage unprocessed records older than 1 minute.

To Timeslice or Not

The definitions you write can either be Timesliced SLIs or not. A Timesliced SLI is evaluated over a short time period against some value/threshold and is determined to be either healthy or unhealthy for that period. The SLO then is evaluated or a longer period made up of hundreds of these timeslices, and is the desired percentage of the time the SLI was evaluated as healthy.

A Non-Timesliced SLI has no desired threshold of its own, but is evaluated against the SLO’s threshold.

Timesliced SLI Definition: 95% of all requests returned a non-500 status over the last 5 minutes.

Non-Timesliced SLI Definition: The percentage of requests which returned a non-500 status.

Both of these approaches are valid and in use throughout the industry. However in this example we will favor the Timesliced approach since it has the advantage of providing us with an answer about whether or not we consider a system to be healthy at any given point of time, as opposed to merely evaluating the system recent trends.

Write A Query Matching the SLI Definition

Once you have an SLI definition (or likey many) you will want to see if you have the necessary metrics to answer the question or assertion. This is simply done by going to your metrics aggregator and attempting to write a query that will provide the answer. For timeslice SLIs this query should likely be written the way your alerting system would query the metrics to determine if the system is healthy at that given point in time or not (though you will not likely create an alert based on this query directly).

Instrument Missing Metrics

It’s very likely that as you go about this exercise you will discover that you are simply missing the required metrics to write such a query. This doesn’t mean you wrote a bad SLI definition, this means that your metrics observability needs work. Making these sort of discoveries is one of the key benefits of going through this excersise.

The metrics needed for each of your SLIs will be either Probe-based or Telemetry-based. Probe-based metrics are produced by some external probe whose results are being written back to your metrics aggregator, while Telemetry-based metrics are coming from your code’s metrics exporting itself.

Set SLOs

Now that you have the ability to write a query for an SLI, have a look at it’s historic behavior. This unfortunately won’t be possible if you only recently added the needed metrics. But since setting an SLO is always an iterative process you can always take an educated guess and adjust your SLO as you gain experience.

The SLO has both a threshold (such as 99.9%) agains which the SLI is evaluated as well as a time window over which it is evaluated.

In the case of timesliced SLIs your SLO threshold will be the percentage of time that you expect the SLI to evaluate to healthy. And the SLO window should be long enough to ensure that you have a significant amount of timeslices included in your percentage rollup. Tuning these thresholds and windows will all affect how your SLO evaluates, so you may need to make adjustments in the future if you find your SLO meausrment isn’t as meaningful as you’d like.

In general I think a good place to start for most SLOs is:

5 Minute SLI Time Slices
30 day SLO rolling windows

Record SLIs/SLOs in an SLO Repository

Once you have your SLI and your SLO defined you should consider storing them in some place where they can be used as both a reference, and (ideally) configuration for your monitoring and dashboarding tools to automatically drive your dashboards and alerts.

This is a newer frontier in the SLO space, but a couple of ideas include:

Nobl9
- This aims to be a complete solution for SLOs, and acts as both an SLO repository as well as an abstraction layer for your monitoring that is specifically made to make your SLO monitoring and alerting easier.
Backstage
- This is a Service Catalog/Developer Portal that is rapidly gaining popularity in the industry. Currently there is an OpenSLO Project (which I help maintain) which is attempting to help Backstage natively understand the OpenSLO Format) and provide dashboards which use your existing monitoring a backend to see your SLO statuses within their Service/System Catalog.
Monitoring as Code
- If you use some sort of Monitoring as Code strategy, such as Terraform, you can leverage that to store your SLI/SLO definitions and have them directly drive your monitoring and dashboards.
Just store them in a speadsheet
- This is a bit old fashioned, but it works.

Setup Monitoring

Each of your systems should have an SLO Dashboard. You will want something that displays:

The current status of each of your SLIs over the most recent timeslice.
The current value of your SLI rollup over the SLO’s window
Whether or not the SLI is meeting the SLO (as well as the SLA)
Optionally: The Error Budget Burn Rate (more on that in the next step)

Setup Alerting

While it is possible for you to alert whenever an SLI is currently failing or when the SLI fails to meet the SLO threshold; the former would likely be rather noisy and the later would alert you later than you probably want to be.

The balance between these two is Error Budget Burn Rate Alerting. Follow the link for a deeper dive into how these work, but in short: they provide a means to alert when you are trending towards a likely SLO violation prior to it actually happening.

In the Service Level Observability model, your Error Budget is simple the inverse of your SLO’s lower boundary/threshold. Or 100 - 99.9 for an SLO threshold of 99.9. And by calculating the rate at which you are consuming this Error Budget you can see if you are trending towards a likely violation of your SLO and respond quicker.

Define/Refine SLAs

After you’ve had some time to live with a given SLO and are confident that you’ve both defined and instrumented it well, it’s time to set an SLA.

While an SLO (and espcially its error budget burn rate) should tell you when your system is having issues, the SLA is a value given to your consumers as a guaranteed threshold you will not dip below.

For internal consumers this acts as a sort of promise of reliability that they can trust as they build systems that depend upon yours. But no SLA should be violated without consequence. So there should be an agreement (ideally across) the organization, that failure to meet an SLA will require an immediate diversion of resources to address that failure and ensure it doesn’t happen again.

However, not all internal systems need to have an SLA. The lack of an SLA is sometimes called a “Best Effort” SLA, which means there should be assumed to be no guarantees made about a system’s reliability.

Iterate

Invariably, you will get things wrong when you first attempt to implement Service Level Obersability, and that OK. Along the way you’ll have learned and discovered new things about your system and can now go back and refine anything from a System Boundary, to a Timeslice size, or an SLO threshold.