Alberto Peña Abril

Service level objectives

June 2025

Service level objectives, also known as SLOs, work as a bridge between the technical and business mindset of the organisation. They help balance the reliability and innovation of systems.

They help you track the performance of a service against agreed upon goals, making everyone in the organisation aware of the status of their critical systems.

They also help managing expectations with customers and stakeholders, as they can reduce the number of unfounded complaints.

They are essential for observability because they define what good performance means for a particular service.

With that in mind, let’s get on with it!

Why are SLOs effective

SLOs keep a balance between innovation and reliability, while keeping everyone in the organisation aware of the status of critical services.

Balance

A system that’s 100% reliable is a rigid one. Changes to it are costly and slow. For some particular applications, like regulated medical devices, this might be the desire goal, but for the standard web application, it is not. You want to be able to iterate fast to gain an advantage in the market, and that means that you need to create some space for mistakes.

That’s exactly what service level objectives give you. SLOs have an error budget that gives the teams permission to make mistakes. Everyone involved in the system, from developers to the CEO can have a say in how big that error budget is, and nobody is surprise when the error budget is used.

Awareness

Once your organisation decides what SLOs it needs, you can get a precise understanding of how your services are behaving.

This help in many ways, from deciding investment if an SLO needs improvement, to preventing misunderstandings on when a service is performing well or not.

How can you implement SLOs

SLOs have three basic components:

Service level indicator (SLI)

An SLI is a metric. As simple as that.

They tell you what’s happening in your system and you don’t have direct control over them.

Some examples can be latency, error rate, availability, etc.

Target value

Once you have your SLI, you want to agree on a target value. As explained before, if you set a target value of 100%, you are freezing your development, so you need to set a realistic target value that allows space for risk. That space for risk is called an error budget, and it’s the result of subtracting your target value from 100%.

Time window

You want to measure your SLOs for a meaningful period. That meaningful period is up to you and your organisation though. It all depends on how quickly you’re allow to make decisions. The shorter the time window, the more risk you can take if your system is reliable. If you chose one week as the window, and you burnt your error rate last week, you will be able to accept risk again this week when the window resets. The opposite is true for longer windows, since the reset of your error budget will take longer.

This is were the balance between reliability and innovation comes in.

If you want to follow the standard, 28 days usually works well.

One last point on this, it’s better to have rolling windows instead of hard re-starts. You don’t want to forget about your previous errors just because you are entering a new cycle.

Setting up alerts

One of the goals of having SLOs is to know when things go wrong. The standard recommendation is to create different alerts when the error budget is being burnt at a slow and fast pace.

This alerts will help your team to be proactive and identify the issues that are spending the error budget.

Conclusion

SLOs are a great tool to help organisations balance reliability and innovation, and they’re also great in aligning different parts of the business on what a good service is from the reliability point of view.

That’s all!