Subtitles section Play video
SETH VARGO: Hi there.
And welcome to the third video in our series
on SRE and DevOps.
My name is Seth, and I'm a Developer Advocate
at Google focused on infrastructure and operations.
LIZ FONG-JONES: And I'm Liz, an SRE focused
on teaching Google Cloud customers how to build
and operate reliable services.
SETH VARGO: In the previous video,
we discussed the differences between SLIs, SLOs, and SLAs.
SLIs are quantitative measurements, like latency.
SLOs are the amount of time that an SLI
can be out of specification.
And SLAs are business agreements with explicit consequences
for failing to deliver service.
But what stops teams from breaking their agreed upon SLOs
and forcing SREs to work overtime?
It seems like a classic DevOps problem where
product teams want to ship new features,
but SREs need to maintain a reliability.
Is there anything in the SRE program
that can help with this classic problem?
LIZ FONG-JONES: That's what we use error budgets for.
SETH VARGO: Error budgets?
LIZ FONG-JONES: Well, before we talk about error budgets,
let's talk about risk and availability.
As I mentioned in the previous episode, trying to go for 100%
availability just isn't a good idea because it's expensive.
It's technically complex.
And in a lot of cases, it winds up being the case
that users don't even see the benefits of it because
of end reliability somewhere else in this system.
SETH VARGO: I see.
That makes a lot of sense.
Because if my cellular network is only 99% reliable,
but my service is 99.9% reliable,
my users are never going to experience that additional 0.9%
of reliability because they're cellular network
is likely to fail before my service does.
LIZ FONG-JONES: Yes.
That's exactly correct.
So while we want to reduce the risk of system failures,
we have to accept some degree of risk
in order to deliver these products and features.
SETH VARGO: But how do we determine
how much risk a service is willing to tolerate?
LIZ FONG-JONES: So that's a product decision.
So we have to work with the product's management team
to figure out what is our explicit goal
for the availability target of our service.
And there are many things to think about,
like how much is it going to cost to add extra fault
tolerance, or to add extra testing time,
or to reduce our frequency of pushes,
or to increase how long it takes for us to decide
that a release is good compared to the benefits to the user
of increased reliability.
SETH VARGO: I see.
So the acceptable risk of a system dictates the SLO,
and the SLO mathematically defines the error budget.
If the service incurs too much downtime,
we have to reduce the risk to remain within the SLO, which
might mean halting deployments.
If service owners want to deliver
a lot of risky features, they have
to be willing to accept a much looser SLO.
Because if they were to choose a strict SLO,
they would quickly exceed their error budget, which
could halt future deployments.
LIZ FONG-JONES: Exactly.
So the main benefit of an error budget
is that it's a quantitative measurement that's
shared between the product and SRE teams, which
means that we can balance innovation and stability
to an appropriate level.
SETH VARGO: So as long as the SLOs are met,
releases can continue.
But how do we know if an SLO breach is about to occur?
LIZ FONG-JONES: So when we defined earlier the expectation
of how much uptime a service is going to have
and how we're going to measure it,
well, we need to actually concretely implement
that using a neutral third party,
like a monitoring system.
SETH VARGO: Well, and the metrics
on that monitoring system, those are the SLIs, right?
LIZ FONG-JONES: Exactly.
And the difference between the actual uptime
and the calculated target uptime from our SLO
is the budget of how much unavailability
that we can tolerate for the system
to be stable over the entire window of the SLO.
So we call this the error budget.
If your SLIs are failing all the time,
then you're going to be burning through your error budget.
And then eventually, you need to stop your feature releases
in order to focus instead on making reliability improvements
and restructuring your application so that it can
meet your SLOs in the future.
SETH VARGO: So who enforces those policies, though?
Because couldn't a product team just go over and break the SLO
and force the SREs to work overtime?
LIZ FONG-JONES: So this is why we
need to have executive buy-in for error budgets.
If the SRE teams don't have the ability
to enforce the error budgets, then the whole system
is going to break down.
So some teams just allow for a limited number
of tokens or golden bullets that you can hand out to a vice
president, for example.
So if a product team really wants
to get that critical feature out,
well, they're going to have to ask their vice
president for a one-time exception,
and they'll only get a certain number per year.
SETH VARGO: I see.
But what about things that are outside of the product team
that aren't necessarily buggy code for my developers,
like someone cuts an undersea cable,
or there's a catastrophic failure at a data center?
Those shouldn't impact my error budget.
It wasn't my fault.
LIZ FONG-JONES: So this is why it's
important to have error budgets from top to bottom
for everything in your stack.
That way you can figure out how much error budget you allocate
to your dependencies and how much error budget is reserved
for your developers to spend.
And this is another reason why targeting 100% availability
isn't realistic, because all of your dependencies
are not 100% available either.
SETH VARGO: That makes a lot of sense.
But what about other things like restarting a failed service
or other kind of manual tasks?
Are those considered part of the error budget?
LIZ FONG-JONES: Yeah.
So Seth, when you have to do manual action
to keep your system from failing, in the wait
before you actually do that manual action,
you'll start burning through your error budget.
But the actual act of doing that manual work,
we track that separately.
And that's a concept that we call toil.
So we'll talk about that more in detail in the next video.
SETH VARGO: Great.
So risk and error budgets are directly related
to many of the DevOps principles that we've
discussed in earlier episodes.
It clearly defines that accidents
are normal by quantifying accidents
and risk through error budgets.
It also enforces that change should be gradual
because a non-gradual change could quickly
burn through the error budget for a particular product
breaking the SLO and preventing further deployment
for the quarter or for the year.
This has really helped a lot.
I think it's really clear why we say that class SRE implements
DevOps.
LIZ FONG-JONES: Thanks, everyone, for watching.
Check the description below for links and more information.
Don't forget to subscribe to the channel
and stay tuned for our next video
where we talk about toil budgets.