Subtitles section Play video
SETH VARGO: Hi there.
And welcome to our second video in the series
on SRE and DevOps.
My name is Seth, and I'm a developer advocate
at Google focused on infrastructure and operations.
LIZ FONG-JONES: And hi.
I'm Liz, a site reliability engineer, or SRE at Google,
where I teach Google Cloud customers
how to build and operate reliable services.
So Seth, have you ever run into the problem
where you're trying to talk about reliability
and availability, and you get so many different definitions it
makes your head want to spin?
SETH VARGO: It happens all the time.
LIZ FONG-JONES: And even worse, have
you ever run into the situation where the developers are trying
to push out new features and they're
breaking things more and more, and they just won't listen?
SETH VARGO: It's like you're in my head.
Have we worked together before?
My team is constantly putting out fires, many of which
end up being bugs in the developers' code.
But when we try to push back on the product teams
to focus on reliability, they don't agree
that reliability is an issue.
It's the classic DevOps problem.
Do you have any recommendations?
LIZ FONG-JONES: Yeah.
This is a really common problem in the relationship
between product developers and operators.
And it's a problem that Google was really worried
about in the early 2000s when we were first
building out Google web search.
And this is when we started defining the SRE discipline.
And it's really been a work in progress
that we're improving on since.
SETH VARGO: Wow.
Google had these same problems in the early 2000s?
I had no idea.
But I still don't understand.
How can SRE help solve this, apparently,
very common problem?
LIZ FONG-JONES: So SRE tries to solve this problem in three
different ways.
First of all, we try to define what availability is.
Secondly, we try to define what an appropriate level
of availability is.
And third, we get everyone on the same page
about what we are going to do if we fail
to live up to those standards.
And we try to communicate this across all the organization,
from product developers to SREs, and all
the way from individual contributors all the way up
to vice presidents.
That way, we have a shared sense of responsibility
for the service and what we're going
to do if we need to slow down.
And we do that by defining service level objectives
in collaboration with the product owners.
And by agreeing on these metrics in advance,
we make sure that there's less of a chance of confusion
and conflict in the future.
SETH VARGO: OK, so an SLO is just an agreement
among stakeholders about how reliable a service should be.
But shouldn't services just always be 100% reliable?
LIZ FONG-JONES: So the problem is
that the cost and the technical complexity of making services
more reliable gets higher and higher the closer to 100%
you try to get.
It winds up being the case that every application
has a unique set of requirements that dictate how reliable does
it have to be before customers no longer notice
the difference?
And that means that we can make sure
that we have enough room for error and enough room
to roll out features reliably.
SETH VARGO: I see.
We should probably do another video
where we talk about why 100% availability isn't
a real target.
OK Liz, I'm ready.
I've decided that I want my service to be 99.9% reliable.
So where do I get started?
Do I use Vim, Emacs, Nano?
What do I do?
LIZ FONG-JONES: So I'm a Nano user.
But first, you really have to define
what availability is in addition to defining
how available you want to be.
We need to make sure that we understand what availability
is in the [INAUDIBLE] server service,
and that we have clear numerical indicators for defining
that availability.
And the way that we go about doing that
is by defining not just service level objectives, but service
level indicators, or SLIs.
So SLIs are most often metrics over time,
such as request latency, the throughput
of requests per second in a case of a batch system,
or failures per total number of requests.
They're usually aggregated over time,
and we typically apply a function like a percentile,
like a 99th percentile or a median.
And that way, we can get to a concrete threshold which
we can define to say, is this single number good or bad?
So for instance, a good service level indicator
might be saying, is the 99th percentile latency of requests
received in the past five minutes
less than 300 milliseconds?
Or alternatively, another service level indicator
might be, is the ratio of errors to total requests
received in the past five minutes less than 1%?
SETH VARGO: OK.
Thank you for explaining that.
It's much clearer now.
But how does that SLI become an SLO?
LIZ FONG-JONES: So if you think back to your calculus lesson,
Seth--
I know this may have been a while ago.
When you have a service level indicator,
it says at any moment in time whether the service
was available or whether it was down.
So what we need to do is we need to add all that up
or integrate it over a longer period of time--
like a year, in your example of 99.9% over a year--
to see, is the total amount of downtime
that we've had more or less than the nine hours
that you were worried about?
SETH VARGO: But you should always beat your SLO, right?
LIZ FONG-JONES: No.
So the thing is that SLOs are both upper and lower bounds.
So this is for two reasons.
First of all, the fact is that if you
try to run your service much more reliable
than it needs to be, you're slowing down
the release of features that you might want to get out
that would make your customers happier
than having an extra femtosecond of up time.
And then secondly, it's an expectation
that you're sending for your users--
that if you suddenly start breaking a lot
more often than they're used to because you start
running exactly at your SLO rather than doing
much better than your SLO, then your users
will be unhappily surprised if they're
trying to build other services on top of yours.
SETH VARGO: OK.
So this is all starting to make a little bit of sense now.
But what is an SLA then?
There are so many SL letter things.
And I remember at a previous job, I signed an SLA something.
What did I do?
LIZ FONG-JONES: So to spell it out first,
a SLA is a service level agreement.
And what it does is it says, here's
what I am going to do if I don't meet the level of reliability
that is expected.
It's more of a commercial agreement that
describes what remediation you're
going to take if your service is out of spec
according to the contract.
SETH VARGO: I see.
So the SLA is like a business agreement
associated with an SLO.
So they're exactly the same, right?
LIZ FONG-JONES: Not quite, because you really
want to make your SLA more lenient than your SLO.
So you get early warning before you
have to do things like field angry phone calls
from customers or have to pay them
lots of money for failing to deliver the services promised.
We rarely work with SREs.
And instead, we focus on meeting our SLOs with the understanding
that sales teams and business teams will think more
about the SLAs they build on top of our SLOs.
SETH VARGO: I see.
So SLAs describe the set of services and availability
promises that a provider is willing to make to a customer,
and then the ramifications associated
with failing to deliver on those promises.
Those ramifications might be things like money back
or free credits for failing to deliver the service
availability.
LIZ FONG-JONES: Yes, that's exactly correct.
SETH VARGO: So to summarize, SLIs
are service level indicators or metrics
over time, which inform about the health of a service.
SLOs are service level objectives,
which are agreed upon bounds for how often those SLIs must
be met.
And finally, SLAs are business level agreements,
which define the service availability
for a customer and the penalties for failing
to deliver that availability.
LIZ FONG-JONES: Exactly.
So SLIs, SLOs, and SLAs hearken very much
to the DevOps principle that measurement is critical,
and that the easiest way to break down
the organizational barriers is to have
common language about what it means to be available.
And we give, with SLIs, a very well-defined numerical
measurement for what that is.
And with the SLOs, we collaborate between the product
owners and the SREs in order to make sure
that the service is running at an appropriate level
of reliability for customers.
It's a lot clearer to me now why we say
class SRE implements DevOps.
SETH VARGO: Awesome.
Thanks, Liz.
And thank you for watching.
Be sure to check out the description below
for more information.
And don't forget to subscribe to the channel
and stay tuned for our next video, where we
discuss risk and error budgets.