Placeholder Image

Subtitles section Play video

  • SETH VARGO: Hi there.

  • And welcome to our second video in the series

  • on SRE and DevOps.

  • My name is Seth, and I'm a developer advocate

  • at Google focused on infrastructure and operations.

  • LIZ FONG-JONES: And hi.

  • I'm Liz, a site reliability engineer, or SRE at Google,

  • where I teach Google Cloud customers

  • how to build and operate reliable services.

  • So Seth, have you ever run into the problem

  • where you're trying to talk about reliability

  • and availability, and you get so many different definitions it

  • makes your head want to spin?

  • SETH VARGO: It happens all the time.

  • LIZ FONG-JONES: And even worse, have

  • you ever run into the situation where the developers are trying

  • to push out new features and they're

  • breaking things more and more, and they just won't listen?

  • SETH VARGO: It's like you're in my head.

  • Have we worked together before?

  • My team is constantly putting out fires, many of which

  • end up being bugs in the developers' code.

  • But when we try to push back on the product teams

  • to focus on reliability, they don't agree

  • that reliability is an issue.

  • It's the classic DevOps problem.

  • Do you have any recommendations?

  • LIZ FONG-JONES: Yeah.

  • This is a really common problem in the relationship

  • between product developers and operators.

  • And it's a problem that Google was really worried

  • about in the early 2000s when we were first

  • building out Google web search.

  • And this is when we started defining the SRE discipline.

  • And it's really been a work in progress

  • that we're improving on since.

  • SETH VARGO: Wow.

  • Google had these same problems in the early 2000s?

  • I had no idea.

  • But I still don't understand.

  • How can SRE help solve this, apparently,

  • very common problem?

  • LIZ FONG-JONES: So SRE tries to solve this problem in three

  • different ways.

  • First of all, we try to define what availability is.

  • Secondly, we try to define what an appropriate level

  • of availability is.

  • And third, we get everyone on the same page

  • about what we are going to do if we fail

  • to live up to those standards.

  • And we try to communicate this across all the organization,

  • from product developers to SREs, and all

  • the way from individual contributors all the way up

  • to vice presidents.

  • That way, we have a shared sense of responsibility

  • for the service and what we're going

  • to do if we need to slow down.

  • And we do that by defining service level objectives

  • in collaboration with the product owners.

  • And by agreeing on these metrics in advance,

  • we make sure that there's less of a chance of confusion

  • and conflict in the future.

  • SETH VARGO: OK, so an SLO is just an agreement

  • among stakeholders about how reliable a service should be.

  • But shouldn't services just always be 100% reliable?

  • LIZ FONG-JONES: So the problem is

  • that the cost and the technical complexity of making services

  • more reliable gets higher and higher the closer to 100%

  • you try to get.

  • It winds up being the case that every application

  • has a unique set of requirements that dictate how reliable does

  • it have to be before customers no longer notice

  • the difference?

  • And that means that we can make sure

  • that we have enough room for error and enough room

  • to roll out features reliably.

  • SETH VARGO: I see.

  • We should probably do another video

  • where we talk about why 100% availability isn't

  • a real target.

  • OK Liz, I'm ready.

  • I've decided that I want my service to be 99.9% reliable.

  • So where do I get started?

  • Do I use Vim, Emacs, Nano?

  • What do I do?

  • LIZ FONG-JONES: So I'm a Nano user.

  • But first, you really have to define

  • what availability is in addition to defining

  • how available you want to be.

  • We need to make sure that we understand what availability

  • is in the [INAUDIBLE] server service,

  • and that we have clear numerical indicators for defining

  • that availability.

  • And the way that we go about doing that

  • is by defining not just service level objectives, but service

  • level indicators, or SLIs.

  • So SLIs are most often metrics over time,

  • such as request latency, the throughput

  • of requests per second in a case of a batch system,

  • or failures per total number of requests.

  • They're usually aggregated over time,

  • and we typically apply a function like a percentile,

  • like a 99th percentile or a median.

  • And that way, we can get to a concrete threshold which

  • we can define to say, is this single number good or bad?

  • So for instance, a good service level indicator

  • might be saying, is the 99th percentile latency of requests

  • received in the past five minutes

  • less than 300 milliseconds?

  • Or alternatively, another service level indicator

  • might be, is the ratio of errors to total requests

  • received in the past five minutes less than 1%?

  • SETH VARGO: OK.

  • Thank you for explaining that.

  • It's much clearer now.

  • But how does that SLI become an SLO?

  • LIZ FONG-JONES: So if you think back to your calculus lesson,

  • Seth--

  • I know this may have been a while ago.

  • When you have a service level indicator,

  • it says at any moment in time whether the service

  • was available or whether it was down.

  • So what we need to do is we need to add all that up

  • or integrate it over a longer period of time--

  • like a year, in your example of 99.9% over a year--

  • to see, is the total amount of downtime

  • that we've had more or less than the nine hours

  • that you were worried about?

  • SETH VARGO: But you should always beat your SLO, right?

  • LIZ FONG-JONES: No.

  • So the thing is that SLOs are both upper and lower bounds.

  • So this is for two reasons.

  • First of all, the fact is that if you

  • try to run your service much more reliable

  • than it needs to be, you're slowing down

  • the release of features that you might want to get out

  • that would make your customers happier

  • than having an extra femtosecond of up time.

  • And then secondly, it's an expectation

  • that you're sending for your users--

  • that if you suddenly start breaking a lot

  • more often than they're used to because you start

  • running exactly at your SLO rather than doing

  • much better than your SLO, then your users

  • will be unhappily surprised if they're

  • trying to build other services on top of yours.

  • SETH VARGO: OK.

  • So this is all starting to make a little bit of sense now.

  • But what is an SLA then?

  • There are so many SL letter things.

  • And I remember at a previous job, I signed an SLA something.

  • What did I do?

  • LIZ FONG-JONES: So to spell it out first,

  • a SLA is a service level agreement.

  • And what it does is it says, here's

  • what I am going to do if I don't meet the level of reliability

  • that is expected.

  • It's more of a commercial agreement that

  • describes what remediation you're

  • going to take if your service is out of spec

  • according to the contract.

  • SETH VARGO: I see.

  • So the SLA is like a business agreement

  • associated with an SLO.

  • So they're exactly the same, right?

  • LIZ FONG-JONES: Not quite, because you really

  • want to make your SLA more lenient than your SLO.

  • So you get early warning before you

  • have to do things like field angry phone calls

  • from customers or have to pay them

  • lots of money for failing to deliver the services promised.

  • We rarely work with SREs.

  • And instead, we focus on meeting our SLOs with the understanding

  • that sales teams and business teams will think more

  • about the SLAs they build on top of our SLOs.

  • SETH VARGO: I see.

  • So SLAs describe the set of services and availability

  • promises that a provider is willing to make to a customer,

  • and then the ramifications associated

  • with failing to deliver on those promises.

  • Those ramifications might be things like money back

  • or free credits for failing to deliver the service

  • availability.

  • LIZ FONG-JONES: Yes, that's exactly correct.

  • SETH VARGO: So to summarize, SLIs

  • are service level indicators or metrics

  • over time, which inform about the health of a service.

  • SLOs are service level objectives,

  • which are agreed upon bounds for how often those SLIs must

  • be met.

  • And finally, SLAs are business level agreements,

  • which define the service availability

  • for a customer and the penalties for failing

  • to deliver that availability.

  • LIZ FONG-JONES: Exactly.

  • So SLIs, SLOs, and SLAs hearken very much

  • to the DevOps principle that measurement is critical,

  • and that the easiest way to break down

  • the organizational barriers is to have

  • common language about what it means to be available.

  • And we give, with SLIs, a very well-defined numerical

  • measurement for what that is.

  • And with the SLOs, we collaborate between the product

  • owners and the SREs in order to make sure

  • that the service is running at an appropriate level

  • of reliability for customers.

  • It's a lot clearer to me now why we say

  • class SRE implements DevOps.

  • SETH VARGO: Awesome.

  • Thanks, Liz.

  • And thank you for watching.

  • Be sure to check out the description below

  • for more information.

  • And don't forget to subscribe to the channel

  • and stay tuned for our next video, where we

  • discuss risk and error budgets.

SETH VARGO: Hi there.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it