Subtitles section Play video Print subtitles SETH VARGO: Hi there. And welcome to our second video in the series on SRE and DevOps. My name is Seth, and I'm a developer advocate at Google focused on infrastructure and operations. LIZ FONG-JONES: And hi. I'm Liz, a site reliability engineer, or SRE at Google, where I teach Google Cloud customers how to build and operate reliable services. So Seth, have you ever run into the problem where you're trying to talk about reliability and availability, and you get so many different definitions it makes your head want to spin? SETH VARGO: It happens all the time. LIZ FONG-JONES: And even worse, have you ever run into the situation where the developers are trying to push out new features and they're breaking things more and more, and they just won't listen? SETH VARGO: It's like you're in my head. Have we worked together before? My team is constantly putting out fires, many of which end up being bugs in the developers' code. But when we try to push back on the product teams to focus on reliability, they don't agree that reliability is an issue. It's the classic DevOps problem. Do you have any recommendations? LIZ FONG-JONES: Yeah. This is a really common problem in the relationship between product developers and operators. And it's a problem that Google was really worried about in the early 2000s when we were first building out Google web search. And this is when we started defining the SRE discipline. And it's really been a work in progress that we're improving on since. SETH VARGO: Wow. Google had these same problems in the early 2000s? I had no idea. But I still don't understand. How can SRE help solve this, apparently, very common problem? LIZ FONG-JONES: So SRE tries to solve this problem in three different ways. First of all, we try to define what availability is. Secondly, we try to define what an appropriate level of availability is. And third, we get everyone on the same page about what we are going to do if we fail to live up to those standards. And we try to communicate this across all the organization, from product developers to SREs, and all the way from individual contributors all the way up to vice presidents. That way, we have a shared sense of responsibility for the service and what we're going to do if we need to slow down. And we do that by defining service level objectives in collaboration with the product owners. And by agreeing on these metrics in advance, we make sure that there's less of a chance of confusion and conflict in the future. SETH VARGO: OK, so an SLO is just an agreement among stakeholders about how reliable a service should be. But shouldn't services just always be 100% reliable? LIZ FONG-JONES: So the problem is that the cost and the technical complexity of making services more reliable gets higher and higher the closer to 100% you try to get. It winds up being the case that every application has a unique set of requirements that dictate how reliable does it have to be before customers no longer notice the difference? And that means that we can make sure that we have enough room for error and enough room to roll out features reliably. SETH VARGO: I see. We should probably do another video where we talk about why 100% availability isn't a real target. OK Liz, I'm ready. I've decided that I want my service to be 99.9% reliable. So where do I get started? Do I use Vim, Emacs, Nano? What do I do? LIZ FONG-JONES: So I'm a Nano user. But first, you really have to define what availability is in addition to defining how available you want to be. We need to make sure that we understand what availability is in the [INAUDIBLE] server service, and that we have clear numerical indicators for defining that availability. And the way that we go about doing that is by defining not just service level objectives, but service level indicators, or SLIs. So SLIs are most often metrics over time, such as request latency, the throughput of requests per second in a case of a batch system, or failures per total number of requests. They're usually aggregated over time, and we typically apply a function like a percentile, like a 99th percentile or a median. And that way, we can get to a concrete threshold which we can define to say, is this single number good or bad? So for instance, a good service level indicator might be saying, is the 99th percentile latency of requests received in the past five minutes less than 300 milliseconds? Or alternatively, another service level indicator might be, is the ratio of errors to total requests received in the past five minutes less than 1%? SETH VARGO: OK. Thank you for explaining that. It's much clearer now. But how does that SLI become an SLO? LIZ FONG-JONES: So if you think back to your calculus lesson, Seth-- I know this may have been a while ago. When you have a service level indicator, it says at any moment in time whether the service was available or whether it was down. So what we need to do is we need to add all that up or integrate it over a longer period of time-- like a year, in your example of 99.9% over a year-- to see, is the total amount of downtime that we've had more or less than the nine hours that you were worried about? SETH VARGO: But you should always beat your SLO, right? LIZ FONG-JONES: No. So the thing is that SLOs are both upper and lower bounds. So this is for two reasons. First of all, the fact is that if you try to run your service much more reliable than it needs to be, you're slowing down the release of features that you might want to get out that would make your customers happier than having an extra femtosecond of up time. And then secondly, it's an expectation that you're sending for your users-- that if you suddenly start breaking a lot more often than they're used to because you start running exactly at your SLO rather than doing much better than your SLO, then your users will be unhappily surprised if they're trying to build other services on top of yours. SETH VARGO: OK. So this is all starting to make a little bit of sense now. But what is an SLA then? There are so many SL letter things. And I remember at a previous job, I signed an SLA something. What did I do? LIZ FONG-JONES: So to spell it out first, a SLA is a service level agreement. And what it does is it says, here's what I am going to do if I don't meet the level of reliability that is expected. It's more of a commercial agreement that describes what remediation you're going to take if your service is out of spec according to the contract. SETH VARGO: I see. So the SLA is like a business agreement associated with an SLO. So they're exactly the same, right? LIZ FONG-JONES: Not quite, because you really want to make your SLA more lenient than your SLO. So you get early warning before you have to do things like field angry phone calls from customers or have to pay them lots of money for failing to deliver the services promised. We rarely work with SREs. And instead, we focus on meeting our SLOs with the understanding that sales teams and business teams will think more about the SLAs they build on top of our SLOs. SETH VARGO: I see. So SLAs describe the set of services and availability promises that a provider is willing to make to a customer, and then the ramifications associated with failing to deliver on those promises. Those ramifications might be things like money back or free credits for failing to deliver the service availability. LIZ FONG-JONES: Yes, that's exactly correct. SETH VARGO: So to summarize, SLIs are service level indicators or metrics over time, which inform about the health of a service. SLOs are service level objectives, which are agreed upon bounds for how often those SLIs must be met. And finally, SLAs are business level agreements, which define the service availability for a customer and the penalties for failing to deliver that availability. LIZ FONG-JONES: Exactly. So SLIs, SLOs, and SLAs hearken very much to the DevOps principle that measurement is critical, and that the easiest way to break down the organizational barriers is to have common language about what it means to be available. And we give, with SLIs, a very well-defined numerical measurement for what that is. And with the SLOs, we collaborate between the product owners and the SREs in order to make sure that the service is running at an appropriate level of reliability for customers. It's a lot clearer to me now why we say class SRE implements DevOps. SETH VARGO: Awesome. Thanks, Liz. And thank you for watching. Be sure to check out the description below for more information. And don't forget to subscribe to the channel and stay tuned for our next video, where we discuss risk and error budgets.
A2 US seth liz fong service availability jones SLIs, SLOs, SLAs, oh my! 16 0 Marsen Lin posted on 2018/09/02 More Share Save Report Video vocabulary