Subtitles section Play video
LIZ FONG-JONES: Hi there.
I'm Liz, a Site Reliability Engineer, or SRE, at Google.
And I teach Google Cloud customers
how to build and operate reliable services.
SETH VARGO: And I'm Seth, a Developer
Advocate at Google focused on infrastructure and operations.
And Liz and I are here to settle things once and for all.
Which is better, DevOps or SRE?
LIZ FONG-JONES: Whoa there, Seth.
Hold on a second.
I'm not sure you're really looking
at this in the right way.
But first of all, maybe we should clarify some things.
What do you think DevOps is?
SETH VARGO: So that's a great question, Liz.
Back in the day, operators and developers
had a lot of contention.
Developers used to throw their code
over the metaphorical wall, and operators
were responsible for keeping that code running
in production.
Operators had little understanding of the code
bases, and developers had little understanding
of operational practices.
But developers were concerned with shipping code,
and operators were concerned with reliability.
This misalignment often caused tension
within the organization.
LIZ FONG-JONES: So if I understand you correctly,
you're saying that the developers were
responsible for features, and the operators
were responsible for stability, meaning the developers wanted
to move faster to get their features out faster
and the operators wanted to move slower to keep things stable?
I could see how that would cause a lot of tension.
SETH VARGO: Exactly.
So DevOps is a set of practices and a culture designed
to break down those barriers between developers, operators,
and other parts of the organization.
I break DevOps down into five key areas.
First, reduce organizational silence.
By breaking down barriers across teams,
we can increase collaboration and thorough put.
Second, accept failure as normal.
Computers are inherently unreliable,
so we can't expect perfection.
And when we introduce humans into the system,
we get even more imperfection.
Third, implement gradual change.
Not only are small, incremental changes easier to review,
but in the event that a gradual change does
make a bug in production, it allows
us to reduce our mean time to recover,
making it simple to roll back.
Fourth, we need to leverage tooling and automation.
And fifth, we need to measure everything.
Measurement is a critical gauge for success.
And without a way to measure if our first four pillars
were successful, we would have no way of knowing if they were.
So, Liz, you've been an SRE at Google for over 10 years now.
Do you think any of the way that I described DevOps aligns
with your experience as an SRE?
LIZ FONG-JONES: It's sounding very familiar.
Because, if you think about DevOps as a philosophy,
SRE is a prescriptive way of accomplishing that philosophy.
So if DevOps were an interface in programming language,
you might almost say that SRE is a concrete class that
implements DevOps.
Let's take a look at how that is.
So, Seth, when you talked about eliminating
organizational silos, what I thought about
is the fact that we share ownership of production
with our developers.
And we use the same tooling in order
to make sure everyone has the same view and same approach
to working with production.
When you talked about accepting accidents and failure
as normal, what I thought about is the fact that--
similar to many DevOps practitioners--
we have blameless postmortems, where
we make sure that the failures that happen in our production
systems don't happen the exact same way more than once.
And we accept the failures as normal by encoding
a concept of an error budget of how much the system is allowed
to go out of spec.
And then third, we talked about making gradual changes.
And when you said that, I thought about the fact
that we canary things, that we roll
things out to a small percentage of the fleet
before we move them out for all users.
And then fourth, when you talked about leveraging
tooling and automation, what I thought about
is the fact that we try to eliminate manual work
as much as possible.
So we measure how much toil we have,
and then we try to automate this year's job away.
And then fifth, when you talked about measuring everything,
I thought about exactly that measurement
of measuring the amount of toil that we have
and measuring the reliability and health of our systems.
SETH VARGO: I really like that.
Class SRE implements DevOps.
We should get that on a shirt or something.
But just like a class in a programming language,
there might be additional functions or methods
that don't necessarily correspond to that interface.
Or the class might implement multiple interfaces.
Do you think SRE is like that?
LIZ FONG-JONES: I absolutely think
that's the case because of the fact
that SRE doesn't do things in the exact same way
that other people that implement DevOps elsewhere
might want to do.
So we'll talk a little bit more about those differences,
such as how exactly SLOs work, which
are a very specific concept that we implement in order
to make SRE successful.
SETH VARGO: Great.
Well, that settles it, then.
It turns out that DevOps and SRE aren't two competing methods,
but rather close friends designed
to help break down organizational barriers
to deliver better software faster.
Thank you, everyone, for watching.
Please be sure to check the description
below for more links, and don't forget
to subscribe to our channel.
Stay tuned for our next video, where
we will discuss the differences between SLIs, SLOs, and SLAs.