To SRE or Not to SRE: A Brief History on Operations Engineering

Ye Olde Enterprise (1970s - 1990)

In the traditional enterprise, DevOps wasn’t a thing, and SREs weren’t even born yet. You simply had engineers, QA, and support. In this world, support was the cost center and problems took a long time to diagnose and fix because of the long gap between getting a new feature planned and getting feedback of how it was failing for end-users.

In this world, there was no environment other than production and maybe just running a small piece of an application on a workstation - but even that was a luxury some places might not have if most of the programming had to be done on a mainframe, for instance.

It’s shocking how many companies still operate this way, even after all these years. While I was at Citadel, the number of times they saw engineers editing code, configs, and scripts in live production environments just to fix things, roll out new software, or even just test ideas was mind-boggling.

The Height of the 90s

By the late 90s, technology corporations had dedicated ops teams, and a NOC (Network Operations Center) was common. Release engineering was its own discipline (as it still is in many places) and it was common to have a staging environment where new versions of code could be tested as a unit by dedicated testing teams. This was where you first started seeing the notion of "locking down production."

Time to resolve failures was reduced drastically with the introduction of a dedicated operations team, which was typically formed from "Unix Admins" - skilled technologists who may or may not actually be software engineers, but who can generally work their way around distributed systems very well and troubleshoot problems. More often than not the best of these engineers found themselves "fixing problems" in distributed systems by hacking production by whatever means necessary, since they rarely had access to the real source code but uptime was still their mandate.

This "throw it over the wall" style of having engineering and operations completely separate from each other, and all the problems that come with that, is what led to the DevOps movement. Of note, this was my life at Orbitz from 2005-2010. Every day, I’d come into a big room filled with monitors with alerts on the screen, most of which I couldn’t fix directly. So, I did what I could do - I worked with my team to create SOPs (Standard Operating Procedures), detailing runbooks for what to do when you saw certain problems, often restarting a service with different configs or moving things from one class of servers to another.

When engaging with the developers, ops team members were often surprised to find out that developers were so far ahead of what was in production, they had either forgotten about how stuff worked or were so new they had never even seen the current code in production anymore.

Releases were supposed to be scheduled for every two months but often they would take significantly longer, and deploying a new version of the platform was also a very lengthy, risky endeavor requiring a "down for maintenance" page all Friday night until early Saturday morning.

DevOps Circa 2010

One of my favorite talks on DevOps described it as "giving a damn about what you do." I love that definition because it cuts through a lot of the misunderstandings at the time - DevOps was never supposed to be a role or a title - it was supposed to be the philosophy that developers and operations engineers need to work more closely together to prevent the kind of dysfunctional scenarios I just described.

However, once the DevOps movement started gaining steam, "DevOps engineers" as a role and title nevertheless took off, and I even had this title for some time at Signal. To a lot of people, DevOps engineers are just the newer term for the "Admins" from the 90s, but to most of the people who embodied the philosophy, these were engineers who recognized themselves as coders first and foremost - solving operational-focused problems with automation and programmatic tooling.

This brought a tremendous improvement in the time to resolve problems, because now there was a focus on thinking about operations and reliability right in engineering. Done properly, "following the DevOps philosophy" meant that each rollout of new functionality was ushered through multiple environments with sophisticated tooling and automated testing, owned and maintained by the DevOps team.

Ideally, new features weren’t created in a vacuum anymore, either. A DevOps engineer might sit in on the initial planning and design of new initiatives to make sure scaling, resiliency and fault-tolerance are no longer afterthoughts but initial goals of features.

SREs and NoOps

SREs and companies like Google embrace the DevOps concept to the logical extreme - all software engineers are just that - software engineers, but SREs have a particular focus on infrastructure as code, resiliency engineering, and monitoring.

You could say that Google was just doing the DevOps thing far before anyone else was, because they were forced to due to their insane scale. When otherwise normal systems start to reach Google-scale, brand new problems emerge that require a completely new way of thinking about the technology you build, and to solve this Google needed to double-down on reliability engineering earlier than most.

First of all, there’s no way to manually maintain systems involving millions (or billions) of anything. Because of this simple truth, you have to throw out any conceived notions of "manual support" or ops work, because if there’s a problem with a system of this scale, generally speaking the only way to solve it is to go back to the original code and fix the bug.

Second, the biggest lesson learned from the DevOps movement was that by building processes and tooling to help handle support problems manually or with one-off, or "out of band" scripts and tools, you’re actually slowing down the velocity of new feature development, because docs, support processes, and manually-created tools are all forms of tech debt as well and require maintenance to keep functioning correctly just like any other code.

To really make this all work, the role of an SRE as a specialized-focus software engineer becomes useful, because in addition to being a member of the product-focused engineering team, they’re taking a unique take on the engineering requirements to form a more fully-realized version of what it takes to get a particular feature/product into production.

An example of how this works in practice is as follows:

Case Study A: The Traditional Approach

On a traditional software engineering team, you might have three engineers building a new application together. Let’s say it involves a database, a frontend, and a simple middleware app. After agreeing on a schema and API, one engineer focuses on the frontend app and the other two tackle the backend apps and manually update the database with the help of a DBA.

After a couple weeks’ worth of work, a prototype is deployed manually into a staging environment with the help of an ops team with the necessary access to the infra and dbs.
After that passes some manual testing with the business stakeholders, it is again manually put into production with the help of ops admins and from that point onward ops maintains the application.

Case Study B: The SRE-Advised Approach

At the other company, a similar team might have four engineers, but there are no separate ops teams whatsoever. One of these engineers is an SRE.

This time, the four engineers work together to design a schema and an API, and the SRE points out that the database in use will have a particular consistency model once deployed to production, because the DB readily available to the team is actually Cassandra and the users of this app will expect a consistent view of data.
Several discussions are had regarding query performance at 10, 1,000, and 1,000,000 rows in the result sets and some indexes are tried out before everyone agrees on the necessary queries for the product and resulting Cassandra tables.
A latency target is identified and pitched to product and only once product agrees this is acceptable does design work continue.
Integration tests are designed and written to describe and verify the behavior of the application at a high level and unit tests are spec’d out for individual classes/components/functions.
The SRE helps prepare build and deploy scripts for automating these tests and getting the new code into all the environments.
As a team of 4, everyone works on instrumentation and logging to have the necessary logs and metrics on the application to ensure it’s doing what it’s supposed to be doing in production within the expected latency bounds. It’s important to note here that the SRE is an advisor in this capacity - it’s not like this engineer is doing all the instrumentation work themselves, but they might be helping coordinate this work and informing on things like conformed tag or metric names.
As a team of 4, the application moves through all the environments till it gets into production, and all four members of the team enter on-call rotations to support the product in production.

The Alert Guidelines

The SRE can help define runbooks for the various alerts that might happen, with the expectation that every alert that gets through should meet the following criteria:

Actionable: Don’t page me if there’s nothing I can do about this.

Impactful: Don’t page me if this isn’t really a problem.

Require human intervention: Why can’t this failure case be dealt with in an automated way?