In Pursuit of the Art (and Business) of Site Reliability Engineering

7 min readAug 26, 2021

Interview with Nick Thompson, Head of SaaS Operations @ Everbridge

Manipulated photo by John Maeda / Original via Unsplash

Nick Thompson is a classical “engineer’s engineer” who also earned his MBA along the way to becoming an expert infrastructure and site reliability engineer. Everbridge Technology Blog (ETB) had the opportunity to learn from Nick about our favorite topic of “resilience” in the context of computational machines that work at exponentially complex scales. One of ETB Co-Editor’s, John Maeda, says his favorite books is the Google Site Reliability Engineering book because it made him especially aware of the unique challenges of the special folks who must be absolutely fluent in #howtospeakmachine. Let’s learn from one of the experts. Take it away, Nick!

ETB: What was your first computer?

Nick Thompson: I shared my first computer with my father. It was an IBM Personal Computer in the early 80s, and I was hooked once I heard the clackity-clack of the mechanical keyboard. not so much with the dial up connection speeds.

ETB: How did you find your way to the world of site reliability and platform engineering?

NT: It was a long arc from traditional operations. I have been fortunate to experience many evolutions and advancements as they were happening (virtual machines, cloud computing, containerization, serverless, software config management, Infrastructure as Code, Container Orchestration). I am not one to let the status quo dictate my decision-making process and have always looked to identify how to improve on what I did yesterday so I can be just a little bit better tomorrow. These evolutions afforded me all kinds of opportunities to try them out and learn new ways to accomplish old tasks more efficiently with big divedends. Traditional operation is about mitigating risk, reducing change, and valuing stability. This doesn’t leave much room for improvement. Site Reliability Engineering really resonated with me because it accepts that there is inherently some degree of unreliability in every system.

This slim margin of error allows us to think differently about risk, change and innovation. It also creates an opportunity to embraces change, to a point, that allow systems to be ever changing as long as the level of acceptable reliability, as measured by the customer (not stakeholders, and this is quite important) is always met. This new contract between Product, Development, operations, and customers allows us to discuss change based on the user experience with a lens on reliability. A system that is reliable with a consistent experience is one that is trusted.

Platform Engineering, at least how I view it, is an extension of Site Reliability Engineering but is focused internally on engineering enablement and empowerment. How reliably can we create applications, systems, and infrastructure to ensure we get through the SDLC successfully? With that in mind, I view the goal of platform engineering as providing the tooling which empowers and enables engineers to align the “easy” path with the “right” path. If the platform can accomplish this one goal it should complement the goals of reliability engineering and provide a successful platform to deliver value to customers with a very high degree of success while being a pleasure to work with.

Visit bestinenterpriseresilience.com to learn more about how CEM is enabling enterprise resilience at scale across digital and physical domains.

ETB: The acronym SLA is quite common, but I know you like to use SLO and SLI, too. Can you explain why these distinctions are important?

NT: An SLA, Service level agreement, is an external commitment to a customer that often involves some financial consequences if breached. Think customer contract agreement.

An SLO, is an internal service level objective which measures the point at which a customer’s expectations of reliability is about to be breached. This point is an indicator for engineering service owners to take action so that they can ensure and maintain the degree of reliability within the system which customers expect. There are no external factors or consequences for breaching an SLO. However, lots of alerts should be going off and people are getting paged (possibly in the middle of the night) if an SLO is breached. The SLO protects the SLA. It is a tighter, more strict, internal warning light. SLOs set the reliability target for SLIs over a given time.

Service Level Indicators, or SLIs, are more specific detailed metrics which indicate to internal engineers that a SLO might be in threat of being breached. SLI are the quantifiable measure(s) of service reliability.

ETB: One case study in the Site Reliability Handbook covers the VALET Dashboard. What’s your favorite dashboard methodology?

NT: With the movement to microservices and containerization I have started think in terms of the service. The underlying infrastructure has become more of a commodity which get replaced without ever really being noticed. So CPU utilization and memory consumption is no longer of major concern. What we need to focus on now is application behavior in terms of Rate, Errors, and Duration. This closely follows the 4 golden rules of monitoring google subscribes to but leaves out the data correctness part.

The RED Method defines the three key metrics you should measure for every microservice in your architecture. Those metrics are:

(Request) Rate — the number of requests, per second, you services are serving.
(Request) Errors — the number of failed requests per second.
(Request) Duration — distributions of the amount of time each request takes.

A nice aspect of the RED method is that it helps you think about how to build low barrier of entry dashboards which provide good insight without having to know everything about the underlying service. You should bring these three metrics front-and-center for each service and error rate should be expressed as a proportion of request rate. I think this gives a nice way to think about every single service we serve to customers. This high-level view can immediately identify across all systems if one is behaving in an abnormal way and reduces the cognitive load of engineers troubleshooting each unique service. This is different and a less detailed than SLO/SLI which are a deeper level of unique service understanding.

ETB: I noticed you have an MBA — how has that training been useful in your career journey so far?

NT: It has helped tremendously in adapting my technical perspective to business perspective. My world has been technical operations for a long time and generally that is identified as a cost center in a traditional enterprise. Understanding the balance sheet and profit and loss reporting has allowed me to think about the cost center more like a profit center. What technical innovations within how we run a business technically will, not only reduce cost, but also enable product, engineering, sales, and marketing to innovate faster, better, and more reliably than the competition. My goal with operations has evolved over the years and my MBA had a large influence on that. it comes down to allowing technical operations the ability to adapt and change with the market. When we can meet those opportunities without major disruption and pivots, we have created a truly agile organization with limitless potential for adding value to the entire CEM marketplace because we can experiment in a safe environment reliably without effecting customers.

ETB: For folks who are just starting out in the SRE world who want to “grow up and be like Nick,” what advice do you give them?

NT: Experiment with lots of ideas. Don’t stop at failure. Learn from it and experiment with the new knowledge of what didn’t work. Someone asked me, “what has been my biggest failure?” in an interview, maybe for Everbridge. My answer was I haven’t failed at anything yet because I have learned from everything that hasn’t gone quite as successfully as I expected. “Improvise, Adapt, and Overcome,” has served me well since my days in the USMC.

ETB: Thanks for the time, Nick.

Everbridge is hiring! We have many engineering positions now open and available across the world — come join us at Everbridge to help keep people safe and organizations running. Faster.

About Everbridge

Everbridge, Inc. (NASDAQ: EVBG) is a global software company that provides enterprise software applications that automate and accelerate organizations’ operational response to critical events in order to Keep People Safe and Organizations Running™. During public safety threats such as active shooter situations, terrorist attacks or severe weather conditions, as well as critical business events including IT outages, cyber-attacks or other incidents such as product recalls or supply-chain interruptions, over 6,000 global customers rely on the Company’s Critical Event Management Platform to quickly and reliably aggregate and assess threat data, locate people at risk and responders able to assist, automate the execution of pre-defined communications processes through the secure delivery to over 100 different communication modalities, and track progress on executing response plans. Everbridge serves 8 of the 10 largest U.S. cities, 9 of the 10 largest U.S.-based investment banks, 47 of the 50 busiest North American airports, 9 of the 10 largest global consulting firms, 8 of the 10 largest global automakers, 9 of the 10 largest U.S.-based health care providers, and 7 of the 10 largest technology companies in the world. Everbridge is based in Boston with additional offices in 20 cities around the globe. For more information visit www.everbridge.com

In Pursuit of the Art (and Business) of Site Reliability Engineering

Written by Everbridge Technology Blog