Who are you and what is your role at Everbridge?
👋 Hello!, my name is Sean Rackley, and I’m a Senior Site Reliability Engineer on the Everbridge Suite SRE team and have been with Everbridge for nine years. I earned my Bachelor’s degree from DeVry University with a speciality in network management. Prior to Everbridge, I worked as a quality assurance supervisor at the worlds largest producer of pistachios and almonds. A true nut job.
👋 Ahoy! My name is Mike Nikolai, and I’m also a Site Reliability Engineer on the Everbridge Suite SRE team. I have been part of the Everbridge crew for almost 9 years, originally starting in the NOC before moving to an SRE role. I earned my Bachelor’s degree from California State University Northridge with a focus in liberal arts. Before Everbridge, I worked as an Implementation Coordinator at Quest Diagnostics.
Why did we need this change?
Four years ago, we began a move from on-premise physical data centers to the cloud. This project, which was internally titled ATLAS (also known as Everbridge Automation Framework, or EAF), initially helped us achieve better resiliency and improved reliability by moving fully to the cloud. Unfortunately, and as expected with all ambitious efforts, the time constraints around getting the ATLAS project going forced upon us a constrained way of building and deploying code to our infrastructure. In fact, an in-house wrapper named Terrabridge was built around the process. Sadly, it was slow and error prone out of the gate and due to the amount of effort involved to manage it at first, it impacted our team’s productivity. While we continued to suffer through painful code deployments utilizing the scripted process, we soon realized this process wasn’t going to scale if we wanted to make any real forward progress. We felt stuck and in desperate need to get out of this situation.Fortunately a new team was formed, named Platform Engineering (PDE), which started working on a project to transition away from EAF. This new project, titled Virtual Machine Operations Platform (VMOP), uses native Terraform to build infrastructure while still leveraging a lot of the work and lessons learned from ATLAS. This is when we decided to migrate our entire stack over to this new platform.
What is VMOP?
VMOP facilitates infrastructure provisioning and deployment automation in AWS of environments that are composed, at least in part, of elements that reflect a monolithic application architecture enabled on discrete virtual hosts. It is comprised of six distinct but related systems, several of which are mutually dependent and designed to be consumed directly by Everbridge’s engineering teams in accessing automation that provisions, configures, and deploys cloud-native infrastructure and environments tailored to support the monolithic elements powering Everbridge Suite applications in the AWS cloud.
Currently, the following systems comprise VMOP:
- Base AMI System: a pipeline that routinely produces and distributes an updated selection of private base AMI artifacts that are security hardened and qualified, built over Open Source OS versions (in both minimal and full-server variants), configured with an array of encrypted EBS storage components, and available with certified FIPS-140–2 cryptographic modules installed. These are suitable for instantiating EBS role-specific EC2 nodes across all environment types.
- AMI Bakery System: a pipeline for producing user-specified bespoke AMI artifacts starting from an available Base AMI (from the Base AMI sub-system) and revising or updating its content, or configuration, according to user-specified directives
- Terraform Modules Continuous Integration System: provides an opinionated repository file layout plan supported by an automatically triggered pipeline that conducts static analysis and verification procedures in qualifying Terraform modules submitted via pull requests to a shared collection of composable Terraform modules. Pipeline results are posted in the pull request conversation thread prior to manual code reviews and approvals as required for merging into the repo’s master branch. These modules collections can be invoked for provisioning infrastructure resources appropriate to various Everbridge environments.
- Infrastructure/Applications Continuous Delivery System: provides an opinionated repository file layout plan supported by a pipeline comprised of static analysis, Terraform plan and Terraform apply stages for “live” provisioning of infrastructure and cloud-native resources to support Everbridge services and operations across the hierarchy of environment types: development, QA integration testing, stage, and production.
- Pipeline Monitoring System: logging and telemetry implemented for monitoring, alerting and alarming all VMOP pipeline processes, ensuring pipeline availability and reliability.
- Documentation System: a repository-based documents collection exemplifying Documentation as Code principles, which produces an automatically updated, purpose-built website hosting Platform Engineering documentation.
How has VMOP helped provision and deploy infrastructure for the EB Suite Product?
To give a bit more context, our releases previously relied on Jenkins to provision our infrastructure using declarative pipelines. These pipelines would use the Terrabridge wrapper to “regenerate” our entire Everbridge Suite application on each execution. Unfortunately, this meant any error in our Terraform files would be caught in the plan output meaning that we’d have to re-run our pipeline again. To put this into perspective, a typical regeneration stage would take about 45 minutes since it compares running infrastructure to desired changes. As you can imagine, a very frustrating and time consuming process.
With our migration to VMOP, the need for reduced performance vanished and in its place, native Terraform. This meant we had our very own repo, which utilized Terraform modules created by the PDE team, and also meant that we no longer had to worry about other teams’ changes, to the previously utilized shared repos, impacting our services. We’re a huge proponent of limiting the blast radius; if at all possible.
For our particular use case, we also chose to implement AWS Auto Scaling groups (ASG) for further resiliency and ease of use as part of our VMOP migration. By using these ASG’s and native Terraform, we can provision infrastructure in a single pull request utilizing Atlantis.io. Once provisioned, all it takes is one more pull request to cutover traffic for our deployments. Not only is this a huge amount of time saved, but we have also eliminated the majority of human error. GitOps to the rescue!
Speaking of time saved, we can now provision services independently of one another instead of the giant monolith release structure we had with EAF. Because of this, our overall provisioning time has since drastically improved:
- Using EAF, provisioning for a full Everbridge Suite release, averaged out to be 3.5 hours with multiple SRE’s
- With VMOP, provisioning for a full Everbridge Suite release, averages out to be 1.5 hours with a single SRE
We can now deliver bug fixes, features, and releases faster than ever before. It’s no longer a daunting task to release code; whether minor or major.
How have you adapted your way of working with this new tool, and what advantages does it provide?
Migrating to the VMOP platform has greatly reduced the cognitive load on our entire team and empowers us all as Engineers. We’re no longer afraid of making changes as now each Everbridge Suite application is deployed with a discreet Terraform state file which reduces the chance of contamination by other unrelated changes. As a benefit of this new tool, we can also submit bug fixes/feature enhancements directly back to the Terraform modules themselves. This helps our use case as well as the larger organization as a whole who may also consume the module or need the feature set or enhancement.
This also means we no longer have to babysit a Jenkins pipeline and can now kick off provisioning while enjoying a homemade sandwich letting Atlantis, native Terraform, and SaltStack do the work for us.
Our new team motto is, “if it’s not in code or supported by the module or Terraform, it’s not being deployed!”
How has your collaboration and working relationship with the VMOP team changed?
As much as we’ve touted the shortcomings and benefits as part of this transition, none of this work would have been possible without our collaboration with the VMOP team themselves. We worked extensively with them during multiple phases of the EAF to VMOP transition. Specifically, we’d like to recognize:
- Steven Lahouchuc with the initial implementation of AWS AutoScaling groups and subsequent related infrastructure
- Ed Silva with the SaltStack module and related state and pillar conversion(s)
- Brandon Strohmeyer with the various module PR’s and tag releases
In general the documentation provided by the VMOP team, in all aspects, was first-class and enabled us to “hit the ground running.” Meaning, we’re now self-sufficient and a consumer/contributor of a great unified platform.
Ultimately, the barriers and silos between our two teams were knocked down and cross-functional collaboration was renewed.
What did we learn?
While the VMOP tooling helped free us from the handcuffs that were formally known as ATLAS/EAF, we really learned to think independently of what the business was requiring of us and ultimately learned from our initial mistakes while improving our approach. Really, the important part is that we can focus even more time on the performance and optimization of Everbridge Suite making sure to provide the best possible experience for our customers.
Where do we go from here?
While VMOP serves our current needs, Everbridge is always looking to improve deliverability and resiliency, and is actively building out and migrating services over to a little known platform named Kubernetes. Kubernetes is an extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation.
If you are an engineer who is curious about enabling developer productivity and you want to know more about what we are doing at Everbridge to ever evolve this capability check out our Careers page!
Everbridge is hiring! We have many engineering positions now open and available across the world — come join us at Everbridge to help keep people safe and organizations running. Faster.
Everbridge, Inc. (NASDAQ: EVBG) is a global software company that provides enterprise software applications that automate and accelerate organizations’ operational response to critical events in order to Keep People Safe and Organizations Running™. During public safety threats such as active shooter situations, terrorist attacks or severe weather conditions, as well as critical business events including IT outages, cyber-attacks or other incidents such as product recalls or supply-chain interruptions, over 5,700 global customers rely on the Company’s Critical Event Management Platform to quickly and reliably aggregate and assess threat data, locate people at risk and responders able to assist, automate the execution of pre-defined communications processes through the secure delivery to over 100 different communication modalities, and track progress on executing response plans. Everbridge serves 8 of the 10 largest U.S. cities, 9 of the 10 largest U.S.-based investment banks, 47 of the 50 busiest North American airports, 9 of the 10 largest global consulting firms, 8 of the 10 largest global automakers, 9 of the 10 largest U.S.-based health care providers, and 7 of the 10 largest technology companies in the world. Everbridge is based in Boston with additional offices in 20 cities around the globe. For more information visit www.everbridge.com