Introduction: DevOps & Site Reliability Engineering


Development and Operations

Throughout the 80’s, and 90’s Systems Administrators (or SysAdmins) wrote code to create, improve, and manage the computing systems under their domains and it worked reasonably well for the environments and needs of the time.

Systems grew and became more complex requiring more and more moving parts (virtual or otherwise) and specializations were created and evolved to handle this.

DevOps is a methodology of continuous change control to streamline processes from Software Development through Testing and Validation and deployments into Production.

It is a lifecycle process of continuous improvements to ensure reliable changes.

DevOps Lifecycle
DevOps Services and Automation Solutions - Crest Data Systems
The same principles and processes were used before this term was coined but the key differences being the automation applied to speed deployment and scale.

What could take days, months and even years of development, waiting on testing by other teams, and then redoing those cycles before putting something into production could now be done much faster and more reliably.

This was the evolution and combination of software engineering, systems administration, and change control to scale.

Where Software Engineering would primarily be concerned with feature sets, bugs, and getting product shipped, the Operations and Systems Administrators would be more concerned with deployment, supportability, and reliability.

Also sometimes overlooked would be QA and Testing which would result in software security and feature flaws getting released into production.

To address this DevOps processes were adopted and a new role of SRE evolved as experts in this field.

Introduction to SRE (Site Reliability Engineers)

Google vice president of engineering Ben Treynor Sloss coined the term SRE back in the early 2000s. He defined it as: “It’s what happens when you ask a software engineer to design an operations function.”

Image source: Splunk + VictorOps

Site Reliability Engineering is a branch of engineering focused on reliability of systems, services, and products. Uptime, Resource Utilization, and Forecasting, System Reliability, Change Control, Systems Integration are all at the forefront and concerns of SRE.

Site reliability engineers (SREs) bridge the gap between development and operations by applying the mindsets of both disciplines to ensure feature development with an appropriate level of security, reliability, scalability, and performance.

SREs are focused on the holistic view from software delivery to monitoring to incident response that improves service resiliency without sacrificing development turnaround time.

What makes having an SRE team effective?
  • Automation is the ultimate goal for SREs. One important way is through building self-service tools that reduce toil. Toil is the kind of work that tends to be manual, repetitive, automatable which scales linearly as a service grows Eliminating toil allows developers to focus exclusively on enhancements or more services to automate.

  • There is more focus on the failure in the process or technology and ways to improve the system as part of incident management. Blameless post-mortems embrace incidents as a way for learning and improvement on the strategy and structure of the organization and the system. An example would be improving monitoring, alerting and other tools used to maintain the system’s reliability.

  • A good way to measure the system’s reliability is through the written Service Level Objectives (SLO) which is used to measure the performance of each service against it. While Service Level Agreements (SLA) is the contract between the customer and the team acting upon the SLOs.
What are some components of SRE?
Continuous Improvement
An SRE team seeks for continuous improvements on both development and operational aspects. Enhancing system monitoring and system performance as well as improving emergency response to attain overall system resiliency. SREs are empowered to identify system gaps to establish observability as well as implementing service level indicators and objectives.
An SRE team needs to implement monitoring on the systems to maintain availability and to identify errors. It is important to intelligently identify what to monitor and how to monitor effectively. Using a monitoring tool that can view the overall performance as well as every component status in the system to identify the initial errors which will help avoid further service interruptions.

An SRE team that is prepared monitors the service’s health and responds effectively during problems. Resources that will help the team understand the entire system especially during troubleshooting. A well-defined incident management with dashboards and metrics will build foundation for a prepared team.

You can find out more about DevOps, SRE, and how Crest Data Systems can help bring those services into your organization from the following links.



Richard McIntosh is a Technical Operations and SRE Lead at Crest Data Systems with 25+ years of industry experience working on High Performance Computing, Cloud Computing, DevOps, and Technology & Security management. Before joining Crest, Richard worked for other small to large enterprise companies in the defense, entertainment, and the semiconductor industry.