Throughout the 80’s, and 90’s Systems Administrators (or SysAdmins) wrote code to create, improve, and manage the computing systems under their domains and it worked reasonably well for the environments and needs of the time.
Systems grew and became more complex requiring more and more moving parts (virtual or otherwise) and specializations were created and evolved to handle this.
DevOps is a methodology of continuous change control to streamline processes from Software Development through Testing and Validation and deployments into Production.
It is a lifecycle process of continuous improvements to ensure reliable changes.
The same principles and processes were used before this term was coined but the key differences being the automation applied to speed deployment and scale.
What could take days, months and even years of development, waiting on testing by other teams, and then redoing those cycles before putting something into production could now be done much faster and more reliably.
This was the evolution and combination of software engineering, systems administration, and change control to scale.
Where Software Engineering would primarily be concerned with feature sets, bugs, and getting product shipped, the Operations and Systems Administrators would be more concerned with deployment, supportability, and reliability.
Also sometimes overlooked would be QA and Testing which would result in software security and feature flaws getting released into production.
To address this DevOps processes were adopted and a new role of SRE evolved as experts in this field.
Google vice president of engineering Ben Treynor Sloss coined the term SRE back in the early 2000s. He defined it as: “It’s what happens when you ask a software engineer to design an operations function.”
Site Reliability Engineering is a branch of engineering focused on reliability of systems, services, and products. Uptime, Resource Utilization, and Forecasting, System Reliability, Change Control, Systems Integration are all at the forefront and concerns of SRE.
Site reliability engineers (SREs) bridge the gap between development and operations by applying the mindsets of both disciplines to ensure feature development with an appropriate level of security, reliability, scalability, and performance.
SREs are focused on the holistic view from software delivery to monitoring to incident response that improves service resiliency without sacrificing development turnaround time.
An SRE team seeks for continuous improvements on both development and operational aspects. Enhancing system monitoring and system performance as well as improving emergency response to attain overall system resiliency. SREs are empowered to identify system gaps to establish observability as well as implementing service level indicators and objectives.
An SRE team needs to implement monitoring on the systems to maintain availability and to identify errors. It is important to intelligently identify what to monitor and how to monitor effectively. Using a monitoring tool that can view the overall performance as well as every component status in the system to identify the initial errors which will help avoid further service interruptions.
An SRE team that is prepared monitors the service’s health and responds effectively during problems. Resources that will help the team understand the entire system especially during troubleshooting. A well-defined incident management with dashboards and metrics will build foundation for a prepared team.
You can find out more about DevOps, SRE, and how Crest Data Systems can help bring those services into your organization from the following links.
Crest Data Systems – DevOps