What is SRE and why should you love it? | IT Outsourcing Company ★ IT Svit
SRE is the acronym for Service Reliability Engineering, a term that was coined by Google experts back in the early 2000-s. It predates the DevOps methodology introduction by nearly a decade and is nearly as old as Agile manifesto. However, SRE does not contradict DevOps — in fact, they form an excellent working relationship, complementing and supporting each other.
Many self-appointed “experts” will proclaim that SRE is centered on “you built it, you run it” motto coined by Amazon CTO Dr. Werner Vogels. This means, according to these so-called “gurus” that developers must learn to run the systems in production, so they get a better understanding of how to write the code to ensure it works well after the release.
This is utter nonsense and cannot be farther from the truth. Ben Treynor, the Head of the Google Service Reliability Engineering team, who coined the SRE term itself, has clearly outlined what SRE should be and how it should be used. According to him, “SRE is what happens when the software developer must handle operations” ©. This is where most of the “experts” stop, so they end up with completely wrong assumptions on what was actually said here.
What is SRE in simple words?
Actually, if one reads the “SRE: How Google runs production systems” book, it becomes self-explanatory, that the meaning of the phrase is quite different. The service reliability engineer must work on ensuring … service reliability, by treating the wholeness of the applications and systems in production as an application, and managing it in a way to ensure optimal performance. Therefore, SRE specialists must have experience both with cloud infrastructure management and software development — but they must concentrate more on the Dev side of things, while DevOps concentrates more on the Ops side of operations.
How does it differ from classic software development? Not so long ago, the Dev and the Ops team were different (and quite often opposing) camps. While the efficiency of the Dev team was measured by the number of features they successfully delivered, the efficiency of the Ops team was measured based on the application uptime — and new feature releases quite often have lead to service downtime.
Thus said, the goals of these two teams directly contradicted each other, and they had separate silos of tasks, tools and skills. The common approach to handling the code was “throw it over the wall to be someone else’s trouble”, which has lead to constant tension and pulling the blanket between the two departments in almost any company. Most importantly, this resulted in an unpredictable software delivery schedule, immense customer frustration and financial losses due to post-release downtime and the fear of innovating due to the risk of bearing the blame for failure.
What is DevOps then?
The daunting situation described above required drastic measures, so the DevOps approach was introduced. Much as with SRE, many “experts” assumed that to enable DevOps you should make Devs and Ops sit in one room and teach each other to code and to run infrastructure, so they share their skills, tools and tasks. This way, the “gurus” assume, the DevOps magic will happen and the teams will become fully interchangeable.
In fact, such an approach would be a direct way to a disaster, as both teams would lose productivity. To say more, each of the specialists you employ has studied their chosen field for years to reach their professional level, and it would take years for them to teach their colleagues everything they know (and to learn from them) in order to form an interchangeable team.
DevOps is NOT and NEVER HAS BEEN a mix of Dev and Ops. It IS a paradigm centered at communication and collaboration between the teams, where OPS engineers are at the head of the table, as they deal with the application 90% of the time, while it runs in production. So OPS engineers define how to structure the future application best (monolith or microservices), how to ensure timely and error-proof application updates (through automated testing and CI/CD pipelines) and how to manage and monitor the production cost-efficiently (through smart alerting and predictive analytics, instead of manual system monitoring).
The DEV part of the DevOps relates to the fact that when the Dev and Ops teams have met and discussed the structure of the future app, the Ops create AUTOMATED TOOLS to support Continuous Integration and Continuous Delivery of new code — Terraform and Kubernetes manifests that Devs can run with ease, without having to dive deep into the infrastructure management part of things. This way, the Devs can create code without wasting time on requesting the Ops engineers to build and configure testing environments for it, or preparing the releases. Once the manifests are in place, the development becomes much more predictable.
Thus said, the DevOps culture fosters collaboration between the teams and individuals, as their goals are now aligned — they have to ensure the application runs reliably at all times while being incrementally improved without interrupting the end-user experience. They retain their skillsets and tools — they just communicate freely with each other to understand how to help each other work as productively as possible — without distracting each other with repetitive routine requests.
Most importantly, DevOps culture treats failure not as a sign of incompetence, but as an indicator that there is some room for improvement in your product, infrastructure or workflows. It is also important that IaC, CI and CD principles of DevOps help create and configure the required testing environments literally in seconds, so the cost of error after failing and experiment is close to zero. This blameless postmortem approach removes the tension and helps all parties be more innovative in their experiments — which helps deliver great new features and products faster.
Enter SRE — when Devs have the final say
What is the difference between DevOps and SRE then, and why would you need SRE at all, if DevOps is so good? Because SRE specialists can help improve both your applications and infrastructure as a whole, which is essential when operating infrastructures at scale. Most importantly, SRE does not contradict DevOps and is actually an important part of it.
How to obtain such experience then? Ben Treynor described 4 basic rules of SRE:
Conclusions: SREs are important for project success
To wrap it up — SRE approach helps system engineers learn to manage the infrastructure and application more efficiently, greatly increasing the reliability of operations and predictability of software development. It does not contradict DevOps and is actually one of 7 core DevOps roles. If you need SRE services or consulting on how to implement SRE in your organizations — IT Svit can help!
This content was originally published here.