This text has been translated from Google Translate.
The SRE is a job title that has hidden a lot of things for a long time. Some took it literally, and those responsible for application availability became SREs. It turned out to be sysadmins with a new name.
SRE is a method, which comes as a specificity of DevOps
The method was formalized in a book by Google employees and quickly became popular. It is also interesting because it will make you think about the physical risks that an application can see: fires, power failure, network failure. Is it useful to be multi-AZ? A whole word that goes beyond the SLA (the percentage of time the application has been available).
The SRE is known to be transversal in the company. You have to communicate a lot internally, have a facilitating posture. Because a good DRP (Disaster Recovery Plan) or a good BCP (Business Continuity Plan) is not centralized. If the SRE takes all this responsibility, it becomes a SPOF (Single Point Of Failure).
So the right SRE distributes the responsibilities and ensures that everyone has the means to assume them. Otherwise they work to find them. And when you think you've thought of everything, the SRE continues to look for where the next problem might come from.
His favorite tool? The logs! The observability of systems is its primary focus. You have to know what is happening in the Cloud or in the Datacenter.
In terms of continuous improvement, our SRE will go through friction logs to find what could become a problem tomorrow or what can be improved today.
Behind an ad or an SRE job description there are many trades or possible variants. You really have to dig to find out more.
I think the one who best embodies this job is Gaëlle Acas and she talks about it really well in this interview on the WeLoveDevs Podcast ⬇️
If you've read this job description, you really need to read the DevOps one next!