Welcome to ACM SRE

Site reliability engineering (SRE) is a software engineering approach to IT operations. SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks.

SRE takes the tasks that have historically been done by operations teams, often manually, and instead gives them to engineers or ops teams who use software and automation to solve problems and manage production systems.

SRE is a valuable practice when creating scalable and highly reliable software systems. It helps you manage large systems through code, which is more scalable and sustainable for sysadmins managing thousands or hundreds of thousands of machines.

The concept of site reliability engineering comes from the Google engineering team and is credited to Ben Treynor Sloss.

SRE helps teams find a balance between releasing new features and making sure that they are reliable for users.

Standardization and automation are 2 important components of the SRE model. Site reliability engineers should always be looking for ways to enhance and automate operations tasks.

In this way, SRE helps to improve the reliability of a system today, while also improving it as it grows over time.

SRE supports teams who are moving from a traditional approach to IT operations to a cloud-native approach.

From: https://www.redhat.com/en/topics/devops/what-is-sre

The content here is an ongoing process. Our ACM SRE team will share our work with the RHACM community, and feedback is always welcome!