SRE
Site Reliability Engineering (SRE) is a discipline that uses software engineering to solve infrastructure and operations problems, with the goal of building scalable, reliable software systems. It was developed at Google in the early 2000s to bridge development and operations and has since spread to many organizations.
A core SRE concept is aligning reliability with product goals through quantitative targets: service level indicators
Practices include monitoring and alerting, incident response, and blameless postmortems focused on learning. SREs perform capacity
In organizations, SREs are software engineers who collaborate with product teams to define SLOs and implement
Common metrics include uptime, P95 or P99 latency, and error rates. Adoption varies; some teams use platform