Home

SRE

Site Reliability Engineering (SRE) is a discipline that uses software engineering to solve infrastructure and operations problems, with the goal of building scalable, reliable software systems. It was developed at Google in the early 2000s to bridge development and operations and has since spread to many organizations.

A core SRE concept is aligning reliability with product goals through quantitative targets: service level indicators

Practices include monitoring and alerting, incident response, and blameless postmortems focused on learning. SREs perform capacity

In organizations, SREs are software engineers who collaborate with product teams to define SLOs and implement

Common metrics include uptime, P95 or P99 latency, and error rates. Adoption varies; some teams use platform

(SLIs)
measure
performance
or
availability,
service
level
objectives
(SLOs)
specify
target
values,
and
error
budgets
cap
how
much
unreliability
is
acceptable.
The
idea
of
toil—manual,
repetitive
operational
work—drives
automation
and
programmatic
tooling
to
reduce
incidents
that
consume
time.
planning,
disaster
recovery,
and
release
engineering
to
ensure
safe
deployments
and
scalable
operations.
The
aim
is
to
automate
routine
tasks
and
enable
developers
to
ship
features
without
compromising
reliability.
reliability
improvements.
A
blameless
culture
and
data-driven
retrospectives
are
central.
SRE
is
often
presented
as
a
concrete
implementation
of
DevOps
principles,
emphasizing
engineering
discipline
and
measurable
reliability
rather
than
process
alone.
or
reliability
engineering
groups
in
place
of
traditional
operations.
Practices
adapt
to
organization
size,
product
domain,
and
technology
stack.