SITE RELIABILITY ENGINEERING (SRE) is a job function,
a mind-set, and a set of engineering approaches for
making Web products and services run reliably. SREs
operate at the intersection of software development
and systems engineering to solve operational
problems and engineer solutions to design, build,
and run large-scale distributed systems
scalably, reliably, and efficiently.
SRE core functions include:
˲Monitoring and metrics: Establishing desired service behavior, measuring how the service is actually behaving, and correcting discrepancies.
˲ Emergency response: Noticing and
responding effectively to service failures in order to preserve the service’s
conformance to its SLA (service-level
˲ Capacity planning: Projecting future demand and ensuring that a service has enough computing resources
in appropriate locations to satisfy that
˲Service turn-up and turn-down:
Deploying and removing computing
resources for a service in a data center
in a predictable fashion, often as a consequence of capacity planning.
˲ Change management: Altering the
behavior of a service while preserving
˲Performance: Design, development, and engineering related to scalability, isolation, latency, throughput,
SREs focus on the life cycle of services—from inception and design,
through deployment, operation, refinement, and eventual decommissioning.
Before services go live, SREs support them through activities such as
system design consulting, developing
software platforms and frameworks
and capacity plans, and conducting
Once services are live, SREs support
and maintain them by:
˲ Measuring and monitoring availability, latency, and overall system
Article development led by
How documentation enables SRE teams
to manage new and existing services.
BY SHYLAJA NUKALA AND VIVEK RAU