Bibliograph

Site reliability engineering, how Google runs production systems, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy

Type

Label

Site reliability engineering, how Google runs production systems, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy

Language

eng

Bibliography note

Includes bibliographical references (pages 501-512) and index

Illustrations

illustrations

Index

index present

Literary Form

non fiction

Main title

Site reliability engineering

Nature of contents

bibliography

Oclc number

930683030

Responsibility statement

edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy

Sub title

how Google runs production systems

Table Of Contents

Introduction. The production environment at Google, from the viewpoint of an SRE -- Principles. Embracing risk -- Service level objectives -- Eliminating toil -- Monitoring distributed systems -- The evolution of automation at Google -- Release engineering -- Simplicity -- Practices. Practical alerting from time-series data -- Being on-call -- Effective troubleshooting -- Emergency response -- Managing incidents -- Postmortem culture: learning from failure -- Tracking outages -- Testing for reliability -- Software engineering in SRE -- Load balancing at the frontend -- Load balancing in the datacenter -- Handling overload -- Addressing cascading failures -- Managing critical state: distributed consensus for reliability -- Distributed periodic scheduling with Cron --Data processing pipelines -- Date integrity: what you read is what your wrote -- Reliable product launches at scale -- Management. Accelerating SREs to on-call and beyond -- Dealing with interrupts -- Embedding an SRE to recover from operational overload -- Communication and collaboration in SRE -- The evolving SRE engagement model -- Conclusions. Lessons learned from other industries

Classification