Tim Falzone’s Post

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

Looking for "hazard states" reminds me of the "impending doom" style alerts, and all the arguments back-and-forth about cause-based vs symptom-based alerting. Even the most hard-core "symptom-based" people admit that sometimes other "cause-based" alerts are useful, but they are often hard to identify and tune alert thresholds for. These cases always (nearly always? is there any other kind?) represent "hazard states", and the STPA approach gives you a systematic way of trying to find them. The next tricky part is figuring out how to tune your alerting for them so that they get appropriately addressed without overwhelming people with too many alerts. The people dealing with these hazards are also part of the system, and pager-fatigue is a very real failure mode. For tuning alert thresholds I recommend a risk analysis style approach to translate it into your clearly defined SLO's. These "hazard states" represent a risk of an outage; what is the probability in average duration till an outage happens from this state, and how much of your SLO error budget will the outage cost? The cost/duration is the effective SLO error budget burn rate of being in that state, so set your alert threshold and priority to reflect that.

Like
Reply
Lorin Hochstein

Staff Software Engineer

2mo

How did you come to learn about STAMP? The only tech company I was aware of that explicitly used STAMP was Akamai, a few years back.

John Thomas

Co-director of MIT's Engineering Systems Lab

2mo

This is excellent work, and it has the potential to impact reliability engineering across industries!

Ofir Cohen

CTO of Container Security @ Wiz | Public Speaker (CNCF, K8s)

2mo

I’m a huge fan of the Google’s SRE culture and have recently finished reading the SRR book. I’ve got the other 2 books on my list. Happy to follow on new content and updates :-)

Like
Reply
Andrew Mallaband

Growth Engineering | Enabling Tech Leaders & Innovators Around The Globe To Achieve Exceptional Results

2mo

This is a great article and worthwhile reading for any practioners (#platformengineering/#devops) who are thinking about their journey/roadmap of innovation as it relates to #SRE and #Observability. Vendors should pay attention to this and think about how their products could help end users implement these concepts.

Great article! Using the STAMP framework to approach reliability more comprehensively and proactively than standard incident response procedures allow looks to be a powerful step forward for #SRE. Anyone maintaining complex systems should seriously consider this control theory paradigm shift.

Like
Reply
👋 Dale Harrison

Senior Director, Engineering at Google

2mo

Great to see this out in the open, and appreciated your continued push to make Maps safer. You're also pretty fun to work with, but I won't ever admit that in public.

John Lunney

Senior Staff Reliability SWE at Google

1mo

Good job on actually sticking to your plan.

Like
Reply
Eloise Koullapis

Infrastructure Engineer, Resilience Engineering Champion

1mo

Tim Falzone I have been working on implementing this in my organisation. I would love to have a chat about some of the concepts :)

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics