I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk
How did you come to learn about STAMP? The only tech company I was aware of that explicitly used STAMP was Akamai, a few years back.
This is excellent work, and it has the potential to impact reliability engineering across industries!
I’m a huge fan of the Google’s SRE culture and have recently finished reading the SRR book. I’ve got the other 2 books on my list. Happy to follow on new content and updates :-)
This is a great article and worthwhile reading for any practioners (#platformengineering/#devops) who are thinking about their journey/roadmap of innovation as it relates to #SRE and #Observability. Vendors should pay attention to this and think about how their products could help end users implement these concepts.
Great article! Using the STAMP framework to approach reliability more comprehensively and proactively than standard incident response procedures allow looks to be a powerful step forward for #SRE. Anyone maintaining complex systems should seriously consider this control theory paradigm shift.
Great to see this out in the open, and appreciated your continued push to make Maps safer. You're also pretty fun to work with, but I won't ever admit that in public.
Good job on actually sticking to your plan.
Tim Falzone I have been working on implementing this in my organisation. I would love to have a chat about some of the concepts :)
Looking for "hazard states" reminds me of the "impending doom" style alerts, and all the arguments back-and-forth about cause-based vs symptom-based alerting. Even the most hard-core "symptom-based" people admit that sometimes other "cause-based" alerts are useful, but they are often hard to identify and tune alert thresholds for. These cases always (nearly always? is there any other kind?) represent "hazard states", and the STPA approach gives you a systematic way of trying to find them. The next tricky part is figuring out how to tune your alerting for them so that they get appropriately addressed without overwhelming people with too many alerts. The people dealing with these hazards are also part of the system, and pager-fatigue is a very real failure mode. For tuning alert thresholds I recommend a risk analysis style approach to translate it into your clearly defined SLO's. These "hazard states" represent a risk of an outage; what is the probability in average duration till an outage happens from this state, and how much of your SLO error budget will the outage cost? The cost/duration is the effective SLO error budget burn rate of being in that state, so set your alert threshold and priority to reflect that.