Tim Falzone’s Post

2mo

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

The Evolution of SRE at Google

usenix.org

17 Comments

Donovan Baarda

2mo

Looking for "hazard states" reminds me of the "impending doom" style alerts, and all the arguments back-and-forth about cause-based vs symptom-based alerting. Even the most hard-core "symptom-based" people admit that sometimes other "cause-based" alerts are useful, but they are often hard to identify and tune alert thresholds for. These cases always (nearly always? is there any other kind?) represent "hazard states", and the STPA approach gives you a systematic way of trying to find them. The next tricky part is figuring out how to tune your alerting for them so that they get appropriately addressed without overwhelming people with too many alerts. The people dealing with these hazards are also part of the system, and pager-fatigue is a very real failure mode. For tuning alert thresholds I recommend a risk analysis style approach to translate it into your clearly defined SLO's. These "hazard states" represent a risk of an outage; what is the probability in average duration till an outage happens from this state, and how much of your SLO error budget will the outage cost? The cost/duration is the effective SLO error budget burn rate of being in that state, so set your alert threshold and priority to reflect that.

Lorin Hochstein

Staff Software Engineer

2mo

How did you come to learn about STAMP? The only tech company I was aware of that explicitly used STAMP was Akamai, a few years back.

4 Reactions

John Thomas

Co-director of MIT's Engineering Systems Lab

2mo

This is excellent work, and it has the potential to impact reliability engineering across industries!

3 Reactions

Ofir Cohen

CTO of Container Security @ Wiz | Public Speaker (CNCF, K8s)

2mo

I’m a huge fan of the Google’s SRE culture and have recently finished reading the SRR book. I’ve got the other 2 books on my list. Happy to follow on new content and updates :-)

Andrew Mallaband

Growth Engineering | Enabling Tech Leaders & Innovators Around The Globe To Achieve Exceptional Results

2mo

This is a great article and worthwhile reading for any practioners (#platformengineering/#devops) who are thinking about their journey/roadmap of innovation as it relates to #SRE and #Observability. Vendors should pay attention to this and think about how their products could help end users implement these concepts.

1 Reaction

Michael Allan

CTO | Zai Payments

2mo

Great article! Using the STAMP framework to approach reliability more comprehensively and proactively than standard incident response procedures allow looks to be a powerful step forward for #SRE. Anyone maintaining complex systems should seriously consider this control theory paradigm shift.

👋 Dale Harrison

Senior Director, Engineering at Google

2mo

Great to see this out in the open, and appreciated your continued push to make Maps safer. You're also pretty fun to work with, but I won't ever admit that in public.

1 Reaction

John Lunney

Senior Staff Reliability SWE at Google

1mo

Good job on actually sticking to your plan.

Eloise Koullapis

Infrastructure Engineer, Resilience Engineering Champion

1mo

Tim Falzone I have been working on implementing this in my organisation. I would love to have a chat about some of the concepts :)

See more comments

To view or add a comment, sign in

More Relevant Posts

Jeffrey Snover
2mo
Report this post
Check out this excellent paper on the evolution of SRE at Google through the adoption of Safety Design principles (aka STPA and CAST).

Tim Falzone
2mo

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

The Evolution of SRE at Google

usenix.org

1 Comment
Like Comment
To view or add a comment, sign in
Semion Akimtsev

Engineer 🛠️ Skier ⛷️ Father of twins 👯♀️
1mo
Report this post
Back in the days at university we had a pretty wide range of fundamental disciplines related to control theory, analog and digital ways of solving various problems. I remember the feeling it brings when you analyze a system with transfer and step functions, and see how it would behave under certain conditions, without even building it. In other courses there were more practical ways to look at systems holistically and try to understand their dynamics, including statistical analysis. This article underscores a widespread look at software and ways to understand how it works. Logging is largely considered a “just enough” way to collect data, sometimes not even using structured logging and correlation. I’ve also been to similar conversations like - oh, we have disabled tracing in production, because it costs. Then how would it be possible to find the root cause of an issue or simply trace the stages of a process? In my opinion, mostly scientific and mathematical methods are not used, or considered sophisticated or irrelevant, because general software is not that complex and/or because it is build with “we will figure it out later” mentality. I think developers should help themselves to be ready to resolve complex production issues by reusing the engineering techniques known for long time.

Tim Falzone
2mo

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

The Evolution of SRE at Google

usenix.org

1 Comment
Like Comment
To view or add a comment, sign in
Andrew Mallaband

Growth Engineering | Enabling Tech Leaders & Innovators Around The Globe To Achieve Exceptional Results
2mo
Report this post
If you are involved in #SRE and #Observability this is a worthwhile read. Vendors in this space should also digest this and think carefully about how their products might support these practices https://lnkd.in/eS-U8ZUY

Tim Falzone
2mo

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

The Evolution of SRE at Google

usenix.org
Like Comment
To view or add a comment, sign in
Salim Virji

Site Reliability Engineer
2mo
Report this post
This article and its insights are 🔥

Tim Falzone
2mo

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

The Evolution of SRE at Google

usenix.org

1 Comment
Like Comment
To view or add a comment, sign in
Donovan Baarda
2mo
Report this post
Looking for "hazard states" reminds me of the "impending doom" style alerts, and all the arguments back-and-forth about cause-based vs symptom-based alerting. Even the most hard-core "symptom-based" people admit that sometimes other "cause-based" alerts are useful, but they are often hard to identify and tune alert thresholds for. These cases always (nearly always? is there any other kind?) represent "hazard states", and the STPA approach gives you a systematic way of trying to find them. The next tricky part is figuring out how to tune your alerting for them so that they get appropriately addressed without overwhelming people with too many alerts. The people dealing with these hazards are also part of the system, and pager-fatigue is a very real failure mode. For tuning alert thresholds I recommend a risk analysis style approach to translate it into your clearly defined SLO's. These "hazard states" represent a risk of an outage; what is the probability in average duration till an outage happens from this state, and how much of your SLO error budget will the outage cost? The cost/duration is the effective SLO error budget burn rate of being in that state, so set your alert threshold and priority to reflect that.

Tim Falzone
2mo

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

The Evolution of SRE at Google

usenix.org
Like Comment
To view or add a comment, sign in
Shane Markstrum

Director of Software Engineering at Google
1mo
Report this post
I’ve been working with Tim and his SRE team for many years at this point. They are true partners helping to ensure that we prevent outages and minimize risks in our Geo products. I can also attest that there is real power in STPA as a method for understanding control hazards in complex systems. There can be some high costs to the initial set up, but there are real upsides to actively preventing the worst scenarios by identifying areas of system decisions with inadequate controls or visibility. It is an empowering game changer for engineering teams to step away from post-mortem analysis as the primary driver for systemic changes.

Tim Falzone
2mo

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

The Evolution of SRE at Google

usenix.org
Like Comment
To view or add a comment, sign in
Colin Breck

Principal Engineer (Add a note if you want to connect)
1mo
Report this post
Interesting paper. This is the evolution in thinking I was anticipating when I wrote my essay “Observations on Observability”: less emphasis on the discreet and linear using inductive reasoning, more emphasis on the gestalt using more systemic approaches. My essay from 2019: https://lnkd.in/gwU48j6

Tim Falzone
2mo

I am very happy to wrap up 2024 with an article that I co-wrote with Benjamin Treynor Sloss about our long-term effort to evolve Site Reliability Engineering at Google. We've been using Systems Theoretic Process Analysis (STPA) to anticipate and prevent problems in our complex systems with great success. This approach enables SREs to manage immense complexity and opens a whole new paradigm for Google's safety and reliability. Read more about our work over the past few years! https://lnkd.in/eWGwT3qk

The Evolution of SRE at Google

usenix.org
Like Comment
To view or add a comment, sign in
Arseny Krasikov

IT QA | PM | System thinking / Engineering | Cybernetics
2mo Edited
Report this post
#Cybernetics and #ControlTheory are good (only?) tools for technical systems reliability. Still quite surprised to see Kalman and Ashby direct references in Google SRE practices in a article by Tim Falzone and Benjamin Treynor Sloss. Next to see how Law Of Requisite Variety meets complexity of modern distributed technical systems

The Evolution of SRE at Google

usenix.org
Like Comment
To view or add a comment, sign in
Venkatesh Sarivisetty

Lead SRE Cloud DevOps Engineer at TransUnion | Mentor | Blogger | ML and AI enthusiastic
3mo Edited
Report this post
"What if I told you breaking your system could make it stronger? Looks 🤪 Here we go ✍️ Chaos Engineering: Building Resilient Systems 🔧 What is Chaos Engineering? A proactive practice to test system resilience by intentionally introducing failures in a controlled manner. 🎯 Why It Matters 💡 Discover weaknesses before they become outages. 🚀 Build confidence in system reliability. 🛡️ Ensure a seamless user experience during disruptions. 📋 Core Steps 1️⃣ Define Steady State: What does “normal” look like? 2️⃣ Create Hypotheses: Predict system behavior under failure. 3️⃣ Run Experiments: Simulate issues like: 🔥 Service crashes 🌐 Network delays 🧮 High CPU usage 4️⃣ Analyze & Learn: Use observability tools (e.g., Grafana, Datadog) to improve. ⚙️ Popular Tools Chaos Monkey 🐵: Random instance failures. Gremlin 🛠️: Controlled fault injection. LitmusChaos ☸️: Kubernetes-specific experiments. 🏆 Real-World Wins Netflix: Stream without interruptions. Amazon: Handle Black Friday spikes. Google: Avoid cascading failures. 🔍 Key Takeaway: Break it intentionally, fix it intelligently "Liked this post? Follow me for more insights on engineering practices, cloud solutions, and building resilient systems!" Thanks Venkatesh Sarivisetty
Like Comment
To view or add a comment, sign in
Jeffrey Snover
2mo Edited
Report this post
Fun facts: 1) Ben Treynor talked to me about STPA and that conversation was one of the reasons I decided to come to Google. (I didn’t join Google - I joined Ben. Had I known that I would also get to work with Tim, that would have been another reason to join!) 2) We’ve been working with John Thomas to apply/tweak/apply STPA to complicated software systems. He his fantastic! 3) STPA is NOT a ‘read the book and give it a shot’ sort of thing. If you want to get STPA going in your org, you should connect with John to figure out how.
John Thomas

Co-director of MIT's Engineering Systems Lab
2mo Edited

The Site Reliability Engineering (SRE) discipline was developed at Google 20 years ago to address the challenges of large-scale systems with high availability and reliability. The founder of SRE, Ben Treynor Sloss, adapted software engineering principles to drastically improve the reliability of large scale operations. It's been widely successful, now adopted by nearly every leading tech company like Netflix, Amazon, Microsoft, Google, Meta, and Apple, and others. According to the founder, while SRE has excelled at reacting to loss events and ensuring similar losses are prevented, the harder challenge has been to anticipate what will go wrong before it happens. That's what Ben and his Google team have been doing in the last few years--integrating modern approaches into SRE to quickly and effectively anticipate the next outage before it happens. At the heart of SRE 2.0 is STPA, a systems approach used in aerospace, automotive, and other industries to anticipate hazardous interactions that can cause future losses. The Google SRE team has now evaluated STPA on dozens of software-intensive applications, and they are reporting that it quickly identifies subtle but catastrophic behaviors from both humans and software that were otherwise overlooked. For example, STPA revealed a critical feedback problem in automated software quota adjustments, resulting in easy fixes to prevent sudden drastic quota reductions that could delete data. They also used STPA to improve many engineering practices, like what happens when an SME is asked to review code but is unable to respond in time before the code is pushed to production. Read more about these and other findings from SRE's founder and Google's SRE team that have been integrating STPA into their practices. https://lnkd.in/e9Vgzr92 #SRE #software #STPA #reliability Tim Falzone
Like Comment
To view or add a comment, sign in

934 followers

6 Posts

View Profile Connect

Tim Falzone’s Post

More Relevant Posts

Explore topics