How many of your supposedly internal-only backends are serving critical end-user traffic? Probably more than you think! Most frontend services have many transitive dependencies, and understanding these dependencies becomes increasingly difficult as your system grows in scale and complexity. As a result, it becomes exceptionally tricky to ensure that internal-only backends are fully isolated from end users.
Google uses a common infrastructure platform for almost all of its services, internal and external; this provides for a message-passing mechanism common to all of these services. One outcome is that systems engineers can build accurate dependency maps of services; and that we can ensure new services only depend on other services at the same level or higher in the dependency stack.
Risky dependencies — in this case, dependencies which place internal-only backends on the critical path for end users — can be a source of major production outages, because the internal-only backends do not generally have necessary reliability or safety guarantees. For example, you might have an externally-visible service that requires high levels of availability and performance, and this system has an indirect dependency on a backend with no availability or performance SLOs.
At Google, we found success preventing major and huge outages by clustering services into categories with distinct properties, clear perimeters, and limited blast surfaces. The properties will depend on the guarantees we want to enforce.
In this article, we use Google Maps to demonstrate how risky dependencies proliferate in complex systems, and then showcase how clustering Google Maps services into just two categories ("Internal" and "External") can prevent major and huge serving outages. We also outline how OpenTelemetry is used to identify violations of this clustering, which can be fixed before the violations cause a major serving outage.
This article is based on a talk at SREcon EMEA, which you can see here: https://lnkd.in/e8i7pF_r.
CEO, Founder, & Community Leader | Keep Innovating.
2moExciting! As I like to say, “DNS is still the #1 way for an enterprise to accidentally cause an outage”.