Groot an Event-graph-based Approach for Root
Groot an Event-graph-based Approach for Root
Abstract—For large-scale distributed systems, it is crucial to are detected with alerting mechanisms [3]–[5] based on mon-
efficiently diagnose the root causes of incidents to maintain itoring data such as logs [6]–[10], metrics/key performance
arXiv:2108.00344v2 [cs.SE] 22 Sep 2021
high system availability. The recent development of microservice indicators (KPIs) [11]–[15], or a combination thereof [16],
architecture brings three major challenges (i.e., complexities of
operation, system scale, and monitoring) to root cause analysis [17]. In the second step, when the alerts are triggered, RCA
(RCA) in industrial settings. To tackle these challenges, in this is performed to analyze the root cause of these alerts and
paper, we present G ROOT, an event-graph-based approach for additional events, and to propose recovery actions from the
RCA. G ROOT constructs a real-time causality graph based associated incident [6], [18], [19]. RCA needs to consider
on events that summarize various types of metrics, logs, and multiple possible interpretations of potential causes for the
activities in the system under analysis. Moreover, to incorporate
domain knowledge from site reliability engineering (SRE) engi- incident, and these different interpretations could lead to
neers, G ROOT can be customized with user-defined events and different mitigation actions to be performed. In the last step,
domain-specific rules. Currently, G ROOT supports RCA among the SRE teams perform those mitigation actions and recover
5,000 real production services and is actively used by the SRE the system.
teams in eBay, a global e-commerce system serving more than
159 million active buyers per year. Over 15 months, we collect Based on our industrial SRE experiences, we find that RCA
a data set containing labeled root causes of 952 real production is difficult in industrial practice due to three complexities,
incidents for evaluation. The evaluation results show that G ROOT particularly under microservice settings:
is able to achieve 95% top-3 accuracy and 78% top-1 accuracy. To
share our experience in deploying and adopting RCA in industrial • Operational Complexity. For large-scale systems, there
settings, we conduct a survey to show that users of G ROOT find are typically centered (aka infrastructure) SRE and do-
it helpful and easy to use. We also share the lessons learned
from deploying and adopting G ROOT to solve RCA problems in main (aka embedded) SRE engineers [20]. Their com-
production environments. munication is often ineffective or limited under the mi-
Index Terms—microservices, root cause analysis, AIOps, ob- croservice scenarios due to a more diversified tech stack,
servability granular services, and shorter life cycles than traditional
systems. The knowledge gap between the centered SRE
I. I NTRODUCTION team and the domain SRE team gets further enlarged
Since the emergence of microservice architecture [1], it has and makes RCA much more challenging. Centered SRE
been quickly adopted by many large companies such as Ama- engineers have to learn from domain SRE engineers
zon, Google, and Microsoft. Microservice architecture aims to on how the new domain changes work to update the
improve the scalability, development agility, and reusability of centralized RCA tools. Thus, adaptive and customizable
these companies’ business systems. Despite these undeniable RCA is required instead of one-size-fits-all solutions.
benefits, different levels of components in such a system • Scale Complexity. There could be thousands of ser-
can go wrong due to the fast-evolving and large-scale nature vices simultaneously running in a large microservice
of microservices architecture [1]. Even if there are minimal system, resulting in a very high number of monitoring
human-induced faults in code, the system might still be at risk signals. A real incident could cause numerous alerts to
due to anomalies in hardware, configurations, etc. Therefore, it be triggered across services. The inter-dependencies and
is critical to detect anomalies and then efficiently analyze the incident triaging between the services are proportionally
root causes of the associated incidents, subsequently helping more complicated than a traditional system [15]. To detect
the system reliability engineering (SRE) team take further root causes that may be distributed and many steps away
actions to bring the system back to normal. from an initially observed anomalous service, the RCA
In the process of recovering a system, it is critical to conduct approach must be scalable and very efficient to digest
accurate and efficient root cause analysis (RCA) [2], the high volume signals.
second one of a three-step process. In the first step, anomalies • Monitoring Complexity. A high quantity of observability
§ Tao
data types (metrics, logs, and activities) need to be mon-
Xie is also affiliated with Key Laboratory of High Confidence
Software Technologies (Peking University), Ministry of Education, China. itored, stored, and processed, such as intra-service and
Hanzhang Wang is the corresponding author. inter-service metrics. Different services in a system may
produce different types of logs or metrics with different rules that decide causality between events, we design
patterns. There are also various kinds of activities, such a grammar that allows easy and fast implementations
as code deployment or configuration changes. The RCA of domain-specific rules, narrowing the knowledge gap
tools must be able to consume such highly diversified and of the operational complexity. Third, G ROOT provides a
unstructured data and make inferences. robust and transparent ranking algorithm that can digest
diverse events, improve accuracy, and produce results
To overcome the limited effectiveness of existing ap-
interpretable by visualization.
proaches [2], [3], [14], [16], [21]–[31] (as mentioned in
Section II) in industrial settings due to the aforementioned To demonstrate the flexibility and effectiveness of G ROOT,
complexities, we propose G ROOT, an event-graph-based RCA we evaluate it on eBay’s production system that serves more
approach. In particular, G ROOT constructs an event causality than 159 million active users and features more than 5,000
graph, whose basic nodes are monitoring events such as services deployed over three data centers. We conduct experi-
performance-metric deviation events, status change events, ments on a labeled and validated data set to show that G ROOT
and developer activity events. These events carry detailed achieves 95% top-3 accuracy and 78% top-1 accuracy for 952
information to enable accurate RCA. The events and the real production incidents collected over 15 months. Further-
causalities between them are constructed using specified rules more, G ROOT is deployed in production for real-time RCA,
and heuristics (reflecting domain knowledge). In contrast to and is used daily by both centered and domain SRE teams,
the existing fully learning-based approaches [3], [10], [23], with the achievement of 73% top-1 accuracy in action. Finally,
G ROOT provides better transparency and interpretability. Such the end-to-end execution time of G ROOT for each incident in
interpretability is critical in our industrial settings because our experiments is less than 5 seconds, demonstrating the high
a graph-based approach can offer visualized reasoning with efficiency of G ROOT.
causality links to the root cause and details of every event We report our experiences and lessons learned when us-
instead of just listing the results. Besides, our approach can ing G ROOT to perform RCA in the industrial e-commerce
enable effective tracking of cases and targeted detailed im- system. We survey among the SRE users and developers of
provements, e.g., by enhancing the rules and heuristics used G ROOT, who find G ROOT easy to use and helpful during the
to construct the graph. triage stage. Meanwhile, the developers also find the G ROOT
design to be desirable to make changes and facilitate new
G ROOT has two salient advantages over existing graph-
requirements. We also share the lessons learned from adopting
based approaches:
G ROOT in production for SRE in terms of technology transfer
• Fine granularity (events as basic nodes). First, unlike and adoption.
existing graph-based approaches, which directly use ser- In summary, this paper makes four main contributions:
vices [25] or hosts (VMs) [30] as basic nodes, G ROOT • An event-graph-based approach named G ROOT for root
constructs the causality graph by using monitoring events cause analysis tackling challenges in industrial settings.
as basic nodes. Graphs based on events from the ser- • Implementation of G ROOT in an RCA framework for
vices can provide more accurate results to address the allowing the SRE teams to instill domain knowledge.
monitoring complexity. Second, for the scale complexity, • Evaluation performed in eBay’s production environ-
G ROOT can dynamically create hidden events or addi- ment with more than 5,000 services, for demonstrating
tional dependencies based on the context, such as adding G ROOT’s effectiveness and efficiency.
dependencies to the external service providers and their • Experiences and lessons learned when deploying and
issues. Third, to construct the causality graph, G ROOT applying G ROOT in production.
takes the detailed contextual information of each event
into consideration for analysis with more depth. Doing II. R ELATED W ORK
so also helps G ROOT incorporate SRE insights with the Anomaly Detection. Anomaly detection aims to detect po-
context details of each event to address the operational tential issues in the system. Anomaly detection approaches
complexity. using time series data can generally be categorized into three
• High diversity (a wide range of event types supported). types: (1) batch-processing and historical analysis such as
First, the causality graph in G ROOT supports various Surus [32]; (2) machine-learning-based, such as Donut [12];
event types such as performance metrics, status logs, and (3) usage of adaptive concept drift, such as StepWise [33].
developer activities to address the monitoring complexity. G ROOT currently uses a combination of manually written
This multi-scenario graph schema can directly boost the thresholds, statistical models, and machine learning (ML)
RCA coverage and precision. For example, G ROOT is algorithms to detect anomalies. Since our approach is event-
able to detect a specific configuration change on a service driven, as long as fairly accurate alerts are generated, G ROOT
as the root cause instead of performance anomaly symp- is able to incorporate them.
toms, thus reducing triaging efforts and time-to-recovery Root Cause Analysis. Traditional RCA approaches (e.g.,
(TTR). Second, G ROOT allows the SRE engineers to in- Adtributor [34] and HotSpot [35]) find the multi-dimensional
troduce different event types that are powered by different combination of attribute values that would lead to certain
detection strategies or from different sources. For the quality of service (QoS) anomalies. These approaches are
effective at discrete static data. Once there are continuous data TABLE I: The scale of experiments in existing RCA ap-
introduced by time series information, these approaches would proaches’ evaluations (QPS: Queries per second)
be much less effective. Approach Year Scale Validated on Real Incidents?
FChain [21] 2013 <= 10 VMs No
To tackle these difficulties, there are two categories of CauseInfer [22] 2014 20 services on 5 servers No
MicroScope [36] 2018 36 services, ∼5000 QPS No
approaches based on ML and graph, respectively. APG [30] 2018 <=20 services on 5 VMs No
Seer [10] 2019 <=50 services on 20 servers Partially
ML-based RCA. Some ML-based approaches use features MicroRCA [15] 2020 13 services, ∼600 QPS No
such as time series information [23], [30] and features ex- RCA Graph [25] 2020 <=70 services on 8 VMs No
Causality RCA [31] 2020 <=20 services No
tracted using textual and temporal information [3]. Some other
approaches [12] conduct deep learning by first constructing
the dependency graph of the system and then representing substantial in the industrial settings. Hence, we believe that
the graph in a neural network. However, these ML-based the target RCA approach should be validated at the enterprise
approaches face the challenge of lacking training data. Gan et scale and against actual incidents for effectiveness.
al. [10] proposed Seer to make use of historical tracking data. Table I lists the experimental settings and scale in ex-
Although Seer also focuses on the microservice scenario, it isting RCA approaches’ evaluations. All the listed existing
is designed to detect QoS violations while lacking support approaches are evaluated in a relatively small scenario. In
for other kinds of errors. There is also an effort to use contrast, our experiments are performed upon a system con-
unsupervised learning such as GAN [12], but it is generally taining 5,000 production services on hundreds of thousands of
hard to simulate large, complicated distributed systems to give VMs. On average, the sub-dependency graph (constructed in
meaningful data. Section IV-A) of our service-based data set is already 77.5
Graph-based RCA. A recent survey [2] on RCA approaches services, more than the total number in any of the listed
categorizes more than 20 RCA algorithms by more than evaluations. Moreover, 7 out of the 8 listed approaches are
10 theoretical models to represent the relationships between evaluated under simulative fault injection on top of existing
components in a microservice system. Nguyen et al. [21] benchmarks such as RUBiS, which cannot represent real-world
proposed FChain, which introduces time series information incidents; Seer [10] collects only the real-world results with
into the graph, but they still use server/VM as nodes in the no validations. Our data set contains 952 actual incidents
graph. Chen et al. [22] proposed CauseInfer, which constructs collected from real-world settings.
a two-layered hierarchical causality graph. It applies metrics
as nodes that indicate service-level dependency. Schoenfisch III. M OTIVATING E XAMPLES
et al. [24] proposed to use Markov Logic Network to express In this section, we demonstrate the effectiveness of event-
conditional dependencies in the first-order logic, but still build based graph and adaptive customization strategies with two
dependency on the service level. Lin et al. [36] proposed motivating examples.
Microscope, which targets the microservice scenario. It builds Figure 1 shows an abstracted real incident example with the
the graph only on service-level metrics so it cannot get full dependency graph and the corresponding causality graph con-
use of other information and lacks customization. Brandon et structed by G ROOT. The Checkout service of our e-commerce
al. [25] proposed to build the system graph using metrics, system suddenly gets an additional latency spike due to a code
logs, and anomalies, and then use pattern matching against a deployment on the Service-E. The service monitor is reporting
library to identify the root cause. However, it is difficult to API Call Timeout detected by the ML-based anomaly detection
update the system to facilitate the changing requirements. Wu system. The simplified sub-dependency graph consisting of 6
et al. [15] proposed MicroRCA, which models both services services is shown in Figure 1a. The initial alert is triggered
and machines in the graph and tracks the propagation among on the Checkout (entrance) service. The other nodes Service-*
them. It would be hard to extend the graph from machines to are the internal services that the Checkout service directly or
the concept of other resources such as databases in our paper. indirectly depends on. The color of the nodes in Figure 1a
As mentioned in Section I, by using the event graph, indicates the severity/count of anomalies (alerts) reported on
G ROOT mainly overcomes the limitations of existing graph- each service. We can see that Service-B is the most severe
based approaches in two aspects: (1) build a more accurate one as there are two related alerts on it. The traditional graph-
and precise causality graph use the event-graph-based model; based approach [25], [30] usually takes into account only the
(2) allow adaptive customization of link construction rules to graph between services in addition to the severity information
incorporate domain knowledge in order to facilitate the rapid on each service. If the traditional approach got applied on
requirement changes in the microservice scenario. Figure 1a, either Service-B, Service-D, or Service-E could be
Our G ROOT approach uses a customized page rank algo- a potential root cause, and Service-B would have the highest
rithm in the event ranking, and can also be seen as an unsu- possibility since it has two related alerts. Such results are not
pervised ML approach. Therefore, G ROOT is complementary useful to the SRE teams.
to other ML approaches as long as they can accept our event G ROOT constructs the event-based causality graph as shown
causality graph as a feature. in Figure 1b. The events in each service are used as the nodes
Settings and Scale. The challenges of operational, scale, here. We can see that the API Call Timeout issue in Checkout
and monitoring complexities are observed, especially being is possibly caused by API Call Timeout in Service-A, which is
(a) Dependency graph (b) Causality graph
Fig. 1: Motivating example of event causality graph
Question 2 Question 2
Question 3 Question 3
Question 4 Question 4
Question 5 Question 5
0 1 2 3 4 0 1 2 3 4