0% found this document useful (0 votes)
17 views

Groot an Event-graph-based Approach for Root

Uploaded by

Victor Andrade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Groot an Event-graph-based Approach for Root

Uploaded by

Victor Andrade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Groot: An Event-graph-based Approach for Root

Cause Analysis in Industrial Settings


Hanzhang Wang∗ , Zhengkai Wu† , Huai Jiang∗ , Yichao Huang∗ ,
Jiamu Wang∗ , Selcuk Kopru∗ , Tao Xie§
∗ eBay, † University
of Illinois at Urbana-Champaign, § Peking University
Email: {hanzwang,huajiang,yichhuang,jiamuwang,skopru}@ebay.com, [email protected], [email protected]

Abstract—For large-scale distributed systems, it is crucial to are detected with alerting mechanisms [3]–[5] based on mon-
efficiently diagnose the root causes of incidents to maintain itoring data such as logs [6]–[10], metrics/key performance
arXiv:2108.00344v2 [cs.SE] 22 Sep 2021

high system availability. The recent development of microservice indicators (KPIs) [11]–[15], or a combination thereof [16],
architecture brings three major challenges (i.e., complexities of
operation, system scale, and monitoring) to root cause analysis [17]. In the second step, when the alerts are triggered, RCA
(RCA) in industrial settings. To tackle these challenges, in this is performed to analyze the root cause of these alerts and
paper, we present G ROOT, an event-graph-based approach for additional events, and to propose recovery actions from the
RCA. G ROOT constructs a real-time causality graph based associated incident [6], [18], [19]. RCA needs to consider
on events that summarize various types of metrics, logs, and multiple possible interpretations of potential causes for the
activities in the system under analysis. Moreover, to incorporate
domain knowledge from site reliability engineering (SRE) engi- incident, and these different interpretations could lead to
neers, G ROOT can be customized with user-defined events and different mitigation actions to be performed. In the last step,
domain-specific rules. Currently, G ROOT supports RCA among the SRE teams perform those mitigation actions and recover
5,000 real production services and is actively used by the SRE the system.
teams in eBay, a global e-commerce system serving more than
159 million active buyers per year. Over 15 months, we collect Based on our industrial SRE experiences, we find that RCA
a data set containing labeled root causes of 952 real production is difficult in industrial practice due to three complexities,
incidents for evaluation. The evaluation results show that G ROOT particularly under microservice settings:
is able to achieve 95% top-3 accuracy and 78% top-1 accuracy. To
share our experience in deploying and adopting RCA in industrial • Operational Complexity. For large-scale systems, there
settings, we conduct a survey to show that users of G ROOT find are typically centered (aka infrastructure) SRE and do-
it helpful and easy to use. We also share the lessons learned
from deploying and adopting G ROOT to solve RCA problems in main (aka embedded) SRE engineers [20]. Their com-
production environments. munication is often ineffective or limited under the mi-
Index Terms—microservices, root cause analysis, AIOps, ob- croservice scenarios due to a more diversified tech stack,
servability granular services, and shorter life cycles than traditional
systems. The knowledge gap between the centered SRE
I. I NTRODUCTION team and the domain SRE team gets further enlarged
Since the emergence of microservice architecture [1], it has and makes RCA much more challenging. Centered SRE
been quickly adopted by many large companies such as Ama- engineers have to learn from domain SRE engineers
zon, Google, and Microsoft. Microservice architecture aims to on how the new domain changes work to update the
improve the scalability, development agility, and reusability of centralized RCA tools. Thus, adaptive and customizable
these companies’ business systems. Despite these undeniable RCA is required instead of one-size-fits-all solutions.
benefits, different levels of components in such a system • Scale Complexity. There could be thousands of ser-
can go wrong due to the fast-evolving and large-scale nature vices simultaneously running in a large microservice
of microservices architecture [1]. Even if there are minimal system, resulting in a very high number of monitoring
human-induced faults in code, the system might still be at risk signals. A real incident could cause numerous alerts to
due to anomalies in hardware, configurations, etc. Therefore, it be triggered across services. The inter-dependencies and
is critical to detect anomalies and then efficiently analyze the incident triaging between the services are proportionally
root causes of the associated incidents, subsequently helping more complicated than a traditional system [15]. To detect
the system reliability engineering (SRE) team take further root causes that may be distributed and many steps away
actions to bring the system back to normal. from an initially observed anomalous service, the RCA
In the process of recovering a system, it is critical to conduct approach must be scalable and very efficient to digest
accurate and efficient root cause analysis (RCA) [2], the high volume signals.
second one of a three-step process. In the first step, anomalies • Monitoring Complexity. A high quantity of observability
§ Tao
data types (metrics, logs, and activities) need to be mon-
Xie is also affiliated with Key Laboratory of High Confidence
Software Technologies (Peking University), Ministry of Education, China. itored, stored, and processed, such as intra-service and
Hanzhang Wang is the corresponding author. inter-service metrics. Different services in a system may
produce different types of logs or metrics with different rules that decide causality between events, we design
patterns. There are also various kinds of activities, such a grammar that allows easy and fast implementations
as code deployment or configuration changes. The RCA of domain-specific rules, narrowing the knowledge gap
tools must be able to consume such highly diversified and of the operational complexity. Third, G ROOT provides a
unstructured data and make inferences. robust and transparent ranking algorithm that can digest
diverse events, improve accuracy, and produce results
To overcome the limited effectiveness of existing ap-
interpretable by visualization.
proaches [2], [3], [14], [16], [21]–[31] (as mentioned in
Section II) in industrial settings due to the aforementioned To demonstrate the flexibility and effectiveness of G ROOT,
complexities, we propose G ROOT, an event-graph-based RCA we evaluate it on eBay’s production system that serves more
approach. In particular, G ROOT constructs an event causality than 159 million active users and features more than 5,000
graph, whose basic nodes are monitoring events such as services deployed over three data centers. We conduct experi-
performance-metric deviation events, status change events, ments on a labeled and validated data set to show that G ROOT
and developer activity events. These events carry detailed achieves 95% top-3 accuracy and 78% top-1 accuracy for 952
information to enable accurate RCA. The events and the real production incidents collected over 15 months. Further-
causalities between them are constructed using specified rules more, G ROOT is deployed in production for real-time RCA,
and heuristics (reflecting domain knowledge). In contrast to and is used daily by both centered and domain SRE teams,
the existing fully learning-based approaches [3], [10], [23], with the achievement of 73% top-1 accuracy in action. Finally,
G ROOT provides better transparency and interpretability. Such the end-to-end execution time of G ROOT for each incident in
interpretability is critical in our industrial settings because our experiments is less than 5 seconds, demonstrating the high
a graph-based approach can offer visualized reasoning with efficiency of G ROOT.
causality links to the root cause and details of every event We report our experiences and lessons learned when us-
instead of just listing the results. Besides, our approach can ing G ROOT to perform RCA in the industrial e-commerce
enable effective tracking of cases and targeted detailed im- system. We survey among the SRE users and developers of
provements, e.g., by enhancing the rules and heuristics used G ROOT, who find G ROOT easy to use and helpful during the
to construct the graph. triage stage. Meanwhile, the developers also find the G ROOT
design to be desirable to make changes and facilitate new
G ROOT has two salient advantages over existing graph-
requirements. We also share the lessons learned from adopting
based approaches:
G ROOT in production for SRE in terms of technology transfer
• Fine granularity (events as basic nodes). First, unlike and adoption.
existing graph-based approaches, which directly use ser- In summary, this paper makes four main contributions:
vices [25] or hosts (VMs) [30] as basic nodes, G ROOT • An event-graph-based approach named G ROOT for root
constructs the causality graph by using monitoring events cause analysis tackling challenges in industrial settings.
as basic nodes. Graphs based on events from the ser- • Implementation of G ROOT in an RCA framework for
vices can provide more accurate results to address the allowing the SRE teams to instill domain knowledge.
monitoring complexity. Second, for the scale complexity, • Evaluation performed in eBay’s production environ-
G ROOT can dynamically create hidden events or addi- ment with more than 5,000 services, for demonstrating
tional dependencies based on the context, such as adding G ROOT’s effectiveness and efficiency.
dependencies to the external service providers and their • Experiences and lessons learned when deploying and
issues. Third, to construct the causality graph, G ROOT applying G ROOT in production.
takes the detailed contextual information of each event
into consideration for analysis with more depth. Doing II. R ELATED W ORK
so also helps G ROOT incorporate SRE insights with the Anomaly Detection. Anomaly detection aims to detect po-
context details of each event to address the operational tential issues in the system. Anomaly detection approaches
complexity. using time series data can generally be categorized into three
• High diversity (a wide range of event types supported). types: (1) batch-processing and historical analysis such as
First, the causality graph in G ROOT supports various Surus [32]; (2) machine-learning-based, such as Donut [12];
event types such as performance metrics, status logs, and (3) usage of adaptive concept drift, such as StepWise [33].
developer activities to address the monitoring complexity. G ROOT currently uses a combination of manually written
This multi-scenario graph schema can directly boost the thresholds, statistical models, and machine learning (ML)
RCA coverage and precision. For example, G ROOT is algorithms to detect anomalies. Since our approach is event-
able to detect a specific configuration change on a service driven, as long as fairly accurate alerts are generated, G ROOT
as the root cause instead of performance anomaly symp- is able to incorporate them.
toms, thus reducing triaging efforts and time-to-recovery Root Cause Analysis. Traditional RCA approaches (e.g.,
(TTR). Second, G ROOT allows the SRE engineers to in- Adtributor [34] and HotSpot [35]) find the multi-dimensional
troduce different event types that are powered by different combination of attribute values that would lead to certain
detection strategies or from different sources. For the quality of service (QoS) anomalies. These approaches are
effective at discrete static data. Once there are continuous data TABLE I: The scale of experiments in existing RCA ap-
introduced by time series information, these approaches would proaches’ evaluations (QPS: Queries per second)
be much less effective. Approach Year Scale Validated on Real Incidents?
FChain [21] 2013 <= 10 VMs No
To tackle these difficulties, there are two categories of CauseInfer [22] 2014 20 services on 5 servers No
MicroScope [36] 2018 36 services, ∼5000 QPS No
approaches based on ML and graph, respectively. APG [30] 2018 <=20 services on 5 VMs No
Seer [10] 2019 <=50 services on 20 servers Partially
ML-based RCA. Some ML-based approaches use features MicroRCA [15] 2020 13 services, ∼600 QPS No
such as time series information [23], [30] and features ex- RCA Graph [25] 2020 <=70 services on 8 VMs No
Causality RCA [31] 2020 <=20 services No
tracted using textual and temporal information [3]. Some other
approaches [12] conduct deep learning by first constructing
the dependency graph of the system and then representing substantial in the industrial settings. Hence, we believe that
the graph in a neural network. However, these ML-based the target RCA approach should be validated at the enterprise
approaches face the challenge of lacking training data. Gan et scale and against actual incidents for effectiveness.
al. [10] proposed Seer to make use of historical tracking data. Table I lists the experimental settings and scale in ex-
Although Seer also focuses on the microservice scenario, it isting RCA approaches’ evaluations. All the listed existing
is designed to detect QoS violations while lacking support approaches are evaluated in a relatively small scenario. In
for other kinds of errors. There is also an effort to use contrast, our experiments are performed upon a system con-
unsupervised learning such as GAN [12], but it is generally taining 5,000 production services on hundreds of thousands of
hard to simulate large, complicated distributed systems to give VMs. On average, the sub-dependency graph (constructed in
meaningful data. Section IV-A) of our service-based data set is already 77.5
Graph-based RCA. A recent survey [2] on RCA approaches services, more than the total number in any of the listed
categorizes more than 20 RCA algorithms by more than evaluations. Moreover, 7 out of the 8 listed approaches are
10 theoretical models to represent the relationships between evaluated under simulative fault injection on top of existing
components in a microservice system. Nguyen et al. [21] benchmarks such as RUBiS, which cannot represent real-world
proposed FChain, which introduces time series information incidents; Seer [10] collects only the real-world results with
into the graph, but they still use server/VM as nodes in the no validations. Our data set contains 952 actual incidents
graph. Chen et al. [22] proposed CauseInfer, which constructs collected from real-world settings.
a two-layered hierarchical causality graph. It applies metrics
as nodes that indicate service-level dependency. Schoenfisch III. M OTIVATING E XAMPLES
et al. [24] proposed to use Markov Logic Network to express In this section, we demonstrate the effectiveness of event-
conditional dependencies in the first-order logic, but still build based graph and adaptive customization strategies with two
dependency on the service level. Lin et al. [36] proposed motivating examples.
Microscope, which targets the microservice scenario. It builds Figure 1 shows an abstracted real incident example with the
the graph only on service-level metrics so it cannot get full dependency graph and the corresponding causality graph con-
use of other information and lacks customization. Brandon et structed by G ROOT. The Checkout service of our e-commerce
al. [25] proposed to build the system graph using metrics, system suddenly gets an additional latency spike due to a code
logs, and anomalies, and then use pattern matching against a deployment on the Service-E. The service monitor is reporting
library to identify the root cause. However, it is difficult to API Call Timeout detected by the ML-based anomaly detection
update the system to facilitate the changing requirements. Wu system. The simplified sub-dependency graph consisting of 6
et al. [15] proposed MicroRCA, which models both services services is shown in Figure 1a. The initial alert is triggered
and machines in the graph and tracks the propagation among on the Checkout (entrance) service. The other nodes Service-*
them. It would be hard to extend the graph from machines to are the internal services that the Checkout service directly or
the concept of other resources such as databases in our paper. indirectly depends on. The color of the nodes in Figure 1a
As mentioned in Section I, by using the event graph, indicates the severity/count of anomalies (alerts) reported on
G ROOT mainly overcomes the limitations of existing graph- each service. We can see that Service-B is the most severe
based approaches in two aspects: (1) build a more accurate one as there are two related alerts on it. The traditional graph-
and precise causality graph use the event-graph-based model; based approach [25], [30] usually takes into account only the
(2) allow adaptive customization of link construction rules to graph between services in addition to the severity information
incorporate domain knowledge in order to facilitate the rapid on each service. If the traditional approach got applied on
requirement changes in the microservice scenario. Figure 1a, either Service-B, Service-D, or Service-E could be
Our G ROOT approach uses a customized page rank algo- a potential root cause, and Service-B would have the highest
rithm in the event ranking, and can also be seen as an unsu- possibility since it has two related alerts. Such results are not
pervised ML approach. Therefore, G ROOT is complementary useful to the SRE teams.
to other ML approaches as long as they can accept our event G ROOT constructs the event-based causality graph as shown
causality graph as a feature. in Figure 1b. The events in each service are used as the nodes
Settings and Scale. The challenges of operational, scale, here. We can see that the API Call Timeout issue in Checkout
and monitoring complexities are observed, especially being is possibly caused by API Call Timeout in Service-A, which is
(a) Dependency graph (b) Causality graph
Fig. 1: Motivating example of event causality graph

Fig. 2: Example of event type addition

Fig. 3: Example of event and rule update


further caused by Latency Spike in DataCenter-A of Service-
C. G ROOT further tracks back to find that it is likely caused by
Latency Spike in Service-E, which happens in the same data
center. Finally G ROOT figures out that the most probable root
cause is a recent Code Deployment event in Service-E. The
SRE teams then could quickly locate the root cause and roll
back this code deployment, followed by further investigations.
There are no casual links between events in Service-B and
Service-A, since no causal rules are matched. The API Call
Timeout event is less likely to depend on the event type High
CPU and High GC. Therefore, the inference can eliminate
Service-B from possible root causes. This elimination shows
the benefit of the event-based graph. Note that there is another
event Latency Spike in Service-D, but not connected to Latency Fig. 4: Workflow of G ROOT
Spike in Service-C in the causality graph. The reason is that
the Latency Spike event in Service-C happens in DataCenter-
A, not DataCenter-B.
Figures 2 and 3 show how SRE engineers can easily change direction, target event, and target service (self, upstream, and
G ROOT to adapt to new requirements, by updating the events downstream dependency).
and rules. In Figure 2, SRE engineers want to add a new Figure 3 shows a real-world example of how G ROOT is able
type of deployment activity, ML Model Deployment. Usually, to incorporate SRE insights and knowledge. More specifically,
the SRE engineers first need to select the anomaly detection SRE engineers would like to change the rules to allow G ROOT
model or set their own alerts and provide alert/activity data to distinguish the latency spikes from different data centers.
sources for the stored events. In this example, the event can be As an example in Figure 1b, Latency Spike events propagate
directly fetched from the ML model management system. Then only within the same data center. During G ROOT development,
G ROOT also requires related properties (e.g., the detection SRE engineers could easily add new property DataCenter to
time range) to be set for the new event type. Lastly, the the Latency Spike event. Then they add the corresponding
SRE engineers add the rules for building the causal links “conditional” rules to be differentiated with the “basic” rules
between the new event type and existing ones. The blue in Figure 3. In conditional rules, links are constructed only
box in Figure 2 shows the rule, which denotes the edge when the specified conditions are satisfied.
IV. A PPROACH TABLE II: List of example event types used in G ROOT
Type Event Type Detection Technique
Figure 4 shows the overall workflow of G ROOT. The High GC (Overhead) Rule-based
triggers for using G ROOT are usually alert(s) from automated High CPU Usage Rule-based
Latency Spike Statistical Model
anomaly detection, or sometimes an SRE engineer’s suspi- Performance Metrics
TPS Spike Statistical Model
Database Anomaly ML Model
cion. There are three major steps: constructing the service Business Metric Anomaly ML Model
dependency graph, constructing the event causality graph, and WebAPI Error Statistical Model
Internal Error Statistical Model
Status Logs
root cause ranking. The outputs are the root causes ranked by ServiceClient Error Statistical Model
Bad Host ML Model
the likelihood. To support fast human investigation experience, Code Deployment De Facto
we build an interactive UI as shown in Figure 8: the service Developer Activities Configuration Change De Facto
Execute URL De Facto
dependency, events with causal links and additional details
such as raw metrics or the developer contact (of a code
deployment event) are presented to the user for next steps. There are three major categories of events: performance
As an offline part of human investigation, we label/collect a metrics, status logs, and developer activities:
data set, perform validation, and summarize the knowledge for
• Performance metrics represent an anomaly of monitored
further improvement on all incidents on a daily basis.
time series metrics. For example, high CPU usage in-
A. Constructing Service Dependency Graph dicates that the service is causing high CPU usage on
The construction of the service dependency graph starts a certain machine. In this category, most events are
with the initial alerted or suspicious service(s), denoted as I. continuously and passively detected and stored.
• Status logs are caused by abnormal system status, such
For example, in Figure 1a, I = {Checkout}. I can contain
multiple services based on the range of the trigger alerts or as spike of HTTP error code metrics while accessing
suspicions. We maintain domain service lists where domain- other services’ endpoints. Different types of error metrics
level alerts can be triggered because there is no clear service- are important and supported in G ROOT, including third-
level indication. party APIs. For example, Bad Host indicates abnormal
At the back end, G ROOT maintains a global service depen- patterns on some machines running the service, and can
dency graph Gglobal via distributed tracing and log analysis. be detected by a clustering-based ML approach.
• Developer activities are the events generated when a
The directed edge from nodes A to B (two services or system
components) in the dependency graph indicates a service certain activity of developers is triggered, such as code
invocation or other forms of dependency. In Figure 1a, the deployment and config change.
black arrows indicate such edges. Bi-directional edges and In Groot, there are more than a dozen event types such
cycles between the services can be possible and exist. In this as Latency Spike as listed in the column 2 of Table II.
work, the global dependency graph is updated daily. Each event type is characterized by three aspects: N ame
The service dependency (sub)graph G is constructed using indicates the name of this event type; LookbackP eriod in-
Gglobal and I. An extended service list L is first constructed dicates the time range to look back (from the time when
by traversing each service in I over Gglobal for a radius range the use of G ROOT is triggered) for collecting events of
r. Each service u ∈ L can be traversed by at least one this event type; P ropertyT ype indicates the types of the
service v ∈ I within r steps: L = {u|∃v ∈ I, dist(u, v) ≤ properties that an event of this event type should hold.
r or dist(v, u) ≤ r}. Then, the service dependency subgraph P ropertT ype is characterized by a vector of pairs, each of
G is constructed by the nodes in L and the edges between which indicates the string type for a property’s name and
them in Gglobal . In our current implementation, r is set to 2, the primitive type for the property’s value such as string,
since this dependency graph may be dynamically extended in integer, and float. Formally, an event type is defined as a tuple:
the next steps based on events’ detail for longer issue chains ET =< N ame, LookbackP eriod, P ropertyT ype > where
or additional dependencies. P ropertyT ype =< (string, type1 ), ..., (string, typen ) > (n
is the number of properties that an event of this event type
B. Constructing Event Causality Graph holds).
In the second step, G ROOT collects all supported events Each event of a certain event type ET is characterized by
for each service in G and constructs the causal links between four aspects: Service indicates the service name that the event
events. belongs to; Type indicates ET ’s Name; StartTime indicates
1) Collecting Events: Table II presents some example event the time when the event happens; Properties indicates the
types and detection techniques for G ROOT’s production im- properties that the event holds. Formally, an event is defined
plementation. For detection techniques, “De Facto” indicates as a tuple: e =< Service, T ype, StartT ime, P roperties >
that the event can be directly collected via a specific API or where P roperties is an instantiation of ET ’s P ropertyT ype.
storage. The detection either runs passively in the back end For example, in Figure 1, the generated event for
to reduce delay and improve accuracy, or runs actively for Latency Spike in DataCenter-A in Service-C would be
only the services within the dependency graph range to save < “Service-C00 , “Latency Spike00 , 2021/08/01-12:36:04, <
resources. (“DataCenter00 , “DC-100 ), ... >>.
Condition and build only the causal link when the predicate
is true. A conditional rule overwrites the basic rule on the
same source-target event pair.
When constructing causal links, G ROOT first applies the
dynamic rules so that dynamic dependencies and events are
first created at once. Then for every event in the initial services
Fig. 5: Example of dynamic rule (denoted as I), if the rule conditions are satisfied, one or many
causal links are created from this event to other events from the
same or upstream/downstream services. When a causal link is
2) Constructing Causal Link: After collecting all events on created, the step is repeated recursively for the target event (as
all services in G, in this step, causal links between these events a new origin) to create new causal links. After no new causal
are constructed for RCA ranking. The causal links (red arrows) links are created, the construction of the event causality graph
in Figure 1b are such examples. A causal link represents that is finished.
the source event can possibly be caused by the target event.
C. Root Cause Ranking
SRE knowledge is engineered into rules and used to create
causal links between the pairs of events. Finally, G ROOT ranks and recommends the most probable
A rule for constructing a causal root causes from the event causality graph. Similar to how
link is defined as a tuple: Rule =< search engines infer the importance of pages by page links,
T arget-T ype, Source-Events, T arget-Events, Direction, we customize the PageRank [37] algorithm to calculate the
T arget-Service, Condition > (Condition can be optionally root cause ranking; the customized algorithm is named as
specified). T arget-T ype indicates the type of the rule, being GrootRank. The input is the event causality graph from the
either Static or Dynamic (explained further later). previous step. Each edge is associated with a weighted score
Source-Events indicates the type of the causal link’s source for weighted propagation. The default value is set as 1, and is
event (Source-Events are listed in the names of the rules set lower for alerts with high false-positive rates.
shown in Figures 2, 3 and 5). T arget-Events indicates the Based on the observation that dangling nodes are more
type of the causal link’s target event. Direction indicates likely to be the root cause, we customize the personalization
the direction of the casual link between the target event and vector as Pn = fn or Pd = 1, where Pd is the personalization
source event. T arget-Service indicates the service that the score for dangling nodes, and Pn is for the remaining nodes;
target event should belong to. Note that T arget-Service in and fn is a value smaller than 1 to enhance the propagation
Static rules can be Self , which indicates that the target between dangling nodes. In our work, the parameter setting is
event would be within the same service as the source event, fn = 0.5, α = 0.85, maxiter = 100 (which are parameters for
or Outgoing/Incoming, which indicates that the target event the PageRank algorithm). Figure 6 illustrates an example. The
would belong to the downstream/upstream services of the grey circles are the events collected from three services and
service that the source event belongs to in G. one database. The grey arrows are the dependency links and
There are two categories of special rules. The first cat- the red ones are the causal links with the weight of 1. Both
egory is dynamic rules (i.e., rules whose T arget-T ype is of the PageRank and GrootRank algorithms detect event5
set to Dynamic) to support dynamic dependencies. Here (DB issue) as the root cause, which is expected and correct.
T arget-Service does not indicate any of the three possible However, the PageRank algorithm ranks event4 higher than
options listed earlier but indicates the name of the target event3. But event3 of Service-C is more likely to be the
service that G ROOT would need to create. For example, live second most possible root cause (besides event5), because the
DB dependencies are not available due to different tech stacks scores on dangling nodes are propagated to all others equally
and high volume. In Figure 5, a DB issue (DB Markdown) is in each iteration. We can see that event3 is correctly ranked
shown in Service-A. Based on the listed dynamic rule, G ROOT as second using the GrootRank algorithm.
creates a new “service” DB-1 in G, a new event “Issues” that The second step of GrootRank is to break the tied results
belongs to DB-1, and a causal link between the two events. from the previous step. The tied results are due to the fact that
In practice, the SRE teams use dynamic rules to cover a the event graph can contain multiple disconnected sub-graphs
lot of third-party services and database issues since the live with the same shape. We design two techniques to untie the
dependencies are not easy to maintain. ranking:
The second category of special rules is conditional rules. 1) For each joint event, the access distance (sum) is calcu-
Conditional rules are used when some prerequisite conditions lated from the initial anomaly service(s) to the service
should be satisfied before a certain causal link is created. In where the event belongs to. If any “access” is not
these rules, Condition is specified with a boolean predicate. reachable, the distance is set as dm + 1 where dm is
As shown in Figure 3, the SRE teams believe Latency Spike the maximum possible distance. The one with shorter
events from different services are related only when both access distance (sum) would be ranked higher and vice
events happen within the same data center. Based on this versa. Figure 7 presents an example, where Service-A
observation, G ROOT would first evaluate the predicate in and Service-B are both initial anomaly services. Since
there are no conflicts, G ROOT just adds/updates edges between
the event types.
After all changes, G ROOT extracts the rules from the graphs
by converting each edge to a single rule. These rules are
automatically implemented, and then tested against our labeled
data set. The G ROOT users need to review the changes with
validation reports before the changes go online.
Fig. 6: Example of personalization vector customization V. E VALUATION
We evaluate G ROOT in two aspects: (1) effectiveness (ac-
curacy), which assesses how accurate G ROOT is in detecting
and ranking root causes, and (2) efficiency, which assesses how
long it takes for G ROOT to derive root causes and conduct end-
to-end analysis in action. Particularly, we intend to address the
following research questions:
• RQ1. What are the accuracy and efficiency of G ROOT
when applied on the collected dataset?
Fig. 7: Example of using access distance to untie the ranking
• RQ2. How does G ROOT compare with baseline ap-
results
proaches in terms of accuracy?
• RQ3. What are the accuracy and efficiency of G ROOT in
an end-to-end scenario?
G ROOT suspects that event2 is caused by either event3
or event1 with the same weight. The scores of event3 A. Evaluation Setup
and event1 are tied. Then, event3 has a score of 1 (i.e., To evaluate G ROOT in a real-world scenario, we deploy and
0 + 1) and event1 has a score of 2 (i.e., 0 + 2), since apply G ROOT in eBay’s e-commerce system that serves more
it is not reachable by Service-B). Therefore, event3 is than 159 million active buyers. In particular, we apply G ROOT
ranked first and logical. upon a microservice ecosystem that contains over 5,000 ser-
2) For the remaining joint results with the same access dis- vices on three data centers. These services are built on different
tances, G ROOT continues to untie by using the historical tech stacks with different programming languages, including
root cause frequency of the event types under the same Java, Python, Node.js, etc. Furthermore, these services interact
trigger conditions (e.g., checkout domain alerts). This with each other by using different types of service protocols,
frequency information is generated from the manually including HTTP, gRPC, and Message Queue. The distributed
labeled dataset. A more frequently occurred root cause tracing of the ecosystem generates 147B traces on average per
type is ranked higher. day.
1) Data Set: The SRE teams at eBay help collect a labeled
D. Rule Customization Management data set containing 952 incidents over 15 months (Jan 2020 -
While G ROOT users create or update the rules, there could Apr 2021). Each incident data contains the input required by
be overlaps, inconsistencies, or even conflicts being introduced G ROOT (e.g., dependency snapshot and events with details)
such as the example in Figure 3. G ROOT uses two graphs to and the root cause manually labeled by the SRE teams. These
manage the rule relationships and avoid conflicts for users. incidents are grouped into two categories:
One graph is to represent the link rules between events in • Business domain incidents. These incidents are detected
the same service (Same-Graph) while the other is to represent mainly due to their business impact. For example, end
links between different services (Diff-Graph). The nodes in users encounter failed interactions, and business or cus-
these two graphs are the event types defined in Section IV-B. tomer experience is impacted, similar to the example in
There are three statuses between each (directional) pair of Figure 1.
event types: (1) no rule, (2) only basic rule, and (3) conditional • Service-based incidents. These incidents are detected
rule (since it overwrites the basic rule). In Same-Graph, mainly due to their impact on the service level, similar
G ROOT does not allow self-loop as it does not build links to the example in Figure 5.
between an event and itself. An internal incident may get detected early, and then
When rule change happens, existing rules are enumerated likely get categorized as a service-based incident or even
to build edges in Same-Graph and Diff-Graph based on solved directly by owners without records. On the other
T arget-Events and T arget-Service. Based on the users’ hand, infrastructure-level issues or issues of external service
operation of (1) “remove a rule”, G ROOT removes the corre- providers (e.g., checkout and shipping services) may not get
sponding edge on the graphs; (2) “add/update a rule”, G ROOT detected until business impact is caused.
checks whether there are existing edges between the given There are 782 business domain incidents and 170 service-
event types, and then warns the users for possible overwrites. If based incidents in the data set. For each incident, the root cause
TABLE III: Accuracy of RCA by G ROOT and baselines TABLE IV: Comparison of G ROOT results on the dataset and
G ROOT Naive Non-adaptive end-to-end scenario
Top 3 Top 1 Top 3 Top 1 Top 3 Top 1
Service-based Business Domain
Service-based 92% 74% 25% 16% 84% 62%
Dataset End-to-End Dataset End-to-End
Business domain 96% 81% 2% 1% 28% 26%
Top-1 Accuracy 74% 73% 81% 73%
Combined 95% 78% 6% 3% 38% 33%
Top-3 Accuracy 92% 91% 96% 87%
Average Runtime Cost 1.06s 3.16s 0.98s 2.98s
is manually labeled, validated, and collected by the SRE teams, Maximum Runtime Cost 1.69s 4.56s 1.14s 3.61s
who handle the site incidents everyday. For a case with mul-
tiple interacting causes, only the most actionable/influential G ROOT ranks the root cause after top 1, the labeled root cause
event is labelled as the root cause for the case. These actual is just one of the multiple co-existing root causes. But for
root causes and incident contexts serve as the ground truth in fairness, the SRE teams label only a single root cause in each
our evaluation. case. According to the feedback from the SRE teams, G ROOT
2) G ROOT Setup: The G ROOT production system is de- still facilitates the RCA process for these cases.
ployed as three microservices and federated in three data Our results show that the runtime cost of applying G ROOT
centers with nine 8-core CPUs, 20GB RAM pods each on is relatively low. For a service-based incident, the average
Kubernetes. runtime cost of G ROOT is 1.06s while the maximum is 1.69s.
3) Baseline Approaches: In order to compare G ROOT with For a business domain incident, the average runtime cost is
other related approaches, we design and implement two base- 0.98s while the maximum is 1.14s.
line approaches for the evaluation: 2) RQ2: We additionally apply the baseline approaches
on the data set. Table III also shows the evaluation results.
• Naive Approach. This approach directly uses the con-
The results show that the accuracy of G ROOT is substantially
structed service dependency graph (Section IV-A). The higher than that of the baseline approaches. In terms of the
events are assigned a score by the severeness of the top-1 accuracy, G ROOT achieves 78% compared with 3% and
associated anomaly. Then a normalized score for each 33% of the naive and non-adaptive approaches, respectively. In
service is calculated summarizing all the events related terms of the top-3 accuracy, G ROOT achieves 95% compared
to the service. Lastly, the PageRank algorithm is used to with 6% and 38% of the naive and non-adaptive approaches,
calculate the root cause ranking. respectively.
• Non-adaptive Approach. This approach is not context-
The naive approach performs worst in all settings, because
aware. It replaces all special rules (i.e., conditional and it blindly propagates the score at service levels. The accuracy
dynamic ones) with their basic rule versions. Its other of the non-adaptive approach is much worse for business
parts are identical to G ROOT. domain incidents. The reason is that for a business domain
The non-adaptive approach can be seen as a baseline for incident, it often takes a longer propagation path since the
reflecting a group of graph-based approaches (e.g., Cau- incident is triggered by a group of services, and new dynamic
seInfer [22] and Microscope [36]). These approaches also dependencies may be introduced during the event collection,
specify certain service-level metrics but lack the context-aware causing more inaccuracy for the non-adaptive approach. There
capabilities of G ROOT. Because the tools for these approaches can be many non-critical or irrelevant error events in an actual
are not publicly available, we implement the non-adaptive production scenario, aka “soft” errors. We suspect that these
approach to approximate these approaches. non-critical or irrelevant events may be ranked higher by the
non-adaptive approach since they are similar to injected faults
B. Evaluation Results
and hard to be distinguished from the actual ones. G ROOT
1) RQ1: Table III shows the results of applying G ROOT uses dynamic and conditional rules to discover the actual
on the collected data set. We measure both top-1 and top-3 causal links, building fewer links related to such non-critical
accuracy. The top-1 and top-3 accuracy is calculated as the or irrelevant events for leading to higher accuracy.
percentage of cases where their ground-truth root cause is 3) RQ3: To evaluate G ROOT under an end-to-end scenario,
ranked within top 1 and top 3, respectively, in G ROOT’s results. we apply G ROOT upon actual incidents in action. Table IV
G ROOT achieves high accuracy on both incident categories. shows the results. The accuracy has a decrease of up to
For example, for business domain incidents, G ROOT achieves 9 percentage points in the end-to-end scenario, with some
96% top-3 accuracy. failures caused by production issues such as missing data
The unsuccessful cases that G ROOT ranks the root cause and service/storage failures. In addition, the runtime cost is
after top 3 are mostly caused by missing event(s). More than increased by up to nearly 3 seconds due to the time spent on
one-third of these unsuccessful cases have been addressed by fetching data from different data sources, e.g., querying the
adding necessary events and corresponding rules over time. events for a certain time period.
For example, initially, we had only an event type of general
error spike, which mixes different categories of errors and thus VI. E XPERIENCE
causes high false-positive rate. We then have designed different G ROOT currently supports daily SRE work. Figure 8 shows
event types for each category of the error metrics (including a live G ROOT’s “bird’s eye view” UI on an actual simple
various internal and client API errors). In many cases that checkout incident. Service C has the root cause (ErrorSpike)
Question 1 Question 1

Question 2 Question 2

Question 3 Question 3

Question 4 Question 4

Question 5 Question 5

0 1 2 3 4 0 1 2 3 4

(a) From 14 G ROOT users (b) From 6 G ROOT developers


Fig. 9: Survey results
time? (Detection and remediation time not included)
Fig. 8: G ROOT UI in production Answer choices: Lots Of Time Saved(4), Some Time
Saved(3), No Time Saved(2), Waste Time Instead(1).
and belongs to an external provider. Although the domain • Question 3. Based on your estimation, how much triage
service A also carries an error spike and gets impacted, G ROOT time G ROOT would save on average when it correctly
correctly ignores the irrelevant deployment event, which has locates the root cause? (Detection and remediation time
no critical impact. The events on C are virtually created based not included) Answer choices: More than 50%(4), 25-
on the dynamic rule. Note that all causal links (yellow) in the 50%(3), 10-25%(2), 0-10%(1), N/A(0).
UI indicate “is cause of”, being the opposite of “is caused by” • Question 4. When G ROOT correctly locates the root
as described in Section IV-B to provide more intuitive UI for cause, do you find that the result “graph” provided by
users to navigate through. G ROOT visualizes the dependency G ROOT helps you understand how and why the incident
and event causality graph with extra information such as an happens?
error message. The SRE teams can quickly comprehend the • Question 5. When G ROOT does not correctly locate the
incident context and derived root cause to investigate G ROOT root cause, does the result “graph” make it easier for your
further. A mouseover can trigger “event enrichment” based on investigation of the root cause?
the event type to present details such as raw metrics and other Figure 9a shows the results of the user survey. We can see
additional information. that most users find G ROOT very useful to locate the root
We next share two major kinds of experience: cause. The average score for Question 1 is 3.79, and 11 out
• Feedback from G ROOT users and developers, reflect- of 14 participants find G ROOT very helpful. As for Question
ing the general experience of two groups: (1) domain SRE 3, G ROOT saves the triage time by 25-50%. Even in cases
teams who use G ROOT to find the root cause, and (2) a that G ROOT cannot correctly locate the root cause, it is still
centered SRE team who maintains G ROOT to facilitate helpful to provide information for further investigation with
new requirements. an average score of 3.43 in Question 5.
• Lessons learned, representing the lessons learned from For the developer survey, we ask the 6 developers the
deploying and adopting G ROOT in production for the real- following 5 questions (Questions 2-5 have the same choices
world RCA process. as Question 1):
• Question 1. Overall, how convenient is it to change and
A. Feedback from G ROOT Users and Developers customize events/rules/domains while using G ROOT? An-
We invite the SRE members who use G ROOT for RCA in swer choices: Convenient(4), Somewhat Convenient(3),
their daily work to the user survey. We call them users in this Not Convenient(2), Difficult(1).
section. We also invite different SRE members responsible for • Question 2. How convenient is it to change/customize
maintaining G ROOT to the developer survey. We call them event models while using G ROOT?
developers in this section. In total, there are 14 users and 6 • Question 3. How convenient is it to add new domains
developers1 who respond to the surveys. while using G ROOT?
For the user survey, we ask 14 users the following 5 • Question 4. How convenient is it to change/customize
questions (Questions 4-5 have the same choices as Question causality rules while using G ROOT?
1): • Question 5. How convenient is it to change/customize

• Question 1. When G ROOT correctly locates the root


G ROOT compared to other SRE tools?
cause, how does it help with your triaging experience? Figure 9b shows the results of the developer survey. Overall,
Answer choices: Helpful(4), Somewhat Helpful(3), Not most developers find it convenient to make changes on and
Helpful(2), Misleading(1). customize events/rules/domains in G ROOT.
• Question 2. When G ROOT correctly locates the root B. Lessons learned
cause, how does it save/extend your or the team’s triaging
In this section, we share the lessons learned in terms of tech-
1 The G ROOT researchers and developers who are authors of this paper are nology transfer and adoption on using G ROOT in production
excluded. environments.
Embedded in Practice. To build a successful RCA tool in it is not possible to scale each model (e.g., deep-learning-
practice, it is important to embed the R&D efforts in the live based models), nor have it deeply customized for every metric
environment with SRE experts and users. We have a 30-minute at every level. Therefore, while selecting models, we must
routine meeting daily with an SRE team to manually test and make careful trade-off in aspects such as accuracy, scalability,
review every site incident. In addition, we actively reach out efficiency, effort, and robustness. In general, we first set
to the end users for feedback. For example, the users found different “acceptance” levels by analyzing each event’s impact
our initial UI hard to understand. Based on their suggestions, and frequency, and then test different models in staging and
we have introduced alert enrichment with the detailed context pick the one that is good enough. For example, a few alerts
of most events, raw metrics, and links to other tools for the such as “high thread usage” are defined by thresholds and work
next steps. We also make the UI interactive and build user just fine even without a model. Some alerts such as “service
guides, training videos, and sections. As a result, G ROOT has client error” are more stochastic and require coverage on every
become increasingly practical and well adopted in practice. We metric of every service, and thus we select fast and robust
believe that R&D work on observability should be incubated statistical models and actively conduct detection on the fly.
and grown within daily SRE environments. It is also vital Phased Incorporation of ML. In the current industrial
to bring developers with rich RCA experience into the R&D settings, ML-powered RCA products still require effective
team. knowledge engineering. Due to the higher complexity and
Vertical Enhancements. High-confidence and automated ver- lower “signal to noise ratio” of real production incidents, many
tical enhancements can empower great experiences. G ROOT is existing approaches cannot be applied in practice. We be-
enhanced and specialized in critical scenarios such as grouped lieve that the knowledge engineering capabilities can facilitate
related alerts across services or critical business domain issues, adoption of technologies such as AIOps. Therefore, G ROOT
and large-scale scenarios such as infrastructure changes or is designed to be highly customizable and easy to infuse SRE
database issues. Furthermore, the end-to-end automation is knowledge and to achieve high effectiveness and efficiency.
also built for integration and efficiency with anomaly detec- Moreover, a multi-scenario RCA tool requires various and
tion, RCA, and notification. For notification, domain business interpretable events from different detection strategies. Auto-
anomalies and diagnostic results are sent through communi- ML-based anomaly detection or unsupervised RCA for large
cation apps (e.g., slack and email) for better reachability and service ecosystems is not yet ready in such context. As for
experience. Within 18 months of R&D, G ROOT now supports the path of supervised learning, the training data is tricky to
18 business domains and sub-domains of the company. On label and vulnerable to potential cognitive bias. Lastly, the
average, G ROOT UI supports more than 50 active internal end users often require complete understanding to fully adopt
users, and the service sends thousands of results every month. new solutions, because there is no guarantee of correctness.
Most of these usages are around the vertical enhancements. Many recent ML algorithms (e.g., ensemble and deep learning)
Data and Tool Reliability. Reliability is critical to G ROOT lack interpretability. Via the knowledge engineering and graph
itself and requires a lot of attention and effort. For example, if capabilities, G ROOT is able to explain diversity and causality
a critical event is missing, G ROOT may infer a totally different between ML-model-driven and other types of events. Moving
root cause, which would mislead users. We estimate the alert forward, we are building a white-box deep learning approach
accuracy to be greater than 0.6 in order to be useful. Recall is with causal graph algorithms where the causal link weights
even more important since G ROOT can effectively eliminate are parameters and derivable.
false positive alerts based on the casual ranking. Since there
are hundreds of different metrics supported in G ROOT, we VII. C ONCLUSION
spend time to ensure a robust back end by adding partial
and dynamic retry logic and high-efficiency cache. G ROOT’s In this paper, we have presented our work around root
unsuccessful cases can be caused by imperfect data, flawed cause analysis (RCA) in industrial settings. To tackle three
algorithms, or simply code defects. To better trace the reason major RCA challenges (complexities of operation, system
behind each unsuccessful case, we add a tracing component. scale, and monitoring), we have proposed a novel event-graph-
Every G ROOT request can be traced back to atomic actions based approach named G ROOT that constructs a real-time
such as retrieving data, data cleaning, and anomaly detection causality graph for allowing adaptive customization. G ROOT
via algorithms. can handle diversified anomalies and activities from the system
Trade-off among Models. The accuracy and scalability trade- under analysis and is extensible to different approaches of
off among anomaly detection models should be carefully anomaly detection or RCA. We have integrated G ROOT into
considered and tested. In general, some algorithms such as eBay’s large-scale distributed system containing more than
deep-learning-based or ensemble models are more adaptive 5,000 microservices. Our evaluation of G ROOT on a data
and accurate than typical ones such as traditional ML or set consisting of 952 real production incidents shows that
statistical models. However, the former requires more com- G ROOT achieves high accuracy and efficiency across different
putation resources, operational efforts, and additional system scenarios and also largely outperforms baseline graph-based
complexities such as training or model fine-tuning. Due to the approaches. We also share the lessons learned from deploying
actual complexities and fast-evolving nature of our context, and adopting G ROOT in production environments.
R EFERENCES [19] G. Da Cunha Rodrigues, R. N. Calheiros, V. T. Guimaraes, G. L. d.
Santos, M. B. De Carvalho, L. Z. Granville, L. M. R. Tarouco, and
[1] A. Balalaie, A. Heydarnoori, and P. Jamshidi, “Microservices architec- R. Buyya, “Monitoring of cloud computing environments: Concepts,
ture enables DevOps: Migration to a cloud-native architecture,” IEEE solutions, trends, and future directions,” in Proceedings of the 31st
Software, vol. 33, no. 3, pp. 42–52, 2016. Annual ACM Symposium on Applied Computing. ACM, 2016, pp.
[2] M. Solé, V. Muntés-Mulero, A. I. Rana, and G. Estrada, “Survey 378–383.
on models and techniques for root-cause analysis,” arXiv preprint [20] How SRE teams are organized, and how to get started, Ac-
arXiv:1701.08546, 2017. cessed: 2020-12-10, https://cloud.google.com/blog/products/devops-sre/
[3] N. Zhao, P. Jin, L. Wang, X. Yang, R. Liu, W. Zhang, K. Sui, and how-sre-teams-are-organized-and-how-to-get-started/.
D. Pei, “Automatically and adaptively identifying severe alerts for online [21] H. Nguyen, Z. Shen, Y. Tan, and X. Gu, “Fchain: Toward black-box
service systems,” in Proceedings of 2020 IEEE Conference on Computer online fault localization for cloud systems,” in Proceedings of 2013
Communications. IEEE, 2020, pp. 2420–2429. IEEE 33rd International Conference on Distributed Computing Systems.
IEEE, 2013, pp. 21–30.
[4] J. Xu, Y. Wang, P. Chen, and P. Wang, “Lightweight and adaptive service
[22] P. Chen, Y. Qi, P. Zheng, and D. Hou, “Causeinfer: Automatic and
api performance monitoring in highly dynamic cloud environment,”
distributed performance diagnosis with hierarchical causality graph in
in Proceedings of 2017 IEEE International Conference on Services
large distributed systems,” in Proceedings of 2014 IEEE Conference on
Computing. IEEE, 2017, pp. 35–43.
Computer Communications. IEEE, 2014, pp. 1887–1895.
[5] L. Tang, T. Li, F. Pinel, L. Shwartz, and G. Grabarnik, “Optimizing sys-
[23] M. Ma, Z. Yin, S. Zhang, S. Wang, C. Zheng, X. Jiang, H. Hu, C. Luo,
tem monitoring configurations for non-actionable alerts,” in Proceedings
Y. Li, N. Qiu et al., “Diagnosing root causes of intermittent slow
of 2012 IEEE Network Operations and Management Symposium. IEEE,
queries in cloud databases,” Proceedings of the Very Large Data Base
2012, pp. 34–42.
Endowment, vol. 13, no. 8, pp. 1176–1189, 2020.
[6] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthi- [24] J. Schoenfisch, C. Meilicke, J. von Stülpnagel, J. Ortmann, and H. Stuck-
tacharoen, “Performance debugging for distributed systems of black enschmidt, “Root cause analysis in it infrastructures using ontologies and
boxes,” ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. abduction in markov logic networks,” Information Systems, vol. 74, pp.
74–89, 2003. 103–116, 2018.
[7] H. Zawawy, K. Kontogiannis, and J. Mylopoulos, “Log filtering and [25] Á. Brandón, M. Solé, A. Huélamo, D. Solans, M. S. Pérez, and
interpretation for root cause analysis,” in Proceedings of 2010 IEEE V. Muntés-Mulero, “Graph-based root cause analysis for service-oriented
International Conference on Software Maintenance. IEEE, 2010, pp. and microservice architectures,” Journal of Systems and Software, vol.
1–5. 159, p. 110432, 2020.
[8] V. Nair, A. Raul, S. Khanduja, V. Bahirwani, Q. Shao, S. Sellamanickam, [26] D. Y. Yoon, N. Niu, and B. Mozafari, “Dbsherlock: A performance
S. Keerthi, S. Herbert, and S. Dhulipalla, “Learning a hierarchical diagnostic tool for transactional databases,” in Proceedings of the 2016
monitoring system for detecting and diagnosing service issues,” in International Conference on Management of Data. ACM, 2016, pp.
Proceedings of the 21th ACM SIGKDD International Conference on 1599–1614.
Knowledge Discovery and Data Mining. ACM, 2015, pp. 2029–2038. [27] V. Jeyakumar, O. Madani, A. Parandeh, A. Kulshreshtha, W. Zeng, and
[9] S. Lu, B. Rao, X. Wei, B. Tak, L. Wang, and L. Wang, “Log- N. Yadav, “Explainit!–a declarative root-cause analysis engine for time
based abnormal task detection and root cause analysis for Spark,” in series data,” in Proceedings of the 2019 International Conference on
Proceedings of 2017 IEEE International Conference on Web Services. Management of Data. ACM, 2019, pp. 333–348.
IEEE, 2017, pp. 389–396. [28] H. Jayathilaka, C. Krintz, and R. Wolski, “Performance monitoring and
[10] Y. Gan, Y. Zhang, K. Hu, D. Cheng, Y. He, M. Pancholi, and C. Delim- root cause analysis for cloud-hosted web applications,” in Proceedings
itrou, “Seer: Leveraging big data to navigate the complexity of perfor- of the 26th International Conference on World Wide Web. ACM, 2017,
mance debugging in cloud microservices,” in Proceedings of the 24th pp. 469–478.
International Conference on Architectural Support for Programming [29] M. A. Marvasti, A. V. Poghosyan, A. N. Harutyunyan, and N. M.
Languages and Operating Systems. ACM, 2019, pp. 19–33. Grigoryan, “An anomaly event correlation engine: Identifying root
[11] J. Mace, R. Roelke, and R. Fonseca, “Pivot tracing: Dynamic causal causes, bottlenecks, and black swans in IT environments,” VMware
monitoring for distributed systems,” in Proceedings of the 25th ACM Technical Journal, vol. 2, no. 1, pp. 35–45, 2013.
Symposium on Operating Systems Principles. ACM, 2015, pp. 378– [30] J. Weng, J. H. Wang, J. Yang, and Y. Yang, “Root cause analysis of
393. anomalies of multitier services in public clouds,” IEEE/ACM Transac-
[12] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, tions on Networking, vol. 26, no. 4, pp. 1646–1659, 2018.
Y. Feng et al., “Unsupervised anomaly detection via variational auto- [31] J. Qiu, Q. Du, K. Yin, S.-L. Zhang, and C. Qian, “A causality mining and
encoder for seasonal KPIs in web applications,” in Proceedings of the knowledge graph based method of root cause diagnosis for performance
2018 World Wide Web Conference. ACM, 2018, pp. 187–196. anomaly in cloud applications,” Applied Sciences, vol. 10, no. 6, p. 2166,
[13] M. Ma, W. Lin, D. Pan, and P. Wang, “Ms-rank: Multi-metric and 2020.
self-adaptive root cause diagnosis for microservice applications,” in [32] Surus, Accessed: 2020-08-15, https://github.com/Netflix/Surus.
Proceedings of 2019 IEEE International Conference on Web Services. [33] M. Ma, S. Zhang, D. Pei, X. Huang, and H. Dai, “Robust and rapid
IEEE, 2019, pp. 60–67. adaption for concept drift in software system anomaly detection,” in
[14] Y. Meng, S. Zhang, Y. Sun, R. Zhang, Z. Hu, Y. Zhang, C. Jia, Proceedings of 2018 IEEE 29th International Symposium on Software
Z. Wang, and D. Pei, “Localizing failure root causes in a microservice Reliability Engineering. IEEE, 2018, pp. 13–24.
through causality inference,” in Proceedings of 2020 IEEE/ACM 28th [34] R. Bhagwan, R. Kumar, R. Ramjee, G. Varghese, S. Mohapatra,
International Symposium on Quality of Service. IEEE, 2020, pp. 1–10. H. Manoharan, and P. Shah, “Adtributor: Revenue debugging in advertis-
[15] L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “Microrca: Root cause ing systems,” in Proceedings of 11th USENIX Symposium on Networked
localization of performance issues in microservices,” in Proceedings Systems Design and Implementation. USENIX, 2014, pp. 43–55.
of 2020 IEEE/IFIP Network Operations and Management Symposium. [35] Y. Sun, Y. Zhao, Y. Su, D. Liu, X. Nie, Y. Meng, S. Cheng, D. Pei,
IEEE, 2020, pp. 1–9. S. Zhang, X. Qu et al., “Hotspot: Anomaly localization for additive KPIs
[16] M. Kim, R. Sumbaly, and S. Shah, “Root cause detection in a service- with multi-dimensional attributes,” IEEE Access, vol. 6, pp. 10 909–
oriented architecture,” ACM SIGMETRICS Performance Evaluation Re- 10 923, 2018.
view, vol. 41, no. 1, pp. 93–104, 2013. [36] J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues
[17] H. Wang, P. Nguyen, J. Li, S. Kopru, G. Zhang, S. Katariya, and with causal graphs in micro-service environments,” in Proceedings of
S. Ben-Romdhane, “Grano: Interactive graph-based root cause analysis International Conference on Service-Oriented Computing. Springer,
for cloud-native distributed data platform,” Proceedings of the Very 2018, pp. 3–20.
Large Data Base Endowment, vol. 12, no. 12, pp. 1942–1945, 2019. [37] C. Manning, P. Raghavan, and H. Schütze, “Introduction to information
[18] H. Baek, A. Srivastava, and J. Van der Merwe, “Cloudsight: A tenant- retrieval,” Natural Language Engineering, vol. 16, no. 1, pp. 100–103,
oriented transparency framework for cross-layer cloud troubleshooting,” 2010.
in Proceedings of 2017 17th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing. IEEE, 2017, pp. 268–273.

You might also like