100 Observability Demo Script
100 Observability Demo Script
Introduction
Narration
In this demo, I’ll show how IBM Instana helps quickly identify, debug, and resolve an incident
in a microservices-based application.
To set the context, our application is called Stan's Robot Shop, and it is a modern, cloud native
application with microservices leveraging various technologies such as Java, Python, and
MySQL and deployed in containers on top of Kubernetes cluster.
Such applications create a serious challenge for managing application performance because
components are dynamic and loosely coupled. They use different technologies, so they usually
require broad knowledge and multiple tools to diagnose. Instana, with a single agent deployed
per host, automates the discovery process.
Application components are discovered and observed as they are deployed. Over 200
technologies are supported with zero or minimal con guration, releasing you from installing
and con guring multiple tools or plugins. The discovered components can be grouped into an
application perspective, giving the application owner an easy overview of key metrics (“golden
signals”) like traf c, errors, and latency on a single pane of glass.
This is a visualization of all the dependencies within the robot shop application. Instana
automatically discovered the relationships between the services and correlated them into this
dynamic graph. We can see how requests are moving through the application in real time.
Instana is able to do this because it tracks every request that flows through the application.
We can tell there are some problems with the application because several services are
highlighted in yellow and red.
But you wouldn’t normally be looking at the dashboard when something like this happens, so
let me walk you through what it looks like from the SRE/IT operator’s point of view when an
incident occurs.
1 of 10
fi
fi
fi
Action 0.1.1
• From the sidebar menu, click the Applications icon and choose RobotShop.
Action 0.1.2
• Click the Dependencies tab.
2 of 10
Narration
We’ve just gotten an alert from Instana that there has been a sudden increase in erroneous
calls on our ‘discount’ service, which is part of the robot shop application.
Although I don’t have it connected right now, the alert would show up via one of the
con gurable alert channels, like PagerDuty, Microsoft Teams, Slack, and many others (full list).
It’s important to note here that you’re not getting alerts for just anything. Behind the scenes,
Instana is determining which events and issues are related, and it only sends alerts if a
problem is likely to affect end users.
Action 1.1.1
• Click the Events icon (triangle) on the sidebar menu.
3 of 10
fi
Narration
Instana recognized that the sudden increase in the number of erroneous calls was something
important to alert on, so we did not have to do any con guration or set thresholds in order to
get this alert. We get key information right away when we come into this incident detail page.
There’s a timeline of the incident, the event that triggered Instana to create the incident, and
all of the related events.
Action 2.1.1
• Click the incident called Sudden increase in the number of erroneous calls on the
‘discount’ service.
Action 2.1.2
• You will see the Incident Timeline, Triggering Event, and Related Events.
4 of 10
fi
Narration
It looks like the abnormal termination of the MySQL database caused the problem. It shows
how one data store issue rippled out to affect a number of directly and indirectly connected
services. Instana’s automatic root cause analysis uses the relationship information from the
dynamic graph to accurately collate the individual issues into one incident. This completely
eliminates alert storms, providing your DevOps engineers and SREs with a single noti cation of
actionable information to enable them to promptly restore normal service. Let’s look at some
related traces for this.
Action 3.1.1
• Under Related Events, click the event that says Sudden increase in the number of
erroneous calls. Then, click Analyze Calls.
Narration
Now, we moved to the Analytics view. You can see how Instana UI allows for easy navigation
between different views, keeping the time span and context. At the top, you can see the lter
that was applied to all collected traces. All ltered requests are grouped by endpoint (in this
case, it is the database CONNECT exposed by the MySQL server).
There is only one endpoint here, but if there were multiple, you’d see a list. Endpoints are
automatically discovered and mapped by Instana. We can go into the details for each
erroneous call to MySQL via this endpoint (CONNECT).
5 of 10
fi
fi
fi
Action 3.2.1
• Click the endpoint named CONNECT. Then, click the rst call (also named CONNECT).
6 of 10
fi
Narration
Clicking an individual call takes us to a view of the call in the context of the end-to-end trace.
We can see where the request began and each call that was made along the way.
Action 4.1.1
• Click the rst call on the list.
Narration
In the call stack, we can click each span to see more information, including the complete stack
trace. We can see the source, in this case the ‘discount’ service, and [scroll down] the
destination, which in this case is CONNECT of MySQL.
It’s useful to have this context because we can easily see how the calls go from one service to
another, just by clicking them. We can also see how the error (red triangle) propagated up the
call stack, in this case beginning with the MySQL database.
So we can con rm that the root cause of the incident that affected the ‘discount’ service was
with the MySQL database. The abnormal termination of the database caused a connection
error, which then flowed back through the application.
7 of 10
fi
fi
fi
Action 4.2.1
• Scroll down to the section labeled Calls.
8 of 10
5.1 - Observe that metrics for the robot shop have returned to normal
Narration
Now that MySQL is working again, we can go back and con rm that the problems with the robot
shop have been repaired.
You should see that the call volume has increased, the number of erroneous calls decreased,
and latency also decreased.
If you’re giving the demo in real time, the incident should have reset itself by the time you’re
done demo’ing. If not, this part can be skipped.
If you set the timeframe at the beginning of the demo, you can set it again to begin at 0:30
minutes past the hour and end at 0:45 minutes past the hour.
Action 5.1.1
• Navigate to Applications in the sidebar menu, choose Robot Shop, and click
the Summary tab.
9 of 10
fi
fi
Summary
Now, we can see that the metrics for the robot shop have returned to normal: the call volume
has increased again, the erroneous call rate has decreased, and latency has decreased.
We’ve xed the problem with the robot shop and restored normal service!
Hopefully, you’ve seen that Instana can help make the process of identifying problems and
nding the root cause of those problems very frictionless. Since Instana automates so many of
the manual and labor-intensive aspects of the process, you can focus on getting other work
done and not worry about instrumenting observability or constantly monitoring for problems.
And when problems do arise, all the trace data is there at your ngertips to dig into.
10 of 10
fi
fi
fi