0% found this document useful (0 votes)
337 views92 pages

Cloud ITIL

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
337 views92 pages

Cloud ITIL

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 92

What is service management?

Cloud service management and operations refer to all the activities that an
organization does to plan, design, deliver, operate, and control the IT and
cloud services that it offers to customers.
Service management includes the operational aspects of your applications
and services. After an application is pushed to production, it must be
managed. Applications are monitored to ensure availability and performance
according to service level agreements (SLAs) or service level objectives
(SLOs).
Talk with an IBM expert

Schedule time with an IBM architecture expert


Enterprises see the value of DevOps, things like development velocity,
agility, direct feedback from users, etc. Operations need to support these
new concepts as well and redefine processes, roles & responsibilities, and
tools. New concepts like Site Reliability Engineering take a fresh approach to
Service Management, allowing operations to sustain the increased volume
coming in from development while protecting performance and reliability of
the solution.
Ingo Averdunk, IBM Distinguished Engineer, Cloud Service Management and Operations

Transform to agile service management


As methods of developing, testing, and releasing new function become more
agile, service management must also transform to support this paradigm
shift. The transformation has implications in various areas:
 Organization: Instead of a discrete operations organization that is
distant from the development team, full lifecycle responsibility is provided
through small DevOps teams. Another approach is site reliability engineering
(SRE), which brings a strong engineering focus to operations. SRE
emphasizes automation to scale operations as load increases.
 Process: A key concept of DevOps is the automated and continuous
testing, deployment, and release of functions. Service management
processes, such as change management processes and the role of the
change advisory board, must change to support this notion.
 Tools: Because time is of the essence in restoring a service, incident
management tools must provide rapid access to the right information,
support automation, and instant collaboration with the right subject-matter
experts. The term ChatOps describes a collaborative way to perform
operations. Bot technology integrates service management and DevOps
tools into this collaboration.
 Culture: As with any transformation project, you must consider a few
cultural aspects. One example is the need for a blameless post-mortem
culture where the root cause of an incident is revealed and the organization
can learn from it.
IBM experts share their point of view
Join our experts, including Ingo Averdunk, Distinguished Engineer - Service
Management for IBM Hybrid Cloud, as they discuss the advancements in full-
scale Enterprise IT Management and Operations led by the onset of
microservices adoption. Covering everything from the problems with
Traditional IT monitoring to new methodologies & practices, like ChatOps,
this is part one of a two-part session that is critical to keep your entire
microservices stack up and running!
Save on cost and increase efficiency
Enterprises that implement a cognitive service management architecture
can save on cost and efficiency. In addition, a service management
architecture provides several other benefits:
 Maximize operational effectiveness. Ensure the availability and performance of applications that
run on the IBM Cloud platform, given the target SLA of 99.99% availability for applications.
__________________________ describes a collaborative way to perform
operations
 Increase operational reliability and agility by using event-driven guidance, automation, and
notification to prompt the right activity to resolve important issues.
 Improve operational efficiency by using real-time analytics to identify and resolve problems
faster.
 Reduce costs by creating a single view and central consolidation point for events and problem
reports from your operations environment.
 Establish and maintain consistency of the application's performance and its functional and
physical attributes with its requirements, design, and operational information.
 Manage and control operational risks and threats.
Aspects of service management
Cloud service management and operations is redefining traditional service
management to better fit the needs for cloud and DevOps patterns. At the
same time, it bridges traditional approaches to service management, such as
the IT Infrastructure Library (ITIL).
The service management reference architecture includes incident
management, problem management, change management, and operations.
Incident management
Incident management aims to restore the service as quickly as possible by
using a first-responder team that is equipped with automation and well-
defined runbooks. To maintain the best possible levels of service quality and
availability, the incident management team performs sophisticated
monitoring to detect issues early, before the service is affected. For complex
incidents, subject-matter experts collaborate on the investigation and
resolution. Stakeholders, such as the application owner, are continuously
informed about the status of the incident. Capabilities include event
correlation, monitoring, log monitoring, collaboration, notification,
dashboard, and runbooks.
Problem management
Problem management aims to resolve the root causes of incidents to
minimize their adverse impact and prevent recurrence. Capabilities include
root-cause analysis, incident analysis, dashboards, and collaboration.
Change management
The purpose of change management is to achieve the successful
introduction of changes to an IT system or environment. Success is
measured as a balance of the timeliness and completeness of change
implementation, the cost of implementation, and the minimization of
disruption that is caused in the target system or environment.
Operations
Operations describes the activities to deliver the right quantity of the right
set of services at competitive costs for customers. IT operations
management runs daily routine tasks that are related to the operation of
infrastructure components and applications. These tasks include application
and systems availability, health checks, compliance checks, performance
monitoring, backups, and capacity monitoring and management.

Site Reliability Engineering


Site Reliability Engineering (SRE) is an approach to operations that ensures
that continuously delivered applications run efficiently and reliably by using
software engineering and automation solutions. The key concept is
engineering, which includes a data-driven approach to operations, a culture
of automation to drive efficiency and reduce risk, and hypothesis-driven
methodology in incident, performance, and capacity tasks.
Another core principle is a focus on improving things. Site reliability
engineers don't only "do" automation and restore failed services. Their job is
to make sure that failures don't happen again. A blameless postmortem
identifies the root cause, or causes, of an incident and results in a balanced
action plan to address them.
Unlike traditional system administrators who are typically risk-averse, site
reliability engineers embrace risk in a controlled fashion. They use the
concept of error budget to determine acceptable risk and make informed
decisions about when changes should be made. The error budget is a limit
on how much time the system is allowed to be down, defined by the
contracted service-level agreement (SLA) or the intended service-level
objective (SLO). Many clients stop new releases when they might miss the
service-level agreement (SLA). Error budget goes a step further and
encourages testing and releasing only if downtime is left in the SLA.
If a system has been unstable, changes are restricted; if it's stable,
engineers can take the opportunity to innovate or upgrade.
DevOps and SRE
Although SRE didn't emerge from DevOps, it is aligned with DevOps. DevOps
has the underlying philosophy of full lifecycle responsibility for DevOps
teams and of iterating based on that experience. "You build it, you run it."
While this approach has many upsides, it also has downsides. Developers
might now do operations although they might not be experts in that domain.
Not every service has developers who are assigned to it, but each service
must still be operated and maintained. Also, in an enterprise where you have
hundreds of DevOps teams, you might lack synergy across the teams. In that
case, an SRE team focuses on cross-domain areas, such as a monitoring the
backend, a logging framework, and an automation framework. Your site
reliability engineers work with the DevOps teams on incident management
and postmortems.
As shown in the following diagram1, site reliability engineers can spend up to
half of their time on operations-related work and the rest of their time on
development tasks. In their operations work, they address customer issues
and are on call. Because the applications that they oversee are expected to
be highly automated and self-healing, the engineers have time to do
development tasks, such as writing new features, scaling, or implementing
automation. The ideal site reliability engineer candidate is either a software
engineer with a good administration background or a highly skilled system
administrator with knowledge of coding and automation. Your site reliability
engineers might also work on eliminating performance bottlenecks, isolating
failures by using the circuit breaker and bulkhead patterns, creating
runbooks, and automating daily operations processes.
The goal of SRE is to make systems more reliable. Development and SRE
work together to deliver application performance and reliability by using the
same development CI/CD delivery pipelines and release processes, but they
each focus on their own metrics of success. Development focuses on the
speed of release of new functions, while operations teams focus on
maintaining reliability. Because site reliability engineers spend part of their
time on development tasks and are integrated with the DevOps team, the
overarching team's goal must include both the development and operations
goals.
SRE principles
When you're building your SRE skills and processes, consider these principles
to apply to how your team operates:
 Use automation to perform operations to scale with load.
 Cap the operational load: spend 50% of the time on toil and 50% of the
time on improvements.
 Share 5% of the operations work with the development team. Any
excess operations work overflows to the development team.
 Have an SLA or SLO for the service and measure against it.
 Create an error budget to control velocity. Balance effective self-
regulation of features against stability.
 Practice observability, including the four Golden Signals: Latency,
Traffic, Errors, and Saturation.
 Use actionable, symptom-based alerts. To govern actions, use
automated runbooks.
 Hold a blameless postmortem for every event.
 Hire only developers, and use a common staffing pool for SRE and
development.
Adopt some or all of those principles, and add your own to define how your
SRE team operates.
The value of SRE
When you adopt SRE, you gain several benefits:
 Reduction in mean time to repair (MTTR) and mean time between
failures (MTBF)
 Faster rollout of version updates and bug fixes
 Reduction of risk through automation
 Enhanced resource retention by making operations jobs attractive and
interesting
 Alignment of development and operations through shared goals
 Separation of duties and compliance
 Balance between functional and nonfunctional requirements

Five principles of service management


To adopt microservices-based applications and consider the service
management and operational facets of their applications, your operations
team can follow five principles:
1. Operations
2. Monitoring
3. Eventing and alerting
4. Collaboration
5. Root-cause analysis

Managing microservices involves five key


principles.
Many companies are transforming their monolithic applications to use a
microservices-based architecture. The microservices architecture provides
many advantages:
 Agility and speed in releasing new functionality
 Flexibility in changing the implementation
 Independence between functional units
 Scalability
However, in terms of management, these benefits come with a price. The
services management solution must deal with the microservices
architecture's inherent dynamics, dependencies, and complexities to ensure
that the application is available and performing. Unless the operations
management team is also shifting its paradigm, the microservices-based
application might behave worse than a monolithic application that was built
in the traditional fashion.
5 key principles of cloud service management and operations
In the Garage Method for Cloud, the Operate practices explain how to ensure
operational excellence. You follow the Operate practices to deliver a more
resilient environment:
 Implicit redundancy through application high-availability and scalability
 Fault tolerance by using concepts such as circuit breaker patterns
While these practices reduce the direct impact of an outage, they don't
relieve the DevOps team from detecting the incident and responding to it
with a sense of urgency.
Managing microservices involves five key principles. The principles assist the
operations team to adopt microservices-based applications. They also help
developers think about the operational facets of their application, as both
developers and operations share a common goal of services that are robust
and of high quality. For this to happen it is important to engage Operations
(or SREs) early to consider operational choices when designing and
developing the application. Refactoring these decisions at a later time is
error-prone, cumbersome and certainly less effective.
1: Operations

Several operational activities typically must be done in a production


environment: making sure that applications have the right capacity to cope
with the load, complying with corporate or governmental policies, and so
forth.
Some principles of these tasks are easier to achieve with the right support in
place. In capacity management, for example, microservices can support
elasticity where infrastructures can automatically scale depending on the
usage. The 12-factor app manifesto describes a methodology for building
applications that can scale without significant changes to tooling,
architecture, or development practices.
While microservices don't need to be implemented by using containers, it's
beneficial to use both technologies together. When you use containers,
everything that is required for the software to run is packaged into isolated,
lightweight bundles, making deployment and operations easier. Kubernetes
is an open-source system for automating the deployment, scaling, and
management of containerized applications. Kubernetes provides functionality
for many operational tasks:
 Placement of workload based on resource requirements
 Automated rollouts and rollbacks
 Service discovery and load balancing
 Horizontal scaling
 Self-healing
As you can see, many of the typical operational tasks can be done through
Kubernetes so that you can focus on other operational activities.
One of those tasks is checking compliance. For example, you might need to
check for compliance with security advisories, corporate policies, or
standards that are enforced by industry or government. These checks should
already be run during the development and testing stages. It's also wise to
run the checks in production because many policies keep changing; for
example, due to a security exposure becoming public.
Another example is backup and archiving. To protect from disasters and
meet regulatory needs, backups must be done regularly based on RPO
(recovery point objective), which is the maximum targeted period in which
data might be lost, and RTO (recovery time objective), which is the targeted
duration of time and a service level within which a business process must be
restored after a disaster. A good microservices design externalizes storage-
related activities to explicit persistency services so that these tasks can be
limited to the services that deal with persistence. Needless to say, you must
verify that the backups are consistent and usable.
2: Monitoring

Each microservice must be monitored. Before thinking of, let alone


implementing, a monitoring solution, be sure to define what to monitor. Your
guiding principle is the experience of the user of the service, which might be
a human (for front-facing services) or a system (for back-end services).
Therefore, the key metrics typically are availability, performance/response
time/latency, and error rate. Ideally, synthetic transactions are done from
multiple locations to ensure that the relevant functions of each service are
"exercised" and that the key metrics are evaluated against expected results.
The metrics are drastically different from a typical monitoring solution that
looks for CPU, memory, and disk space. Those parameters might still be
monitored, but with the move toward cloud-based operating models such as
IaaS and PaaS, the involvement of the application owner decreases.
Expose a HealthCheck API for each microservice. As developers know best
what the critical resources and checks for their services are, they should
implement HealthCheck.
Prometheus is an open-source monitoring framework and Fluentd is a
logging data collector. Both tools work with Kubernetes. They provide some
level of HealthCheck API natively, easing the path for developers to take
advantage of it.
Another important element to monitor is application logs, as they provide
visibility into the behavior of a running app. A typical use case is the parsing
and investigation of logs during the diagnostic analysis of an incident. Critical
alerts might be exposed in logs as well, so a monitoring solution should look
for those patterns and alert the operations team. Most of the time, the logs
are streams, such as 12-factor app processes that write their unbuffered
event stream to stdout. Although microservices are loosely coupled, they
depend on each other to provide their logic. They have been developed in
different programming languages and their execution characteristics are
distributed and highly dynamic, so procedures must be in place to aggregate
logs to a central place and perform search and analysis from there.
One technique to stitch traces together is the use of correlation identifiers.
Using correlation IDs not only helps to identify the execution path for a given
transaction, but also supports visualization that adds the context of time,
such as latency; the hierarchy of the services that are involved; and the
serial or parallel nature of the process or task execution. OpenTracing is a
vendor-neutral open standard for distributed tracing.
3: Eventing and alerting

The monitoring solution detects problems with the services, but plenty of
alerts can still occur. In this meshed architecture, services depend on each
other, so a degraded performance in one service might result in cascading
failures in each of the dependent services. To avoid chasing symptoms
rather than causes, an event-management system integrates alerts from
various feeds—service monitoring, log monitoring, infrastructure monitoring
—and attempts to correlate those events. To do so, topology information and
deployment state information, such as how many instances of a service are
currently running, are required.
As this information changes rapidly, the data must be gathered at the time of
the correlation. Traditional approaches like Configuration Management
Databases (CMDBs) have a high risk of showing incomplete or stale
information. This dynamic data—for example, topology information—must be
retrieved directly from the (container) management system at the time of
the detection and used for correlation.
The results of this correlation are actionable alerts. Each event should be
associated with a runbook so that the First Responder team knows how to
respond to the alert and what mitigation action to do. Ideally, these runbooks
are codified in the form of scripts so that the event-management system can
automate the execution and surface only unique problems to a human.
Of course, you don't want First Responders to waste time staring at consoles,
so the system instead notifies them at the receipt of a new alert.
Notifications can be sent through various channels: email, SMS text
message, or an alert in an instant messaging system. The notification system
also alerts other people if the response is not acknowledged by First
Responders within a defined SLA.
4: Collaboration
After the First Responder is notified about the incident, he or she starts with
the diagnostics. The first step is to isolate the component at fault. After it is
isolated, the investigation continues to see what exactly happened and what
can be done to restore the service as quickly as possible.
In an architecture where many services depend on each other, it's likely that
many people need to collaborate. As one of the key concepts of
microservices is to support multi-language and event multi-platforms, the
need to interact with subject matter experts (SMEs), including developers, is
only increasing. The term ChatOps describes this process, where people use
an instant-messaging communication platform to collaborate among SMEs.
Through the ChatOps platform, all interaction is logged in a central place and
you can browse through the log to see what actions were taken.
ChatOps is not limited to humans interacting with each other. By using bot
technology, DevOps and service management tools can be integrated. Two
examples are a monitoring system that pushes a chart showing the
response-time distribution over the last 24 hours and a deployment system
that informs ChatOps about the recent deployment tasks.
In addition, improved visibility through dashboards can expedite restoration.
As the microservices-based application is dynamic in nature—continued
deployment, auto scaling, dynamic instantiation, circuit-breakers, and so
forth—having an accurate understanding of the application is a challenge.
Dashboards visualize topology, deployment activities, and the operational
state, showing availability and performance metrics. A dashboard should also
visualize the key service indicators from a user perspective.
5: Root-cause analysis

Through collaboration, the operations team eventually identifies the correct


mitigation and restores the service. To prevent the incident from
reappearing, the root cause must be assessed. Follow the 5 Hows approach,
as this method helps to surface the issue that was ultimately responsible for
an incident. This investigation must be operated in a blameless culture; only
through that approach are people willing to share their insights and help
others to learn from the experience.
After the root-cause is known, appropriate steps are taken to address it. The
steps might range from changes to the application or its architecture,
changes to the infrastructure, or changes to the management system.
Following an agile approach, these changes are put into the backlog and are
ranked at the next iteration.
A continual challenge is that functional enhancements tend to be ranked
higher than the outage-related changes. For those changes to be
implemented, companies take different approaches. Some companies make
operations the responsibility of the DevOps teams. In this model, developers
have an intrinsic interest in addressing the reliability issues. Another
approach is to establish a Site Reliability Engineering (SRE) team. This team
is empowered to address reliability issues by spending at least 50% of their
time on engineering work. Examples are reducing toil through automation
and assisting the development team in implementing outage-related
changes.
Balancing short-term tactical improvements with longer-term strategic
implementations is an act that needs to be carefully managed.

Build a DevOps culture and squads


The IBM Garage Methodology, including DevOps, is a cultural movement; it's
all about people. An organization might adopt the most efficient processes or
automated tools possible, but they're useless without the people who run the
processes and use the tools. Therefore, building a culture is at the core of
adopting the Methodology.
Building culture
A DevOps culture is characterized by collaboration across roles, focus on
business instead of departmental objectives, trust, and value placed on
learning through experimentation. Building a culture isn't like adopting a
process or a tool. It requires the social engineering of a squad of people,
each with unique predispositions, experiences, and biases. This diversity can
make culture-building challenging.
Building a DevOps culture requires the leaders of the organization to work
with their squads to create an environment and culture of collaboration and
sharing. Leaders must remove any self-imposed barriers to cooperation.
Typically, operations professionals are rewarded for uptime and stability, and
developers are rewarded for delivering new features. As a result, those
groups are set against each other. For example, operations professionals
know that the best protection for production is to accept no changes, and
developers are encouraged to provide new functions quickly, sometimes at
the expense of quality. In a DevOps culture, squads have a shared
responsibility to deliver new capabilities quickly and safely.
The leaders of the organization must further encourage collaboration by
improving visibility. Establishing a common set of collaboration tools is
essential. Build trust and collaboration by giving all stakeholders visibility
into a project's goals through agile planning discussions and playback
meetings.
Sometimes, building a DevOps culture requires people to change. People
who are unwilling to change, that is, to adopt the DevOps culture, might
need to be reassigned.
Building a squad
The most important thing to remember when you build a squad is that a
squad in the most basic sense is a group of people who work together to
accomplish a common goal. In a DevOps squad, the goal is usually the
delivery of a product or a microservice that is part of a product.
What makes a squad successful? Think about the squads that you've worked
with. Several might stand out as favorites. Successful squads demonstrate a
few key behaviors.
First, successful squads communicate well. They don't always agree, but
everyone contributes. Each person respects the ideas and opinions of other
squad members, and when the squad gets together, squad members can
freely discuss ideas without regard to hierarchy.
Remember that squads are made of people. Pretending that life doesn't
impact work isn't realistic. Often, you spend more time with your co-workers
than with family and friends. Be sure to consider how your squad members
feel on a personal level. Although the squad has a goal to accomplish, squad
members shouldn't feel like cogs in a machine. On a successful squad, squad
members can discuss both good and bad topics and maybe even have fun
and develop friendships.
Finally, having a common goal is key to the success of a squad. Every
member of the squad must understand that goal. To understand your
squad's common goal, use Enterprise Design Thinking to define a minimum
viable product (MVP), the infrastructure to build the MVP, and the related
rank-ordered backlog of stories. By planning work from the backlog and
participating in playbacks, everyone stays in sync as the squad moves
toward its goal.
Squad characteristics
When you build a squad, several more characteristics are critical to its
success as it strives to adopt DevOps in the Method:
 Diversity
 Autonomy
 Colocation
 Productivity
 Transparency
 Blameless root-cause analysis
 Peer recognition
 Fun
Diversity
The following content is based on "Building diverse teams" by Adrian Cho.
Diversity is an essential characteristic of a healthy, resilient, high-
performance squad. People tend to hire and promote others who think like
they do; it's natural to want to be around people who are like you. However,
to successfully innovate at scale, squads must be able to run at full speed
and pivot without tripping. Without diversity, squads might sink into
groupthink and are less likely to pivot when they should. When you build a
squad, consider three aspects of diversity: diversity of skills, diversity of
style, and diversity of thinking.
A common way to think about diversity of skills is to think of multi-disciplined
squads that bring together designers, developers, testers, operations, and so
forth. Increasingly, though, many organizations are now building squads of
full-stack developers where each person must be multi-disciplined.
Achieving diversity of mindset and culture is harder than achieving diversity
of skill. In any squad activity, some people tend to move quickly. Others are
cautious, waiting for others to move first. If too many people in a squad rush
forward, the squad will take unnecessary risks. If most of the squad tends to
wait, the squad won't be competitive enough. A well-balanced squad must
have a mix of both styles.
When it comes to getting things done, some people are good at starting
tasks and others are good at finishing them. Without strong starters, a squad
is slow to build momentum. Without strong finishers, the squad might never
achieve its goals. A mix of both styles ensures a high degree of collective
productivity.
When someone on your squad thinks differently than everyone else, it can
be easy to reject that person's opinion. However, think of it this way: the
squad is a box. If thinking outside the box powers innovation, embracing the
outliers can be the best path to success.
While diversity is important, the size of a squad must remain small. Ideally, a
squad's size follows the two-pizza rule and is no larger than 10 people.
Autonomy
The following content is based on "Autonomous, colocated squads" by Scott
Will.
What does it mean to be an autonomous squad? It doesn't mean that a
squad can do whatever it wants to. Squads should always be working on
items that are in alignment with the overall goals of the project. For
example, when the project needs the squad to complete a microservice so
that the offering can go live, the squad doesn't get to say, "Let's build
another version of Solitaire instead." However, the squad does get to
determine the way that they will complete the work on the microservice.
In autonomous squads, "autonomous" means that the squads are
responsible to figure out how best to do the work that needs to be done.
Autonomous squads get to make these kinds of decisions for themselves:
 Should we adopt pair-programming, or use formal code reviews?
 Should two people pair together for an entire week, or should we
switch pair assignments daily?
 Should Sally work on the GUI, or should we let Steve, the new guy on
the squad, work on the GUI for this story?
 Should we all plan to show up at 8:30 AM so we can begin our day
together, or should people come in whenever they want to?
Colocation
Most of the time, it's preferable to have a face-to-face conversation with a
subject-matter expert rather than read an article. When squad members are
colocated, they spend less time writing and reading emails, sitting on long
calls trying to convey ideas, and worrying about network outages that keep
squad members from checking in code from remote locations. In colocated
squads, time-zone problems don't exist. Colocation improves both
communication and efficiency.
Colocated squads might be together in a shared, open space. Although
having squad members with their own offices in the same building is better
than having people scattered all over the place, the goal is to have everyone
in the same room so that the squad can build the synergy to improve
productivity and morale.
Overcoming the obstacles to colocation
Unfortunately, colocation is not always easy to accomplish. The following
problems are common obstacles that squads face.
Problem: The squad already has members who are remote. If remote squad
members are forced to move to another city, several might quit.
Suggestion: Typically, the reason why squad members are remote from
each other is because of skills issues: the right mix of skills is not available at
one location. To address this problem, the long-term goal is to create the
right skills mix at each location. Obviously, this goal takes time and involves
folks learning new skills. In the short-term, the remote employees with the
needed skills can train employees who are local. Don't forget: the remote
employees will also likely need to learn new skills so that they can become
part of a squad that is local to them.
Problem: There is not enough open space to bring squad members together
in one place.
Suggestion: Squads have solved this problem in different ways. One squad
took over a conference room. Another squad went to one of their other
buildings that had a modular floor plan and several unoccupied offices and
formed their own squad area. Another squad pursued funding to rent space
at a nearby office building that was looking for tenants.
Problem: One of the squad members lives in a city with no option to join
another squad; that member is the only employee who lives there. The
squad member doesn't want to move. The member is also in a time zone
that is 5 hours different than the rest of the squad.
Suggestion: In this case, the best solution depends on the squad. To reach
a decision, the squad might consider these suggestions:
 Try to find times to work with the remote member so that no one is
inconvenienced all the time. For example, if the squad has an 8:00 AM call,
move it to noon. Attending a call during your lunch hour is inconvenient, but
it's better than forcing the remote squad member to attend a call when it is
3:00 AM in that member's time zone.
 Experiment with remote pair programming.
 Depending on the budget, the remote employee can travel to the
location where the rest of the squad is on a semi-frequent basis.
If squads are motivated to gain the benefits that colocation offers, they will
come up with the ideas to make it happen.
Productivity
The following content is based on "Minimizing distractions" by Jan Acosta.
All too often, people reach the end of the day and realize that they haven't
accomplished even a small percentage of what they set out to do. Why?
Email, interruptions from coworkers, meetings, or a barrage of conference
calls are often to blame. Minimizing distractions is a key cultural principle
and one that many squads struggle with.
Minimizing distractions focuses on looking at how squads spend their time
and then empowering the squad members and their managers to act to
reduce the noise, thus allowing them to focus on the critical tasks at hand.
The benefits are enormous. Squads report higher job satisfaction because
they can focus on doing what they love to do. Productivity is higher because
developers can get much more done than they would if they were dealing
with constant interruptions. The quality of the code developed is higher as
well because a developer's train of thought doesn't get derailed, leading to
coding errors or omissions.
How do you minimize distractions? Consider these tips as a starting point:
 Decline meetings that are not essential.
 Limit meeting attendance to the smallest number of participants
possible.
 Block time on your calendar to do uninterrupted work. Start with a two-
or four-hour block each day. No email, no calls, no getting pulled into
discussions; this time is reserved so that you can focus on the tasks you
must accomplish that day.
 Schedule meetings for the minimum amount of time needed. If you
schedule a call for an hour, then you're more likely to spend the hour talking.
Challenge yourself and others to see whether you can cover the same
material in 30 minutes, or even better, 15 minutes.
 Designate an "interrupt pair" for the squad each day. This pair is
responsible for handling questions from other squads, attending meetings,
and handling any emergencies that might arise. Then, the rest of the pairs
on the squad are free to focus on their tasks.
 Seek support from your management. Managers want their squads to
be as effective, efficient, and happy as possible. If you're struggling with
interruptions, or things that take you away from your tasks, lean on your
manager for help with blocking out the noise so you can have more time
without interruptions.
Transparency
Transparency in the workplace means operating in a way that makes it easy
for everyone in your organization to see what's going on. Transparency
creates trust across virtual squads and dissolves the boundaries between
them. Make sure that your squads operate transparently in these areas:
 Code: Everyone in your extended squad must have access to source code.
When you define the scope of your organization, consider securing coding practices.
 Backlog: Everyone in your extended squad must have access to functional
and nonfunctional requirements and their prioritization. By providing details on your
decision-making process, you can get the support that you need from the members
of your extended squad.
 Metrics: Your extended squads must have access to availability and metrics
data. Depending on what you're working on, you might also need to ensure that the
consumers of your services can access that data.
 Incident investigation and problem management: Document information
about incidents and lessons learned. Make that information available so that your
squad and others can benefit from your experience.
Establishing trust through transparency is key to successful teamwork
Blameless root-cause analysis
Things go wrong. People make mistakes. You need an environment where
squad members can share lessons learned to prevent others from making
the same mistakes. To create that environment, address any disincentives,
such as fear of punishment or reprimands.
Peer recognition
The following content is based on "Peer recognition" by Donna Fortune and
Carlton Mason.
People love recognition; it's part of human nature. Recognition builds self-
confidence and is an intrinsic workplace motivator.
Different people prefer being recognized in different ways—some people are
not fond of public recognition while others are. However, most everyone
appreciates peer recognition, and it often has more impact than recognition
from leaders or managers. Whether it is for creating outstanding code, fixing
a complex bug, or completing an otherwise undesirable challenge, knowing
that your peers recognize and appreciate your work feels good.
Peer recognition is a symptom of a healthy squad. Healthy squads are
typically far more engaged, collaborative, innovative, and productive.
Establishing a culture of meritocracy, where employees are recognized for
their achievements, dissolves the formal hierarchy of an organization.
Meritocracies thrive on trust, transparency, and consistency. The silos and
political barriers that prevent the sharing of information, balancing
workloads, and responding to inevitable failures are broken down. Peer
recognition promotes teaming, where once it was "us vs. them." Talent and
leadership are recognized within the squad, and the people who are
recognized naturally assume greater responsibility within the squad and for
the organization's success.
This emergent leadership is prevalent in many open source communities
where the engineers who are actively engaged in contributing to and
improving the project can gain committer status. The value of your
contributions, the mastery of your craft, and your technical credibility,
prowess, and knowledge are more important than a title or positional
authority.
DevOps cultures thrive instead of stalling, even when faced with disruptions.
Fostering peer recognition is an excellent way to enhance a squad's culture.
Fun
When employees have fun in the workplace, they enjoy their work and
produce better results. Managers in DevOps environments strive to create an
atmosphere that is challenging, creative, and fun for employees and for
themselves. For more information about how to create a great work
environment, see Fun in the workplace.
Creating a social contract
Many squads use a social contract to document their decisions about how to
behave and interact. A social contract is a squad-designed agreement for an
aspirational set of values, behaviors, and social norms. Think of it as a vision
for working on an incredibly safe and powerful squad.
The social contract identifies dysfunctional behaviors and addresses them
quickly to mitigate long-term damage. Anyone on the squad can and must
enforce the contract by identifying deviations, as people invariably forget
agreements over time.
Follow these tips to create a successful social contract for your squad:
 Gather all squad members to create the contract. Use a facilitator to
ensure that all perspectives are heard.
 Make sure that the facilitator asks many questions to encourage
conversation: What do we value? What's important? What would make this
squad powerful? What can we count on from one another? Think about
negative experiences you had on projects and identify ways to avoid those
problems in the future.
 Allow the participants to voice their thoughts by using sticky notes,
either literally or virtually. Give participants 15 - 20 minutes of time to record
their thoughts.
 Group similar ideas into an affinity-type map.
 Prioritize the top 5 - 10 groups and agree on a group label. Those
labels become the elements of your social contract.
Squad organization
Each autonomous squad must fit into a larger organization. Spotify defined
common terminology that is used in the industry to describe squads and
combinations of squads.
Tribes
While autonomous squads must be able to do their work their way, they also
fit in a larger organization. Typically, delivering an enterprise-grade
application requires work from multiple squads, each of which are
responsible for a microservice. When those microservices are combined, the
entire product is created.
Spotify uses the word tribe to describe a set of squads, and people from
disciplines such as Marketing and Finance, that are aligned around the goal
of delivering a product or service.
Guilds
Autonomous squads consist of diverse squad members who have a wide
variety of skills. Sometimes, it's important for people who share a common
skill to discuss ideas and solve problems within their specialty. Guilds gather
people from multiple squads around a common discipline.
For example, each squad has one or two people who are familiar with the
continuous delivery tools that are used to build, deploy, and manage the
product that the squad is delivering. A Continuous Delivery guild brings
together people who do that job from each squad. The guild drives best
practices in continuous delivery and acts as a forum where people who are
struggling with a problem can find answers from fellow guild members.
Squad leadership
In a self-managing, cross-functional squad, everyone is a leader of some sort
at some point. What does it mean to serve your squad as a leader? How do
you know whether you're a good leader? In an autonomous squad of 10,
each person has plenty of leadership opportunities.
 Product ownership: Each squad must have one person who is defined
as the product owner. This person is responsible to understand the product
that is being delivered. The product owner must ensure that work is
represented in the rank-ordered backlog and must set the priorities so that
the squad knows what it needs to deliver.
 Technical leadership: On a 10-person squad that is responsible to
deliver a product, each person has a unique set of skills that he or she uses
to reach the common goal. Leadership in this respect is not reporting status,
but being the best at a skill and using that skill to help the squad succeed.
 Coordination and status reporting: The goal of each autonomous squad
is to spend as little time on coordination and status-reporting as possible.
However, those tasks still must be done. Squads can strive to minimize the
effort on those tasks by using playbacks to convey status to management
and by using the rank-ordered backlog to surface plans for upcoming work.
Dynamic leadership
The following content is based on "Self-organized teams" by Adrian Cho.
The concept of self-organized squads might be new in the business
enterprise, but many examples can be found in other domains. Unlike the
permanent leadership and well-defined hierarchy of a symphony orchestra, a
group of jazz musicians might start with one member picking the tunes and
counting off the tempo, but then the role of leader moves freely throughout
the group, in real time and typically with no explicit communication. How do
they do this without avoiding situations where people fight for leadership, or
even worse, where no one leads?
In jazz, anyone can take the initiative to explore new possibilities that can
lead to moments of wonderful creativity. The risk of failure exists, but the
same dynamic leadership means that someone is always there to help
preserve the stability. The musicians are adept at practicing leadership on
demand because they are equally comfortable leading, following, and
switching fluidly between the two. This mindset requires a willingness to let
others lead.
Software development squads must be willing to work beyond simple static
organizational structures. In a modern dynamic organization, it is common
for virtual squads to form for one purpose and disband after they accomplish
that purpose. Guilds that exist across multiple squads bring together people
of like interest or expertise. People often work simultaneously in many
squads, in multiple guilds, and in matrix reporting structures.
Static structures are often ill-equipped to respond to the constant change,
chaos, and confusion of the new business world. Where squads are trying to
design, build, and operate many microservices instead of a single monolithic
application, they must similarly organize into decentralized, independent,
loosely coupled squads. Otherwise, as Conway's Law suggests, a monolithic
organization is constrained to create monolithic software.
This delicate balance between individual and group performance is the
difference between a group that works in synergy, performing as more than
the sum of its individuals, and one that is just a group of high-performance
individuals. People with the squad-first mindset understand that their
individual contribution is vital to the squad's success. They also know that
without the rest of the squad, they alone cannot achieve the same success.
In software development, certain indicators of stability must be prized above
all else. These indicators include the health of the current build based on the
main stream of code, the health of running services with zero downtime as
the target, and the health of each squad.
Developers must put these things first to ensure that even as they work
independently, explore possibilities to innovate, and push boundaries of
personal productivity, the squad and its most prized assets are never
compromised. These indicators of stability must be quantified, treated as
actionable metrics, and shared widely throughout the squad. In many cases,
the squad can use tools and services to monitor and report such metrics as
build failures, code complexity quality, uptime, incidents in production, and
more.
Summary
Your most important goal in building a squad is to ensure that the squad can
collaborate without adding a layer of bureaucracy—a development that
would defeat the purpose of adopting a new culture.
Align development and operations for success
DevOps, which is short for development and operations, is only a buzzword
to many people. Everyone talks about it, but not everyone knows what it is.
DevOps, along with Enterprise Design Thinking, agile methodology, and lean
methodology, is part of the Garage Method for Cloud. The Method broadens
DevOps and defines the practices, architectures, and toolchains to
continuously deliver apps to the cloud.
But, what is DevOps and what part does it play in the Method? DevOps is an
approach that is based on lean and agile principles in which business owners
and the development, operations, and quality assurance team members
collaborate to deliver software continuously. DevOps enables the business to
more quickly seize market opportunities and reduce the time to include
customer feedback.
Some people say that DevOps is for practitioners only; others say that it
revolves around the cloud. IBM takes a broad and holistic view and sees
DevOps as a business-driven approach to software delivery. In that
approach, a team takes a new or enhanced business capability from an idea
all the way to production, providing business value to customers and
capturing feedback as customers engage with the capability.
One goal of DevOps in the Method is to enable organizations to react and
make changes faster. In software delivery, this goal requires an organization
to quickly get feedback and learn from every action it takes. This principle
calls for organizations to create communication channels that all
stakeholders can access and use to act on feedback:
 Development acts by adjusting its project plans or priorities.
 Production acts by enhancing the production environments.
 The business acts by continuously modifying its release plans,
including frequency and functions.
All this activity works toward the goal of delivering continuously.
Adopting the Garage Method for Cloud to enable DevOps
Adopting the Method approach to DevOps requires a plan that spans people,
process, and technology. You can't succeed without all three aspects,
especially in an enterprise that has multiple, potentially distributed
stakeholders. Although the word DevOps suggests development-and-
operations-based capabilities, DevOps spans all stakeholders, including
customers and sponsor users, business owners, architecture, design,
development, quality assurance (QA), operations, security, partners, and
suppliers. Excluding any stakeholder—internal or external—leads to an
incomplete implementation of the Method.
Transforming a culture to a new way of doing things is complicated, but it is
worth the journey when you consider the rewards. Several steps are
involved:
 Identifying business objectives
 Finding bottlenecks in the delivery process
 Organizing to build a DevOps culture
Identifying business objectives
The first task in creating a culture is getting everyone in the same direction
and working toward the same goal, which means identifying common
business objectives for the team and the organization. This task is where
DevOps can be combined with Enterprise Design Thinking. Enterprise Design
Thinking identifies what the customer wants.
Be sure to encourage the entire team based on business outcomes instead of
creating conflicting individual team incentives. When people know what their
common goal is and how their progress toward that goal is measured, fewer
challenges exist from teams or practitioners that have their own priorities.
DevOps isn't the end goal. It helps you reach your goals of delivering what
your customers want, quickly and with high quality.
Practicing lean development
As you code, follow lean development practices. The term lean is rooted in
the lean manufacturing process that, in its simplest form, passed products
down an assembly line and incrementally built the whole product.
Over time, the software industry combined the principles of the lean and
agile methods to efficiently and incrementally deliver products that delight
customers:
 Reduce waste everywhere. Develop only what the customers want; no
more, no less.
 Strive for quality. Ensure that code is reviewed and tested before it is
integrated into the product. The goal is automation of testing and zero
defects after code is delivered.
 Learn as you go. Development is done incrementally. At every
increment or iteration, the team needs to examine what was done,
determine how it can be improved in the next iteration, and, most
importantly, commit to making the improvements.
 Defer commitment. When software releases were annual events, the
practice of delivering a product release specification was a fundamental
release deliverable. With the lean methodology, the goal is to design as you
go, committing to design and function only when they become the highest
priority.
 Deliver quickly. Each agile iteration should be deliverable. When you
plan the iterations, make sure that they contain only what the team can
commit to delivering. Overcommitting the team is a recipe for failure.
 Respect the team. A motivated team is a force to be reckoned with.
 Keep an eye on the end goal. Someone needs to oversee the entire
system to ensure that it all comes together.
The implementation of these practices requires a change in culture, team
mentality, and management style.
Finding bottlenecks in the delivery process
In the delivery process, you might be challenged with any or all of these
inefficiencies:
 Deploying an infrastructure that is necessary to deliver to production
 Unnecessary overhead, such as repeatedly communicating the same
information
 Unnecessary rework, such as uncovering defects in testing or
production and forcing assignments back to the development team
 Over-production, such as developing functions that weren't required
Those inefficiencies are addressed in the Method implementation of DevOps.
It increases the velocity of application delivery and puts pressure on the
infrastructure to respond more quickly. By working in software-defined
environments, you can capture infrastructure as a programmable and
repeatable pattern, accelerating deployments.
The introduction of an automated delivery pipeline, such as the one in the
IBM Cloud Continuous Delivery service, enables a repeatable process that
automates the build, testing, and continuous delivery of an application to
production.
To keep the team focused on the work that is important to the customer,
work from a customer-vetted, ranked backlog that is built based on ongoing
customer feedback and usage metrics.
Organizing to build a DevOps culture
All of the best practices in the world can't benefit an organization unless it is
willing and able to adopt the DevOps culture. How development and
operations are positioned in the organization is a critical key to success.
Over the past few years in software development, the pendulum swung
widely either in the direction of development holistically owning operations
or development contracting all of operations with another group. As the
pendulum strives to reach an equilibrium point, a few observations are clear.
The DevOps movement is popular because of a widespread realization:
having two teams that are measured on different metrics does not work well.
Operations is measured on uptime. Development is measured on speed of
delivery. Those two metrics are diametrically opposed. When you are
measured on uptime, you typically view change as bad because changes
often break things and cause outages. However, in development, you are
measured by how quickly you can induce change. You can't have one group
saying "change is bad" and the other saying "change is good;" that's
dysfunctional. You also can't fix the problem by making the Director of
Operations report through the same management chain as development
directors.
Including operations in your development team
Enterprise solution owners who are venturing to cloud tend to outsource all
operations to another group because solution owners can't deploy, maintain,
and monitor a live solution. For years, those responsibilities were the
customers'. Shifting those responsibilities to the development team was
unthinkable; the skill didn't exist.
However, with outsourcing comes the risk of the previously mentioned
conflicting interests. Traditionally, operations teams are talent pools who are
measured on uptime and stability. Those metrics are the only metrics that
are common across the pool. Also, because they are talent pools, they might
not be dedicated or consistent resources, especially when you need them
the most. This model does not work well in a DevOps culture.
Providing operational leadership to development
At a minimum, an enterprise development team that is moving to the cloud
needs to build an operational leadership team. That operational leadership
team must include the director of operations (an executive with decision-
making authority), project and delivery management, and platform
architecture and design. If you don't have one of those skills, find it. If you
outsource the leadership to another group, you lose both control and
business-goal alignment.
Start by assigning an existing development director as the director of
operations, and fortify that person with knowledgeable architecture and
delivery project management leadership. Then, you have an executive who
understands how to develop and deliver products and who will naturally align
that knowledge with operational goals.
Giving ownership of the whole component to the development director
The development director owns the complete end-to-end health of his or her
component. "Complete" means not only design, development, build, and
test, but also deployment, management, monitoring, notifications, and
support. If something fails, it is the responsibility of the development director
who owns that component to fix it.
After you are organized to build the DevOps culture, you can focus on
building DevOps teams.
Summary
The Garage Method for Cloud is the natural evolution of DevOps. It broadens
the scope to include practices, architectures, and toolchains to build
applications, deliver continuously, and manage those applications within a
DevOps culture.

Roles in a DevOps organization


A common misconception about agile development is that everyone can do everything. In fact,
the opposite is true. To run a disciplined agile process, well-defined roles and responsibilities are
essential. 

Although the names of these roles might differ from what you call them, you'll likely find that
their descriptions match roles that you're familiar with. A key aspect of being agile is that people
are empowered to make decisions without extensive meetings and building consensus. This
aspect can be a significant challenge in some enterprise organizations. Project cadence with
playbacks to all interested parties is key to making the whole organization more comfortable
with decisions.

Core roles in an agile squad

In the most basic sense, a team is a group of people who work together to accomplish a common
goal. For a DevOps team, the goal is usually the delivery of a product or a microservice that is
part of a product. To achieve success, make sure that the key roles are filled.
Product manager (product owner)

The role of the product manager is critical. The product manager ensures that the team creates an
engaging product and delivers business value by meeting the needs of the markets. Product
managers must understand company strategy, market needs, the competitive landscape, and
which customer needs are unfulfilled. 

The product manager has these responsibilities:

 Owns the scope of the project, works with sponsor users, identifies personas, defines the
minimum viable product (MVP), and defines hypotheses
 Collaborates with designers on the overall user experience, attends playbacks with
sponsor users and stakeholders, and collaborates through user testing
 Defines, writes, and ranks the user stories that direct the design and development work
 Accepts user stories; that is, decides when a user story has delivered the MVP
 Decides when to go to production and owns or collaborates on the "Go To Market" plan

Larger projects might have more than one product manager. Ideally, the project is divided into
components or services that individual squads can work on, and each squad has a product
manager who provides direction.

The product managers must understand and coordinate requirements from other components of
an overall solution. Ultimately, one voice must direct the design and development work of a
squad.

Sponsor

The sponsor is typically an executive who has the vision and owns the overall delivery and
success of the project. Ideally, the squad has playbacks and brings any issues to the sponsor each
week. Part of the sponsor's job is to ensure that the squad has everything it needs to succeed and
to support it in its use of agile methods. 

User experience (UX) design lead

The UX design lead is responsible for all aspects of the project's user experience. The goal of
every project is to create experiences that delight users. Reaching that goal requires a strong and
experienced design leader who can oversee all aspects of the design, from the initial workshop
through visual design and user testing.

The UX design lead has these responsibilities:

 Leads design thinking practices, such as persona definition, empathy mapping, scenarios
mapping, ideation, and MVP definition.
 Creates great user experience concepts and produces wireframe sketches to communicate
the experience with the broader team.
 Drives collaboration with the extended design team, development team, and product
management team to maximize innovation.
 Collaborates with visual designers to deliver high-fidelity designs.
 Ensures a consistent user experience across all facets of an offering, working with UX
leaders in other squads when necessary.
 Plans and runs user testing to ensure that real-world feedback is injected in all phases of
the project. In some projects, a dedicated user researcher assumes this responsibility.

Larger projects might have a UX designer or dedicated user researcher who works with the UX
design lead. In those cases, the UX design lead is responsible for all aspects of the experience,
but the UX designer might create a few of the design artifacts.

Visual designer

The visual designer converts the user experience concepts into detailed designs that emotionally
connect with users.

The visual designer has these responsibilities:

 Uses his or her understanding of color, fonts, and visual hierarchy to convert conceptual
designs into detailed designs that developers can build
 Creates all the necessary visual artifacts, including images, illustrations, logos, and icons
 Ensures a consistent application of corporate styles and branding, or where appropriate,
suggests deviations from styles and branding
Sponsor user

A product that doesn't satisfy a real user need is a failed design. The best way to ensure that your
project meets users' needs is to involve users in the process. 

A sponsor user is a real user—someone from outside your organization—who will use the
product that is being built. A sponsor user must be carefully selected so that he or she accurately
represents the needs of the widest possible user base. In some cases, you might want more than
one sponsor user. 

The sponsor user has these responsibilities:

 Represents the needs of the user throughout the process


 Participates in all phases of the project, from the initial design workshop through design,
development, and user testing
 Ensures that all decisions reflect the needs of the user

In some cases, the sponsor user might be the person who demonstrates the design or prototypes
to other stakeholders. What better way to show the value of a design than to have a real user
show it and explain why it matters?
Developer

Developers build code by using core agile practices such as "keep it simple," test-driven
development (TDD), continuous integration, polyglot programming, and microservice design.
Agile development requires more collaboration than is required in the waterfall model.

As part of a modern, agile DevOps team, developers must be adept at rapidly learning and using
new technologies. In a DevOps approach, the developer role is merged with quality assurance
(QA), release management, and operations.

A developer has these responsibilities:

 Collaborates with the designers and product managers to ensure that the code that is
developed meets their vision
 Designs the solution to meet functional and nonfunctional requirements
 Writes automated tests, ideally before writing code
 Writes code
 Develops delivery pipelines and automated deployment scripts
 Configures services, such as databases and monitoring
 Fixes problems from the development phase through the production phase, which
requires being on call for production support

In order for developers to have such broad roles, they rely on cloud technologies to free them
from many—and sometimes even all—infrastructure tasks. In an ideal DevOps organization, the
developers take on operations fully and are on "pager duty" for production problems. In many
companies, including virtually all enterprises, an operations team provides the first response to
production issues. However, you can still have a DevOps culture if the second call is to a
developer to fix the problem. For more information about the operations roles that developers
take on, see Cloud service management and operations roles.

Teams might need developers with specialized skills, such as analytics or mobile. In those cases,
use pair programming to spread the skills throughout the team.
In a DevOps approach the developer role is
merged with quality assurance (QA),
release management, and operations.

Anchor developer (technical team lead)

An anchor developer is an experienced developer who provides leadership on architecture and


design choices, such as which UI framework to use on a project. Even with the use of agile
tracking tools, sponsors and stakeholders often want a direct report on progress, issues, and key
technical decisions. The anchor developer is that technical focal point who also does
development work.

Agile coach

The agile coach leads the organization in agile methods. The coach must have the knowledge and
experience to recommend changes to various practices in response to unique circumstances
within the organization. The coach identifies problems and misapplications of the agile
principles and suggests corrections and continuous improvements.

One approach that works well is for the agile coach to be part of the squad that has delivery
responsibilities. In that way, the coach can provide guidance about the method and practices as
part of the day-to-day work. The agile coach usually has experience with agile projects and is
passionate about the process and practices. Alternatively, an agile coach can work across several
squads and mentor them on the process and practices. 

Depending on your squad structure and project size, the day-to-day responsibilities of facilitating
daily standup meetings, planning, and playbacks might be the responsibility of either the agile
coach or the anchor developer.

Cloud Service Management and Operations roles

When you move to the cloud, the resulting culture change requires modifications to the structure
and roles of your project teams. Some DevOps team members can play more than one role, and
groups might be merged to create a cohesive, diverse squad. As you form the Ops side of your
squad, consider the addition of several new roles.

Service management introduces processes that teams must implement to manage the operational
aspects of their applications. This diagram illustrates the processes and the roles that are needed
to implement them:
Click to expand the image
Incident management roles

Incident management restores service as quickly as possible by using a first-responder team that
is equipped with automation and well-defined runbooks. The incident management team
members define the incident process and build a tool chain that implements ChatOps across the
organization. Incident management roles include the first responder, incident commander,
subject matter expert, and site reliability engineer.

First responder

The first responder solves problems by using runbooks and working with subject matter experts
(SMEs). This role has these responsibilities:

 Receives alerts through collaboration tools


 Researches to determine the nature of the problem
 Evaluates and adjusts the urgency and priority of the problem, if needed
 Contacts and communicates with the incident commander when a major incident occurs
 Reviews known issues to determine whether the problem is a known issue
 Tries to resolve the issue by using the prescribed runbooks, collaborating with SMEs, or
both
 Gains concurrence from the customer when the incident is resolved

Incident commander

The incident commander manages the investigation, communication, and resolution of major
incidents. This role has these responsibilities:

 Receives incident information and collaborates with SMEs to restore services as fast as
possible
 Updates key stakeholders with status and expected resolution times
 Seeks senior leadership support and endorsement, if needed
 Interfaces and works with vendors to isolate problems and drive resolution

Keeping stakeholders up to date with the status of an incident is a key responsibility of the
incident manager.

Subject matter expert (SME)

SMEs apply the deep technical skills that are needed to resolve application issues. Their skills
support either specialized application expertise or a specific technical field or domain, such as
database administration. An SME has these responsibilities:

 Investigates a problem by using monitoring tools to get more details


 Inspects logs
 Tests and verifies issues
 Recommends fixes if instruction is missing for the first responder, or fixes the problem
 Proposes changes if they're needed and requests change management
 Implements change
 Provides data for the Site reliability engineer (SRE) review

SMEs use their skills to resolve problems as quickly as possible.

Site reliability engineer (SRE)

An SRE takes operational responsibility for supporting applications and services that run on a
global scale in the cloud by using a highly automated service management approach. The SRE
pays particular attention to removing toil, which is repetitive manual labor that doesn't add real
value to a project.

An SRE spends approximately 50% of his or her working time on engineering project
improvements. Fundamentally, the role is a combination of software engineering and operations
design.

Problem management roles

Problem management aims to resolve the root cause of an incident to minimize its adverse
impact and prevent recurrence. The problem owner and problem analyst ensure that problems are
fixed and not repeated.

Problem owner

The problem owner oversees the handling of a problem and is responsible to bring it to closure.
As needed, this role enlists the help of analysts and specialists. The problem owner is essentially
the same role as the traditional IT role. However, the tools that those roles use to identify and
solve a problem and ultimately provide a root cause analysis are different. Typically, the problem
owner has a strong personal interest in the resolution.

Problem analyst

The problem analyst discovers incident trends, identifies problems, and determines the root cause
of a problem. They're SMEs in one or more areas. This role is radically different from the
traditional IT because business analytics, runbooks, and cognitive techniques play a major role.
However, human supervision and creative thinking are still important in this role. Typically, an
SRE takes this role.

Change management roles

The purpose of change management is to successfully introduce changes to an IT system or


environment. The roles that are associated with change management are change owner, change
manager, and change implementer.
Change owner

The change owner is the person who requests the change. The change owner has these
responsibilities:

 Raises the change within the change management tool


 Creates the business case that justifies the change
 Sets the priority
 Determines the urgency

Input from the change owner is used to rank the change against all the other work that the
DevOps squad does.

Change manager

The change manager completes the preliminary assessment of a change record to ensure that it
contains the correct information. That information includes an accurate description, correct
classification, and correct change type of standard, normal, or emergency.

Before the change is implemented, the change manager verifies that all the authorizations are
obtained. After implementation, this role reviews changes to ensure accuracy and quality.

Change implementer

This role implements changes. Typically, the change implementer is an SRE or associated SME.

Other possible roles

Your project might need additional roles such as architect, project manager, and user researcher.

Architect

In projects that have experienced developers and a strong anchor developer, those roles provide
"just-in-time" architecture. The use of "just-in-time" architecture is typical
for greenfield projects, which are not imposed by preexisting architecture. However, if enough
complexity exists, you need an architect who is separate from the developers so that
development work isn't impacted. You might need an architect if a project is integrating to
existing systems by using a wide range of services, security, or movement of data, or if you're
coordinating with many other technical teams. 

The architect works closely with the anchor developers across squads. This role also works with
other architects to ensure architectural consistency across an offering portfolio. The architect
creates only the architectural designs, diagrams, and documents that are actively used by the
squads to guide their development. As in all other roles, the focus must be on communication
and collaboration that is effective for the squads.
Project manager

Tracking within the squads' day-to-day work is done through user stories and tracking software,
but often, dependencies exist on groups outside the squad.

Project managers do a wide variety of tasks:

 Procure software
 Coordinate dependencies on systems, exposing APIs
 Report summaries beyond the tracking software
 Manage issues
 Coordinate integration with third parties

If the dependencies are minimal, the product manager, anchor developer, or both might handle
the project management tasks. However, if those roles spend too much time on project
management tasks, a project manager is likely needed. Ideally, the project manager has
experience with agile teams and understands agile process and tracking.

User researcher

The user researcher validates all the aspects of a design with real users. Sometimes the UX
leader fulfills this role. However, where more extensive user research is part of the project,
include a user research expert as part of the team.

The user researcher has these responsibilities:

 Completes the initial research of users and their world to build personas, empathy maps,
and scenario maps
 Validates the problem statements and MVP with real users
 Plans and conducts usability tests throughout the project to get real-world feedback on all
ideas

Automated separation of duties

After you define your organizational roles, set up automated Separation of Duties (SoD) to
enforce them and the separation of duties. SoD is defined as ensuring that no single person can
introduce fraudulent or malicious code or data without detection. To set up SoD, follow these
tips:

 Define SoD-related roles and access rights


 Use tools like IBM UrbanCode® Deploy, which provide role-based security and logging
 Use tools to automatically track not only what the change was but also when or who
made the change
 Clearly document who can and cannot have access to production, and use access control
to enforce your policy
 Add roles to the automated tool
 Use automated scripts to supplement tools; for example, generate a separation of duties
matrix

Build effective squads


To meet the growing need for high-quality customer experiences and rapid
business concept introduction, development organizations must change.
Adopting the Garage Method for Cloud can help an organization make the
transformation. The Method uses Enterprise Design Thinking, Lean Startup,
and agile DevOps concepts to enable continuous design, delivery, and
validation of new function. A good way to begin to change is to organize your
teams into collocated, autonomous squads before you start to build code.
Many types of squads exist. In a large-scale cloud development project, you
can organize into many separate squads or create squads with combined
responsibilities.
A squad is a small independent team
A squad is a small, independent team made up of these roles:
 A squad leader, who acts as an anchor developer and agile coach for
the squad
 3 or 4 development pairs who practice pair programming
Each squad has an associated designer, an associated product owner, and
can have an associated application architect. These associated roles can
come from specialized support squads. An example of a specialized support
squad might be the content squad that handles overall design and UX
creation for your squads.
Although it's ideal to embed designers and UX content creators directly into
a squad, not every organization has enough of those skills to do so. You can
centralize that work, which is often related to the work of several squads
anyway, into one squad.
Squads that are responsible for developing application functions might be
called build squads.
A squad is responsible for chunks of function
Squads implement epics, which are groups of related user stories that
describe higher levels of system functions. A single squad can implement
one or more epics within a chunk, but an epic is the smallest element of
implementation responsibility for a squad. The user stories that define that
epic are added to a rank-ordered backlog of user stories. The organization
might choose to use Kanban or other tracking methods to manage and
maintain that backlog, but the backlog must be kept up-to-date constantly
and reprioritized daily.
One way to size user stories is to make them implementable by a single
development pair within a single day. Because the team’s implementation
speed can change over time, you might need to adjust the sizing. Stories can
be broken up or combined as necessary and added to the backlog.
Squads work by using best practices
A squad can benefit from implementing pair programming. When pairs write
all the code, it undergoes continuous code review, making it possible to
reduce or eliminate formal code reviews. Rotating programming pairs daily
spreads the knowledge of individual system elements across the entire
squad. The code is continually read, revisited, and revised as new user
stories are implemented. This practice has the added benefit of reducing the
dependence on any single person on the squad. Using pair rotation with test-
driven development makes it possible for any squad member to participate
in a pair with confidence.
Pair programming can be a key advantage of the squad model for large-scale
organizations. As a squad gains experience and maturity, you can divide it
into two squads. A senior squad member becomes the squad leader of the
new squad, which is made up of some original squad members plus some
newly trained developers. The new squad begins work on a new chunk. The
original squad also adds new developers to gain experience with the squad’s
code as they work on epics within the original chunk or in a related chunk. In
this way, a team can grow quickly.
This combination of practices, plus practices such as daily standup meetings,
speaks to the need for squads to be autonomous and completely own an epic
or chunk from end-to-end, and colocated. It is difficult to separate pairs
across locations. While the overall project team can have squads in different
locations, the individual members of each squad should be colocated.
The role of testing in the squad model
In the squad model, the use of automated testing, test-driven development,
and pair programming means that you do not need a large, dedicated testing
staff embedded in the teams. The skills of the testing staff are needed, but
the people can take on different roles. Instead of acting only in a test role,
some testers become developers and others use their deep domain
knowledge as product owners.
Teams should follow the practice of test-driven development. Test-driven
development requires that you write the test before you write any code,
using the concept of tests as specifications. This practice ensures that any
membe
r of a squad can understand the code. If developers can read a suite of
functional tests, they can understand how a particular code element is
implemented. The test suite that is developed through this practice must
encompass all of the major forms of testing that are required: functional
testing, user interface testing, and performance testing. Fully embracing
automated testing can dramatically improve the quality of code and can
markedly reduce the time spent running manual tests.
A need still exists for a smaller, more specialized testing squad to conduct
types of testing that require specialized skills, such as cross-device mobile
testing and end-to-end performance testing.
Summary
Adopting the squad model and adopting the principles of Garage Method for
Cloud can help your organization stay competitive in today's fast-paced
environment.

Build to manage
You've been there before: development throws its new code "over the wall"
and your operations team has to figure out how to deploy, monitor, and
manage it. Traditionally, your development team was measured on how fast
it updated features and released them into production. Your operations team
was measured on availability, which resulted in resistance to change. It's
easy to see that these goals are diametrically opposed. In the traditional
world, your team had time to build knowledge. Applications were
infrequently updated, and after they were deployed, the lifetime of the
application spanned years.
As you adopt practices that increase velocity and the speed of change,
operations can become a bottleneck, leading to long release times or
increased operational risk. To address this problem, create DevOps teams
with a broad set of skills and common goals. All the team members are
empowered to use their unique skills to drive the team towards overall
success. The knowledge of your skilled operations team members helps your
developers create more robust software.
Because continuous deployment is key in delivering cloud-based
applications, the Ops part of your DevOps team has much less time to build
and apply knowledge to prepare for each deployment. To address this
reality, you need a different approach to operational management: build to
manage.
As development and operations come closer
together, new practices arise to ease operations for
cloud-based applications.

Build to manage specifies a set of practices that developers can adopt to


instrument the application and provide manageability aspects as part of the
release. When you implement a build-to-manage approach, consider these
practices:
 Health check APIs
 Log format and catalog
 Deployment correlation
 Distributed tracing
 Topology information
 Event format and catalog
 Test cases and scripts
 Monitoring configuration
 Runbooks
 First Failure Data Capture
By adopting those practices, your organization achieves a more mature
operational level and faster velocity. Your DevOps team comes closer
together as it works toward the common goal of quickly releasing robust
functions that meet the required functional, availability, performance, and
security objectives.

Automate continuous integration


The effort required to integrate a system increases exponentially with time.
By integrating the system more frequently, integration issues are identified
earlier, when they are easier to fix, and the overall integration effort is
reduced. The result is a higher quality product and more predictable delivery
schedules.
Activities
Continuous integration (CI) is implemented through the following activities:
 Changes are delivered and accepted by team members throughout the
development day.
 Developers deliver their changes and perform personal builds and unit
tests before making the changes available to the team.
 Change sets from all developers are integrated in a team workspace,
and then built and unit tested frequently. This should happen at least daily,
but ideally it happens any time a new change set is available.
The first activity ensures that any technical debt from conflicting changes is
resolved as the changes occur. The second activity identifies integration
issues early so that they can be corrected while the change is still fresh in
the developer's mind. The third activity ensures that individual developer
changes that are introduced to the team have a minimum level of validation
through the build and unit testing, and that the changes are made to a
configuration that is known to be good and tested before the new code is
available.
The ultimate goal of CI is to integrate and test the system on every change
to minimize the time between injecting a defect and correcting it.
Benefits of CI
CI provides the following benefits:
 Improved feedback. CI shows constant and demonstrable progress.
 Improved error detection. CI can help you detect and address errors
early, often minutes after they've been injected into the product. Effective CI
requires automated unit testing with appropriate code coverage.
 Improved collaboration. CI enables team members to work together
safely. They know that they can make a change to their code, integrate the
system, and determine quickly whether or not their change conflicts with
others.
 Improved system integration. By integrating continuously throughout
your project, you know that you can actually build the system, thereby
mitigating integration surprises at the end of the lifecycle.
 Reduced number of parallel changes that need to be merged and
tested.
 Reduced number of errors found during system testing. All conflicts are
resolved before making new change sets available, and the resolution is
done by the person who is in the best position to resolve them.
 Reduced technical risk. You always have an up-to-date system to test
against.
 Reduced management risk. By continuously integrating your system,
you know exactly how much functionality you have built to date, thereby
improving your ability to predict when and if you are actually going to be
able to deliver the necessary functionality.
Getting started with CI
If the team is new to CI, it is best to start small and then incrementally add
practices. For example, start with a simple daily integration build and
incrementally add tests and automated inspections, such as code coverage,
to the build process. As the team begins to adopt the practices, increase the
build frequency. The following practices provide guidance in adopting CI.
Developer practices
As part of a CI approach, developers follow these practices:
 Make changes available frequently. For CI to be effective, code
changes need to be small, complete, cohesive, and available for integration.
Keep change sets small so that they can be completed and tested in a
relatively short time span.
 Don't introduce errors. Test your changes by using a private build and
unit testing before making your changes available.
 Fix broken builds immediately. When a problem is identified, fix it as
soon as possible, while it is still fresh in your mind. If the problem cannot be
quickly resolved, back out the changes instead of completing them.
Integration practices
A build is more than a compilation. A build consists of compilation, testing,
inspection, and deployment.
 Provide feedback as quickly and as often as possible.
 Automate the build process so that it is fast and repeatable. In this
way, issues are identified and conveyed to the appropriate person for
resolution as quickly as possible.
 Test with build. Include automated tests with the build process and
provide results immediately to the team.
Automation
Consider taking these measures to increase automation:
 Commit all of your application assets to the code management (CM)
repository so they are controlled and available to the rest of the team. The
assets include source code, data definition language source, API definitions,
and test scripts.
 Integrate and automate build, deploy, testing, and promotion. Do this
for both developer tests and integration tests. Tests must be repeatable and
fast.
 Automate feedback from the process to the originator, whether this is
the entire team or a developer. Process and resolve feedback, avoiding
excess formality, as a part of the backlog process.
 Commit your build scripts to the CM repository so that they are
controlled and available to the rest of the team. Both for private builds and
integration builds, use automated builds. Builds must be repeatable and fast.
 Invest in a CI server. The goal of CI is to integrate, build, and test the
software in a clean environment any time that there is a change to the
implementation. Although a dedicated CI server is not essential, it greatly
reduces the overhead that is required to integrate continuously and provides
the required reporting.
Common pitfalls
As you start to implement CI, remember these potential issues:
 A build process that doesn't identify problems. A build is more than a
simple compilation or its dynamic language variations. Sound testing and
inspection practices, both developer testing and integration testing, must be
adopted to ensure the right amount of coverage.
 Integration builds that take too long to complete. The build process
must balance coverage with speed. You don't have to run every system-level
acceptance test to meet the intent of CI. Staged builds will provide a useful
means to organize testing to get the right balance between coverage and
speed.
 Build server measures ignored. Most build servers provide dashboards
to measure build results. These results can also be delivered directly to the
individual users. Review them to identify trends in applications, components,
and architecture that provide an opportunity for improvement.
 Change sets that are too large. Developers must develop the discipline
and skills to organize their work into small, cohesive change sets. This
practice simplifies testing, debugging, and reporting. It also ensures that
changes are made available frequently enough to meet the intention of CI.
 Failure to commit defects to the code management repository. Ensure
that developers are performing adequate testing before they make change
sets available.

Implement high availability for on-premises


applications
If your on-premises application fails, you can face significant impacts to your
business continuity. To be successful, you must implement a business
continuity plan that includes a high availability (HA) and disaster recovery
(DR) solution. But how do you select the optimal HADR topology for your
solution?
High availability versus disaster recovery
The terms high availability and disaster recovery are often used
interchangeably. However, they are two distinct concepts:
 High availability (HA) describes the ability of an application to
withstand all planned and unplanned outages (a planned outage could be
performing a system upgrade) and to provide continuous processing for
business-critical applications.
 Disaster recovery (DR) involves a set of policies, tools, and
procedures for returning a system, an application, or an entire data center to
full operation after a catastrophic interruption. It includes procedures for
copying and storing an installed system's essential data in a secure location,
and for recovering that data to restore normalcy of operation.
High availability is about avoiding single points of failure and ensuring that
the application will continue to process requests. Disaster recovery is about
policies and procedures for restoring a system or application to its normal
operating condition after the system or application suffered a catastrophic
failure or loss of availability of the entire data center.
Developing your HADR solution
To guide the development of a high availability disaster recovery (HADR)
solution for your on-premises application, you should consider business
challenges, functional requirements, and architecture principles.
Business challenges of your HADR solution
Your HADR solution should address these challenges:
 For business continuity, the application, and the business processes it
supports, should remain available and accessible without any interruption,
despite man-made or natural disasters. It should serve its intended function
seamlessly.
 For continuous availability, a well-designed HA solution maintains an
optimal customer experience with quick system response time and real-time
execution of transactions.
 The architecture must be capable of processing the additional
workload resulting from a spike in business transactions and mitigate the risk
for revenue opportunity loss.
 For operational flexibility, you should have a well-designed HA topology
with replication of code and data in a secondary site that is separated by
sufficient geographical distance. The application can be reconstituted and/or
activated at another location, processing the work after an unexpected
catastrophic failure at the primary site.
Functional requirements for your HADR solution
You should consider the following functional requirements for your HADR
solution:
 Minimize interruptions to the normal operations of the application. If
any application component has an availability issue, ensure the smooth and
rapid restoration of the application component back to normal operation.
 Restoration of the service of any application component must be
completely automated or must be activated by humans with a single action.
 Monitor availability of each application component of the application.
Alert in case of service level issues, such as slow response time or no
response from any application component. Activate rapid restoration by
automation or by a single action performed by a human specialist
responsible for the high availability of the application.
Architecture principles affecting your HADR solution
The events that can cause a software application to fail to process user or
other system requests can be divided into three categories. Each requires a
different technique to mitigate.
 Events that involve the unexpected failure of only one component of
the system, such as operating system process, a physical machine, or the
network link connecting members of the system.
 Events that involve simultaneous unexpected failure across many
components of the system. These events might be triggered by a natural
disaster, human error, or a combination of both.
 Events caused by human error that involve logical corruption of a
primary datastore by persisting incorrect or incoherent content into it.
Example: an on-premises B2B order application
In this example, an organization develops an on-premises application to
process online orders from its B2B customers. The B2B order application
uses multiple components providing specific services, such as user interface,
product catalog, order creation, workflow, decision and integration services,
and analytics. An ERP application stores the product data, such as price and
inventory, and orders. The catalog application manages the unstructured
data related to the product, such as images. The online order application
uses a NoSQL database for storing its product catalog and traditional RDBMS
for its analytics and ERP back-end application.
This is a high-level illustration of the components involved in the on-premises
order application.

Data center topology options for HADR


To achieve high availability, you can select from two deployment topology
options—a two data centers architecture and a three data centers
architecture. You would set up your B2B order application the same way in a
highly available cluster configuration in both the primary and the secondary
data centers.
Two data centers topology
You can configure a two data centers topology in either active-standby
mode or active-active mode. The simplest configuration is the active-
standby topology, where the B2B order application in the secondary data
center is in cold standby mode. In the active-active topology, the application
and the services it uses are active in both data centers.
Three data centers topology
The configuration for three data centers has two variants, active-active-
active and active-active-standby. In the active-active-standby
configuration, the application and services are in active mode in the primary
and secondary data centers, while the application is in standby mode in the
third data center.
Disaster recovery scenarios
When a disaster strikes, the topology and configuration choices you made
will determine how your application recovers. You need to understand the
costs and benefits associated with each to determine the optimal one for
your needs.
Disaster recovery with a two data centers topology
Active-active or active-standby are two possible configurations for this
scenario. In both cases, you must have continuous replication of data
between the two data centers.
Active-active configuration

This configuration provides higher availability with minimal human


involvement than the active-standby configuration. Requests are served
from both data centers. You should configure the edge services (load
balancer) with appropriate timeout and retry logic to automatically route the
request to the second data center if a failure occurs in the first data center
environment.
Benefits of this configuration are reduced recovery time objective (RTO) and
recovery point objective (RPO). For the RPO requirement, data
synchronization between the two active data centers must be extremely
timely to allow seamless request flow.
Active-standby configuration

Requests are served from the active site. In the event of an outage or
application failure, pre-application work is performed to make the standby
data center ready to serve the request. Switching from the active to the
standby data center is a time-consuming operation. Both recovery time
objective (RTO) and recovery point objective (RPO) are higher compared to
the active-active configuration.
The standby data center can be either a hot or cold standby environment. In
the hot standby option, the order application and associated services are
deployed to both data centers, but the load balancer directs traffic only to
the application in the active data center. The benefit of this configuration is
that the hot standby data center is ready to be activated when the active
data center experiences a disaster. The DR procedure only requires
reconfiguring the load balancer to redirect the traffic to the newly activated
data center. The drawback of hot standby is that the second data center is
kept active and the application is kept up-to-date, but it is not used to
process customer requests. The software license is applicable to both data
centers, although only one is actively in use.
In the cold standby option, the order application and associated services are
deployed to both data centers, but are not started in the standby data
center. If the active data center experiences a disaster, the DR procedure
includes starting the application and services, and reconfiguring the load
balancer to redirect the traffic. This option is cost-effective in terms of the
software license cost and the data center operations cost, including the
personnel. However, the application availability might suffer, depending on
how quickly the cold standby data center and the order application can be
started and activated to process requests.
When the application in the primary data center is restored after the outage,
you can modify the edge service DNS to route user requests to the now
active application in the primary data center. The application in the
secondary data center can be switched back to standby mode.
Disaster recovery with a three data centers topology
In this era of Always On service with zero tolerance for downtime, customers
expect every business service to remain accessible around the clock
anywhere in the world. A cost-effective strategy for enterprises involves
architecting your infrastructure for continuous availability rather than
building disaster recovery infrastructures.
A three data centers topology provides greater resiliency and availability
than two data centers. It can offer better performance by spreading the load
more evenly across the data centers. A variant of this is to deploy two
applications in one data center and deploy the third application in the second
data center, if the enterprise has only two data centers. Alternatively, you
can deploy business logic and presentation layers in the 3-active topology
and deploy the data layer in the 2-active topology.
Two possible configurations are considered for this scenario, an active-
active-active (3-active) and active-active-standby configuration. In both
cases, a continuous replication of data is required between the data centers.
Active-active-active (3-active) configuration

Requests are served by the application running in any of the three active
data centers. A case study on IBM.com website indicates that 3-active
requires only 50% of the compute, memory, and network capacity per
cluster, but 2-active requires 100% per cluster. The data layer is where the
cost difference stands out. For further details, read Always On: Assess,
Design, Implement, and Manage Continuous Availability.
Active-active-standby configuration

In this scenario, when either of the two active applications in the primary and
secondary data centers suffers an outage, the standby application in the
third data center is activated. The DR procedure described in the two data
centers scenario is followed for restoring normalcy to process customer
requests. The standby application in the third data center can be set up in
either a hot or a cold standby configuration.
Data replication across data centers
The procedure and technique for the continuous replication of data between
the databases in the three data centers should follow the standard,
established practices recommended by the vendors and the customer's
existing corporate IT standards and procedures.
Make use of database management tools, such as the IBM Db2® HADR
feature and Oracle Data Guard, to replicate the database contents to a
remote site.
 Replicate the SQL database using vendor-specific data-mirroring
technology to mirror analytics data from the primary site to the secondary
site.
 Replicate the NoSQL database so that the data is copied from the
primary data center site to the secondary data center site.
 Replicate the ERP database using vendor-specific data-mirroring
technology so that the order data is mirrored from the primary site to the
secondary site.
Selecting the optimal topology for your HADR solution
How you implement your HADR solution is an important architecture decision
that affects the continuous availability of the services provided by your on-
premises application. While an active-active-active configuration provides
the greatest resiliency, it is the most costly topology. An active-standby
configuration is the most cost-effective but can reduce application
availability. You should select the topology that best meets the needs of your
business continuity and operational flexibility.
See these solution architectures for the topologies discussed in this article:
 On-premises high availability disaster recovery: Active-active-standby
topology
 On-premises high availability disaster recovery: Active-active topology
 On-premises high availability disaster recovery: Active-standby
topology
Implement a high availability architecture
In today's global marketplace, websites are expected to be always available.
The Garage Method for Cloud website is no different; it has a service level
agreement (SLA) goal of being available 99.999% of the time. The website is
hosted on the IBM Cloud environment, and code is frequently being delivered
to production. As a result, opportunities for errors and downtime abound. In
this environment, it is critical to have a strategy to release new code into
production with zero downtime.
To meet the SLA goal, the development team took these actions:
 Implement a continuous delivery process by using IBM Cloud
Continuous Delivery.
o Implement a "Deploy to Test" stage.
o Implement blue-green deployment.
o Deploy the production website to multiple IBM Cloud data centers.
 Implement automated monitoring and outage notifications.
 Capture and maintain application log information to troubleshoot
outages.
 Write and maintain runbooks to troubleshoot operational issues.
 Surface SLA reports that clearly show daily, weekly, and monthly
outage data.
Implementing a continuous delivery process by using IBM Cloud
Continuous Delivery

To create and manage the build and deployment of the website, the team
adopted the Delivery Pipeline in the IBM Cloud Continuous Delivery.

Click to expand the image


Implementing a "Deploy to Test" stage
To avoid disruptions that can occur when developers deploy directly to
production, the team created a delivery pipeline that includes a "Deploy to
Test" stage. The purpose of the stage is to isolate the team's developers
from the production website. In this stage, the team runs acceptance tests
by using Sauce Labs to validate that the website is ready to push to
production.
Implementing blue-green deployment
The Garage Method for Cloud website is continuously delivered—as often as
daily. To ensure that the transition to the upgraded version of the website
has zero downtime, the team implemented blue-green deployment. As new
function is pushed to production, it is deployed to an instance that isn't the
actual running instance. After the new application instance is validated, the
public URL is mapped to the new instance of the application.
Blue-green deployment involves these steps:
1. If the blue app exists, manually delete it before you restart.
2. Push a new version of the blue app.
3. Set environment variables for the blue app.
4. Create and bind services for the blue app.
5. Start the blue app.
6. Test the blue app.
7. Map traffic to the new version of the blue app by binding it to the
public host.
8. Delete the temporary route for the blue app that was used for testing.
9. Rename the green app to "green app backup." The backup application
continues to run so that active sessions are not terminated.
10. Rename the blue app to "green" app.
The team completes the blue-green deployment steps by using the built-in
Cloud Foundry command line interface.
Deploying the production website to multiple IBM Cloud data centers
The primary instance of the website runs in the IBM Cloud US South data
center. To ensure that the team can handle outages and maintain its SLA,
the team created failover sites in the IBM Cloud UK and Sydney data centers.
When the team set up the failover sites, the team had to create a space in
each data center and create a stage in the pipeline for each of the failover
sites.
After a new feature is placed into production in US South, it is then pushed to
the London and Sydney sites so that all the sites are consistent.
Requests to ibm.com/cloud/garage are routed by using Akamai. Akamai polls
the US South, UK, and Sydney health check URLs to determine whether they
are up and running by looking for an HTTP 200 response. If Akamai detects
that the application that is running on the US South data center is not
responding, requests are routed to the application that is running on the UK
data center.
Click to expand the image

More deployment methods for zero downtime


Implementing automated monitoring and outage notifications
You can choose from among several tools to implement automated
monitoring and outage notifications: IBM Cloud Availability Monitoring, IBM
Alert Notification, PagerDuty, and New Relic. To measure outages on the
website, the team uses the New Relic IBM Cloud service. This service
continuously monitors the availability of the website by checking its
availability once per minute from nine locations around the world. If an
incident is detected, New Relic calls out the PagerDuty service, which
contacts the operations staff that is responsible to fix issues with the
website. PagerDuty notifies people in three ways:
 It sends a message to a Slack channel that the team uses to monitor outages.
 It pages a preconfigured list of operations and support personnel.
 After a preconfigured time of no acknowledgment, it escalates and notifies
operations management of the issue.
Click to expand the image

The IBM Cloud US South, London, and Sydney sites are also monitored
individually so that the operations team can be notified if any site is
unavailable or needs attention. The strength of redundancy is best when all
three nodes are available.
Writing and maintaining runbooks to resolve operational issues
Preventing downtime in applications is the best way to ensure high
availability. Unfortunately, failures can still occur. Runbooks are explicit
procedures for first responders to follow. When you have the appropriate
access and know exactly what to do, you can act quickly when failures
happen, and minimize the downtime that is associated with an application
instance. With tools like PagerDuty, teams can directly link a runbook to an
incident.
Capturing and maintaining application log information to
troubleshoot outages
When problems occur on the production website, whether due to
infrastructure outages or coding errors, the team that is tasked with getting
the website back online must have access to all of the critical error log files
and other troubleshooting information to identify and fix the problem. IBM
Cloud keeps only a limited amount of log information, which might not be
enough if you are troubleshooting an issue that occurred hours ago or if the
application produces so many log messages that the relevant messages are
overwritten.
To access older logs, teams need a log management service that supports
the 'syslog' protocol, such as SumoLogic or Splunk. Application logs should
be streamed to that service. For more information about how to register a
service with your application on IBM Cloud, see the IBM Cloud Docs.
Surfacing SLA reports that show daily, weekly, and monthly outage
data
The New Relic service provides reports that the team uses to determine
whether it is meeting the 99.999% availability goal. This information
provides critical insight into whether more actions are required to maintain
availability.
Over time, the website became much more reliable as the team added the
failover sites and site monitoring. In particular, implementing the failover
sites in the IBM Cloud environment was a critical part of meeting the
99.999% uptime goal. The website can now continue running while regularly
scheduled maintenance is done on the IBM Cloud infrastructure.

Click to expand the image

The Garage Method for Cloud website ecosystem


The following diagram shows the end-to-end ecosystem to develop, deliver,
run, and manage the website.
Click to expand the image

Continuous delivery
Continuous delivery (CD) is a practice by which you build and deploy your
software so that it can be released into production at any time. One of the
hallmarks of computer science is the shortening of various cycle times in the
development and operations process. In the early days of computers when
programs were entered in binary with switches and toggles, entering in a
program was a time-consuming and error-prone effort. Later, editing was
instantaneous, but compile times measured in hours were common for large
systems. Within a few years, modern compilers and languages such as
Java™ and Ruby had made that a thing of the past, as your code was
compiled as quickly as you could save the source file.
Moving from continuous integration to continuous delivery
The move away from compilation waits merely shifted the focus of waiting.
For many years after that, it was normal for developers to code in isolation
on their own aspects of the system while an automated or semi-automated
build system ran each night to integrate all of that work. Developers lived in
fear of being "the one who broke the build," as a build failure couldn't be
resolved until the next night. Many teams had "trophies," such as cardboard
cutouts of obnoxious movie characters or funny hats, that were awarded to
the people who were responsible for build failure.
That approach began to change with the introduction of continuous
integration (CI) tools and practices. When you integrated your code more
frequently, the possibility of having a misunderstanding that might lead to a
build-breaking problem became less common. In addition, the consequences
of breaking a build with a faulty automated test became less severe. Again,
the focus shifted. Many teams have implemented CI tools but still do system
releases on a quarterly or bi-yearly basis. They still live with the pain of
multiple code branches: the code release that they sent to users 5 months
ago is still being patched as bugs are fixed, the new code base for the next
release is drifting farther away from it, and the possibility of missing new
bugs in the new release increases daily.
Frequent small releases: Releases become boring
Why is this true? Why do enterprises and commercial software companies
put themselves through the pain and anxiety of the "big bang" release?
Probably the biggest reason is inertia. Operations teams carefully defined
their operations environments and tweaked them in just the right way to
ensure that they are secure, perform well, and are reliable. But as a result,
they live in fear of change because change might wreck all of work that went
into their carefully constructed environments.
In commercial software, Sales and Marketing teams are used to the twice-
per-year training seminars, which they plan their years around. In enterprise
development shops, the calendar revolves around pre-planned code freezes,
planned vacations to respect those code freezes, and the various audits and
checks that are the usual cause of the freezes. What if you turned all of that
calendar planning on its head? What if instead of two or four large, disruptive
releases, you had much more frequent and smaller releases? You would see
several advantages:
 If you change less with each release, the release can break fewer
things—that makes the release more predictable and probably easier to roll
back.
 If you release more frequently, you vastly reduce the time between
concept and rollout—in infrequent releases, the market forces that a feature
was designed to address often change by the time it is released.
 You save time, anxiety, and money by having fewer meetings to plan
the big-bang releases, less complexity to manage at the time of the releases,
and less time spent testing and verifying each release.
The benefits are huge: your team can be more productive, less stressed, and
more focused on feature delivery rather than dealing with big, unknown
potential changes. In fact, you can go so far as to say that when you do
releases often enough, they become predictable and even boring. However,
to take advantage of these benefits, you have to embrace a few principles of
CD.
Principles of continuous delivery
 Every change must be releasable: That goes almost without saying,
but it hides a deep set of practices that influence the way your development
and operations teams interact and join together. If every change is
releasable, it has to be entirely self-contained. That includes things like user
documentation, operations runbooks, and information about exactly what
changed and how for audits and traceability. No one gets to procrastinate.
 Code branches must be short-lived: A practice from CI that applies—
especially when you augment CI with CD—is the notion of short-lived code
branches. If you branch your code from the main trunk, that branch must live
for a only a short period of time before it is merged back into the trunk in
preparation for the next release. If your releases are weekly or daily, the
amount of time that a developer or team can spend working in a branch is
limited greatly.
 Deliver through an automated pipeline: The real trick to achieving CD
is the use of an automated delivery pipeline. A well-constructed delivery
pipeline can ensure that all of your code releases are moved into your test
and production environments in a repeatable, orderly fashion.
 Automate almost everything: Just as the secret to CD is in assembling
a reliable delivery pipeline, the key to building a good delivery pipeline is to
automate nearly everything in your development process. Automate not only
builds and code deployments, but even the process of constructing new
development, test, and production environments. If you get to the point of
treating infrastructure as code, you can treat infrastructure changes as one
more type of code release that makes its way through the delivery pipeline.
 Aim for zero downtime: To ensure the availability of an application
during frequent updates, teams can implement blue-green deployments. In a
blue-green deployment, when a new function is pushed to production, it is
deployed to an instance that isn't the actual running instance. After the new
application instance is validated, the public URL is mapped to the new
instance of the application.

Automate tests for continuous delivery


In the context of continuous delivery, test automation is a requirement for
success. In order to have unattended automation from code commit to
production, squads need to deliver several levels of automated tests to
ensure the quality of what is being delivered, as well as to quickly
understand the state of the software.
Benefits of automated testing
An obvious benefit of automating testing, in contrast with manual testing, is
that testing can happen quickly, repeatably, and on demand. It becomes a
simple matter to verify that the software continues run as it has before. In
addition, using the practices of test-driven development (TDD) and behavior-
driven development (BDD) to create test automation has been shown to
improve coding quality and design. In short, test automation has the
following advantages:
 Reduces time to delivery
 Ensures higher quality
 Supports continuous delivery
 Provides confidence in rapidly changing software
 Enables programmers to run automated tests to ensure their code
commits are stable
Getting started with test automation
In a DevOps continuous delivery environment, the first principle is that no
code is delivered without automated tests. But what automated tests,
exactly?
Determining where to invest in test automation requires a strategy. Consider
the test automation pyramid:

Here, the largest numbers of tests are unit and API tests. Test-driven
development ensures that the squad creates unit tests and has a robust
framework that makes them easy to write, deliver, and run.
Adopting behavior-driven development creates a robust, maintainable test
automation framework for customer acceptance tests using the API or
Service layer. In fact, the combination of developers implementing BDD
scenarios, in conjunction with their code delivery, tends to ensure the
testability of the API or Service layer. This helps teams achieve the desired
structure of the pyramid. Typically, these tests are run in a deployed test
environment and include integration tests, which are sometimes called
"system" tests.
Finally, there is GUI test automation. This is typically the hardest to write and
maintain. As a best practice, if the GUI tests can simply verify that
everything is "hooked up," meaning that values entered though the UI are
passed correctly to the APIs that were robustly tested independently, then
this layer can indeed be even smaller than represented in the pyramid
above. The smaller the top portion of the pyramid, the better.
Across all layers of testing, it is important to consider how the tests will run
automatically. For unit tests, many industry-standard frameworks run with
the continuous integration build. For API or service and GUI tests, setting up
the production-like test environment is automated with the same
deployment automation that is used for delivering to production. These test
environments require deploying test tools, test scripts, and possibly test
data, into the production-like test environments to allow the tests to run
unattended. When a test automation framework is implemented, introducing
dependencies increases the complexity of automatically running the tests.
Avoid introducing dependencies, if possible.
Test automation can be built into a continuous delivery pipeline by
implementing stages for the different types of tests that are required. The
pipeline can run tests in one or more stages, and can be configured to stop
in any stage where a test case fails. This process ensures that broken code
never makes it to production.
The pipeline can include stages that run many types of tests in an
automated manner—unit, API, GUI, security, scalability, performance,
globalization— to ensure that the application is production ready.

Circuit breaker pattern


When a cloud application that uses many interdependent microservices is
being developed, a failure in one service impacts the entire application. In
the case of cloud services, the goal is to allow a cloud application to continue
to function to the degree possible when a service outage occurs in only part
of the application. Achieving this goal becomes more complicated as
complex applications are built by using many microservices.
The Circuit Breaker pattern is a framework that provides a graceful
degradation of service rather than a total service failure.
Why use the Circuit Breaker pattern?
When a development team uses the Circuit Breaker pattern, they can focus
on what to do when a dependency is unavailable, instead of simply detecting
and managing failures. For example, if a team is developing a website page
and retrieves content from ContentMicroservice for a single widget on the
page, they can make the page available but provide no content for the
widget when ContentMicroservice is unavailable. The page can continue to
function without the failing service.
Teams can use the pattern to minimize damage when only part of the
application is down. The pattern also helps teams decide which actions to
take for failures in dependent microservices.
Implementing the Circuit Breaker pattern
To implement the Circuit Breaker pattern, each invocation of a remote
service requires the caller to extend an abstract class. The class provides the
logic to manage the execution of the action and call a fallback method when
the service is unavailable.
The method can respond to the failure in several ways:
 Fail silently: Return null or an empty set. In the earlier example, this
method would work.
 Fail quickly: Throw an exception. For example, if an authentication
service is unavailable, no new users can log in to the application. However,
anyone who was already logged in can continue to use the application.
 Best effort: Return a close approximation of the requested data. For
example, if a cached copy of the requested data exists, that copy can be
used instead of data from the remote service if that service fails. In this way,
users can still proceed.
The Circuit Breaker framework monitors communications between the
services and provides quality of service analysis on each circuit through a
health monitor. Teams can define criteria to designate when outbound
requests will no longer go to a failing service but will instead be routed to the
fallback method. Criteria can include success/failure ratios, latent response
ratios, and pool size. The fallback method is called until the failing service is
functional again.

Auto scale applications


You developed an amazing cloud application. In fact, it's so successful that
you now must watch carefully to ensure that enough resource is allocated on
your hosting platform to keep up with demand. No one wants to spend time
manually watching usage and adjusting a system configuration based on
need. Enter auto scaling.
Application operators use auto scaling to ensure that enough resource is
available in the application at peak times and to reduce allocated resource
during low usage times. The benefits of auto scaling include customers who
are delighted that the application runs well, and savings in cost and
electricity. Auto scaling is responsive to actual usage patterns, so it can
handle unexpected traffic patterns, such as when a link goes viral.
Based on a policy, auto scaling automatically increases and decreases the
resources in your application infrastructure. You can define the policy based
on a review of the historical scaling history. Typically, you can adjust several
attributes through the auto-scaling policy:
 Number of application instances
 Heap size
 Memory
 Response time
If you use advanced policies, you can set application characteristics for time
periods. For example, if you know that your application is heavily used
between the hours of 9 AM and 9 PM, you can set a policy that is specific to
that period. You might be wondering, "Why review scaling history or set
time-based policies if the system can dynamically adjust to load?"
When not to use auto scaling
Auto scaling is a great way to ensure that you have the resources that you
need when you need them. However, limits exist on how much it can, and
should, adjust capacity. Increasing the number of instances to cope with
sustained heavy load can cause costs to increase. At some point, the
graceful degradation of service is preferable to runaway cost. Think about
what that point is and put appropriate upper limits in place. Also, set up
monitoring and alerting so that the system can ask for help if it is getting
near the limits or if a distributed denial-of-service DDoS attack is detected.
Auto scaling might not be able to handle sharp spikes. New service instances
need a warm-up time before they can handle requests. If the number of
services is low, it might take too long to start enough instances to cope with
a surge in demand. This issue is called "the cold start problem". For services
that run on dedicated hardware, virtual machines, or other systems where
the underlying system doesn't have much elasticity, trying to start too many
services can cause system stress.
If the system is struggling to cope with high demand, the last thing it needs
is to try to start several new services. By doing so, you can create a failure
cascade where the service instances are starved of resources and then crash
and can't restart.
To avoid this scenario, most auto-scaling services enforce a time window to
prevent haphazard scaling up or down. They adjust the number of instances
only if a few minutes passed since the last scaling event. If you need a large
change in capacity, try to anticipate demand to some extent and start
instances in a phased way.
Auto scaling doesn't apply to a few types of applications. If your application
isn't a web-based application, such as a "Worker" or a background job, auto
scaling might not help because response time and throughput metrics won't
be available. Likewise, auto scaling in IBM Cloud has a few features that
depend on the type of application that you're using.
For example, Liberty for Java™ applications support scaling rules for heap,
memory, response time, and throughput, but Node.js applications support
scaling rules for only heap, memory, and throughput. For a full list of
supported application types and scaling rule support, see the IBM Cloud
Docs.
Related practices
Auto scaling complements the microservices approach because in a
microservices architecture, you can make scalability decisions for each
service, as each service runs in its own container.

Chaotic testing
Earlier this year, a major restaurant chain suffered a significant software
system outage that caused many of their restaurants to close early. Those
that didn't close early simply gave food away to customers because they had
no way to charge the customers and accept payments. The outage was a
major topic across all the news outlets for at least a day. This is not the type
of "press" that companies typically desire.
So what's the point of bringing up this situation? It is doubtful that
customers, or employees for that matter, thought to themselves, "Well, it's
been three years since there's been a major outage like this, so it is not a
problem that the restaurants are closed down for the next 12 hours." The
incongruity of that statement helps us to think about the difference between
mean time to failure (MTTF) and mean time to recovery (MTTR). Historically,
the emphasis has always been on MTTF; working hard to extend the time
between system failures, with little emphasis on how fast a failure could be
corrected. "It's been three years since we've had an outage. Isn't that great!"
Getting started
In today's world, the emphasis needs to shift to MTTR, minimizing the time it
takes to recover from a failure. To illustrate the point, if the restaurant's
software system had gone down 100 times that day, but the recovery time
for each of the failures was on the order of microseconds, apart from the
restaurant's internal operations personnel, would anyone have actually
noticed? Would customers have been turned away? Would it have made the
news? No, no, and no.
Given the importance of being able to recover from software failures quickly,
how do organizations actually improve their mean time to recovery? This
article steps you through an approach that's gaining traction across the
software industry, and that you can use to put your organization on the road
to improving MTTR.
One of the first things to do, and this will likely sound counterintuitive, is to
purposefully crash a production software system. Yes, crash a production
software system on purpose. Once you stop and think about it, it does begin
to make sense. Here are a few reasons why:
 Typically, system failures occur unexpectedly; however, in this case,
the date and time of the failure are known beforehand. The specific failure
itself should not be known. Because the date and time are predetermined,
personnel are ready to immediately jump in and fix the problems when they
occur.
 There will also be a heightened focus on monitoring system data
before, during, and after the failure. This is obviously meant to help with the
recovery from the failure itself, but it also provides data for subsequent
analysis and improvement.
 When the system has been brought back online, and the subsequent
analysis occurs, new insights about the production system will come to light.
You should expect to hear comments like, "Wow! I didn't know that if this
failed, it would cause a problem over there!" or perhaps, "We gather a bunch
of logging and tracing data, but it didn't help us debug the problem." And the
one comment you really want to hear will hopefully sound something like
this, "If I had only added a check in my code, the downstream failure could
have been prevented and the impact of the crash could have been limited."
 With the last comment in mind, in my estimation, the biggest impact
will be on increasing the awareness of everyone in the organization on the
need to focus on resiliency. There's nothing like messing with your
production system to get folks to pay attention.
Benefits of chaotic testing
Once you do these production system crashes a couple of times, and people
begin to take to heart the need to focus on resiliency in both the software
itself and in the operations environment, with improvements actually being
made after each episode, the level of confidence regarding MTTR capabilities
should increase significantly. That's when you can move to the next step,
chaotic testing.
At a high level, chaotic testing is simply creating the capability to
continuously, but randomly, cause failures in your production system. This
practice is meant to test the resiliency of the systems and the environment,
as well as determine MTTR. As you can imagine, as ongoing, random failures
are injected into the production system, and as improvements are
continuously made, the overall production system will become much more
stable and recovery times will be greatly reduced. Adopting chaotic testing
also ensures that no one gets complacent. Note that you can get creative
and generate targeted yet random failures, aimed at only a particular aspect
of your environment, such as degrading system performance, shutting down
access to part of the network, or killing off a microservice.
Adopting chaotic testing will help improve your MTTR, improve organizational
confidence in the resiliency of your production environment, and it will also
keep you out of tomorrow's headlines.

Canary testing and feature toggles


Sometimes, you need to push features into production that aren't ready for
consumption by all users. Consider these examples:
 For testing purposes: "Will this feature perform reasonably in the production
environment?"
 To enable some other feature to be deployed that has a dependency on the
new feature: "The UI isn't ready, but you can call the API directly."
 Single stream development: "We always deploy the whole development
stream, but turn on features only when they're ready."
To address those examples and limit the impact of potential problems, use
canary testing, dark launch, and feature toggles.
Canary testing
A canary release rolls out a change to only a subset of users. Those users
test the new function. Canary testing reduces the risk of introducing a defect
to all users. When you're satisfied with the results, you can scale up the
canary release and roll it out to all users.
Dark launch
Deploying a feature but limiting access to it is commonly called dark
launching the feature. You can implement dark launching in any number of
ways. The best mechanism depends on the complexity and type of
application that you're building.
In a simple web application, the easiest way to dark launch a new feature on
a page is to duplicate the page with a different name; for example,
../gitpage.html → ../gitpage2.html. Then, you can update the new page and
push both. In this way, only the people who know the new page name can
access the new version.
The nice thing about this approach is that the new code is kept separate
from the old code. When it's time to make the page available to everyone,
you can delete the old page and rename the new one, "taking it out of dark
launch mode."
Unfortunately, the dark launch approach has a few downsides. Because the
two pages are separate, if both pages are kept "live" for a significant amount
of time, any common bugs must be fixed twice. Even worse, because the
new page has a new URL, any links from other pages in the app will point at
the old page, making it difficult to test flows that cross pages.
Feature toggles
To address these problems, consider a related notion, feature toggles. At its
most basic, a feature toggle is just a way to provide extra context to the
application, and then based on that context, to change the way that the
application behaves. For example, to dark launch a feature in a web app, you
can implement both the old behavior and the new feature. Then, you can
test a feature toggle on the page to choose which one to show the user. This
approach typically makes the code more complex (for old-school C
programmers, think #ifdef statements), but it also means that bugs in
common code need to be fixed only once. Additionally, the same URL now
provides either the old or the new behavior based on the state of the toggle.
Getting started
So, how do you implement feature toggles? A single best practice doesn't
exist, but you can use a number of different techniques in different
situations.
One possibility would be to use a query parameter on the URLs. So, for
example, you might use:
.../manage.html?darklaunch=true
However, if several pages are involved in the flow, then the parameter would
need to be passed from the page.
Another possibility is to test a value that is kept in the browser's local
storage. The code can check for something like this example:
if (localStorage.darkLaunch==="true")
You would turn on this feature flag by going into the browser's debug console
and running the following line:
localStorage.darkLaunch="true"
Because this behavior is controllable directly from the browser, this approach
is good for cases where you want to enable the new behavior for particular
users, or only while debugging.
Another case is enabling or disabling a particular feature across the whole
site, and switching the behavior with no downtime. In this case, you might
have a "feature flag service," perhaps something like this example:
.../home/services/...common.service.IFeatureFlagService
This returns flags for any configurable features, such as:
{"isNewCoolFeatureEnabled":true}
Even more complex variations are possible, such as having the service
return different values based on which user is accessing the application.
However, if you push this too far, you end up with pages that you can't
debug because it is difficult to understand the issue without knowing the
exact context at the time they were invoked. For example, it might be that
no two users would necessarily get the same experience.
You also must carefully manage how you access the service, because it can
become a single point of failure, and the service itself, because you might
end up with outdated flags.
Despite these potential issues, dark launching and feature flags are two
related, powerful methods to enable new features to be deployed quickly
and tested in the production environment.
Health check APIs
As you move your applications to the cloud and refactor to microservices,
you face new challenges to monitor your microservices in a scalable way.
Instead of having one application to monitor, you must monitor separate,
interrelated services and understand how your application behaves when
any of them becomes unavailable.
Because microservices generally have a small and clearly defined scope, the
service owner or developer is well positioned to understand all the
dependencies and how to validate their availability. To have a quick,
standardized way to validate the status of a service and its dependencies,
you can introduce a health check API endpoint to a RESTful (micro) service.
At minimum, a health check API is a separate REST service that is
implemented within a microservice component that quickly returns the
operational status of the service and an indication of its ability to connect to
downstream dependent services. As part of the returned service status, a
health check API can also include performance information, such as
component execution times or downstream service connection times.
Depending on the state of the dependencies, an appropriate HTTP return
code and JSON object are returned.
The derived requirements are as follows:
 The health check API must be a REST service in the microservice component.
 The health check API must return the operational status of the component
and its ability to connect to the downstream components that it depends on.
 An advanced health check API can be extended to return performance
information, such as connection times.
 The results must be returned as an HTTP status code with JSON data.
Get started with a simple health check
Start with a minimum viable health check API that can respond when you ask
it if a service is available. Depending on the state of the service's
dependencies, an appropriate HTTP return code and JSON object are
returned. The response must be explicitly defined as noncacheable to avoid
returning an incorrect status as a result of caching network equipment.
With a simple health check API, you can test whether the service is available.
Consider a microservice that is responsible to retrieve user information from
a database. The main dependency is the database where the user entries are
stored. In pseudocode, a health check API might look like this example:
Route.add(‘/api/v1/healthcheck’, Method=’GET’) {
If (userdb.connect() == “Successful”) {
headers={‘http_status’:200, ‘cache-control’: ‘no-cache’}
body={‘status’: ‘available’}
}
Else {
headers={‘http_status’:500, ‘cache-control’: ‘no-cache’}
body={‘status’: ‘unavailable’}
}
}
As you continue to enhance your microservices, you can include fall-back
mechanisms to deal with failures in the primary data store. The database
service might grow to include slower, less frequently updated tiers of storage
or cached results that can be exposed as an alternative response code, such
as 203 “Non-authoritative information”.
In pseudocode, an implementation might look like this example:
Route.add(‘/api/v1/healthcheck’, Method=’GET’) {
If (userdb.connect() == “Successful”) {
headers={‘http_status’:200, ‘cache-control’: ‘no-cache’}
body={‘status’: ‘available’}
}
Elseif (backupdb.connect() == “Successful”) {
headers={‘http_status’:203, ‘cache-control’: ‘no-cache’}
body={‘status’: ‘backup-storage available’}

}
Else {
headers={‘http_status’:500, ‘cache-control’: ‘no-cache’}
body={‘status’: ‘unavailable’}
}
}
As you evolve your microservices, you can implement more advanced health
checks to automatically check all the dependent interfaces and report on
them as part of a services health check.

Capture diagnostic information by using First


Failure Data Capture
To quickly resolve an incident, your first responder and DevOps teams need
immediate access to all information that can help them diagnose and resolve
the incident. First Failure Data Capture (FFDC) is a technology that teams
use to collect information about events and conditions that lead to a failure
so that they don’t need to re-create the failure.
The first step to implement FFDC is to gain a clear understanding of what
you need so that you can diagnose problems in your service. You might
capture this data:
 Configuration information
 Runtime statistics
 Function and stack traces
 Trace logs
 Message logs
 Dumps of in-memory data structures
The second step is to collect relevant information. You can manually collect
the information or run scripts to collect it. After you have the information,
you make it available to first responders and DevOps team members.
For advanced implementations of FFDC, the DevOps team instruments code
to dump valuable information when the service crashes. Without
instrumented binary files and libraries, the amount of valuable data in FFDC
is limited.
At a minimum, be sure to instrument the function entry and exit points and
the parameters that are passed between functions. You also need to manage
the tradeoff between capturing everything that happens in your system,
which can overwhelm first responders and bury useful information, and not
capturing enough data to determine the problem. To avoid information
overload, identify functions that are infrequently called or error prone and
stop logging those functions. Also, perform analyses to see what level of
tracing you need for each component of your service based on reliability and
quality.
Because a failure is likely to happen while the system is in an unattended
mode, capture data in such a way that it isn't overwritten before you can
gather it and send it to a support center or help desk. When you dump this
information into the file system, be sure to preserve the information by using
file names that aren't likely to be overwritten. Consider including timestamps
and process IDs as part of the file names. In addition, make sure that the
information is stored in a persistent location and not in a volatile location,
such as a disposable container.
Avoid information overload by using in-memory log queues
When errors occur, the log should contain many details. Unfortunately, the
exact detail that led to an error is often unavailable after the error occurs. If
you're not logging everything, the log records before the error record might
not provide enough detail.
To solve this problem, create temporary, in-memory log queues. Throughout
the processing of a transaction, you can add verbose details about each step
to the in-memory log queue. If the transaction is completed, you can discard
the detailed in-memory log information and log a summary message. If an
error occurs, log the entire contents of the in-memory queue and the error.
This technique is useful when you're logging complex system interactions.
Distributed tracing
Moving your applications from a monolithic design to a microservices-
oriented design introduces several advantages during development and in
operations. However, that move has a price. New challenges are introduced,
as traditional metrics and log information tend to be captured and recorded
in a component and machine-centric way. When your components are
spread across machines and physical locations and are subject to dynamic
horizontal scaling over transient compute units, traditional tools to capture
and analyze information become powerless.
Distributed tracing is a technique that addresses logging information in
microservice-based applications. A unique transaction ID is passed through
the call chain of each transaction in a distributed topology. One example of a
transaction is a user interaction with a website. The unique ID is generated
at the entry point of the transaction. The ID is then passed to each service
that is used to finish the job and written as part of the services log
information. It's equally important to include timestamps in your log
messages along with the ID. The ID and timestamp are combined with the
action that a service is taking and the state of that action.
Unique identifiers, such a transaction IDs and user IDs, are helpful when you
gather analytics and debug. Unique IDs can point you to the exact
transaction that failed. Without them, you must look at all the information
that the entire application logged in the time frame when your problem
occurred. After you implement the generation and usage of the unique ID in
your logs, you can use the unique ID in several ways.
Enable log correlation
By implementing the transaction ID and creating a timestamp of each log
entry that contains that ID, you can take advantage of log-aggregating tools,
such as IBM Cloud™ Log Analysis. Those tools provide a distributed stack
trace of the steps that led to the failure of a specific transaction.

Perform advanced latency analysis


Advanced latency analysis attaches timing information to the transaction.
You get an in-depth, end-to-end analysis of the total time it takes to process
one transaction. You can also explore the details to find bottlenecks in your
application.
Build a topology view
Correlate the captured transaction ID with application and service
identification and other metadata. Doing so makes it possible to build a
topology view that shows the dependency map of all the services in the
topology that are required to satisfy the transaction.

Change management
In a traditional model where you delivered a monolithic application with
limited releases, a change advisory board (CAB) supported a change
management team by prioritizing and approving requested changes.
As you implement continuous integration, testing, and deployment, the focus
of your CAB shifts. Instead of assessing individual changes, the CAB assesses
automation policies. As suggested by the Information Technology
Infrastructure Library (ITIL), you can make standard changes without the
involvement of your CAB. Your CAB now focuses on establishing policies,
such as required thresholds for code coverage and test results. Based on the
policies that the CAB defines, most changes can be automatically handled as
part of your continuous delivery pipeline.
Major changes might still require the involvement of the CAB, including
changes that affect many microservices or disrupt the application. Disruptive
changes must be reviewed and then carefully promoted into production.
To assess the criticality of a change, use DevOps analytics tools. With the
data in code repos, issue-tracking systems, and build systems, you can
deliver apps faster and with greater quality. Examples include tools that
explore these aspects:
 Developer insights: Examine key information on error-prone files,
commitments, issues, and lines of code that were changed.
 Team dynamics: Review the interaction of DevOps teams through the
code changes that are being made.
 Deployment risk analytics: Assess and enforce quality and control
across your delivery pipelines by using automated data collection, risk
analysis, and policy gates.
 Delivery insights: Filter deployment data by application, environment,
components, and date range to view applications and discover areas that
need more attention.
 Availability insights: To understand how your application runs in the
production environment, collect and review quality metrics, such as service
availability, customer impact, root-cause analysis action item closure, and
mean time to repair (MTTR).

The principles of modern service management


Modern service management and operations refers to all the activities that
an organization does to plan, design, deliver, operate, and control the
applications in an enterprise. It includes the people who do the work,
processes that define what work is needed and how it is done, and tools to
enable and support these activities. Applications are monitored to ensure
availability and performance according to service level agreements (SLAs) or
service level objectives (SLOs).
Operations should be as agile as development, with continuous delivery
practices and an emphasis on continuous improvement. Service
management must transform to support this paradigm shift. The
transformation has implications in various areas:
 Organization: Instead of a discrete operations organization that is
distinct from the development team, full lifecycle responsibility is provided
through small DevOps teams. Another approach is site reliability engineering
(SRE), which brings a strong engineering focus to operations. SRE
emphasizes automation to scale operations as load increases.
 Process: A key concept of DevOps is the automated and continuous
testing, deployment, and release of functions. Service management
processes, such as change management processes and the role of the
change advisory board, must change to support this notion.
 Tools: Because time is of the essence in restoring a service, incident
management tools must provide rapid access to the right information,
support automation, and provide instant collaboration with the right subject-
matter experts (SMEs). The term ChatOps describes a collaborative way to
perform operations. Bot technology integrates service management and
DevOps tools into this collaboration.
 Culture: As with any transformation project, you must consider a few
cultural aspects. One example is the need for a blameless postmortem
culture where the root cause of an incident is revealed and the organization
can learn from it.
Slow is the new down
One reason that service management paradigms are shifting is that
customers' expectations for services are shifting. Customers demand fast
service and the rapid delivery of new products and features. If your mobile
app or website is slow and does not perform, your site might as well be
down. Your customer will take their business elsewhere.
Benefits of service management
A well-designed service management architecture provides several benefits:
 Maximizes operational effectiveness. Ensures the availability and
performance of applications that run on IBM Cloud™, given the target SLA of
99.99% availability for applications.
 Increases operational reliability and agility by using event-driven
guidance, automation, and notification to prompt the right activity to resolve
important issues.
 Improves operational efficiency by using real-time analytics to identify
and resolve problems faster. Reduces costs by creating a single view and
central consolidation point for events and problem reports from your
operations environment.
 Establishes and maintains consistency of the application's performance
and its functional and physical attributes with its requirements, design, and
operational information.
 Manages and controls operational risks and threats.
Five principles of service management
To effectively manage modern applications and consider the service
management and operational facets of their applications, your operations
team can follow five principles:
 Operations: Services need management. Operations activities typically
include placement of workload based on resource requirements, rollouts and
rollbacks, service discovery and load balancing, horizontal scaling, and
recovery. Cloud platforms such as Kubernetes help with many of these
activities by providing functions for self-healing, dynamic discovery, and
automated rollbacks. Operational activities should also include compliance
checks (ideally automated, regular, and done in production) and data
backup.
 Monitoring: Services also need to be monitored. When you are deciding
what to monitor, be guided by the experience of the user of the service. The
user might be a human for front-facing services or a system for back-end
services. Key metrics typically are availability, performance, response time,
latency, and error rate. To ensure that you detect an issue before it causes
an outage, prioritize monitoring of the four golden signals: latency, traffic,
error rate, and saturation. Traditional metrics such as CPU, memory, and disk
space are less relevant in a cloud context.
 Eventing and alerting: What happens if the monitoring solution detects
a problem? An alerting system must notify first responders if a problem is
detected, either by using email, SMS text message, or an alert in an instant
messaging system. A single problem can cause a cascading failure across
multiple systems, so the alert system must be able to correlate related
events from different sources.
 Collaboration: A first responder is the first person, but probably not the
only person, who helps to resolve an incident. In an architecture where many
services depend on each other, expertise across multiple areas or systems is
likely needed. The term ChatOps describes the process of using an instant-
messaging communication platform to collaborate among SMEs and
automated tools. Through the ChatOps platform, all interaction is logged in a
central place and you can browse through the log to see what actions were
taken.
 Root-cause analysis: To prevent an incident from reappearing, the root
cause must be assessed. Follow the 5 Hows approach, which helps to surface
the issue that was ultimately responsible for an incident. This investigation
must be operated in a blameless culture; only through that approach are
people willing to share their insights and help others to learn from the
experience.
Observability
Observability is the ability to understand the running system by observing
external outputs (that is, without stopping it and taking it apart). Provide
observability by instrumenting the application and services. Extend the
management and monitoring of containers by using sidecars and a service
mesh framework such as Istio. Most importantly, monitor the service as it is
experienced by the user.
Service developers should expose a health check API. These health checks
should be tested automatically on every deployment.
Shift left and build to manage
Operations isn't just the responsibility of the Ops team. Developers should be
adding instrumentation to support observability. They are the ones who
know how to create runbooks and how to analyze logs and traces to identify
and solve issues. The development team should also be using automation to
test and deploy your applications as early in your development cycle as
possible. (What should be automated? Everything!)

Operational readiness
When your applications fail and it takes time to determine the root cause
and restore service, customers get frustrated. You want to ensure that your
customers are delighted. An assessment of your organization's operational
readiness answers three questions:
 What needs to change?
 How significant is the change?
 What are the expected benefits?
Your answers to these questions identify the gaps that you need to close.
Follow these three steps to get started:
1. Assess where you are. Engage in an operational readiness review to
examine all key operational processes and to determine the as-is versus the
to-be state.
2. Determine where you need to be. Cost and risk tradeoffs are inherent
in all processes. Assess each process to determine where you need to be.
3. Improve and assess continuously. Identify gaps where processes don't
meet minimum requirements. Put plans in place to address the gaps. As your
organization matures, repeat the whole process regularly.
As you adopt cloud technologies and move workloads to the cloud, be sure
to adapt your guidelines. Examine your readiness from two perspectives.
Operationalize the cloud
As your company adopts a new platform, its processes, roles, and
responsibilities must be revisited to determine whether they still apply. The
same is true when you adopt a public, dedicated, or private cloud.
Operationalizing the cloud starts with understanding the roles and
responsibilities of the cloud consumer and the cloud provider. Consider using
a RACI matrix to detail your operational activities and who is responsible,
accountable, consulted, and informed.
Operationalize application readiness
As you adopt a microservices approach for your applications, be sure to
establish guidelines and processes to keep your services robust and
serviceable. For more information, see Operationalizing your application
readiness.

RACI matrix
As you move to the cloud, it's important to understand your development
and operational processes and who is responsible for what. One way to
ensure that everyone understands the roles and responsibilities is to use a
responsible, accountable, consulted, and informed (RACI) matrix. A is a
common way to implement a decision-rights framework to clarify the roles
and responsibilities for key processes.
The matrix shows key activities as rows and participating parties as columns.
For each participating party, you indicate whether they're responsible,
accountable, consulted, or informed:
 Responsible: This role does the work to complete the activity. Only one role is
responsible, but other roles can help as needed.
 Accountable: This role approves the completion of the high-quality
deliverable to fulfill the activity. Only one party is accountable for each specific task
or deliverable.
 Consulted: This party is an individual or a group who is consulted to provide
opinions or technical expertise to complete an activity or deliverable. They are
typically subject-matter experts (SMEs) who are in communication with the people
who are responsible for activities.
 Informed: These parties are notified of progress, often only when a task or
deliverable is completed. One-way communication exists with these parties.
In this example, you can see part of a RACI matrix that shows the
relationships between the service consumer and service provider for the
tasks. RACI matrixes that detail operational activities can have hundreds of
rows.

As you define your RACI matrix, make sure to include this information:
 Who is responsible for what activities
 Gaps in your processes and plans to address the gaps to prevent service outages
 Areas where you might need assistance from SMEs who can help you improve and implement
your processes
Get guidance from IBM Garage experts

Manage your cloud →


As your organization and processes change, maintain your RACI matrix.
Update it to reflect new activities that might be necessary, remove activities
that are no longer completed, and update the roles and responsibilities. You
can also define specific teams or people that fulfill each role for the activities
in your matrix.
You can use a RACI matrix as a starting point for runbooks and operational
models. To be most effective, it must be a living document. You can expand
key use cases in your RACI matrix into a fully documented process that
provides more details to use in your runbook.
References
Weill, Peter and Jeanne W. Ross. IT Governance: How top performers
manage IT decision rights for superior results. Boston, MA: Harvard Business
Review Press, 2004

Operationalize application readiness


As you move your applications to microservices, look for guidelines and
processes to ensure that your services are robust and serviceable.
Implement guidelines such as the Twelve-Factor App Manifesto, which
provides guidance for your developers as they build microservice-based
applications.
When you consider operational application readiness, go beyond the
technical questions to organizational, process, and cultural elements.
Developers must have a vested interest in building high-quality, robust
services and avoid favoring new functions over operational nonfunctional
aspects. To meet the production needs of a service, including security,
compliance, resilience, availability, and performance, you need the right
technical implementation, a high-performing organization, and agile,
integrated processes.
Even if you implement DevOps across your organization with teams that
have end-to-end responsibility, you need to govern at scale across your
organization. New operating models such as Site Reliability Engineering
(SRE) can enable an efficient operating culture. But even SRE organizations
aren't as efficient in operating a cloud-enabled landscape at scale unless
your development team adheres to well-defined rules.
Establish guidelines and enforce them to ensure
that your applications are ready for production.

You must define, communicate, and enforce clear guidelines to help assess
whether a microservice or other cloud service is of high enough quality and
is operationally ready for production. To avoid ambiguity in manual
interpretations, make sure that your DevOps pipeline automatically validates
that the guidelines are adhered to. You also need to store the results of the
validation. Then, you can weigh the results by relevance, aggregate them
into an overall score, and publish that score on a scorecard and your
management dashboards.
Technical and nontechnical guidelines
What guidelines should your team follow? For a few examples of technical
and nontechnical guidelines, see the following table. These guidelines are a
sample. Be sure to define more guidelines to verify that your applications are
production ready.

Consider what action you might take if a service doesn't comply with the
guidelines. Also, think of the incentives that can be created for adhering to
the guidelines.

What is incident management?


Incident management is the practice of restoring a damaged service to
health as quickly as possible by using a first-responder team that is equipped
with automation and well-defined runbooks. To maintain the best possible
levels of service quality and availability, incident management leverages
sophisticated monitoring to detect issues early, before the service is
affected. For complex incidents, subject matter experts will collaborate on
the investigation and resolution. Stakeholders, such as the application
owner, are continuously informed about the status of the incident.
Developers do their best to write applications that perform, are robust, and
provide great user experience. But things can go wrong, resulting in the
unavailability or slowness of the application. This is where incident
management comes into play: the objective of incident management is to
restore the service as quickly as possible.

Many times, cloud brings higher expectations of availability, performance,


and reliability. Enterprises need new approaches to handle incidents:
redundancy, automation, collaboration, ... paired with organizational and
cultural changes.
Ingo Averdunk, IBM Distinguished Engineer, Cloud Service Management and Operations
Building the incident management toolchain
When you look at the incident management reference architecture and
toolchain, you might be overwhelmed with the number of functions. Although
the functions shown are the recommended capabilities of a robust incident
management solution, no one will have all functions at the beginning. This is
a suggested journey map to build the toolchain.
The core of the solution is monitoring to detect outages, performance
saturation, and more.
Because you cannot afford to have your staff continuously watch consoles,
the next critical element is notification to alert the right subject matter
expert (SME) when something is going wrong.
People often need to collaborate with subject matter experts to isolate the
issue and to define a mitigation strategy. Rather than relying on email and
telephones, you can use ChatOps where people collaborate using instant
messaging with each other, and potentially with tools and systems.
Your first toolchain looks like this:

In addition to monitoring and active probing of services and APIs,


monitoring of logfiles is an important functionality that you should add
next. This monitoring can help to identify issues before the service is
impacted. It can also expedite the incident identification and resolution
phase.
As the load increases and the application landscape becomes more complex,
first responders start suffering from too many alerts. They receive alerts
related to symptoms in addition to causes. Some alerts may not be
actionable. Events may not provide sufficient context to take action on them
quickly (such as an SLA or impact data). This is where event management
is introduced to your toolchain. It correlates related events, removes noise to
only show actionable alerts, and enriches these events with additional
context.
Your enhanced toolchain now looks like this:
In order to respond to issues quickly, you next add runbooks and
automation. Runbooks can be invoked automatically, either to perform
diagnostic commands or to attempt to mitigate the issue. Runbooks can also
be executed manually by the first responder and incident resolver. To avoid
logging into a system and risking mistyped commands, semi‐automated
runbooks provide secure and consistent execution.
As you add more tools, you need visibility across the entire landscape. This is
not to replace existing product UIs, but rather to complement and to provide
a combined view of the environment in persona‐specific dashboards.
Ideally, these views also show additional information such as deployment
activities or SLA information.
Finally, incident information is tracked persistently in ticketing tools, which
provide a source of truth for SLA calculation. Enterprises especially need to
maintain an audit trail for all incidents. The start and end of an incident is
tracked, as well as major updates. Integration across the entire toolchain
automates the population of this activity journal. This also enables you to
detect trends in the environment and to take the right countermeasures.
Your completed toolchain looks like this:
Incident management concepts
A sophisticated monitoring infrastructure detects deviation from normal
behavior, such as a decline of response time, and alerts the operations team
about these incidents. First responders, who are on call 24x7, act to
identify the component at fault and to restore the service as quickly as they
can. They do this by leveraging automation and runbooks to remove the
dependency and risk associated with manual execution of tasks.
While first responders use dashboards that provide visibility into the
application and its landscape, they do not stare at consoles waiting for
alarms to happen. Instead, they are notified about actionable alerts. These
alerts are aggregated by a variety of monitoring systems, and are already
correlated and enriched with relevant information, such as the application
name, impacted user community and stakeholders, and SLA information.
These are actionable alerts, ideally with a clear description of the mitigation.
Using call-rotation and on-call lists, the alarm is sent to the right first
responder to take the necessary action.
Alerts that cannot quickly be pinpointed to a mitigation require more
analysis. Subject matter experts across multiple domains collaborate to
isolate the incident and to identify an effective response. Technologies like
ChatOps help this collaboration. DevOps and service management tools are
also integrated through bot agents. The incident commander coordinates
these tasks and maintains transparent communications to the affected
stakeholders.
The objective of incident management is to restore the service. The team
does not waste time analyzing the root cause of the problem; this will be
done in the following step (problem management). Therefore, approaches
include the restart of a microservice, a configuration of the load balancer to
ignore the failing instance, or a rollback to the previous version. Typical
DevOps principles like Blue-Green deployment (continuous delivery) ease the
implementation of these approaches.
Benefits of implementing incident management
 Improved availability and performance of applications and
services. Supports the need for high availability by proactive monitoring and
rapid restoration of services.
 Manage and control operational risks and threats. Effectively
manage change, mitigate new threats from interconnected services and
infrastructures, and ensure compliance.

Collaborate using ChatOps


When an issue occurs, collaboration is critical. ChatOps is part of the shift
toward a more collaborative way to perform operations. Instead of using a
traditional help ticket tool, DevOps subject-matter experts (SMEs), including
security, network, and infrastructure experts, use instant messaging tools to
communicate with each other and with the tools that they use to do their
jobs. ChatOps uses collaboration tools such as Slack and Hipchat to create an
environment where SMEs and other IT personnel can literally be "on the
same page" as it relates to an ongoing IT issue.
How to implement ChatOps
ChatOps can be implemented in phases as your team becomes more familiar
and reliant on its chosen collaboration tool. The first phase is simple
persistent communications between team members. Two or more team
members can start a conversation to get answers to pressing issues and
ensure that the team agrees before they provide a response.
In the second phase, you establish groups that must communicate with each
other to do their jobs. In addition to simple group messages, the team can
share screen captures, videos of problems, and files such as log files,
configuration files, or command output. The ChatOps tool stores messages
persistently, so anyone who is added to a conversation can see all previous
communications.
When a major incident is received, some ChatOps tools can automatically
create channels. You can also specify assignment lists so that when a
specific incident occurs, the right people are invited to join the conversation.
As the team evolves, the ChatOps platform can provide two-way
communication between the chat participants and the system or systems
that the team uses in their everyday work. For example, you can create a
group chat channel with tool integrations to see this information:
 The final result of a production build from your continuous delivery pipeline
 Notifications about application deployment failures in data centers around
the world from your monitoring tools
 Site usage metrics on a regular schedule from your analytics tools
 Related outages and incidents so that you can investigate a potential
correlation
 The most recent changes that were deployed to the server or the application
 The resource usage of the affected resource over the past 24 hours
The following images show a monitoring tool integration in the ChatOps
platform. The tool integration notified the team that a production site in a
data center has a problem. The message from the monitoring tool is written
to a dedicated channel with members who share an interest in monitoring
information.

In the channel, the team can react to the issue. They can chat with each
other and click links in the messages from the monitoring tool to get more
information and to resolve the issue.
The benefits of ChatOps
As your team integrates more tools into the ChatOps tool, the team gains
several benefits. First, the team receives fewer emails because
conversations are often held in chat conversations and are persisted in
context. The second benefit is time savings because the team no longer
must continuously switch between tools to find information. The more tools
that you integrate into the ChatOps platform, the greater the benefits are.
Advanced teams take things one step further. By bringing your tools into
your conversations and using a chatbot that is modified to work with key
plug-ins and scripts, teams can automate tasks and collaborate to work
better, cheaper, and faster. While in a chat room, team members can type
commands that the chatbot is configured to run through custom scripts and
plug-ins. These scripts and plug-ins range from code deployments to security
event responses to team member notifications. The entire team collaborates
as commands are run.
Your team can apply ChatOps to various disciplines of service management:
 Incident management: ChatOps is useful for incident analysis, isolation, and
investigation, especially as people join the process and must be onboarded quickly.
 Problem management: Your team can apply ChatOps in the 5 Hows technique
and use it to develop and prioritize a balanced action plan.
 Change management: Your team can use ChatOps to conduct a virtual
change advisory board to run a change impact assessment and schedule a change
to the application.
ChatOps enables collaborative communication between humans and tools
that reduces incident response time, eliminates repetitive requests for
information, and ensures that all DevOps team members have consistent
access to the information that they need to do their jobs.
Automate application monitoring
Automated monitoring is essential to every successful DevOps project.
Knowing that your application or service is available and functioning within
service level agreements (SLAs) is vital. Most applications strive to provide
99.999% availability. That's less than 6 minutes of downtime over the course
of a year. Automated monitoring is the best way to ensure that applications
are always functioning.
In an effort to guarantee this kind of availability, application owners build or
purchase monitoring tools that measure application response time every few
minutes from around the globe. Basic availability can be measured through
simple URL pings. Proving a URL can be resolved and can return an expected
result within an expected amount of time. More sophisticated monitors might
involve authentication and traversing several dialogs, locating a form,
entering a user name and password, and then validating the results.
Monitoring can involve agent-based monitors or synthetic monitors. Agent-
based monitoring tools require that one or more agents be installed to
analyze the details of code, server, user activity, or other data. Synthetic
monitoring tools don't require the installation of an agent; instead, they
simulate user traffic so that you can determine whether your application or
site is performing correctly.
Often, applications are built from multiple microservices, such as an
authentication service or a entitlement service. Because all of these services
play a role in the aggregate responsiveness of your application, they too
should be targets of automated monitoring. Monitoring at both the aggregate
and individual levels also helps to isolate problems. The quick identification
of a failing component leads to faster resolution.
Practices for implementing automated monitoring
 Monitor your application and the things that it depends on. If you
establish monitors for each component of the application and the top-level
interfaces, you can identify bottlenecks faster.
 Start early. Learn how to monitor an application, what to monitor, and
how often. Use monitors to learn the trends of your application. Understand
the dependencies that your application has on other components or services.
Study the problems that are detected and those that aren't. Add monitoring
where needed to clarify the complexities and fill the gaps. Measure the time
it takes to find problems and restore services.
 Integrate automated monitoring with rich notification tooling. Many
teams use tools like PagerDuty to notify on-call personnel when a problem
occurs. If acknowledgements are not received, escalation policies can notify
additional members of the team.
 Collaborate. Automated monitoring leads to some form of failure or
performance degradation. Today's applications are complex and rely on
many services and subject-matter experts to resolve issues. Use
collaboration tools, such as Slack or Google Hangouts, to collectively solve
problems.
 Test your monitors. Test that the automated monitoring that you
established quickly detects downtime or poor performance. Simulate outages
and downgraded performance situations to see how each is reflected by the
tools.
 Investigate multiple monitoring tools. Many tools have features that
are uniquely useful for one situation or application type. Custom extensions
or programming skill level might be factors in determining which tool is the
best fit for you.
The benefits of automated monitoring
 Your website can become more reliable and provide a better user
experience.
 You can find problems before your users call or open a support ticket.
No one likes the inconvenience of a problem. People become frustrated with
downtime or poor performance and often go elsewhere. A sign of good
automated monitoring is being able to recognize trends that lead to a
problem. When you can fix problems before they happen, you've reached
monitoring nirvana.
How to get started
Getting started with automated monitoring is simple. IBM Cloud™ makes it
easy by providing a built-in logging mechanism that produces log files for
your apps as they run. The logs can show errors, warnings, and informational
messages and can also be configured to log custom messages from your
apps. To make sense of the logs and the availability of your apps, IBM Cloud
offers the Monitoring and Analytics service, which monitors your application's
performance and includes features for log analysis. The Monitoring and
Analytics service provides several advantages:
 Instant visibility and transparency into your application's performance
and health without the need to learn or deploy other tools
 Faster innovation, as you spend less time fixing bugs and addressing
performance issues and more time developing new features
 Quick identification of the root cause of an issue, through the use of
line-of-code diagnostics
 Faster time to resolution of your application's problems, as you can use
embedded analytics to search log and metric data
 Reduced maintenance costs, as you can keep your application running
with minimal effort
Outside of IBM Cloud, many tools in this space offer some form of free trial or
limited edition to help you get started. Sign up and begin monitoring today.
Simple monitoring can be configured in minutes. The number of monitors,
the number of locations that they can be run from, and the amount of
historical data might be limited. However, don't let those limitations stop you
from getting started.
To get started with automated monitoring, consider using one or more of
these tools:
 IBM Cloud Availability Monitoring
 New Relic
 Pingdom
 Datadog
 Uptime
 Sensu

Automate alert notifications for first responders


The faster your first responders find out about an issue, the faster they can
respond. You can't expect your first responders to constantly monitor a
dashboard for issues. They need to be notified about alerts that require their
attention.
The best way to notify first responders is to implement alert notification tools
and integrate them into your monitoring systems so that alerts can be
automatically generated when an issue is detected. Notification tools can
have some or all of these features:
 Send notifications through various channels, including email, text messages,
and apps such as Slack. In some cases, you can configure how each responder
receives notifications.
 Configure on-call and on-duty schedules to determine who is notified when.
 Override normal notifications for extreme or critical issues.
 Provide a way to specify escalation policies if an issue isn't resolved fast
enough.
If you use a sophisticated system, you can avoid alerting first responders
who are already involved in a high-severity incident so that they can
complete their work.
When the DevOps team delivered the Garage Method for Cloud site, the
team implemented automated monitoring and notification so that they were
immediately notified when issues occurred.

When you implement notifications, be careful to not notify your first


responders when it's not necessary. Make sure that notifications take place
only when application users are affected or are about to become affected.
Suppress notifications of noncritical alerts until the next day to keep the on-
call process effective. Configure your system so that you can avoid
overworking your first responders.
Event management and alert notification
One common problem with the model in which some monitoring tools
directly feed your notification system is that too many notifications are
generated. This problem results in a lack of trust in the solution. Alerts on
symptoms instead of effects, redundant information, or noncritical events
create noise in the system.
In this situation, use event management. Before an alert notifies someone,
an intelligent event management system can analyze the alert. The event
management system suppresses duplicates, correlates dependent events,
and enriches events with meaningful information, such as the affected
service, the associated service level agreement (SLA), and a link to expert
advice. The result is a clear, complete, and actionable alert that triggers the
notification to the right person who has the right information.

Use runbooks to automate operations


When an incident occurs, your subject-matter experts (SMEs) and operators
need to respond fast. They must tap into the expertise of all your experts,
not only the ones that are on call. A runbook provides standardized
procedures that explain how to address recurring IT tasks. Instead of
spending time on problems that other people have solved before, a runbook
provides the optimal way to get tasks done. It's important for developers to
become accustomed to provide relevant runbooks as part of delivering their
code.
As your team and application mature, your runbooks can mature too. The
stages of runbook maturity are as follows:
 Ad hoc: This initial state is characterized by individual manual actions with no
documentation or consistency.
 Repeatable: Standard activities are documented and are consistent across
the organization. These activities are still done manually.
 Defined: Activities are enforced. Actions are made available as scripts and
tasks and are provided to the operator in context within management tools.
 Managed: The system suggests the right activity for an event. By using basic
if/then functions, the system automatically runs activities.
 Optimized: At the highest level, analytics are applied to identify when and
what to automate.
Consider the following example of how a problem is solved by using a
runbook. Often, your operators face the problem of a full Linux file system.
As part of defining a repeatable process to address the problem, you can
publish a set of instructions and associated commands that the operator
uses to address the problem:
1. Open a session to the host.
2. Enter the following command go to the file system that is full:
3. cd
4. Enter the command to identify core files in the file system:
5. find . -name core.*
6. Enter the following command to identify large files in the file system:
7. du -hsm . | sort -nr
If problem commonly occurs, the next step is to create a script that allows
the user to specify any needed parameters as input and automatically run
the commands.

By using runbook tools that are integrated with event notifications, you can
define automated procedures to run when a specific event occurs. Consider
what happens when an event-management system receives an event that
indicates that a service failed. The system can complete automated actions
that are defined in a runbook. For instance, when the system receives a
service failure event, the runbook script suggests taking a snapshot of the
system. You might automate these actions:
 Using the system host name and credentials to log on to the server
 Obtaining the process list, memory, CPU utilization, version information, and
other information
 Pulling any related log and trace messages
The results of the actions can immediately be sent to the first responder as
soon as the action is completed. Over time, your team can become
comfortable enough to code scripts to automate problem fixes without
manual intervention.
Summary
Use runbooks to define standardized procedures for your IT operators to use.
As your operators use the runbooks, you save on costs because each
operator no longer must spend time to create a procedure to fix a problem
that was solved before.

What is problem management?


Problem management is the practice of preventing the recurrence of an
incident by resolving the root cause to minimize its adverse impact on the
service and to prevent the recurrence of similar incidents.
The objective of incident management is to restore the service as quickly as
possible, and it does so by finding an immediate tactical solution to the
incident. For critical incidents, follow with problem management to identify
and resolve the root cause. If you do not, you implicitly decide that it's
acceptable for the same incident to happen again.
It's important to dig deep and truly identify the root cause of the incident.
Otherwise, you fix only the symptoms and not the underlying cause, and
similar incidents might recur.

We need to minimize the adverse impact of incidents caused by errors, and


to prevent the recurrence of incidents related to these errors. A solution can
only be as good as the problem statement and a blameless root cause
analysis is the problem statement for IT operations. A “postmortem”
debriefing should be considered first and foremost a learning opportunity,
not a fixing one.
Ingo Averdunk, IBM Distinguished Engineer, Cloud Service Management and Operations

Problem management techniques


5 Hows is an iterative interrogative technique that you can use to explore
the cause-and-effect relationships underlying a particular problem. The
technique determines the root cause of a defect or problem by repeating the
question "Why?" recursively so that each answer forms the basis of the next
question.
The goal of this root cause analysis (RCA) process is to identify why the
incident happened and to put sufficient measures in place so that a similar
incident doesn't occur again. These measures might target the application
itself, the application architecture, or the infrastructure and management
environment.
Following the RCA, hold a postmortem meeting. The purpose of the
postmortem is to find out what happened and to define actions to improve
the organization. It also can provide insights into how the team can better
respond to future incidents.
A culture of blameless postmortem is critical to allow the organization to
learn from past mistakes. Only when people are free from fear of punishment
and can openly share their mistakes will others learn from the experience
and be prevented from making the same mistakes.
Benefits of implementing problem management
 Reduction in incident volume. Effective use of problem
management techniques can stop an issue from occurring multiple times or
prevent the issue from happening in the future.
 Improved overall quality of IT services. Since people expect cloud
services to be constantly available, repeated problems can result in a loss of
confidence in the reliability of your application.
 Improved organization knowledge and learning. As the
organization implements a structured approach to problem management, it
can learn from previous mistakes and use that knowledge to prevent failures
or outages.
Problem analysis by using the 5 Hows
In the push to cloud adoption, organizations need problem analysis. As you
move to public, private, or hybrid clouds, broad collaboration and the ability
to connect with extended teams are essential. No longer does one team
provide the skills to restore service, ensure reliability, and maintain the
infrastructure.
People expect cloud services to always be available and to improve
continuously. Problem analysis can't be a part-time task; it's the driver to
eliminate repeated issues. You must fix the right issue the first time, as
repeated problems can lead to a loss of faith in the reliability of your
application.
To identify the contributing factors of a problem, use analysis techniques
such as the "5 Hows". The 5 Hows technique provides a simple, results-
focused approach. In this technique, the first responder or the site reliability
engineer uncovers the factors that led to an issue instead of determining
blame.
The 5 Hows technique
Problem analysis means understanding how a system degradation or
unavailability occurred. If your analysis is just another layer of managing
incidents, the reliability of your cloud applications will suffer. Be sure to
allocate the appropriate resources and time to complete a full analysis in
these situations:
 When issues occur more than once
 When an outage can affect many users
 When the system isn't functioning as designed
When an issue occurs and you try to understand how it happened rather
than assign blame, you create a positive and free-flowing informational
approach to problem solving. You can use the ChatOps process to invite
anyone who is affected by or who might have insight into the issue to
collaborate and ask the 5 Hows questions. This directed analysis process can
help you determine the action items that are needed to create change
requests.
Your 5 Hows might include fewer or greater than five questions. The
technique works as follows:
1. State the issue and ensure that the group agrees with the problem
statement.
2. Share any data that was gathered and ask the first How.
3. The group responds. As it answers each question, the group must agree that
the answer is correct, that it supports the question, and that it leads to the next
question.
4. After you determine the contributing factors of the problem, create the
corrective actions as a change request, an enhancement, or a personnel change.
Whatever the result is, be sure that an action can be taken and that the action is
assigned.
This analysis produces entries in the backlog. The entries might be for the
application, the infrastructure, or the management environment. Following
the DevOps culture, the backlog entries must compete against functional
backlog elements. But because people act as a single team, an intrinsic
interest exists to address these nonfunctional requirements. The 5 Hows
technique is supported through ChatOps collaboration. Both techniques are
focused on reaching beyond typical team boundaries and finding the answer
through a collective working style.
Example
The following example shows 5 Hows questions and answers.
1. How did we go down?
Answer: The database became locked.
2. How did it become locked?
Answer: There were too many database writes.
3. How was it possible that we were doing too many database writes?
Answer: This scenario wasn't foreseen and it wasn't load tested.
4. How was it possible that the change load wasn't tested?
Answer: We don't have a development process set up for when we should
load test changes.
5. How was it possible that we don't have a development process for
when to load test?
Answer: We haven't done much load testing and are reaching new levels of
scale.
Identifying the contributing factors of a problem in a cloud
environment
Finding the contributing factors of on-premises issues might require being
methodical in your search. But the cloud can make it difficult to detect
factors because the system is orchestrated to proactively resolve itself. A
system might automatically move an application to another zone or restart
something to prevent an outage. This automation means that your analysis
must be proactive instead of reactive.
Because the system is designed and monitored to not exceed the service
level agreement thresholds, no alarms are raised. The only way to detect
and identify contributing factors is through analytics, such as trending or
anomaly detection. Problem management activities to support and maintain
the cloud need strong collaboration across the system resources and
developers to ensure that the right questions are followed by deep research
and analysis before you move to the next How.
Not all problems are worth solving. As incidents are the input to the analysis,
you must prioritize and classify records to ensure focus. This focus on
identifying and solving the right problems can have a positive impact on the
availability and reliability of the overall cloud infrastructure.
Define a strong problem management strategy that focuses on increasing IT
service availability while simultaneously increasing IT service quality and
decreasing problems. When the right problems are identified, deliver the
initial response in 24 hours and deliver the final findings within 5 days. By
using this type of service level agreement, you ensure correct focus on
problem activities and create a sense of urgency among the collaborating
groups and resources.
Postmortem
After you answer the 5 Hows and develop an action plan, document the
incident and share it openly. Include this information:
 What happened: The timeline of events and the effects on the users of
the service
 How it happened: The results of the 5 Hows analysis
 What the resolution was: The immediate action that was taken to
resolve the problem
 What countermeasures will be taken: The action plan to ensure that
the problem doesn't happen again
In a business-to-business context, prepare formal documents to distribute.
For user services, share the information more informally in an email or a blog
post. You must be believed to be heard.
Critical success factors
When you conduct a 5 Hows analysis, remember these factors:
 Use and expand the roles that drive analysis (first responder and site
reliability engineer) to ensure timely and consistent participation from
incident management and across the infrastructure teams.
 Focus efforts on people change management as people adapt to their
roles in the 5 Hows problem technique and collaborate through ChatOps.
 Broadly collaborate across many application resources, departments,
organizations, and teams throughout the business to ensure a cohesive
analysis of the problem.
 Skillfully apply the 5 Hows technique to the standard process of
analysis.
 Interconnect with other processes, such as incident, configuration, and
change management.
 Encourage developers and operations personnel to be proactive in
thinking about causes as they deliver applications and code and enable the
infrastructure. If they work in this mindset, they can adapt code, support,
and logging in to the structure of their work and adapt them to be a part of
the "blameless" analysis.
 Set service level agreement standards, such as initial findings in 24
hours and final response in 5 days.
Summary
The 5 Hows technique is effective because it uses the ChatOps approach,
interviewing many people for answers and conducting those interviews in a
collaborative way. This approach ensures that no single-threaded response
or diagnosis occurs within a single team. The technique enables broad-
reaching collaboration to ensure that the right questions are asked. By using
these resources to drive the problem management processes for cloud
support, you follow a lean and focused approach to problem analysis.

Address the root cause of an incident while


managing technical debt
After you complete a root-cause analysis, identify counter measures to
prevent similar incidents. The goal is to prevent issues, but preventive
actions are typically expensive in terms of time and implementation. Your
action plans must include tactical short-term fixes and strategic
improvements.
Learn more about root-cause analysis
Problem analysis by using the 5 Hows →
Formulate a balanced action plan that consists of actions that fall into any of
these categories:
 Detective: Improve the monitoring and instrumentation components to detect
the issue faster. For example, you might add monitors with thresholds that support
early detection.
 Investigative: Provide improvements to isolate and diagnose issues faster.
For example, you might improve logging to document input and output of API calls.
 Corrective: Provide improvements to correct malfunctions faster. For
example, you might use runbooks to automatically reroute traffic.
 Preventive: Improve the underlying application code, architecture, or both.
For example, you might perform input validation.
To make the solution more robust, balance the actions between the service
provider and the consumer of the failing service. The service provider must
take preventive measures so that the issue doesn't recur. The service
consumer must implement more fault tolerant measures so that the
consuming application is less affected by disruption of the dependent
service. As an example, the service consumer might implement the Circuit
breaker pattern.
After you identify the tactical and strategic actions to address the root cause
of your issue, add them to the backlog that you use to track your team's
work. For high-severity issues, these items have the highest priority over any
functional features. If you give them a lower priority, you're implicitly
acknowledging that it's acceptable for the incident to happen again.
Balance new function against technical debt
One risk is that your team might address the tactical backlog items while
continually deprioritizing the strategic backlog items. Technical debt is a
concept that reflects the implied cost of extra rework that is caused by
choosing a solution that's easy to implement instead of a better approach
that takes longer.
Make sure that you measure and review the burn rate of incident-related
actions and move toward reducing the technical debt of those actions.
Executive buy-in and support are often required to resolve the conflicting
goals of velocity and reliability.

You might also like