TheNewStack GuideToCloudNativeDevOps PDF
TheNewStack GuideToCloudNativeDevOps PDF
TO
CLOUD NATIVE
DEVOPS
The New Stack
Guide to Cloud Native DevOps
Alex Williams, Founder & Editor-in-Chief
Core Team:
Bailey Math, AV Engineer
Benjamin Ball, Marketing Director
Chris Dawson, Technical Editor
Gabriel H. Dinh, Executive Producer
Joab Jackson, Managing Editor
Judy Williams, Copy Editor
Kiran Oliver, Podcast Producer
Lawrence Hecht, Research Director
Libby Clark, Ebook Editor, Editorial Director
Michelle Maher, Editorial Assistant
BUILD
Contributors ................................................................................................................................10
01 - Doing DevOps the Cloud Native Way..............................................................................11
02 - Cloud Native DevOps Roles and Responsibilities ........................................................30
03 - Cloud Native DevOps Comes to Security and Networking ........................................52
CloudBees: Cloud Native DevOps with Jenkins X................................................................74
Bibliography ................................................................................................................................76
DEPLOY
Contributors ................................................................................................................................88
04 - Culture Change for Cloud Native DevOps ......................................................................90
05 - Role of DevOps in Deployments: CI/CD........................................................................107
06 - Case Study: The Art of DevOps Communication, At Scale and On Call .................122
Pulumi: Filling in the Dev and Ops Gap in Cloud Native Deployments .........................129
Bibliography ..............................................................................................................................131
MANAGE
Contributors ..............................................................................................................................137
07 - Creating Successful Feedback Loops With KPIs and Dashboards ..........................138
08 - Testing in Cloud Native DevOps ....................................................................................151
09 - Effective Monitoring in a Cloud Native World .............................................................159
KubeCon + CloudNativeCon: CI/CD Gets Standardization and Governance ...............170
Bibliography ..............................................................................................................................172
Disclosure ..................................................................................................................................176
With scale in mind, it just made sense to focus on the cloud native DevOps
practices and workflows that practitioners are developing for the outer
dimensions of at-scale application architectures. But what exactly those “cloud
native DevOps” practices are, wasn’t clear. In this third and final book in our
cloud native technologies series, we examine in detail what it means
to build, deploy and manage applications with a cloud native approach.
It’s a continual adaptation that DevOps practices help manage. Such practices
have been built upon and refined for over a decade in order to meet the deeply
complex challenge of managing applications at scale. And DevOps is now
undergoing another transformation, buoyed by the increasing automation and
transparency allowed through the rise of declarative infrastructure,
microservices and serverless architectures. This is cloud native DevOps. Not a
tool or a new methodology, but an evolution of the longstanding practices that
Practices have evolved quickly. There is a “new guard” of full stack developers
— in the words of our friends at LaunchDarkly — who all have some part in
developing and managing services. Developers now count on approaches that
treat the architecture as invisible, allowing them to program the resources
according to the workloads their team is managing. Similarly, the operations
story is quickly changing as the role of site reliability engineer (SRE) grows
and becomes more associated with overall services management.
Services are now at the core of how modern businesses work. All the auto-
navigating that a phone manages, the immediate payment from an application,
the secure connection back to the bank — at their technical depths all are the
result of automated and declarative technologies developed on distributed
architectures by teams of developers and software engineers.
Shortening the feedback cycle between developer and end user experience
speeds application development and provides critical, actionable business
information in a timely way. The operations role is also growing, as a result.
SREs complement those on operations teams, who together develop
infrastructure software, technologies and services. The service is the product
in today’s world, making their roles aligned as they both seek better
efficiencies and observations in the feedback cycle to improve the experience
for the services they provide.
The workflows that modern, cloud native teams adopt are increasingly defined
by the cycle of inner feedback loops and outer-loop management practices.
The path from code commit to production and beyond tells a story in itself of
how open software has largely been developed according to continuous
feedback cycles. Continuous integration platforms, continuous delivery
technologies, monitoring software — they and many other categories have
been built to continually drive the evolution of ever more accelerated
software development.
But there always have to be checks on behaviors in workloads, tests and more
tests to find the answers. In the end there are always more options for what
can be tested and analyzed. It’s an impossible quest to perform all the tests.
Gaining deeper views into error handling and overall incident management is
a hot touch point where the complexity of automated and declarative
infrastructure can have its own chaos. The only answer is knowing how to
manage it.
The people who define and evolve new practices and technologies for at-scale
application architectures once had no choice but to build their own tools to
manage unprecedented complexity. Today, the people who build open source
projects have a deeper and broader community who are familiar with the
technologies for at-scale architectures. New tools are plentiful, and open
The connections in all of this complexity are the practices that people follow
to build the software that runs the internet. DevOps will continue to evolve
to meet the practices teams follow and the requirements of increasingly
different forms of workloads as cloud native technologies also evolve.
Workflows will continue to change, influenced by newer techniques, largely
developed in open source communities. The form that DevOps takes for any
organization is first about the team. The team and its trust-oriented
philosophies will determine the pace of automation and the iterative
improvements that come through testing, delivery and management of
services across distributed infrastructure.
Libby Clark
Ebook Editor, Editorial Director
Alex Williams
Founder and Editor-in-Chief, The New Stack
Pulumi provides a Cloud Native Development Platform. Get code to the cloud
quickly with productive tools and frameworks for both Dev and DevOps. Define
cloud services — from serverless to containers to virtual machines — using
code in your favorite languages.
SECTION 1
BUILD
Learn how containers, Kubernetes, microservices and serverless technologies
change developer and operations roles and responsibilities.
First there was the wheel, and you have to admit, the wheel was
cool. After that, you had the boat and the hamburger, and
technology was chugging right along with that whole evolution
thing. Then there was the Web, and you had to wonder, after
the wheel and the hamburger, how did things make such a sudden left turn
and get so messed up so quickly? Displaying all the symptoms of having spent
30 years in the technology news business, Scott Fulton (often known as Scott
M. Fulton, III, formerly known as D. F. Scott, sometimes known as that loud
guy in the corner making the hand gestures) has taken it upon himself to move
evolution back to a more sensible track. Stay in touch and see how far he gets.
“Cloud native” approaches use containers as units for processes, allowing the
running of sophisticated, fast and distributed infrastructure for developing,
deploying and managing at-scale applications. The philosophy and the practice
of building and running cloud native production systems means the need to
build and run scalable applications in public, private and hybrid cloud
environments. It embodies a “pan-cloud” approach that follows a service level
agreement (SLA), which allows for interoperability. It is exemplified by tools
Because applications are deployed to the cloud, as well as built, tested and
staged there using cloud-centric tools, it’s tempting to define “cloud native
DevOps” as DevOps practices applied in a public cloud environment.
However, the application architecture, tools and workflows — and the intent
with which they’re used — matter more than the location when it comes to
cloud native DevOps.
“A common misconception is that ‘cloud native’ means ‘on the cloud.’ It might
be a bit of a misnomer. Lots of organizations have ‘on prem’ data centers that
use Docker, Kubernetes, Helm, Istio, serverless and other cloud native
technologies,” Dan Garfield, chief technology evangelist for cloud native
continuous integration/ continuous delivery (CI/CD) platform Codefresh, said.
“Cloud native is more of a mindset for how application definition and
architecture looks than a description of where those services are running.”
that all organizations should undertake, but which will yield unique methods
for each one, and will result in the creation of pipelines and automation chains
that not only fit each organization exclusively, but which will grow and evolve
along the way.
DevOps — whether it’s cloud native, or not — is about your team and its
workflows. Maybe there are ways in which a provisioning system, similar to a
cloud native approach like Amazon’s, can help you create pipelines for part of
the automation chain, or, in a different context, for a pipeline that pertains to
a limited span of the application life cycle. But no cloud native chain can apply
to all parts of that life cycle, especially when an application artifact emerges
from testing. You may be jump-starting something with a cloud native, self-
provisioning pipeline, and you can’t exactly say that’s not a valuable thing, but
it’s not the entire toolchain, and thus, it’s not really Dev plus Ops. At the same
time, just because a tool or platform does not encompass DevOps end to end,
or because both departments use the same tool in different ways, does not
mean that DevOps itself is not end to end.
developers can focus on writing code without worrying about the system on
which their code will run. Specifications for building a container have become
remarkably straightforward to write, and this has increasingly led to
development teams writing these specifications. As a result, development and
operations teams work even more closely together to deploy these containers.
But the real game changer has been the CNCF’s open source container
orchestration project, Kubernetes. As the de facto platform for containerized,
cloud native applications, Kubernetes not only lies at the center of the DevOps
transformation, but also enables it by abstracting away the details of the
underlying compute, storage and networking resources. In addition to improving
traditional DevOps processes, along with the speed, efficiency and resiliency
commonly recognized as benefits of DevOps, Kubernetes solves new problems
that arise with container and microservices-based application architectures.
The New Stack’s ebook, “CI/CD with Kubernetes,” goes into more depth on the
evolution of DevOps alongside cloud native application architectures.
updates, enabling teams to deliver applications and features into users’ hands
quickly.
Rolling updates and native rollback capabilities are just one example of how
Kubernetes has evolved and improved DevOps workflows. Before Kubernetes, if
you wanted to deploy something, a common deployment pattern involved the
server pulling in the newest application code and restarting your application.
The process was risky because some features weren’t backwards compatible
— if something went wrong during the deployment, the software became
unavailable. Kubernetes solves this problem with an automated deployment
rollback capability that eliminates large maintenance windows and anxiety
about downtime.
As a result of these and other benefits of Kubernetes, operations are now less
focused on the infrastructure and more on the applications that run light
workloads. The combined effect is increasingly automated processes that yield
“This is not just about technology. It’s about business and people,” Corley told
attendees. “To me, this is the essence of what DevOps is: We’re all
participating in activities that can continuously improve all functions of
business, and involving all employees. You hear about the breaking down of
silos, the integration of business … By improving standardized activities and
processes, we can become more efficient and reduce waste.”
The cloud native aspect of DevOps becomes more practical for businesses,
Corley continued, once it has been freed from the constraints of traditional,
on-premises systems management. Theoretically, much of these
“undifferentiated heavy lifting” work processes may be automated, in order to
expedite them. But cloud native development, he asserted, advances the notion
that busy work may be eliminated altogether. That elimination would, in turn,
radically transform the requirements of DevOps, by way of changing the
definitions of what Dev teams do and what Ops teams do.
Organizations that have begun to further break down their microservices into
functions that they then run on Functions as a Service (FaaS) and serverless
platforms may come closest to realizing cloud native DevOps in its purest
form. Though enterprises are mostly still testing serverless platforms for
specific use cases, those tests are speeding the move to new organizational
structures in which the role of traditional IT departments will disappear, said
James Beswick, co-founder of Indevelo, which creates web and mobile
applications completely on serverless architectures. Chris Munns, principal
developer advocate for serverless at AWS, has gone so far as to predict that by
2025 most companies that are not cloud providers themselves will not need
operations roles. 6 This “NoOps” future is still far away, but their predictions
highlight just how closely cloud native DevOps and serverless delivery are
bundled together.
“Really, what you’re talking about at this point [with serverless] is a platform
that handles the release management, the life cycle management, the
telemetry, instrumentation and the security around that component,” JP
Morgenthal, chief technology officer (CTO) of DXC Technology, said. “It’s really
a PaaS [Platform as a Service], but more so.”
“It’s a very different model of computing than what we’ve been doing in the
past,” Morgenthal said, “and frankly, why would I want to then also have to go
and invest at least hundreds of thousands of dollars in setting up all of that
infrastructure to do the same exact thing?”
Viewed from this perspective, one could argue that, for a DevOps platform to
be completely effective, it actually must be cloud native — it should be
constructed and located in an environment that is, from the outset, accessible
to all, yet set apart from any one department’s silo or exclusive oversight. The
point isn’t to automate away operations roles, but to further blur the lines
between Devs and Ops with more cross-functional teams. Cloud native
technologies are changing workflows and redefining Dev and Ops roles by
increasing automation and bringing IT more closely aligned with end users
and business objectives. We cover these roles in more detail in the next chapter.
Decision
5 1
Automation
3 Business Processes
Automation
Distributed
Transactions
DRIVER OF AUTOMATION
4
Orchestration
Communication in
2
Distributed Systems
DURATION OF A
IT
Source: “5 Workflow Automation Use Cases You Might Not Have Considered,” Bernd Rücker, The New Stack, April 9, 2018. © 2019
FIG 1.1: Bernd Rücker, co-founder and developer advocate at Camunda, illustrates
how each use case ranks along two dimensions: whether business or IT is the main
driver of the automation, and the duration of a workflow instance.
processes are often long running in nature. They might involve straight-
through processing/service orchestration; waiting for internal or external
messages or timers, such as the promised delivery date or other events; human
task management; order fulfillment; and more.
3. Distributed Transactions.
You cannot rely on atomicity, consistency, isolation and durability (ACID)
transactions in distributed scenarios. But some database providers, such as
Amazon DynamoDB and PingCAP’s TiDB, have started to offer ACID
capabilities. ACID is what you experience from working with a typical
relational database — begin transaction, do some stuff, commit or rollback.
Attempts like two-phase commit — XA transaction — bring ACID to
distributed scenarios, but are not really used much in real life as they do not
scale: You still have to solve the business requirements of having a one-or-
nothing semantic for multiple activities.
4. Orchestration.
Modern architectures are all about decomposition into microservices or
serverless functions. When you have many small components doing one thing
well you are forced to connect the dots to implement real use cases. This is
where orchestration plays a big role. It basically allows invoking components
— or services, activities or functions — in a certain sequence. Examples
include: one microservice invokes three others in a sequence, or multiple
serverless functions need to be executed in order.
5. Decision Automation.
Decision management is “the wingman” of workflow automation. Of course, it
is a discipline on its own, but from the workflow automation perspective it is a
great tool to extract business decisions and separate them from routing
decisions. Examples include: automated evaluation of eligibility or approval
rules, validation of data, fraud detection or risk rating, and more.
And for that reason, Frost argued, the whole DevOps transformation thing may
be a waste of time.
“If you spend the next three years implementing DevOps, will you actually
be out of date?” asked Frost rhetorically. “Because while you’re focusing on
DevOps, there’s a whole host of companies that are going to be disrupting
the market. We’re all aware of these sorts of disruptors … So what you’ve got
to ask yourself is, should we be spending our time on DevOps, or should we
be spending our time on getting business-critical functionality out as quickly
as we can?”
Each organization must make this decision based on its own needs, teams and
processes. Microservices and serverless applications come with additional
complexities that must also be considered. The user experience (UX) of the new
environments may make some use cases better suited than others for
developer teams and the workflows they follow. For this reason, taking an
That list sounds dangerously close to a recipe for what some would call
“undifferentiated heavy lifting.” And this is at the heart of Capgemini’s
counterargument: Software development is evolving toward the ability to
produce a function or service based almost solely upon intent, without regard
to the requirements of its infrastructure. It’s a methodology that is, by design,
more Dev and less Ops. It may be the separation of underlying functions which
makes the serverless approach, and to some extent the microservices approach,
valuable and viable. Why bother, the counterargument asks, investing in the
time and effort needed to integrate tasks that are no longer relevant?
“The beautiful thing about being a software developer is, everything that I’m
reaching out towards is controllable by code in some form or fashion,” said R.
A cloud native DevOps platform could include idea creation, which may better
enable that idea to take root in the cloud. Thus it would be the ability for
people to devise ideas and to innovate that could not only preserve their jobs
but bring them into a closer-knit loop. Still, the focus of such a platform
should be on enabling fast, iterative development that’s consistent across
multiple environments, said Marc Holmes, former chief sales and marketing
officer at Pulumi. “The first step for a cloud native DevOps platform would be
the consistency of the model to provide inner loop productivity [for developers]
and outer loop management.”
not to say it should prune failed processes from the tree of development, but
rather suspend their development until circumstances change. One or more
bad ideas may converge into, or otherwise catalyze, one good one.
That sounds like the classic question of which tool lies at the center of the
ecosystem. The various open source vendors will likely agree to disagree on
which tool gets to play conductor. Yet, many have come together to form an
open source foundation for this purpose. Still, even if the role of conductor
remains for each organization to designate for itself, will it matter whether or
not that conductor is “cloud native?”
“I don’t think it matters,” Croy answered. “I’m comfortable stating this now:
There is not a single, successful business in the world right now that’s not
relying on cloud native technologies in some form. … To me, that ship has
sailed. I think the idea of owning everything is now passé, from a corporate
standpoint.”
What does matter, offers Pulumi’s Holmes, is a consistent model that works
across the entire CI/CD pipeline and addresses the need for general integration
between the various tools. “Working in concert isn’t possible without a
consistent model and leads to CI/CD headaches that aren’t solved by pipelines,”
he said.
And yet, for cloud deployments, automation does not necessarily replace any of
the roles of the DevOps team members. Even if the work is shoved off to a
cloud provider’s operations team by employing managed services, internal IT
teams are still critical for DevOps success. Dev teams are taking on some of the
tasks previously handled by more traditional operations roles such as
configuration and deployments, and thus reducing the Ops cost of every
deployment. Meanwhile, Ops roles change to manage the complexity of the
system as a whole. Automation can be difficult to set up and maintain — and
still requires considerable people power. So operators spend their time coding
compliance, security, performance and SLA metrics and making sure all
infrastructure is available via automated self-service.
In short, developer and operations roles and responsibilities are changing with
cloud native technologies. A decade ago, Lachhman had “no purview as a JEE
[Java Enterprise Edition] developer in 2008 to much of the build — much less
the networking and storage that was needed. Fast forward to today, and the
Dev team can be packaging up connectivity items such as Istio and defining
storage mounts inside a container,” Lachhman said. “The traditional system
engineering skills are moving into Dev, thanks to cloud native architecture.”
Despite the added complexity, addressing the challenges associated with cloud
native deployments and making the necessary changes in DevOps roles are not
necessarily harder to do compared to traditional environments, Mitchell
Hashimoto, founder and CTO of HashiCorp, said. However, the ability to deploy
more applications quickly and therefore deliver more business value which
comes with cloud adoption, also strains the existing systems and teams.
FIG 2.1: A centralized IT team, cross-functional team, and dedicated DevOps team
are the three most common ways organizations structure DevOps now. But site reli-
ability engineering will be more common in the future.
Service team providing DevOps capabilities (e.g., building test environments, monitoring)
in the future. And several of the experts The New Stack consulted agree that
SRE is the future of IT operations.
“SRE straddles the line between the ‘Dev’ and ‘Ops’ sides of DevOps teams,
both writing code and supporting existing IT systems,” writes Kieran Taylor,
senior director of product marketing at CA Technologies and author of
“DevOps for Digital Leaders.” 13 Traditional IT Ops teams take a “find and fix
faults” approach, waiting for an alert to fire and then dispatching experts to
troubleshoot the issue. SRE focuses on how to deliver the best experience
possible across every touchpoint of customer engagement.
Site Reliability Engineers (SREs) did exist a decade ago, but they were mostly
inside Google and a handful of other Silicon Valley innovators. Today, however,
the SRE role exists everywhere. From Uber to Goldman Sachs, everyone is now
in the business of keeping their sites online and stable. Microservices and SREs
evolved in parallel inside the world’s software companies. The SRE role
combines the skills of a developer with those of a system administrator,
producing an employee capable of debugging applications in production
environments when things go completely sideways. Some would argue their
fuller perspectives about the resources that are managed doesn’t provide the
more granular perspective that DevOps engineers need to manage individual
services. But in an organization and infrastructure as large as Google, it’s
impossible for an SRE to have a complete view. Still, SREs provide context that
a DevOps team working closer to the services may not have. 14
In the case of microservices, the development team can “be narrower in focus
than in the past,” Steve Herrod, former chief technology officer of VMware and
now managing director at General Catalyst, said. Each microservice team can
thus develop and deploy their code independently of the rest of the application,
he said. “In many cases, the site reliability engineering role has grown to
bridge the gaps and coordinate monitoring and debugging across the whole
application,” Herrod said. “Great SREs thus require great technical skills, but
In the serverless space, the Dev and Ops engineers still have different roles to
play as well. “Let us not forget, that doing serverless at scale entails multiple
layers of complex configuration,” Nuatu Tseggai, director of solutions
engineering for Stackery, said. “[It also requires] expertise that relates to
networking and distributed systems, of which Ops engineers have built up
significant experience implementing solutions for over the past few decades.”
However, at least semantically, DevOps does not define a team or a role, but
instead, represents a culture, or a set of practices. A “team does both
development and operations,” Marko Anastasov, co-founder of Semaphore,
said. It’s about fast, repeatable processes and continuous feedback loops to
bring the team and its outputs closer to the customer.
“Whatever definition you might have, you are not wrong because there is no
official definition of DevOps,” she said. “This is quite a problem, actually,
because everybody is talking about it and everybody means something else.”
Operations, but that’s where clarity ends. Rodriguez Pardo’s favorite DevOps
definition comes from VersionOne: “DevOps is not a tool or a process, but a
practice that values continuous communication, collaboration, integration and
automation across an organization.”
FIG 2.2: CI/CD workflows vary by organization, but follow a similar pattern depicted
here.
CONTINUOUS INTEGRATION
CONTINUOUS DELIVERY
CONTINUOUS DEPLOYMENT
GUIDE TO CLOUD NATIVE DEVOPS 35
“It’s not about how continuous you are. It’s about what are you heading for,”
she said. “Continuous everything” can be too ambitious for most companies.
However, advancements, like virtualization and containerization, mean that
every developer can get a “pretty operational environment.” This consistency
between development and production environments acts as an important
DevOps accelerator.
Automation is not an option, agility consultant Rodriguez Pardo said. Just don’t
forget the human factor of that automation. A CI/CD pipeline has to include
customer feedback and consider the distance and connection points to them. A
particular process may need to be adapted for a customer, for example, when a
developer needs to be on premises with a customer. She calls this “manual
delivery,” which is the antonym of automation, but doesn’t negate its value,
since the whole point of DevOps is making customers happy.
Continuous Monitoring
Ops Need &
Collection Operation
Legacy
Continuous Continuous Continuous Type
Analysis Delivery Deployment Deployment
tic
ma
to
au
Continuous
Continuous Continuous
Dev Design
Integration
Release
& Test
Source: “DevOps and People: Where Automation Begins!” Almudena Rodriguez Pardo, Agile Tour London, 2018. © 2019
FIG 2.3: Successful cross-functional teams create many checkpoints for feedback
throughout the software development life cycle. Each stage is also subject to
continuous improvement.
“By bringing your software personally to your customer, you sit with him, you
hear his comments, you hear him bitching about something he doesn’t like.
You can have empathy and learn pain points,” Rodriguez Pardo said.
Whether Dev teams use Scrum, Kanban or something else, Rodriguez Pardo
recommends including operations on those teams in order to unite around
shared objectives. This helps create support and cross-functionality. The wider
you map that feedback loop, the more operations are included in hearing it.
And with everybody listening, you uncover sometimes absurd incongruities.
On one team, Devs had key performance indicators (KPIs) to make as few
mistakes as possible, while testing had KPIs to find as many mistakes as
possible. Department managers then did something “revolutionary,” Rodriguez
Pardo quipped, and they talked to each other, deciding on common goals. By
FIG 2.4: This illustration of Spotify’s agile engineering culture emphasizes small,
cross-functional teams with autonomy to make decisions within the bounds of their
mission.
Source: Spotify Labs as presented in “DevOps and People: Where Automation Begins!” Almudena Rodriguez Pardo, Agile Tour London, 2018. © 2019
CLOUD NATIVE DEVOPS ROLES AND RESPONSIBILITIES
To achieve this, teams can start with small testing experiments for developers,
or give everyone some basic workshops in Jenkins to cross-pollinate
continuous integration. The purpose is to attract some interest and give them a
taste, applying ideas like communities of practice and other cross-company
guilds. The end goal is best illustrated by the Spotify diagram above. 18
“With the operators’ support, they are aligned. They are mature within the
teams and understand what they have, what they don’t have, and where they
can collaborate,” Rodriguez Pardo said, adding that everyone is autonomous,
secure and heading toward delighting the customer.
In the end, DevOps is just about people, both the customers and the teams.
When hiring for the cultural change of DevOps, you are no longer just hiring
the best in a certain language, you are hiring a personality and how she can
adapt to change and work well with not only her own team, but with other
teams as you head toward cross-functional knowledge.
When HSBC shifted to cloud native deployments, the bank employed a new
hiring strategy which ultimately helped lead the DevOps transformation there
as well. Cheryl Razzell, global head of platform digital operations for HSBC
operations, described during DevOps World | Jenkins World 2018 in Nice,
France, how the banking firm’s DevOps embraced a new way of thinking.
“There’s been a massive cultural shift to adopt agile working DevOps, but I
think the bank really wants change so it’s embracing this change, so it’s open
to challenging new ideas,” Razzell said. “HSBC acquired people from outside of
the banking sector on purpose because they want to change their culture from
within and they want to bring out new ideas.”
HSBC recruited people from Google, Amazon and Microsoft, from startups and
from different backgrounds and different sectors from outside of the banking
sector to complement the teams that already understand how the bank’s
networks worked. “There’s synergy between the teams with new ways of
working versus some of the bank’s legacy products and some of the bank’s
processing,” Razzell said. “I think you need both to complement the journey.”
As stated previously, the facility and speed at which software can be deployed
on cloud native platforms require a shift in roles, so that operations-related
tasks do not get in the way of delivery cycles. Central to accomplishing this is
the cloud native concept of declarative infrastructure. Automating everything
is made much easier if your infrastructure is managed as code and
configuration by creating consistencies and parity across all of your
environments. Accomplishing this is dependent on your specific
implementation, but cloud native development and consulting company
Kenzan typically uses some sort of scripting, such as with Terraform, and also
employs dedicated infrastructure pipelines to automate the deployment. This
ensures that individuals are never manually hand tweaking environments
without the proper checks and balances in place. 19 Next generation, cloud
native IDE tools such as Pulumi and HashiCorp’s Terraform aim to improve
upon the developer experience by creating an interface to write code and
configure it to work across multiple environments.
“It is the mission of modern IT operators to get out of the critical part of the
continuous release process. This requires adopting a declarative paradigm to
managing data center and cloud infrastructure, instead of manually
responding to ad hoc requirements,” EMA’s Volk said. “This declarative
management approach also requires IT operations to understand the basic
principles of coding, without becoming a pure-blooded software developer.”
To put this another way, from an article on the changing role of developers:
“If you’re not already releasing features to a small segment of your customer
base, you’re already behind,” writes David Hayes, former product leader at
PagerDuty.
“Those involved in cloud native development life cycle patterns and practices
assume they’re operating in the silo-less environment. But many organizations
are still trying to figure out how to properly integrate all of the teams within
DevOps,” Brian Dawson, a DevOps evangelist at CloudBees, said. “For your
standard application, there are operations and security [people], for example,
who haven’t yet figured out how to inject themselves into cloud native
development.”
The change in operations roles that comes with this new abstraction of
storage, network and compute resources isn’t always clear. While some
technology thought leaders have declared a “NoOps” future, the reality is a lot
more nuanced. The next section examines the operations role of cloud native
DevOps in more detail.
“The role of ‘Ops’ hasn’t been eliminated, but rather, it’s been shifted,”
Stackery’s Tseggai, said. “Exactly how and where is a matter of the
engineering culture and vision within the organization. If an engineering org
were to foolishly rebrand DevOps into NoOps, they would be running a real
risk of alienating engineers whose skills are in high demand.”
The new challenges with cloud native architectures are leading to new ways of
thinking about storage, with persistence as the foremost concern. That means
there must be data backup and recovery solutions. Companies such as
Portworx and StorageOS are at the forefront in addressing these concerns with
their container backup approaches. What this really means is that the IT team
needs to take more time considering storage than it used to, when delivering
storage involved installing and formatting hard drives and telling the Ops
team how much storage was available.
First, a bridge call happens that just generates a lot of questions. The lead Dev
escalates it to the scrum master. It can’t be figured out. The questions get bigger
and bigger and now everyone is on the phone and on a growing ticket —
network engineers, business managers, app managers, lead developers, an SRE,
system administrators, middleware managers, the senior vice president (SVP),
chief of staff, two technical VPs, more middleware folks … the list goes on.
Finally, the company adds a customer engagement manager to the case “who
has to test it all instead of trusting SRE,” Edwards said.
When all is fixed, the next day the SVP wants to know:
● What happened?
● Whose fault is this?
● What processes and approvals can we add to keep this from happening
again?
But in all of these shiny tools and cultural transformations, operations was
just kind of ignored.
● Silos.
● Ticket queues.
● Toil.
● Low trust.
1. Cross-Company Silos
Edwards calls this just a different way of working where silos, or traditional
departmental divisions, are torn down and everyone in IT shares:
● A common backlog.
● A common tooling.
● A common context.
Ideally, everyone uses tools the same way and everyone shares a common set
of priorities. This way, when teams need to work together, they align on
capacity, context and process. This also enables feedback loops, learning and
higher quality work. Edwards continues that this isn’t just happening in
development and operations, but in the environment, network and on
customer teams.
2. Ticket Queues
“We create a ticket queue to solve silos,” Edwards said. He referenced back to
the story above, saying “Then I wait for something. I don’t know what I’m
asking. I’m not a firewall engineer but I’m typing into a blank box trying to
explain what I need.”
He says this makes queues not only slow down business processes and DevOps
work, but ticket queues are expensive. Donald G. Reinertsen’s talk on his book,
“The Principles of Product Development Flow”, lays out the problems that
queues create:
After all, the longer people have to wait for something, the more detached they
become. The end goal becomes split and obfuscated. Each goal becomes like a
snowflake — unique, brittle, technically acceptable, but not reproducible. This
makes it much harder to automate things.
“The only thing worse than automating things that are broken is automating
something that’s just a bit off,” Edwards said, pointing to ticket queues as a
huge contributor to bottlenecks.
Ticket queues are further aggregated — and aggravating — when they push
primary management focus to protecting team capacity and when operations
repeatedly says no. He says the latter is interpreted as Ops being afraid of
change, but a lot of times they are just trying to protect capacity.
3. Toil
“Toil” is a more common term now because of the growing popularity of SRE,
especially in the DevOps world. First, let’s distinguish between toil and
overhead. Overhead is important work that doesn’t directly affect production
services. It may be anything from setting goals to human resources activities
to team meetings — important things that don’t necessarily affect the code.
On the other hand, toil typically includes things that are:
● Manual.
● Repetitive.
● Able to be automated.
● Not strategy or value driven.
● Repeatedly waking on-call developers up.
● Non-creative.
● Not very scalable.
“It may be necessary, but it should be viewed as something a little bit icky,”
Edwards said.
At Google and many companies, managers try to keep toil down to less than 50
percent of the SRE team’s work. It isn’t moving the company forward and it
can frankly be demotivating to your engineers. Google particularly points out
their fear that greater toil means that SREs will fall into a strictly Ops or
strictly Dev role, while they should be working with both.
Now, Edwards says you can’t get rid of ticket queues completely. Organizations
just have to be aware when they are being used as a general purpose work
management system.
“Tickets are really good at documenting true problems, issues, exceptions and
routing for necessary approvals,” Edwards said. “The idea is that you cut down
on all the interruptions.”
Overall, Edwards says successful DevOps is about shifting the ability to take
action leftward toward the build end of the pipeline, giving everyone the same
tooling and enablement for a safer pathway to do things.
“The whole idea behind the cloud is everything is self-service. So, whether
you automate it with a pull request or you click a button on a UI, it’s a way
everything should work now really, by no longer raising a ticket and waiting
three months for a VM [virtual machine] to show up,” James Strachan, senior
architect at CloudBees and the project lead on Jenkins X, said.
While many traditional operations tasks are “shifting left” to developers, the
NoOps concept is more about creating layers of abstraction and self-service
capabilities than removing the need for DevOps. Operations roles are still
critical for building and managing those systems that deliver agility and create
business value. While a monolith application might be broken down into
microservices and serverless functions, it requires new thinking and opens
the door to new problems.
“DevOps doesn’t just become ‘NoOps’ — that is entirely the wrong way to
think about it,” said Chad Arimura, vice president of serverless at Oracle. “In
fact, serverless makes DevOps even more important than ever.” 24
The purpose of DevOps is to break down the walls — both physical and
metaphorical — between software developers and the operations teams
running the infrastructure. When successful DevOps allows building, testing
and deploying to happen in tandem so that reliable releases occur more
frequently and quickly. This leads to a cultural change as roles and processes
evolve and silos fade into cross-functional teams. DevOps also leverages
tooling to support cross-functional interoperability and automated workflows
that back continuous development, testing, deployment and integration.
other more traditionally siloed IT roles as well, most notably in security and
networking, which require new tools and approaches to management in cloud
native environments. In a traditional three-tier application, the application
tier, business logic tier and data tier talk to each other via a load balancer, and
networking and security policies are administered from a central place —
typically the load balancer or a firewall, writes Jeroen van Rotterdam,
executive vice president of engineering at Citrix. 27 The distributed nature of
microservices architecture makes administering networking and security
policies a lot harder than it is in a monolith architecture, he said. Containers
appear, disappear and are moved around to different compute nodes far too
frequently to be assigned static IP addresses, much less be protected by
firewalls and IP tables at the network’s perimeter. And unlike monolith
architectures where modules communicate with each other within a single
image, either through in-process calls or interprocess communications (IPC),
with microservices these calls are made over the network. More services and
service instances potentially mean more failure points and more operational
complexity; there are a lot more service instances to load balance, the short
lifespan of service instances makes health checking these ephemeral
containers a challenge and the sheer number of service instances calls for a
high degree of automation for managing rolling upgrades at scale. Manual or
semi-manual processes don’t work when dealing with containers at scale.
Fortunately, microservices architecture is flexible in its design to enable
continuous integration and delivery, unlike the three-tier architecture.
Applying DevOps to networking and security — alongside new tools and best
practices — helps organizations manage the increased operational complexity
that comes with a containerized, microservice architecture. DevOps adoption
changes well-established security practices, for example, because its
emphasis on automation and monitoring assures that flaws are found more
quickly and patches and updates are released continually. The patterns and
practices which emerge when DevOps and security are combined is now
“Security has to be part of the CI/CD pipeline,” said Dr. Chenxi Wang, founder
and general partner of Rain Capital, an early stage cybersecurity-focused
venture fund. 28 “In the past with the major releases, you have your security
reviews and security testing all done. Everything else stops until you finish
that. That doesn’t work anymore, so vulnerability scanning has to be done in a
way that fits seamlessly into the CI/CD process.”
The shift to containers and Kubernetes has begun to change networking and
security roles in much the same way that it has changed developer and
operations roles. In fact, these two trends are connected at the hip: 40 percent
of network managers report fully converged, shared tools and processes with
the security group, with another 51 percent reporting formal collaboration,
according to EMA research on what they aptly call “NetSecOps.” 30
While not every organization has integrated security and networking into its
CI/CD pipelines, many are beginning to experiment with the approach. As the
“As things get pushed to Layer 7, things are also getting more modular in the
sense that application developers can focus on the application. They don’t
have to worry about networking, they don’t have to worry about security, and
security and networking functions are being encapsulated into sort of
manageable units, itself being microservice driven,” Dr. Wang said.
So, what’s the first step for network’s parlay into DevOps culture and
automation? Whaley says it’s all about making sure your routers and switches
are connected via APIs, allowing them to be programmable. Once that’s in
place, then you can add assurance, analytics and observability into the mix,
starting to apply DevOps processes and principles to the network.
NetDevOps is all very new, with companies starting to try it out. Service
24%
Compliance Checks 29%
66%
Configuration Management 48%
15%
Incident Response 9%
28% DevOps
Pre/Post Change Test 22%
12% NetOps
Service Assurance 13%
32%
Service/Application Availability 24%
9%
Troubleshooting 6%
43%
Upgrades 26%
18%
None 27%
FIG 3.1: Adoption of automation technologies among NetOps teams lags behind
DevOps but the gap is closing fast, according to a 2018 study by F5 Networks and Red
Hat.
Whaley agreed that there are still tooling gaps, like the testing piece.
“If you’re going to test a network change, you either have to have a test
network sitting around — most people don’t have that spare network — or a
simulated network,” she said. Python automation test systems (PyATS) and
the DevNet Sandbox for reserved hosting and testing are good places to start.
The security benefit of DevOps carries over as a strong motivator for network
adoption of DevOps practices. By tracking changes as code, network operators
have more visibility into the changes that have been made and can further
automate security.
The network also has some interesting data for application developers,
such as location collected from WiFi, heatmaps and usage information.
Business analytics will soon blend with NetDevOps to unlock a lot of data
that will improve a whole range of applications, from smart cities to customer
experience. This phenomenon is very likely to see a boost in adoption
alongside the growing popularity of the enterprise Internet of Things over
the next few years.
“It’s similar [to] DevOps [in] that it is culture plus tech change — not
necessarily a fit for everyone. But you can start with automating common
tasks and grow the practice from there,” she said. “If a strong DevOps
practice already exists, it’s natural to extend those ways of working” to
networking as well.
She pointed out that it brings a lot of possibilities of teams getting closer
together. So far, Whaley’s team has found that — like many things in the
enterprise space — it’s the culture that has to change first. DevOps had to
deal with literal walls often between Devs and Ops, often on the same floor.
Just to start, the network team may even be at a different location.
Another piece of NetDevOps that’s still evolving is who manages it all. So far,
Whaley sees some leadership coming from within the networking
organizations looking to link up with their DevOps counterparts because they
share in the problems of scale and speed, recognizing automation is the only
way to get there. At other organizations, DevOps is leading the way because
the patterns, practices and tools have reached a level of sophistication that the
natural next step is to look for ways to improve how the network is responding.
No matter which side of the aisle is pushing for the NetDevOps transformation
— networking or DevOps — Whaley says that the demand for speed, agility
and scale is coming from the chief information officer (CIO) level, and either
side is focusing on how to leverage the APIs on the other side.
DevSecOps
DevSecOps fits security into the DevOps process. Teams of DevOps engineers
and security analysts share priorities, processes, tools and, most
importantly, accountability, giving organizations a centralized view of
vulnerabilities and remediation actions, while also automating and
accelerating corrective actions. 35
applications, writes Twain Taylor, a technology analyst and guest writer for
Twistlock. 36 Companies are thinking more deeply about developing separate
pipelines for various workload types. Other emerging best practices include:
The integration of DevOps and security has already started to take hold in
most organizations that have already adopted DevOps practices. A 2018 DevOps
survey of over 1,000 IT pros by Logz.io found that DevOps handles security at
FIG 3.2: The integration of DevOps and security has already started to take hold in
most organizations that have already adopted DevOps practices, according to a
2018 DevOps survey of over 1,000 IT pros by Logz.io.
DevOps 54.5%
Security
operations
41.3% 24%
Yes
53%
Development 35.3%
System
28.4% No
23%
administration
Site reliability
engineering 6.7%
Consultant/
MSSP 4.6% In the process
of implementation
GUIDE TO CLOUD NATIVE DEVOPS 61
The term DevSecOps “has always struck me like the last kid getting on the
bus and there’s no seat available. We are treating security as an afterthought.
Security has never been an afterthought with any customer I’ve dealt with —
in financial services or now at Amazon Web Services. I feel like the name
doesn’t reflect the importance,” Margo Cronin, senior solutions architect at
Amazon Web Services, said in her talk at the European DevOps Enterprise
Summit.
In fact, with the new European General Data Protection Regulation (GDPR),
Cronin says privacy by design and privacy by default are built right in. “It
nearly mandates you should be doing DevSecOps.”
Call it DevSecOps or Rugged IT, like agile and Kanban enthusiasts do, but how
AWS refers to it is pretty accurate: security automation at cloud scale.
DevSecOps replaces disconnected, reactive security efforts with a unified,
proactive CI/CD-based security solution for both cloud and on-premises
systems. A more cohesive team from a diversity of backgrounds works toward
a common goal: frequent, fast, zero-downtime, secure deployments. This goal
empowers both operations and security to analyze security events and data
with an eye toward reducing response times, optimizing security controls and
checking and correcting vulnerabilities at every stage of development and
deployment. Blurring the lines between the operations and security teams
brings greater visibility into any development or deployment changes
warranted, along with the potential impacts of those changes. 40
stakeholder telling you to get the service back online. You are on hour five
of a severity one call. You are on a Slack channel with 40 people, 38 of
whom are not really contributing. You are on cup of coffee number seven.
You make a change in production to resolve this issue.” She said that under
these often common circumstances, you are more prone to make an error
than in a business-as-usual scenario, and maybe you forget to document
the change and the next release overwrites the fix. “Humans make
mistakes, and when you’re under pressure you’re more likely to make
mistakes.”
2. People bend the rules. Then she shared a well-meaning common use case:
People bend the rules in an effort to be helpful and to collaborate, like
when you have scheduled a big release — and release party — and
everyone’s ready to celebrate and it’s almost there, so you say: “We’re just
going to get it out. We’ll do the release and fix that tomorrow. People will
ask you to bend the rules from a place of goodness, but these create gaps in
your product landscape.”
3. People act with malice. “While attacks like DDoS are automated, there is
invariably a human behind the scenes instigating that attack.”
Machines don’t make mistakes, bend the rules or act with malice, which is
why Cronin argues that automating security tasks must be your biggest
priority for successful DevOps. Yet only 34 percent of information security
professionals’ organizations have automated security testing in their software
release life cycle, according to Cybersecurity Insiders’ 2018 Application
Security Report. 41 Although most organizations have some DevSecOps
processes in place, in reality, automated security testing is not deployed in a
majority of CI/CD pipelines.
“The reality is that manual security doesn’t work in the cloud native age,”
writes John Morello, CTO of Twistlock on The New Stack. 42 “Environments
move too fast, and configurations change too quickly for your engineers to be
able to interpret security threats manually and react in a timely fashion. You,
therefore, need tools that can make informed data-based decisions about
threats for you, then take action to stop them before they cause damage.”
Several approaches have evolved to reduce risk across the clusters, pods and
nodes running on Kubernetes, or existing in some similar serverless approach,
without hiring more staff or burning out the existing team. The trend is for
greater observability across the system to allow for automated responses —
such as scaling, rollbacks and load balancing — as well as more informed
decision-making and feedback loops. There is no single, standard metric that
developers can rely upon to tell them whether their code is working or not.
Observability gives developers the ability to collect data from their application
and trace problems to the root cause, in order to debug code and relieve issues
affecting the application in production. 43 This is the shift-left approach
which gives developers more responsibility for securing their code. In addition,
automated security tests, such as container image scanning and vulnerability
scanning while code is still in development and compliance tests, help enforce
security policies and practices at scale. And configuring network security
policies, monitoring for breaches and automating the response, helps ensure
network security at scale.
“It doesn’t matter where you are in the spectrum, you can still get the value of
the cloud, but the point of trust that you have is congruent to the amount of
automation you need to implement,” she said.
Cronin gave the example of transferring trust with more automation. This
trust level of zero could be a company deploying native Kubernetes, managing
the master nodes (scaling and distributed consensus), the worker nodes, and
all the security. On the other end of the spectrum at the right, she offered an
example where the customer with higher trust uses Amazon Elastic Container
Service for Kubernetes (Amazon EKS).
enabled RBAC. By putting more trust into Amazon’s EKS, she says RBAC for
Kubernetes is automatically turned on, has native integration with AWS, and
managed master nodes.
She says that “No matter where you are on the trust scale, plan to integrate
security automation, but remember that creating this automation will also
take the DevOps team time.”
This involves mapping the tooling based on where you are on the trust bar.
The lower the level of trust, the higher the level of security automation the
DevOps team needs to implement. The higher the level of trust, the more the
cloud provider can automatically manage and automate for you. This impacts
how quickly you are going to release your minimum viable product. Also the
less trust, the more you have to plan your security ahead, like ensuring your
RBAC is on.
2. Security by Design
Cronin contends that in DevSecOps, every team member feels the
responsibility of a security owner — it’s no longer a team in another building,
just a stakeholder to your project. Just like DevOps tears down the silos
between developers and operations, the same must happen for security.
With DevSecOps, sprints can be based on security needs, breaking epics down
to functional security stories. This security process used to take a couple of
months with on-premises hosting via complicated waterfall epics. Now she
says that, with any cloud service provider, you can spin up the web application
firewall in sections, which usually are:
You then use the same dynamic CI/CD pipeline to roll out the security features
that you would with the rest of your application life cycle.
Git Secrets is a tool that collects a set of DevSecOps open source resources, that
can be leveraged along with the cloud for such important CI/CD steps as:
writes Roy Feintuch, CTO of Dome9 Security. 45 This is the flip side of agility.
In the public cloud, where simple configuration changes can leave sensitive
data and private servers exposed to the world, the security implications of
automation are profound.
“Now, where your entire infrastructure is defined in a JSON file, the DevOps
folks have their hands on the keyboards. This puts the security folks in a weird
new place — instead of being a chokepoint, they now try to keep up with the
changes. Sometimes retroactively,” Feintuch said.
The InSpec tool from Chef, for example, enables compliance, security and
DevOps teams to more clearly define security and compliance tasks by writing
specific rules to automate them. Security teams can set the policies that
DevOps teams deploy against. Users can write custom compliance policies for
AWS and Microsoft Azure or use pre-defined policies for regulations such as
PCI, HIPAA and the Department of Defense. And they can validate cloud
configurations, covering virtual machines, security groups, block storage,
networking, identity and access management and log management, against
the policies. 46
“It provides a simpler level of abstraction so you can find out things like what
Docker containers do you have running, what packages do you have running,”
said Julian Dunn, former director of product marketing at Chef. “We want to
make sure our database doesn’t have the default database installed or the
default user installed and we have strong passwords and we can see what
systems are allowed to connect to this database server, [and] make sure
they’re using encryption.”
“If you’ve got a limit on, say, how many VMs you can spin up on AWS, Salt can
respond to that failure, and trigger an alternate orchestration. It could halt the
step, or roll it back to a previous state,” said Gary Richmond, senior technical
product manager at SaltStack.
This process of security automation combined with the steps above becomes
highly immutable and reduces your blast radius — the extent of damage that a
compromised container can do to other containers on the same node, or how
much damage a compromised node can do to the rest of the cluster. 48
4. Automate Responses
For security automation to work, you need to know what you are doing based
on your log files. It all comes down to four questions surrounding your logs:
Take the example of someone switching off an AWS service. This action can
send an automated event to your security team for them to look into the
environment. It allows you to make the decision if it was shut off by someone
whose privileges are too high, or if it’s actually an event that needs looking
into and maybe servers need to be ring-fenced. Cronin pointed out how
powerful logging has become and how logging in the cloud prevents more
incidents.
The abstraction of the security layer that comes with containers and
Kubernetes has many benefits to the overall security posture of the
organizations deploying applications on a cloud native architecture. It also
poses new risks and possible vulnerabilities. With their individual APIs,
microservices can be reconfigured and updated separately, without
interrupting an application that might rely on many microservices to run.
However, microservices also come with many separate APIs and ports per
application, thus exponentially increasing the attack surface by presenting
numerous doors for intruders to try to access within an application. While
their isolated and standalone structure within applications makes them easier
to defend, microservices bring unique security challenges. The attack surface
widens further for Kubernetes users because of the orchestrator’s
comprehensive reach in the container runtime environment. 49
Organizations have begun to take take a “shift left” approach that rests
security practices deeper into the development process, so that security teams
are more involved in engineering and vice versa. The shift left approach, which
gives developers more responsibility for application security, itself has two
sides: Most developers do not fully understand how applications are connected
across a network mesh. What they are looking for often are open end-points
— APIs to interconnect applications. That in itself creates a security gap and
an opportunity to isolate issues. At the same time, it allows teams to detect
anomalies faster through automated container image scanning — going
further left by baking security into the code itself. In the end, when looking to
automate security, it seems best to follow Cronin’s final words on the
importance of the Sec in DevSecOps:
“If security is your most important job, you should look at automating those
tasks and stories first, before anything else.”
Listen on SoundCloud
SPONSOR RESOURCE
• Building Cloud Native Apps Painlessly
— The Prescriptive Guide to Kubernetes and Jenkins X
by CloudBees, 2019
In this paper, you’ll read about how the future of modern application develop-
ment can benefit from the powerful combination of Jenkins X and Kubernetes,
providing developers a seamless way to automate their continuous integration
(CI) and continuous delivery (CD) process.
10. “5 Workflow Automation Use Cases You Might Not Have Considered”
by Bernd Rücker, Co-founder of Camunda, The New Stack, April 9, 2018.
Workflow automation is so much more than human task management.
In this contributed post, Rücker elaborates on use cases for workflow
automation.
as they create, deploy and manage microservices for cloud native applications,
and lays the foundation for understanding serverless development and
operations.
SPONSOR RESOURCE
• Delivering Cloud Native Infrastructure as Code
by Pulumi, 2018
Find out how to deliver all cloud native infrastructure as code with a single
consistent programming model in this white paper from Pulumi.
20. “The Ever-Changing Roles of the Developer: How to Adapt and Thrive”
by David Hayes, former Director of Product Management at PagerDuty,
The New Stack, June 20, 2017.
Developers are now spending less time than ever writing code and
instead find themselves focused (and evaluated) on enhancing and
maintaining scalability, improving customer experience, boosting service
efficiencies and lowering costs, Hayes writes in this contributed article.
SPONSOR RESOURCE
• Continuous Delivery Summit
by Continuous Delivery Foundation, 2019
The Continuous Delivery Foundation will be hosting a Continuous Delivery
Summit (CDS) event on May 20 at KubeCon + CloudNativeCon Europe 2019 in
Barcelona, Spain. The Cloud Native Computing Foundation’s flagship confer-
ence gathers adopters and technologists from leading open source and cloud
native communities including Kubernetes, Prometheus, Helm and many
others. Register to attend, today!
automate them.
SECTION 2
DEPLOY
Adopting a cloud native DevOps culture means making changes that facilitate
speed and communication through increased autonomy, transparency and
automation.
disconnected from DevOps as the main culprit. With a DevOps culture and
processes in place, organizations are in a better position to take advantage of
the new processes and tooling around securing microservices applications. 1
The question is, why have most organizations not undergone a DevOps
transformation and what makes cloud native DevOps any different? In prior
works, The New Stack has developed the premise that at-scale application
development, deployment and management is built on DevOps practices. The
premise being that the tools needed to make scale possible came from the
belief that monolithic technologies were built to manage systems of record
with storage and networking technologies as the most significant cost factors.
Now the cost is moving to the creation and management of services on cloud
native architectures that are built to scale. Storage and networking have
different contexts. In monolithic systems, the application runs on the network,
configured to the machine. In cloud native architectures, storage is persistent
and the network is software that is application-centric, meaning the
application does not necessarily get configured. The network is software
running on orchestration engines such as Docker swarm mode, Kubernetes
and Mesos. Underneath are containers and virtual machines (VMs). The
containers are portable and the VM plays a different role, albeit increasingly as
a supporting mechanism for containers. For example, Amazon Web Services’
AWS Fargate is built on containers but underneath is a micro-VM architecture.
What changes in cloud native DevOps is the construct for teams, their
workflows and the adoption of cloud native technologies that support the
organization’s business objectives and developer requirements. Kubernetes is
largely viewed as cloud native. But there are any number of other tools that
must be considered in relation to who is on the team, their level of experience
and the workflows that best suit them. The people who developed the first
at-scale architectures made their own tools. Luckily, in today’s world, there
are now a selection of tools that largely are based on the experiences of these
pioneering technologists. These include a new generation of graph databases,
continuous delivery tools and a host of services that allow for teams to
manage the persistence of their applications, the networking and,
increasingly, the communications between services, including events that
trigger alerts and notifications. Serverless technologies are emerging to also
fill the void, but largely are defined in newer application architectures that use
automation practices to manage the functions that historically required a
more manual approach.
The fact that DevOps is lacking in many organizations means that those that
do make the cultural shift are at a strong competitive advantage — at least for
now. This is because it can be assumed that most organizations have either not
fully optimized DevOps to support cloud native architectures, or in many
cases, have not yet adopted DevOps.
Cloud native architecture is based on discrete and loosely coupled units of code
— containerized microservices or serverless functions — built and managed
by largely autonomous teams. Application architecture development is a direct
result of how individuals and teams interact and communicate about their
own, and overlapping, orchestrations. As more developers are added to
achieve scale and the application gets more sophisticated, so does the overall
complexity of the architecture. Microservices make that scale more
manageable by breaking the monolith down into pieces that can be managed
separately by teams with responsibility for the full life cycle of that code.
This new mode of development, and the complexity that comes with it, requires
new tools, processes and team structures, and it’s precisely why DevOps
becomes critical to success. DevOps is a culture of transparency, openness to
change, shared responsibility and continuous improvement — a way of working
that eliminates barriers between teams and values communication. Such
transparency is necessary in order for teams to remain loosely coupled and
independent. Openness means a willingness to adapt and change course, a
desire to learn new skills as well as learn from mistakes, and a willingness to
share those mistakes and the lessons learned so that others do not repeat them.
Through open communication and clearly defined and automated processes,
change propagates quickly as product teams adapt to that feedback. All of this
is done for the benefit of the customer or end user, and thus, the business.
Platforms, like Cloud Foundry and Kubernetes, provide technical solutions that
make the developer experience more predictable and scalable, Chisara
Nwabara, a service and product specialist at Pivotal writes on The New
Stack. 3 But companies must also work on making their customer experience
more predictable and scalable by improving communication channels within
their own teams. Communication — alongside tooling and automation — is
key to DevOps transformation and to fully realizing the benefits of a cloud
native architecture.
organization. Someone from the IT department will likely initiate the concept
of applying DevOps as part of a shift to software development from a
monolithic application that runs on premises to microservices running on
containers and Kubernetes in the cloud. But without the business teams
participating, nothing much will happen, Brian Dawson, a DevOps evangelist
at CloudBees, said. “Business has to be involved. The whole idea is you want to
go from ideation or concept to customer.”
Chief technology officers (CTOs) and chief information officers (CIOs) have
usually bought into DevOps, but it’s one chain of command down that doesn’t
interact with the developer (Dev) and operations (Ops) teams every day, but
oversees the success of the product, that can be the downfall of a DevOps
initiative. These are the roles that need to get on board with DevOps early,
helping to form the product statement and drive toward actionable user
feedback, Ashley Hathaway, engagement director at Pivotal, said.
DevOps creates a shift left approach, not only for developers who take on
operations work, but for management and product development teams as well,
who must consider how the features they build will affect the reliability of the
application in production, writes Max Johnson, DevOps engineer at
Pypestream and a former Holberton School student. 4 What this means
practically, is that whenever an idea or new feature is suggested, all teams —
product, Dev, quality assurance (QA), and Ops — come together to discuss the
feasibility of the feature, what considerations should be kept in mind, and the
minimum expectations of the feature. Two concepts that enforce this are
behavior-driven development (BDD) and test-driven development (TDD). BDD
sets user behavior as the standard to define what success means for a feature.
TDD defines a number of unit tests which the feature must pass if it’s to be
considered acceptable. 5 In both of these approaches, the teams define what
success means for every feature right at the start, before it is built.
“There are many benefits to the shift left methodology. It makes quality a
priority for everyone, not just for QA and Ops. It saves cost and effort, and
prevents a bad user experience by ensuring bugs are caught early on in the
cycle,” Johnson said. “It forces product, Dev, QA, and Ops teams to work
together at the start. ... This helps break silos between them, as they use their
unique expertise to solve common problems.”
“The middle managers are tasked with ‘change the culture and make this
successful,’ but ‘oh, by the way, we need this yesterday’,” Hathaway said.
She gave the example of one line of business at one of her big banking clients
which was worth millions of dollars. The CTO came in and said, “I don’t care if
we ship late. I don’t care if things break. We have to change things from the
ground up and slowly, and it’s going to be painful way more than a win.”
That’s now the most innovative line of business at that large bank, and they
act as an example for the rest of the company, Hathaway said. They were OK
not hitting budgets or deadlines for a couple of months. Instead, they are now
successful, with lower turnover and happier employees.
It all comes down to middle and senior management buying in for the change,
and then setting expectations for a bumpy, but worthwhile ride. At the same
time, engineering teams must understand their role in creating business value.
Dawson said when working with an organization’s DevOps teams to help them
build their cloud native deployments to scale, he first identifies who the
developers, QA, operations and business team members are. “I make a point
that, ‘now, all of you are part of the business, because at the end of the day,
we’re doing what we do to deliver functionality to support the business,’”
Dawson said. “For your standard cloud native application, there will be
operations and security [people], for example, asking ‘okay, how do I inject
myself into cloud native development?’”
Highly evolved
30% 42% 9% 19%
organization
FIG 4.1: DevOps culture begins with a single team and expands to multiple teams as
organizations get farther on their DevOps journey.
So how does an organization start to adopt a DevOps culture with buy-in from
business leaders, Devs and Ops? How do you eat an elephant? One bite at a
time, Pivotal’s Nwabara quips. Start small, then scale. Collect your evidence
and acknowledge the wins. Most companies of any size are a complex network
of people, each with numerous communication touch points. There are many
teams that must collaborate to deliver a given success. And “pivoting” an
entire organization or even just a subset of departments does not happen
accidentally and in a single step. So start small and refine as you go, to
improve upon the way in which you solve customer concerns — one case at
a time. Each learning contributes to the overall shift to improved
She’s come onto startups as small as three people, and to companies with tech
teams of about 100, where their current CTO might not yet have the skills for
running a larger organization.
Warner says her job involves a lot of talking to people and listening to their
answers to questions such as:
In the very traditional financial space, if even senior or middle managers aren’t
committed, Warner says you’re destined to hit a wall.
“Somebody will want to put controls in that slow you down and interfere with
your journey to rapid change, and you can end up in the worst of both worlds,”
she said. “Any company can change their culture if they’re committed.”
1. Transparency.
2. Autonomy.
3. Automation.
Transparency
In a transparent organization, decision-making happens in the open, and
performance results — both positive and negative — are freely shared. Taking
this idea even further toward radical transparency, weaknesses and mistakes
are also openly discussed and team members are encouraged to share their
opinions, regardless of their position or level of experience. 8 Warner called
companies that move toward radical transparency “really healthy.”
On one contract Warner worked on, only two people — “the priesthood of
sysadmin” — knew how key systems worked, for fear of reverse engineering.
This gatekeeper culture means no one can improve things, no one can point
out scenarios that won’t work, and, when those two sysadmins eventually
leave the company, it’s really in trouble.
Jira and Confluence are solid examples of project management tools that
promote information radiation in a wiki-like way, where anyone can comment
and make suggestions for improvements. But the process is more important
than the tools.
Autonomy
DevOps organizations need to hire the right mixture of people, and make sure
engineering teams understand the business. This includes making sure
engineers know regulatory constraints and what customers and partners are
expecting from the code. Make sure technical teams understand business
problems, then give them freedom to come up with solutions to those
problems.
Warner says part of the job is continuous learning. There is always a better tool
or library and so many different solutions to existing problems that you get
better results with more individuals who are continuously learning toward
collective objectives.
Automation
Finally, Warner sees that DevOps transformation requires a commitment to
automation. If you want to have loads of tiny safe changes in a process where
any engineer can trigger a code release, you need to make sure your
automation — which includes infrastructure as code (IaC), continuous
integration and test automation — is in excellent shape, and that you are using
automation at every stage, including production.
Just remember, Warner said, “If tech is driving for a collaborative culture but
the rest of the company isn’t interested, it’s not going to work. It has to be
across senior management and product.”
“For our clients, [pair programming] seems counterintuitive. ‘Now I’m cutting
my time in half.’ But it actually strengthens code checks. You write better code.
Then the domain knowledge doesn’t leave if someone is sick. Everyone deploys
on the same pipelines [and] writes their own pipelines with continuous
delivery,” Hathaway said.
DevOps is all about breaking down silos, but some silos are necessary,
especially in banking security. Hathaway says, with banking regulations, the
person who writes the code can’t ship the code. This is a bonus of Pivotal’s pair
programming policy as two people together can continuously ship and test the
code, with checks along the way.
One application (app) development team saw great success by simply talking to
an Ops team.
“Just the interaction of that one application group, those Devs giving their
feedback to the Ops team, was a lightbulb moment for them,” she said,
pointing out that Ops wasn’t talking to their developer users before.
Putting the Devs who are using the infrastructure, system rules and
integrations every day in front of the Ops team is a logical first DevOps
conversation. The same is true for increasing communications and breaking
down silos between IT and business management. What if middle management
won’t change? Try pair managing, says Hathaway: “Link up your weaker
people with your really strong performers to see what different thinking looks
like, because it is a practice in thinking. They need to know it’s OK from their
management.”
The company has since put out that fire, but the incident also drove home how
software usually solves problems, while sometimes solutions could be as
simple as changing people’s mindset. This adjustment can impact teams
greatly and result in a significantly positive outcome. You may already have
what you need to make a huge impact, and the cultural transformation is a
critical element for digital transformation. Everyone’s ultimate goal is to
ensure that the company provides recognizable value to customers through a
positive customer experience.
Here are five approaches that Pivotal is taking to improve communication and
build empathy between their research and development (R&D) and customer-
facing teams in order to deliver a better experience and value. We have already
discussed the first:
towards improving the ways in which teams work together to deliver customer
value, they must be aware of the fact that if their learnings are not directly
incorporated back into their practices, there was no point running the
experiment in the first place. What lessons from this experience make it more
sustainable and scalable across teams? Think about, and promote, running
experiments beyond the software or products a single team happens to build.
Role of DevOps in
Deployments: CI/CD
ll along, the idea of the cloud has been one of abstraction. Moving
A storage to the cloud not only means you can’t point to the physical
drive where your data is stored, but storage is theoretically limitless
and available on demand. Similarly, moving compute to the cloud means your
processing power can be increased exponentially, without the need to procure
and provision physical servers.
While the move to the cloud has solved some problems, other problems have
inevitably come to take their place. With cloud native technologies, we are
again seeing a new level of abstraction, where technologies previously reserved
for on-premises scenarios are moving to the cloud and experiencing the
increased capabilities and complexities that come with it. At its core, cloud
native DevOps means yet another transformation to undergo and a new set of
ideas to incorporate. In practice, this means new tools and workflows —
alongside the culture shift described in the previous chapter — which have
wide-ranging implications.
Emerging GitOps, SecOps and DevOps practices paired with a CI/CD pipeline on
top of Kubernetes will speed up your release life cycle, enabling you to release
multiple times a day, and allow for nimble teams to iterate quickly. With
Kubernetes and cloud native DevOps patterns, builds become a lot faster.
Instead of spinning up entirely new servers, your build process is quick,
lightweight and straightforward. Development speeds up when you don’t have
to worry about building and deploying a monolith in order to update
everything. By splitting a monolith into microservices, you can instead update
pieces — this service or that — and also encourage autonomy among
development teams who own the full life cycle of that service. 9 However,
complexity also increases as each piece now needs its own delivery pipeline as
well. Teams test their own code and deploy directly to production. It’s of
critical importance in production environments to have services that are
thoroughly tested at all stages of development.
The shift to cloud native can be a double-edged sword and adapting your CI/CD
tools and practices is mandatory to keep pace. Overall, cloud native offers
opportunities in terms of velocity and scale, but also increased complexity, as
teams move from handling single monolithic applications to multifaceted
microservices. The lines across the stack are increasingly blurred, with more
dependencies and more layers.
One way to handle the increased speed is through increased automation and a
culture of experimentation, which Tharisayi points to as critical for cloud
native CI/CD.
FIG 5.1: Tests and health checks can prevent bad code from reaching production.
As part of a rolling update, Kubernetes spins up separate new pods running your
application while the old ones are still running. When the new pods are healthy,
Kubernetes gets rid of the old ones.
section, let’s explore how DevOps patterns and processes change with this
cloud native approach.
GitOps is all about “pushing code, not containers,” said Alexis Richardson,
CEO of Weaveworks and chair of the Cloud Native Computing Foundation’s
Technical Oversight Committee, in a KubeCon + CloudNativeCon EU keynote.
The idea is to “make Git the center of control” of cloud native operations, for
both the developer and the system administrator, Richardson said. 12
The open source Git version control software makes perfect sense as a frontend
for cloud native computing. Given its nearly widespread usage by developers,
Git commands are the “Lingua Franca” of the open source world. The basic
idea is that all changes to a cloud native system can be done through Git. Once
a commit is made, it sets off an automated pipeline, perhaps using a tool such
as the Continuous Delivery Foundation’s (CDF) Jenkins X or Spinnaker, to
containerize and test the code and press it into production.
GitOps could work exactly the same way for infrastructure management as
well. Richardson calls this approach “declarative infrastructure.” Make
changes in configuration through a YAML file, and Kubernetes can detect
changes in the file and adjust the resources as necessary. Weaveworks itself
has released a number of tools, such as kubediff, for comparing the desired
state with actual state in such cases.
organizations. Second, it’s not always best to choose a ‘pull’ approach to make
environment as code real; a ‘push’ approach can be more suitable in many
cases.” These issues crop up with organizational complexity, and this is where
simple implementations of GitOps fall short, he said.
One early adopter of the GitOps approach has been the content platform team
of the Financial Times, according to a KubeCon + CloudNativeCon EU keynote
talk from Sarah Wells, who is their technical director for operations and
reliability. The company’s content platform is built from 150 microservices. In
2015, the newspaper moved this platform over to Docker from virtual
machines. After migrating to containers, the Financial Times found the
number of releases it did increased from 12 a year to 2,200, an increase that
also came with a much lower failure rate. Instead of spinning up a new
virtual machine, all the changes are simply made to a YAML file. While she
didn’t mention “GitOps” by name, Wells did say developers push all their
changes through GitHub.
“If you just want the GitOps experience or the command-line interface (CLI)
experience, you can take your existing YAML files and deploy them as they are
without changing them at all. Maybe you want to convert one YAML file at a
time or convert as you go, that type of mixed environment — we call it
brownfield — is really important for customers with large investments in
Kubernetes,” Joe Duffy, co-founder and CEO of Pulumi, said. “Similarly, we
can take Helm charts and just deploy them. If I want to provision an S3 bucket
in a Kubernetes app, usually I have to use two different toolchains for that. It’s
astonishing to me the lengths people will go to get tools to work together that
were just not designed to work together.”
Regardless of how the CI/CD pipeline kicks off, all cloud native applications
put configuration into code. This puts how an application runs much closer to
the application developer than ever before, said Garfield. This has implications
not only for how CI/CD pipelines are built and automated, but for those who
manage that work as well.
“In the old world, operations would worry a lot about how an application
would run and how it would be deployed,” Garfield said. “One side-effect of
cloud native is that developers take more responsibility for how their
CI/CD also plays a key role in what DevOps teams have been able to achieve as
they shift away from on-premises and virtual machine environments to take
advantage of the stateless environments on offer on the cloud. As cloud native
DevOps matures, we will continue to see advances and improvements in the
production cycle in a number of ways. One such example is how DevOps can
leverage microservices to merge CI with CD.
“We found that to go fast, you need to get rid of these CI/CD blocks that we
often didn’t think about before,” James Strachan, senior architect at CloudBees,
the project lead on Jenkins X, said during DevOps World | Jenkins World 2018
in Nice, France.
Before Kubernetes, only the CI half of CI/CD was automated. CD was assembled
by hand with scripts, pipelines, metadata and configurations. Kubernetes
enables CD automation, and some tools make it simple to deploy on
Kubernetes. Merging CI and CD allows automation of the full software
development life cycle — “it’s automating the automation,” as CloudBees
co-founder and engineering manager Michael Neale has stated — and
essentially commoditizes CI/CD. Having consistent, end-to-end pipelines also
enables more feedback — and eventually more automated feedback — to
developers on how their code is performing in production. 17
“It’s that kind of extra feedback and intelligence that really helps us deliver
software. So I think we’re going to have more and more automation that’s like
smarter services that can analyze what’s happening in production and give you
pull requests to say ‘We recommend you make this code change now,’ or ‘We
recommend you revert that version that you’ve just gone live with because of
X, Y, and Z.’ And then you as a developer are more directing changes rather
than always having to change everything by hand,” Strachan said.
Merging CI with CD and supporting it all with DevOps also means better
security. “By lifting this artificial separation between CI and CD, IT operators
can centrally address continuous security and compliance as the central pain
points of today’s line of business,” Volk said. “This should happen through the
implementation of security and compliance as code to centrally define and
enforce requirements in terms of code and infrastructure configuration, data
handling, and overall deployment architecture.”
To enable this end-to-end automation, the cloud native CI/CD marketplace has
exploded over the past year. And a collaboration between industry leaders to
form the CDF aims to establish industry specifications around pipelines,
workflows and other components of CI/CD systems to help grow the ecosystem
of tools and services. While most of the solutions are CI tools that have been
extended with CD capability, some are purpose built for cloud native CI/CD,
with a few others that focus on CD only. Some popular tools include Atomist,
AWS CodePipeline, CircleCI, Codefresh, GitLabCI, Harness, Jenkins and Jenkins
X, Puppet (which acquired Distelli) and Spinnaker.
Hidden Bottlenecks
Putting development on hold as different teams along the production pipeline
do their work represents an area for improvement in the CI/CD development
cycle. For organizations with a limited number of testing environments, for
example, a non-essential update or a four-hour long test might prevent an
urgent code fix from being completed. This scenario serves as a concrete
example of how merging “CD with CI can remove these kinds of bottlenecks,”
CloudBees’ Strachan said.
Previously, teams would consider separate deployment tools for CI and CD,
Strachan said. When merging CI with CD, the processes are completed in
parallel, which “creates a dynamic preview environment for each pull request
that gets deployed into its own separate dynamic environment,” he said.
HSBC has relied, in part, on CI/CD tooling to “rebuild the bank from within” as
a way to revolutionize the way that HSBC delivers to their customers, Cheryl
Razzell, global head of platform digital operations for HSBC Operations, said.
“This required building many of the processes largely from scratch and
assembling the teams to build the infrastructure. We rebuilt some Jenkins
masters and realized that we want to stabilize the environment,” Razzell said.
“There was another tool in the CI/CD platform that was then failing so we had
to work our way through the stack to rebuild the stack so we entirely built a
new digital infrastructure for our CI/CD platform.”
“As there is more of a focus on CI/CD for cloud deployments, the developer’s
job also often involves testing, automation, deployment and monitoring,
Nitzan Shapira, co-founder and CEO at Epsagon, said. “The developer is
actually the one now deploying and updating who has access to the system.
And if the design of the microservices is good, a small team of developers can
be in charge of the service. The developers are in charge of the production ...
making [them] more empowered and have more impact in the organization.”
“If you take over one of these build systems, even though it’s running in a
container, you could take over a network because these containers share IP
space,” said security researcher Tyler Welton in a talk about CI/CD hacking at
the DEF CON 25 conference. “Even if they might be on their own mesh
network of IPs, they still often have ports mapped to the hosts.”
In some cases, it’s even possible to exploit the trust relationship between these
servers and code repositories in order to make commits back to master,
compromising the code. At the very least, they can abuse the authorized SSH
keys that these services use.
“When you compromise one of these services, you haven’t compromised the
entire system, but dump some environment variables and you’ll probably be
able to pivot to some of the other systems,” Welton said.
Even though some modern CI/CD tools allow restricting privileges inside
containers, a lot of systems are configured to run services inside containers as
root. At first glance, this doesn’t seem to be a big deal, because any potential
attackers would only be able to perform actions inside those particular
containers, which are often short lived.
However, root access allows attackers to scan the entire IP space in order to
find other potentially exploitable services running on the host, and if the
container has internet access, it allows them to download and install
additional packages they need to launch further attacks.
There’s also an older CI/CD audit framework called Rotten Apple that was
created by Mozilla ethical hacker Jonathan Claudius. This can be used to
determine if the root user is being used to build projects and if attackers can
deploy malicious code to steal API keys, to pivot to private networks, to
authenticate using GitHub credentials, to create reverse shells, to exfiltrate data,
to access other projects on the same server or to steal SSH keys. The
framework also has an attack mode, which can be used for penetration testing.
Welton’s 2017 talk at DEF CON contains real-world CI hacks and a wealth of
information about different configuration issues. However, the risks posed by
CI/CD tools has been known in the security industry for years.
InfoSec expert Nikhil Mittal’s presentation at Black Hat Europe two years
earlier is also a great resource about insecure default configurations in CI
environments. 18 At the time, Mittal described CI tools as “an attacker’s best
The security of CI/CD deployments is even more important these days, in light
of a recent spike in software supply chain attacks where hackers break into
software development infrastructure in order to insert backdoors and
malicious code into resulting applications. This allows the hackers to
compromise a large number of end users by taking advantage of trusted
software distribution channels. It also makes developers a highly attractive
target.
CI/CD can have benefits for security. For one, it makes remediation and the
deployment of patches much faster. Also, splitting applications into
microservices helps reduce single points of failure and contain compromises,
if configured properly. However, having insecure CI/CD systems in your
infrastructure increases your attack surface and opens entry points for
hackers.
“The automated build systems, like the CI systems and the CI pipelines, are
checked less for security than the code in which they’re deploying,” Welton
said. “They sit in between the infrastructure components, which are being
tested through network penetration tests, and the application code which is
handled through application pen tests. But then you’ve got this quasi-
containerized environment that’s sitting on its own IP space, in its own
containers, on top of the infrastructure, but below the code, and it’s really not
being tested.”
We spoke to a team at Microsoft that is about a year into its DevOps transition,
and a marketing automation startup, Klaviyo, that has been DevOps-driven
from day one. No matter where in your DevOps process you are, you surely can
use the advice from these software reliability engineers (SREs) about
communication during key points in a DevOps transformation.
The first step is a take-home assignment, or what Stone calls a simulation for
what it’s like to work at Klaviyo. Candidates have to write a small CRUD (create,
read, update and delete) application that deals with weather data and sends
people a personalized email. The right candidates don’t have to be masters of
certain languages, but they must show they are eager to learn and that they
are thinking of the next person who has to use that code, including attention
to documentation, algorithms, readability and cleanliness.
“I love it when people write tests — it shows that it’ll be easier to use your
code in the future. I think a lot of people should test their code and document
their code and they just don’t do it,” Stone told The New Stack.
DevOps isn’t just about scaling a company, but scaling a code base.
Stone gave the example: “Let’s say you had 100 people signed up to this service
and 40 people are in Boston, do you make one API call for each or create it and
use it once and cache it?”
Once you pass your first test, you come in for some collaborative coding.
Candidates are put in charge of refactoring and are allowed to ask as many
questions as they like about anything. The goal is to understand how they
work and look for candidates’ openness and willingness to admit when they
don’t know something.
The next part of the interview test is about DevOps ownership. They inform
the candidates that they now own this code, asking them questions like:
Stone says they aren’t looking for candidates who know all the answers,
especially those fresh out of university, but they want to see signs of a desire
to want to be on-call and to own their service, from creating to maintaining it.
Similarly, they don’t expect precise answers, but for responses that indicate a
candidate’s ability to think ahead. They should suggest running their code on
a machine, not their computer, so they aren’t tying it to one person and
personal infrastructure. They should think about where automation can speed
up processes. Most importantly, they need signs of customer empathy and a
desire to make sure the solutions are as stable as possible.
Stone says they are looking for “people who are motivated to learn and who
are technically savvy and who can show empathy. Given the right structures
and resources in place, then they can be successful owning their service from
start to finish.”
When Stone joined, there was just one product team and an SRE team. Eighteen
months later, there were an additional four or five product engineering teams
focused on specific areas of the solution. They have a greater need to
concentrate and codify knowledge transfer as engineers can no longer know
the entire product and, in many cases, can no longer have deep relationships
with other engineers on the team.
Scaling DevOps all comes down to one question: How can communication flow
and how can people still have ownership?
One form of scalable knowledge transfer they use is mob programming. It’s
like the Agile Methodology’s pair programming exercise, but the whole team
is working on the same thing at the same time on the same computer. Klaviyo
did this to help train the team on Terraform when they adopted it to
automate their infrastructure. The SREs acted as mentors to the product
teams, Stone said.
Farrukh was part of the Microsoft social engagement (MSE) and market
insights team’s DevOps journey, which began a year ago. This small, nimble
team is working to minimize downtime for thousands of customers. In a
recent restructuring, a handful of SREs are now sharing infrastructure and
on-call responsibilities with developers.
“This generally increases the health of your monitoring system because the
people who are writing the code are fixing it and feeling the pain points too,”
explained Farrukh.
Each team member is on call for one week at a time. To start, each trainee has
a shadow week, acting as backup for an SRE or another fully trained developer.
Then, within a few weeks, the trainee takes on the role of primary adminis-
trator on duty (AOD), and the more experienced person shadows. There are also
ample tutorials, documents and regular simulated outage exercises to assist.
“People generally are very nervous when they go on call for a service for the
first time and they’ve never been on call for anything,” Farrukh said.
She continued that it isn’t about knowing a product inside out, it’s about
knowing where to find what you need:
● Do you have the tools to debug a new problem that comes up?
● Do you know where to find the answer in documentation?
● Do you know who can answer doubts and how to contact them?
On the MSE team, while there is some leeway to choose appropriate tools for
tasks, the infrastructure team works hard to keep standardization across their
stacks, with a limited number of logging systems, languages, libraries and
monitoring systems, so everyone shares a baseline knowledge.
“In the MSE team, as AOD you are first responder, not sole responder. You can
pull in anyone in the entire team to help you during an incident,” she said.
The engineering leads and SREs have even gone out of their way to volunteer
to respond any time.
“An AOD has the power to call anyone, but there is some sort of psychological
barrier so these people have come out and said ‘Please call me’,” Farrukh said.
She said software architects are usually a good first contact for issues within a
DevOps organization, because if they don’t know exactly the problem, they’ll
know who to call.
Farrukh opines that AODs can and should delegate tasks that act as
distractions for their main goal: debugging efficiently. Even looking for the
right contacts and then calling them can be a distraction.
She suggests two roles to help limit these interruptions: bridge manager and
communications (comms) person.
For smaller issues, the comms and bridge managers may be the same person
or even the AOD herself for really small issues. In reality, on the MSE team,
the AOD often has five to six engineers helping her with larger problems,
including SREs for faster deployments and rollbacks.
“We tend to know how the infrastructure works and how to put in place
workarounds. We either advise the AOD or take responsibility for specific
tasks,” Farrukh explained.
As Craig Martin writes in The New Stack’s CI/CD with Kubernetes, DevOps is a
journey and not a destination. It means building cross-functional teams with
common goals, aligning the organization around the architecture and creating
a culture of continuous improvement. Hiring the right people to build your
team, or training an existing team to communicate openly in a DevOps
culture, will be key to maximizing cloud native technologies and unlocking the
speed and agility your organization needs to bring developers closer to your
customer needs.
“Folks who are first coming into this world see these sorts of
mountains of YAML and very deep and complex details of the
underlying platform. When they’re going to AWS, they’ve got huge
cloud formation templates that they’re trying to manage that describe
how to get their software into an actual running environment inside
the cloud,” Hoban said. “And this becomes a really critical part of their
application in terms of how they think about the application and its
delivery to its end users.”
Listen on SoundCloud
8. “Radical Transparency Can Reduce Bias — but Only If It’s Done Right”
by Francesca Gino, Professor at Harvard Business School, Harvard
Business Review, October 10, 2017.
Bridgewater Associates is the largest hedge fund in the world, managing
almost $160 billion. This review of founder Ray Dalio’s book, “Principles,”
SPONSOR RESOURCE
• Building Cloud Native Apps Painlessly
— The Prescriptive Guide to Kubernetes and Jenkins X
by CloudBees, 2019
In this paper, you’ll read about how the future of modern application develop-
ment can benefit from the powerful combination of Jenkins X and Kubernetes,
providing developers a seamless way to automate their continuous integration
(CI) and continuous delivery (CD) process.
SPONSOR RESOURCE
• Delivering Cloud Native Infrastructure as Code
by Pulumi, 2018
Find out how to deliver all cloud native infrastructure as code with a single
consistent programming model in this white paper from Pulumi.
SPONSOR RESOURCE
• Continuous Delivery Summit
by Continuous Delivery Foundation, 2019
The Continuous Delivery Foundation will be hosting a Continuous Delivery
Summit (CDS) event on May 20 at KubeCon + CloudNativeCon Europe 2019 in
Barcelona, Spain. The Cloud Native Computing Foundation’s flagship confer-
ence gathers adopters and technologists from leading open source and cloud
native communities including Kubernetes, Prometheus, Helm and many
others. Register to attend, today!
SECTION 3
MANAGE
Once a cloud native application is deployed, DevOps best practices include next-
generation monitoring, dashboards and feedback loops to help ensure success.
Creating Successful
Feedback Loops With KPIs
and Dashboards
t wasn’t that long ago that software updates came in intervals of months
I and years. Big companies built big pieces of software and if anything
went wrong from one version to the next, it was just something the user
had to deal with, while teams of developers went through monolithic piles of
code to troubleshoot where they went wrong along the way. Software bugs
aside, even unpopular features and customer complaints were met with
months, if not years, of wait time before new features could be released to
address the issue.
First, let’s take a look at how we got to where we are today, what it means and
how we can use these conditions to better serve our customers and ourselves.
Over time, he explains, we moved away from boxes of compact disks (CDs)
used to install software to constantly connected devices that updated
software in the background. Software was no longer installed on the user’s
machine, but instead run as Software as a Service (SaaS), that could be
updated and deployed in the cloud and immediately provided in rolling waves
to millions of users. And at the same time, technologies like Infrastructure as
a Service (IaaS) and containerization made it possible to quickly spin up
servers in a matter of minutes and seconds instead of hours and days. With all
of these conditions, the expectation for constant updates not only appeared,
but the ability to deliver on that promise also became available to software
teams both large and small, with continuous integration and continuous
delivery (CI/CD) systems.
“With SaaS and PaaS [Platform as a Service], we don’t need to wait on Ops
teams anymore to stand up a server,” Dawson said. “The cloud provided
developers and QA [quality assurance] teams rapid access to compute resources
and infrastructure in a way that revolutionized things.”
Throughout this evolution, teams have adapted their workflows, which really
is the DevOps way. The approach means that developers will always be
looking for platforms and tools to get deeper into the code, looking for better
optimization to make the infrastructure essentially invisible so that they may
adapt application architectures more effectively. This means a new generation
of platform tools that orchestrate the software architectures and the data
plane to allow more informed and automated approaches. The next wave of
innovation is now emerging with serverless technologies — part of a
narrative that will unfold with new practices that use frameworks for
managing streaming architectures for various workloads. Increasingly,
workloads are using Kubernetes and automated platforms that may label
themselves as serverless. HashiCorp, for example, is viewed as a collection of
software tools that allow for scale and is continuing to find a real market for
its services. Developers are widely supportive of HashiCorp and how it
manages open source communities.
Demystifying DevOps
Much like agile, using the DevOps methodology isn’t as simple as hiring
someone or implementing a solution. Instead, DevOps is an approach to software
development that starts at ideation and continues through to deployment, using
monitoring along the way to identify where improvements can be made.
In other words, deploying new technologies and adopting new methods is only
as good as the final result. It can be easy to get wrapped up in meeting abstract
goals when employing a new methodology, but the ultimate goal of DevOps is
still to better serve your customers.
So what does all of this mean for the modern software development team
looking to increase performance through a cloud native DevOps approach?
Teams need reliable feedback in a way that’s easy to understand, access and
act upon. Through emerging observability practices, cross-functional teams
can set key performance indicators (KPIs), track progress against them
through monitoring and dashboards, and adjust course based on this feedback.
Observability is the evolution of monitoring for the cloud native era. It gives
engineers the information they need to adapt systems and application
architectures to be more stable and resilient. This, in turn, provides a feedback
loop to developers which allows for fast iteration and adaptation to changing
market conditions and customer needs. Without this data and feedback,
developers are flying blind and are more likely to break things. With data in
their hands, developers can move faster and with more confidence. 1 Such
data can also be used to unify Dev, Ops and management around common
goals and establish a definition of success for DevOps initiatives themselves.
The New Stack’s “CI/CD with Kubernetes” ebook has an in-depth discussion of
modern observability practices.
the bottom 5th percentile. For organizations in the top 10th percentile of Alexa
Internet Ranked organizations, they saw the 95th percentile deploying 42
times per week.
Whatever initial measures you choose to observe, however, they are just the
beginning. Tharisayi cautions that the key to DevOps is to customize your
monitoring and tighten your feedback loops to focus on the key ingredients of
your particular success.
“It is really about visibility,” Dawson says. “We have a new level of visibility
and insight into how we are delivering software — and a new ability to share
and compare those insights — that provides us a new level of clarity when it
comes to optimizing what we do.”
A dashboard is a display of those KPIs that’s readily visible and that shows how
something is functioning or progressing. When used right — whether on a car,
aircraft or a DevOps laptop — a dashboard is simple to read at a glance and a
very powerful way to know you’re heading in the right direction.
In 2014, IBM developer advocate Steve Poole took over leadership of a more
than 100-person, somewhat distributed, European “Slow IT” team, as it moved
toward the much faster cloud. They moved from a two-year shipping cycle to
FIG 7.1: A management dashboard gives technical and non-technical teams a way to
assess an application’s business performance at a glance.
supporting SaaS products that shipped daily. He shared how dashboards helped
close the communication gap among siloed teams with the Agile Tour London
conference and in a follow-up interview with The New Stack.
“I learned that IT and Dev don’t communicate. They just shout at each other in
their own language,” Poole said, because they don’t share a common
experience or perspective.
“IT teams tend to group themselves around the experts so you see a lot of mini
silos who are responsible for parts of it. And the team that I took over had lots
of small teams.”
Poole then pointed out how even tech conferences are usually very siloed
around certain languages, methodologies or roles. He saw a need for a new
“DevOps contract” between developers and operations that included self-
service assets like Platform as a Service and Infrastructure as a Service,
containers and testing, and it needed to cover new availability requirements
with an emphasis on speed to market and feedback loops.
“You stop looking at your email and reports in your system, and you start
looking at dashboards,” Poole said. “Prior to that, Dev teams would do the
work that they are planning and then the management team would come in
and say ‘We have a bug, who’s going to fix that?’ Now, it’s a pager look for the
whole team” — everyone on call with the same view.
All of the dashboards had one thing in common: to make sure the right people
were the first to know if something went wrong.
While middle management dashboards are all about seeing where intervention
is necessary, executives are desperate for more accurate data to guide which
levers to pull to redirect the organization.
For example, one of the things Dev teams were told to do is make better use of
cloud capacity, moving from on-premises infrastructure. Poole says to do that,
you have to understand your current on-premises status. With dashboards, the
large team went from numbers to a progress report, a line that goes up when
cloud is gaining and local is going down, with a projection for the future. Of
course, they needed to make some adjustments in order to make a direct
comparison. They classified servers by the type of workloads and on-premises
workloads in terms of cloud characteristics, such as central processing units
(CPUs), memory and multitenancy to bare metal.
With such visual displays, managers were able to justify budget increases more
effectively by showing how their existing capacity is being used. They could
also better predict the overall cost of the move to the cloud.
Everything had to be handcrafted for the executive dashboards, but then, since
it’s DevOps, the information retrieval was automated.
“When you come up with a common vision, you chuck all the titles away and
say, ‘Let’s just sit down and see what you want to do and what you want to get
out of it.’ You can educate the exec in the realities of the challenges and [they]
can come back and say ‘I accept that’ or ‘I don’t.’ The creation of the dashboard
[makes it] a two-way activity,” Poole said.
Like all dashboards and big data, a big challenge to DevOps automation is
cleaning up inaccurate, unrealistic and stale data. It took Poole’s team about a
month of conversations to understand what data was coming out of thousands
of machines, understanding purposes, use cases and the size of said data. Only
then could the data be cleaned, bringing usefulness to the dashboards.
If you really want to get Devs, Ops and support on a united front, you need to
offer transparent feedback and let them visibly see they are on a path together
toward shipping better code. It’s all about creating actionable insights by
making patterns easier to see in a graph. In one case Poole’s team was trying
to figure out how to get to a reasonable delivery every time. Using a
dashboard, they charted out how many lines of code they were shipping
versus the level of complication in that code. If there was a certain
combination of quantity and complexity, they determined that it should be
delayed from being shipped.
After initial success creating graphs, they followed this mindset with all the IT
dashboards, improving ticketing dashboards and measuring things like “How
long to close 80 percent of the tickets?” They also used dashboards to highlight
the different processes each team uses.
Once dashboards become useful within teams, it makes sense to begin sharing
them with end users as well. Sharing teams’ initial response times and time-
to-resolution figures with end users — whether those are the Devs being
served by Ops or sometimes even external customers — can help everyone
align around common goals and expectations.
FIG 7.2: Developer and operations teams should start the day by level-setting from
the same application performance dashboard, according to New Relic.
“To make that real, you have to have a conversation with your end user [about]
what it means to be down. What does red mean? Depending on what service
that you’re looking at it can be different.” Poole continued, “What we taught
the IT teams is to understand what it really meant for a service to be available,
so when it was unavailable it really meant unavailable.”
“You’re washing your dirty linen: ‘You do that? Why do you do that?’
Sometimes it’s a fair question and sometimes it’s just not understanding,” he
explained.
But once this dashboard infrastructure was up and running, the value was
clear.
“We don’t really need to put it down into dollars, it’s all about progress and
status,” Poole said.
Can dashboards really change culture? Of course not, but they act as a good
mirror to see if the culture has embraced transparency, autonomy and
automation.
“The important thing about this is it changed how people behaved. You go
from a team to where the body language is very closed, they don’t want to
talk, and they don’t want to share because they are afraid they’re being told
off,” Poole said.
Once they understood their customers loved what they were doing, he said,
their behavior changed. “We went from very negative teams to understanding
they were valued.”
The end goal is no longer for testers to check code after it’s written, but rather
to play a more strategic role in an organization looking for speed through
automation and responsiveness to production data. Testing code early and
often, in an automated and parallel fashion — called continuous testing —
eases testing bottlenecks, allows bugs to be found and fixed earlier and helps
speed code to production, writes Lubos Parobek, vice president of product at
Sauce Labs. 4 It helps provide security and compliance checkpoints,
improving the quality of code in production. And it ensures new application
features have their intended effect for customers, before such features are
rolled out across an entire customer base. Testing is now considered as part of
the planning and architecture discussion, before a single line of code is
“What I see in most enterprise customers is people trying to bring some level
of control into their pipelines,” said Alex Martins, chief technology officer for
continuous quality testing at CA Technologies. “The low hanging fruit towards
that is to use service virtualization as one technique that is quick and dirty to
isolate what you’re testing. If you’re testing on premises and it needs some
services in the cloud, then you isolate the services and don’t worry about the
cloud.”
It’s a DevOps approach, applied to testing. Like other DevOps processes, testing
is being increasingly automated as part of the CI/CD pipeline. And QA and
testing job roles have been adapting to agile development practices. Embedded
testing is an emerging role that digitally driven companies such as Facebook
and Moo are experimenting with as they look to strengthen software in a way
which is distinct from site reliability engineering.
Test engineers are embedded within tech teams in order to break down silos,
Bangser said. Her job isn’t about testing for the developers, but rather to help
the team identify what quality means for each service, by identifying suitable
requirements and using testing tools to support them.
“Our engineers are in charge of leveraging tools [for] their own monitoring,
writing their own service-level alerts, running their own pipelines including
through to production, deployment and monitoring,” she said.
Bangser looks at her role on the platform team — which she says combines
platform engineering, operations, and testing — as a good alternative to a site
reliability engineer (SRE) for smaller development teams. She says her role
focuses on infrastructure and provisioning, shared resources and the
observability of these services.
Bangser continued that her role involves a lot of requirement and user story
“You focus on user experience. I hear Devs talk about all the time that ‘they
don’t need UX because it’s an API.’ But if you are building software, you have
users — and as a platform team we absolutely are aware of our users — and
they [the users] are the dev teams,” she said.
She says the test engineer role helps the whole team focus on the impact of
any change to the end users, which means making sure documentation,
training, feature prioritization and update communication all stay front of
mind.
While the developers are responsible for writing their own tests, Snyder said,
“teams don’t have to reinvent the wheel to run and track their tests at the
right times. The testing infrastructure team builds shared systems for test
selection, execution, triage and reporting to support the full life cycle of tests
once they are written.”
At Facebook, each product team has three types of tests that are running
continuously on any change that will be committed to Facebook’s master
repository:
● Unit tests: very small and targeted tests. They write a lot of these cheaper
tests early on.
● Integration tests: larger chunks of dependent code.
● End-to-end tests: more expensive and longer to run.
Snyder said this is all done with the goal of signaling any anomalies as early as
possible to developers: “Either your test is broken by this change or something
is broken in the master [repo] so please come and see what’s going on.”
At scale, automation becomes critical for handling the level of testing each
application requires, alongside the efficient use of testing resources. Facebook
and its other applications have a greater need than many to test across
environments due to scale. For example, it must test different versions of the
application for specific operating systems: iOS, Android and KaiOS. This is why
Facebook created a resource pool for tests.
One World is used when a product developer would like to test the build of an
Android application but doesn’t want to set up a complicated Android emulator.
Or the tool can be used when someone wants to try something on Windows
but doesn’t have a Windows machine; they can use a virtual machine to test it.
Facebook even has a physical lab for testing devices like mobile phones.
Everything they are working with needs to focus on answering the same set of
questions:
Developers learn from production engineers how to evaluate the trade-offs and
the cost — time and capacity resources — of executing and running tests.
Facebook is working to “do more with less at every layer of testing. Infinitely
running tests won’t scale forever,” Snyder said.
“As technologists, we can fall into the trap of assuming that we know what is
best for the product, but that kind of thinking can be dangerous. Only by
getting customer feedback and by falling in love with the customer’s problems
will we be able to create the space for innovation and build remarkable
customer experiences,” Swann writes on The New Stack.
initial teams begin to thrive, roll out the changes to other teams.
StubHub teams think about testing and learning in two distinct ways:
The important part is making sure that the team is focused on getting
“signals” and learning as quickly as possible, Swann said. Rather than
implementing large, time-consuming features and then learning — large
effort with high risk — the team needs to get data as early as possible to learn
whether the hypothesis is correct — small efforts with small risk. When you
are right, keep going and perform another test to learn more. By keeping tests
small and focused you minimize the overall risk to the company.
Effective Monitoring in a
Cloud Native World
he transition to new cloud native technologies like Kubernetes has
to simply can’t keep up with these more modern architectures. It can’t provide
the level of insight into our applications that we need to fully understand
what’s happening.
before they become externally visible, and can provide valuable information for
in-depth debugging. Each of the approaches in the following sections is a form
of white box monitoring.
With cloud native infrastructure, containers and servers are ephemeral, and it
becomes more important than ever to ship logs to some kind of centralized
logging system. Components in a cloud native system also tend to change more
often with the increased use of continuous delivery (CD) techniques. 9
Elasticsearch, often deployed with Logstash and Kibana to make up an “ELK”
stack, has become one of the most popular open source solutions for
centralized logging. The components of an ELK stack, also known as the
Elastic Stack, combine to provide a very compelling set of open source tools
that simplify log storage, collection and visualization respectively.
Having all system and application logs in a single place can be an incredibly
powerful component of your monitoring system. When things go wrong,
centralized logging allows you to quickly see everything happening in your
system at that point in time, and filter through logs for specific applications,
labels or messages. Taking this idea further, Grafana Labs has released an
open source log aggregation tool for Kubernetes called Loki, which indexes the
log metadata, allowing for queries on where the logs came from, when they
were generated, on what host and what version of the software. 10
“Machine learning can help you find an anomalous and malicious activity, and
automate workflows so operations teams and developers can address the
issues faster. Traditional security approaches, by comparison, can’t keep up
with the speed or scale of these new architectures,” writes Rohan Tandon, a
statistician and member of the technical staff at StackRox. 11
that are a better measure of application health. These kind of metrics can
provide much more precise information than the kind of metrics derived from
polling data from outside of the system.
Open source tools like Prometheus have transformed this space. At its core,
Prometheus is a monitoring and alerting toolkit that stores metrics with a
multidimensional time series database. Each time series is identified by a
key-value pair, and tracks the value of that metric over time. The simplicity of
this model enables the efficient collection of a wide variety of metrics.
Prometheus has become especially popular in the cloud native ecosystem, with
great Kubernetes integration. The ease of tracking new metrics with
Prometheus has resulted in many applications exposing a wide variety of
custom metrics for collection. These are usually well beyond the standard
resource utilization metrics we’d traditionally think of when it comes to
monitoring. As an example of what this could look like, the popular
Kubernetes nginx-ingress project exposes metrics such as upstream latency,
process connections, request duration, and request size. When Prometheus is
running in the same cluster, it can easily collect the metrics exposed by the
many applications like nginx-ingress that support Prometheus out of the box.
In addition to all the tools that have Prometheus support built in, it’s rather
straightforward to export custom metrics for your own application. Having
these kinds of custom metrics monitored for your application can provide a
great deal of insight into how your application is running, along with exposing
any potential problems before they become more outwardly visible.
There are some great open source tools focused on request tracing, including
Jaeger and Zipkin. These tools allow you to see detailed information about all
requests that spawned from an initial request, providing end-to-end visibility
across your microservices. This kind of insight can be invaluable when trying
to diagnose any bottlenecks in your systems.
It’s not just that AIOps can help with availability and performance monitoring,
event correlation and analysis, IT service management, help desk and customer
support, and infrastructure automation. It’s also part of the general ‘shift left’
DevOps trend where operations become an integrated part of application
IT and operations is a natural home for machine learning and data science. If
there isn’t a data science team in your organization, the IT team will often
become the “center of excellence,” said Vivek Bhalla, who was until recently a
Gartner research director covering AIOps and is now director of product
management at Moogsoft.
By 2022, Gartner predicts, 40 percent of all large enterprises will use machine
learning to support, or even partly replace, monitoring, service desk and
automation processes. That’s just starting to happen in smaller numbers.
“Look at the repetitive, low-level tasks that are ripe for automation to free up
the time of operations staff, lowering their stress levels and letting them use
that extra bandwidth to work smarter,” Bhalla said at Moogsoft’s AIOps
Symposium.
Visualization and statistical analysis of historical data are what Gartner views
as a reactive approach; you can look back and understand what has happened
using machine learning, either for general performance understanding or for
root cause analysis. As you move to the combination of historical and live data
with machine learning and causal analytics, operations teams can become
more proactive with predictive warning systems. If AI-powered systems are
going to predict problems and even automate fixes, they need to do more than
spot patterns; they need to understand them. For now, simply detecting which
alerts and errors come from the same event can be very valuable, reducing the
flood of noise to something useful.
“IT systems generate vast quantities of self-describing data but the data
streams generated tend to be highly redundant,” Will Cappelli, chief
technology officer at Moogsoft, said. “Stripping out that redundancy turns
something that’s voluminous but information poor into something thinner,
information rich.”
Correlating what events were created in the same time period, while taking
into account latency, and using the physical and application topology of the IT
system and comparing the text streams for related text, with the option for
customers to write their own rules, aggregates the alerts so they’re more
manageable. OpsQ and BigPanda’s LØ do similar kinds of correlation and
aggregation for visibility and noise reduction.
The next level is causal analysis, Cappelli explained. “You wind up with an
envelope of correlated data items that you have some reason to think are
related, and then we introduce causal analytics.” Some of that is done by
probabilistic root cause analysis using statistical machine learning, which is
common in AIOps tools from Big Panda, Elastic, IBM and Splunk.
“We look at packets of correlated data and we can structure this package
causally based on neural networks,” Capelli said,
new situation going down a path that’s been proven to be fruitless, we’ll tell
them, to guide them in a different direction.”
It’s important therefore that modern monitoring methods are baked into the
deployment pipeline with the minimum of fuss, Waterhouse writes. In the
aforementioned Kubernetes cluster deployment, monitoring can be established
with the actual deployment itself, meaning no lengthy configuration and
interrupts. In another example, we could increase observability by establishing
server-side application performance visibility with client-side response time
analysis during load testing — a neat way of pinpointing problem root cause as
we test at scale. Again, what makes it valuable isn’t just the innovation, it’s the
simple and straightforward application — ergo, it’s frictionless.
It’ll still take a fair amount of cultural nudging and cajoling to get teams to do
the right things if systems are to become more observable, but beware of
dictatorial approaches and browbeating. To this end, DevOps-centric teams
will always be on the lookout for opportunities to demonstrate how to make
applications more observable and the value it delivers. That could be as easy as
perusing an application topology over coffee to determine latency blind-spots
and where instrumentation could help. Another opportunity could be after a
major incident or application release, but again the focus should be on
collective improvement and never finger pointing.
In all cases, the goal should be to train people on how to get better at making
their systems observable. That’ll involve delivering fast insights to get some
quick wins, but could quickly develop into a highly effective service providing
guidance on monitoring designs and improvement strategies. To this point,
many organizations have built out teams of “observability engineers.” Some go
even further and incorporate observability learnings and practices into their
new-hire training programs.
Listen on SoundCloud
SPONSOR RESOURCE
• Building Cloud Native Apps Painlessly
— The Prescriptive Guide to Kubernetes and Jenkins X
by CloudBees, 2019
In this paper, you’ll read about how the future of modern application develop-
ment can benefit from the powerful combination of Jenkins X and Kubernetes,
providing developers a seamless way to automate their continuous integration
(CI) and continuous delivery (CD) process.
SPONSOR RESOURCE
• Delivering Cloud Native Infrastructure as Code
by Pulumi, 2018
Find out how to deliver all cloud native infrastructure as code with a single
consistent programming model in this white paper from Pulumi.
SPONSOR RESOURCE
• Continuous Delivery Summit
by Continuous Delivery Foundation, 2019
The Continuous Delivery Foundation will be hosting a Continuous Delivery
Summit (CDS) event on May 20 at KubeCon + CloudNativeCon Europe 2019 in
Barcelona, Spain. The Cloud Native Computing Foundation’s flagship confer-
ence gathers adopters and technologists from leading open source and cloud
native communities including Kubernetes, Prometheus, Helm and many
others. Register to attend, today!
14. “Monitoring and Observability — What’s the Difference and Why Does
It Matter?”
by Peter Waterhouse, Senior Strategist at CA Technologies, The New
Stack, April 16, 2018.
Waterhouse reviews monitoring basics and defines modern
observability practices.