TheNewStack Book3 CICDwithKubernetes PDF
TheNewStack Book3 CICDwithKubernetes PDF
WITH
KUBERNETES
The New Stack
CI/CD with Kubernetes
Alex Williams, Founder & Editor-in-Chief
Core Team:
Bailey Math, AV Engineer
Benjamin Ball, Marketing Director
Gabriel H. Dinh, Executive Producer
Judy Williams, Copy Editor
Kiran Oliver, Podcast Producer
Lawrence Hecht, Research Director
Libby Clark, Editorial Director
Norris Deajon, AV Engineer
Closing ........................................................................................................................................116
Disclosure ..................................................................................................................................118
Kubernetes “does not deploy source code and does not build your
application. Continuous Integration, Delivery, and Deployment (CI/CD)
workflows are determined by organization cultures and preferences
as well as technical requirements.”
This ebook, the third and final in The New Stack’s Kubernetes ecosystem
series, lays the foundation for understanding and building your team’s
practices and pipelines for delivering — and continuously improving —
applications on Kubernetes. How is that done? It’s not a set of rules. It’s a
set of practices that flow into the organization and affect how application
architectures are developed. This is DevOps, and its currents are now
deep inside organizations with modern application architectures,
manifested through continuous delivery.
Section Summaries
• Section 1: DevOps Patterns by Rob Scott of ReactiveOps, explores
the history of DevOps, how it is affecting cloud-native architectures
and how Kubernetes is again transforming DevOps. This section traces
the history of Docker and container packaging to the emergence of
Kubernetes and how it is affecting application development and
deployment.
While the book ends with a focus on observability, it’s increasingly clear
that cloud-native monitoring is not an endpoint in the development life
cycle of an application. It is, instead, the process of granular data
collection and analysis that defines patterns and informs developers and
operations teams from start to finish, in a continual cycle of improvement
and delivery. Similarly, this book is intended as a reference throughout
the planning, development, release, manage and improvement cycle.
the infrastructure and more on the applications that run light workloads.
The combined effect is a shaping of automated processes that yield
better efficiencies.
In fact, some would argue that an application isn’t truly cloud native
unless it has DevOps practices behind it, as cloud-native architectures are
built for web-scale computing. DevOps professionals are required to build,
deploy and manage declarative infrastructure that is secure, resilient and
high performing. Delivering these requirements just isn’t feasible with a
traditional siloed approach.
Therein lies the challenge: You must make sure your organization is
prepared to transform the way all members of the product team work.
Ultimately, DevOps is a story about why you want to do streamlined, lean
product development in the first place — the same reason that you’re
moving to a microservices architecture on top of Kubernetes.
Our author for this chapter is Rob Scott, a site reliability engineer at
ReactiveOps. Scott is an expert in DevOps practices, applying techniques
from his learnings to help customers run services that can scale on
Kubernetes architectures. His expertise in building scaled-out architectures
stems from years of experience that has given him witness to:
Traditionally, the speed with which software was developed and deployed
didn’t allow a lot of time for collaboration between engineers and operations
staff, who worked on separate teams. Many organizations had embraced
lean product development practices and were under constant pressure to
release software quickly. Developers would build out their applications, and
the operations team would deploy them. Any conflict between the two
teams resulted from a core disconnect — the operations team was
unfamiliar with the applications being deployed, and the development team
was unfamiliar with how the applications were being deployed.
With that in mind, the industry started to move toward a pattern that
avoided making changes to existing servers: immutable infrastructure.
Virtual machines combined with cloud infrastructure to dramatically
simplify creating new servers for each application update. In this
workflow, a CI/CD pipeline would create machine images that included
the application, dependencies and base operating system (OS). These
machine images could then be used to create identical, immutable
servers to run the application. They could also be tested in a quality
assurance (QA) environment before being deployed to production.
The ability to test every bit of the image before it reached production
resulted in an incredible improvement in reliability for QA teams.
Unfortunately, the process of creating new machine images and then
running a whole new set of servers with them was also rather slow.
It was around this time that Docker started to gain popularity. Based on
Linux kernel features, cgroups and namespaces, Docker is an open source
project that automates the development, deployment and running of
applications inside isolated containers. Docker offered a lot of the same
single package. Before, each server would need to have all the OS-level
dependencies to run a Ruby or Java application. The container changes
that. It’s a thin wrapper — single package — containing everything you
need to run an application. Let’s explore how modern DevOps practices
reflect the core value of containers.
Next, the way containers worked and behaved was largely undefined
when Docker first popularized the technology. Many organizations
wondered if containerization would really pay off, and some remain
skeptical.
networking and its own isolated process tree separate from the host.
You once had specialized servers and were worried about them falling
apart and having to replace them. Now servers are easily replaceable and
can be scaled up or down — all your server needs to be able to do is run
the container. It no longer matters which server is running your container,
or whether that server is on premises, in the public cloud or a hybrid of
both. You don’t need an application server, web server or different
specialized server for every application that’s running. And if you lose a
server, another server can run that same container. You can deploy any
number of applications using the same tools and the same servers.
Compartmentalization, consistency and standardized workflows have
transformed deployments.
What began as two disparate job functions with crossover has now
become its own job function. Operations teams are working with code
bases; developers are working to deploy applications and are getting
farther into the operational system. From an operational perspective,
developers can look backward and read the CI file and understand the
deployment processes. You can even look at Dockerfiles and see all the
dependencies your application needs. It’s simpler from an operational
perspective to understand the code base.
Introduction to Kubernetes
Kubernetes is a powerful, next generation, open source platform for
automating the deployment, scaling and management of application
containers across clusters of hosts. It can run any workload. Kubernetes
provides exceptional developer user experience (UX), and the rate of
innovation is phenomenal. From the start, Kubernetes’ infrastructure
promised to enable organizations to deploy applications rapidly at scale
and roll out new features easily while using only the resources needed.
With Kubernetes, organizations can have their own Heroku running in
their own public cloud or on-premises environment.
In years past, think about how often development teams wanted visibility
into operations deployments. Developers and operations teams have
always been nervous about deployments because maintenance windows
had a tendency to expand, causing downtime. Operations teams, in turn,
have traditionally guarded their territory so no one would interfere with
their ability to get their job done.
FIG 1.1: With Kubernetes, pods are distributed across servers with load balancing
and routing built in. Distributing application workloads in this way can dramatically
increase resource utilization.
Load Balancer Load Balancer Load Balancer Load Balancer Load Balancer
API API Web Web API API Web Web Kubernetes Ingress Controller
Code Code Code Code Container Container Container Container
cluster, or the endpoints that can be used to access them, Kubernetes can
help with configuration management. Kubernetes has a concept called
ConfigMap where you can define environment variables and configuration
files for your application. Similarly, objects called secrets contain sensitive
information and help define how your application will run. Secrets work
much like ConfigMaps, but are more obscure and less visible to end users.
Chapter 2 explores all of this in detail.
Both probes are useful tools, and Kubernetes makes them easy to
configure.
Simplified Monitoring
While on the surface it might seem that monitoring Kubernetes would be
quite complex, there has been a lot of development in this space.
Although Kubernetes and containers add some levels of complexity to
your infrastructure, they also ensure that all your applications are running
in consistent pods and deployments. This consistency enables monitoring
tools to be simpler in many ways.
Development speeds up when you don’t have to worry about building and
deploying a monolith in order to update everything. By splitting a
monolith into microservices, you can instead update pieces — this service
or that. Part of a good CI/CD workflow should also include a strong test
suite. While not unique to Kubernetes, a containerized approach can make
tests more straightforward to run. If your application tests depend on
other services, you can run your tests against those containers, simplifying
the testing process. A one-line command is usually all you need to update
a Kubernetes deployment.
In a CI/CD workflow, ideally you run many tests. If those tests fail, your
image will never be built, and you’ll never deploy that container.
FIG 1.2: Before Kubernetes shuts down existing pods, it will start spinning up new
ones. Only when the new ones are up and running correctly does it get rid of the
old, stable release. Such rolling updates and native rollback features are a game-
changer for DevOps.
In particular, the aforementioned tests and health checks can prevent bad
code from reaching production. As part of a rolling update, Kubernetes
spins up separate new pods running your application while the old ones
are still running. When the new pods are healthy, Kubernetes gets rid of
the old ones. It’s a smart, simple concept, and it’s one less thing you have
to worry about for each application and in your CI/CD workflow.
Complementary Tools
As part of the incredible momentum Kubernetes has seen, a number of
DevOps tools have emerged that are particularly helpful in developing CI/
CD workflows with Kubernetes. Bear in mind that CI/CD tools and
practices are still evolving with the advent of cloud-native deployments on
Kubernetes. No single tool yet offers the perfect solution for managing
cloud-native applications from build through deployment and continuous
delivery. Although there are far too many to mention here, it’s worth
highlighting a few DevOps tools that were purpose-built for cloud-native
applications:
• Skaffold: Similar to Draft, this is a new tool from Google that enables
exciting new development workflows. In addition to supporting more
standard CI/CD workflows, this has an option to build and deploy code
Extensions of DevOps
Continuous deployment of cloud-native applications has transformed the
way teams collaborate. Transparency and observability at every stage of
development and deployment are increasingly the norm. It’s probably no
surprise then that GitOps and SecOps, both enabled by cloud-native
architecture, are building on current DevOps practices by providing a
single source of truth for changes to the infrastructure and changes to
security policies and rules. The sections below highlight these
evolutionary developments.
GitOps
Git is becoming the standard for distributed version control, wherein a Git
repository contains the entire system: code, config, monitoring rules,
dashboards and a full audit trail. GitOps is an iteration of DevOps, wherein
Git is a single source of truth for the whole system, enabling rapid
application development on cloud-native systems, and using Kubernetes
in particular. “GitOps” is a term developed by Weaveworks to describe
DevOps best practices in the age of Kubernetes, and it strongly
emphasizes a declarative infrastructure.
The fundamental theorem of GitOps is that if you can describe it, you can
automate it. And if you can automate it, you can control and accelerate it.
The goal is to describe everything — policies, code, configuration and
monitoring — and then version control everything.
With GitOps, your code should represent the state of your infrastructure.
GitOps borrows DevOps logic:
• Configuration is code.
How it works: Code is committed and pushed to GitHub, and then you
have a CI/CD workflow listening on the other side, and that CI/CD workflow
makes those changes and commits those changes in the configuration.
The key difference: Instead of engineers interacting with Kubernetes or the
system configuration directly — say, using Kubernetes CLI — they’re doing
SecOps
Security teams traditionally hand off security test results and vulnerability
scans to operations teams for review and implementation, often as an
application is being deployed. So long as an application is running and
performing as expected, security’s involvement in the process gets a green
light. However, information exchange and approval cycles can lead to delays
and slow what would otherwise be an agile DevOps workflow. It’s not
surprising that this occurs as the two teams are dealing with two very different
sets of goals. Operations tries to get the system running in as straightforward
and resilient a manner as possible. Security, on the other hand, seeks to
control the environment — the fewer things running, the better.
SecOps bridges the efforts of security and operations teams in the same
way that DevOps bridges the efforts of software developers and
operations teams. Just as a DevOps approach allows product developers
to strategically deploy, manage, monitor and secure their own
applications, a SecOps approach gives engineers a window into both
operations and security issues. It’s a transition from individual tacticians
— system administrators and database administrators — to more
strategic roles within the organization. These teams share priorities,
processes, tools and, most importantly, accountability, giving
organizations a centralized view of vulnerabilities and remediation actions
while also automating and accelerating corrective actions.
Conclusion
The story of DevOps and Kubernetes is one of continued, fast-paced
evolution. For example, in the current state of DevOps, technologies that
may be only a few years old can start to feel ancient. The industry is
changing on a dime.
Kubernetes is still very new and exciting, and there’s incredible demand in
the market for it. Organizations are leveraging DevOps to migrate to the
cloud, automate infrastructure and take Software as a Service (SaaS) and
web applications to the next level. Consider what you might accomplish
with higher availability, autoscaling and a richer feature set. Kubernetes
has vastly improved CI/CD workflows, allowing developers to do amazing
things they couldn’t do before. GitOps, in turn, offers a centralized
configuration capability that makes Kubernetes easy. With a transparent
and centralized configuration, changes aren’t being individually applied
willy nilly, but are instead going through the same pipeline. Today, product
DevOps and Kubernetes are the future. Together, they just make good
business sense.
C opers. Their ranks are filled with people who are creating micros-
ervices that run on sophisticated underlying infrastructure tech-
nologies. Developers are drawn to the utility and packaging capabilities in
containers which improve efficiency and fit with modern application
development practices.
In all of this, the container can be defined as a unit. Each unit holds code,
a payload that in more complex operations gets orchestrated across
distributed architectures and infrastructure managed by cloud services.
More developers are now testing how to optimize the resources available
on different types of infrastructure to take advantage of the benefits
offered by containers and virtualization.
Traditional/ Workload
Stateless Stateful Batch Processing Serverless Monolithic Type
UI Scheduled
Jobs
Legacy
RDBMS LOB App
API Batch
Jobs
Legacy
SCM
FIG 2.1: Today’s modern application architectures bridge with monolithic legacy
systems.
these microservices are stateless, while others are stateful and durable.
Certain parts of the application may run as batch processes. Code
snippets may be deployed as functions that respond to events and alerts.
The scalable layer runs stateless services that expose the API and user
experience. This layer can dynamically expand and shrink depending on
the usage at runtime. During the scale-out operation, where more
instances of the services are run, the underlying infrastructure may also
scale out to match the CPU and memory requirements. An autoscale
The durable layer has stateful services that are backed by polyglot
persistence. It is polyglot because of the variety of databases that may be
used for persistence. Stateful services rely on traditional relational
databases, NoSQL databases, graph databases and object storage. Each
service chooses an ideal datastore aligned with the structure of stored
data. These stateful services expose high-level APIs that are consumed by
both — the services from the scalable and durable layers.
Apart from stateless and stateful layers, there are scheduled jobs, batch
jobs and parallel jobs that are classified as the parallelizable layer. For
example, scheduled jobs may run extract, transform, load (ETL) tasks
once per day to extract the metadata from the data stored in object
storage and to populate a collection in the NoSQL database. For services
that need scientific computing to perform machine learning training, the
calculations are run in parallel. These jobs interface with the GPUs
exposed by the underlying infrastructure.
To trigger actions resulting from events and alerts raised by any service in
the platform, cloud-native applications may use a set of code snippets
deployed in the event-driven layer. Unlike other services, the code
running in this layer is not packaged as a container. Instead, functions
written in languages such as Node.js and Python are deployed directly.
This layer hosts stateless functions that are event driven.
Enterprises will embrace microservices for building API layers and user
interface (UI) frontends that will interoperate with existing applications. In
this scenario, microservices augment and extend the functionality of
existing applications. For example, they may have to talk to the relational
database that is powering a line-of-business application, while delivering
an elastic frontend deployed as a microservice.
A controller can create and manage multiple pods within the cluster,
handling replication that provides self-healing capabilities at cluster scope.
For example, if a node fails, the controller might automatically replace the
pod by scheduling an identical replacement on a different node.
If there are too many pods, the ReplicationController may terminate the
extra pods. If there are too few, the ReplicationController proceeds to
launch additional pods. Unlike manually created pods, the pods
maintained by a ReplicationController are automatically replaced if they
fail, are deleted or terminated. The pods are re-created on a node after
disruptive maintenance such as a kernel upgrade. For this reason, it is
recommended to use a ReplicationController even if the application
requires only a single pod.
StatefulSets are useful for workloads that require one or more of the
following:
Run to completion jobs are typically used for running processes that need
to perform an operation and exit. A big data workload that runs until the
data is processed is an example of such a job. Another example is a job
that processes each message in a queue until the queue becomes empty.
A Job is a controller that creates one or more pods and ensures that a
specified number of them successfully terminate. As pods successfully
complete, the Job tracks the successful completions. When a specified
number of successful completions is reached, the Job itself is complete.
Deleting a Job will clean up the pods it created.
A Job can also be used to run multiple pods in parallel, which makes it
ideal for machine learning training jobs. Jobs also support parallel
processing of a set of independent but related work items.
The depiction below (Fig. 2.2) maps various layers to the cloud-native
application stack with Kubernetes primitives.
FIG 2.2: DevOps engineers can define the desired configuration state through a
declarative approach — each workload maps to a controller.
UI Scheduled
Jobs
Legacy
RDBMS LOB App
API Batch
Jobs
Legacy
SCM
1. Never Deploy “Naked” Pods. Naked pods are pods that are not a
part of a ReplicaSet, ReplicationController or Deployment. Since it is
easy, a common practice for both developers and operations teams is
to package a container as a simple pod and deploy it in Kubernetes.
Naked pods suffer from single points of failure as Kubernetes will not
be able to reschedule them when a node fails. Always package pods
as a ReplicationController or a Deployment.
Conclusion
One of reasons that Kubernetes has become successful is the flexibility
and control it offers to developers and operators. Developers stay focused
on shipping microservices without worrying about the deployment
environment. DevOps engineers take the software, map the layers to
appropriate primitives and deploy it in Kubernetes. This workflow and
decoupling of software design and deployment is what makes Kubernetes
unique.
In this chapter, The New Stack asked Craig Martin, senior vice president of
engineering at Kenzan, to discuss Spinnaker and how it illustrates the way
CI/CD practices are evolving for cloud-native architectures. Kenzan is a top
contributor to the open source Spinnaker project and has built its own
open source framework for Kubernetes deployments with it. Here, Martin
draws on his experience working with organizations to enact digital
transformations through building large-scale microservices applications
on top of Kubernetes and Spinnaker to explain:
• The benefits and drawbacks of this new tool, which has not yet gained
widespread adoption.
Enter Spinnaker
CD differs from CI. CI is a mechanism to merge and test code changes on
an ongoing basis, often achieved by a tool like Jenkins. CD is the attempt
to speed up and automate deployments, where an operator can push out
multiple deployments in a week across numerous services, and know the
exact condition of the applications and infrastructure in the course of the
deployments. What is truly required for continuous delivery which is not
provided by CI tools is a “state” machine. Such a state machine will have
the ability to take an environment from one state to the next until it makes
it all the way to production. The machine will move the environment, such
as Docker containers, through to production in an automated fashion, and
will even have the ability to do things like rollbacks, canary deployments
and scaling instances. This allows for the agile, push-button, automated
deployments that an ideal CD mindset drives towards.
Spinnaker Features
Using Jenkins alone to enable the fine-grained control of pipelines that
can do things like automate testing, rollbacks, visualization and templated
reuse would take quite a bit of custom Groovy code using Jenkins 2.0
pipelines. And after all this work, you would still not have a true “state”
management tool.
Spinnaker Architecture
Understanding the component architecture of Spinnaker is important to
seeing its strengths. A typical Spinnaker installation includes a number of
microservices that all work together to create and manage pipeline-based
deployments.
FIG 3.1: Each component in Spinnaker’s modular architecture has its own
responsibilities.
Gate Orca
Abstraction
Layer
Custom API Gateway: Orchestration
basic AuthN, routing Engine
Script Cloud Fiat
into microservices.
driver
</> Authorization
Service
Rosco
Create
CI/CD WITH KUBERNETES machine images 68
• Gate: The API Gateway that is used to perform basic AuthN and routing
into the microservices.
FIG 3.2: A new image in the development repository triggers the Spinnaker develop-
ment (dev) pipeline to deploy the application to a Kubernetes dev cluster.
AUTOMATIC TRIGGER
Database
Development
Cluster
CI/CD WITH KUBERNETES 70
Development Environment
For example, in a development (dev) environment, you might set up a very
simple Spinnaker deployment which runs often. Triggering off a new image
being pushed to the container image repository, it deploys the application
to the dev Kubernetes cluster, then on post-deploy runs a number of curls
against the application’s REST API using a Newman script.
Staging Environment
In the staging environment, you might set up another pipeline that is a bit
more complex. Triggering off a new image pushed to your staging repository,
you first create some new test data using a database insert. The next few
stages deploy the application, run integration tests utilizing the test data,
perform security penetration tests and finally load tests to simulate peaks.
Production Environment
In the production (prod) environment, instead of triggering off of Jenkins,
we allow operations to start the pipeline manually. Using Spinnaker’s
FIG 3.3: A new image in the staging repository triggers the Spinnaker staging pipe-
line to deploy the application to a staging cluster.
AUTOMATIC TRIGGER
Database
Staging Cluster
MANUAL TRIGGER
Database
Production
Cluster
FIG 3.4: The operations team manually triggers the prod pipeline to deploy an appli-
cation to production on Kubernetes.
All of the above pipelines were constructed inside the same Spinnaker UI.
Although scripting pipelines is possible for more complex operations, the
templates in the Spinnaker UI make stringing together pipeline stages
fairly easy, with most tools and options needed for robust deployments. In
our experience, pipelines and its concept of Pipeline Templates are one of
the areas where Spinnaker really shines. Creating custom pipeline
templates is an easy way to create reusable pipeline modules that can be
And while the above pipelines demonstrate what deployments might look
like across typical development, staging and production environments,
the ultimate focus of CD on fast deployments should cause you to
question having so many environments. Could all of the above stages be
accomplished in fewer environments? Deployment strategies, like canary,
combined with the automation and accuracy Spinnaker provides in its
pipeline stages, could very well provide a path towards having code pass
through fewer environments, making it to production in fewer steps.
Example Implementations
It’s helpful to look at a few actual implementations to see the flexibility
and power of Spinnaker. At Kenzan, we’ve helped implement many of the
following Spinnaker use cases with clients.
The outcome of the approach is that the organization can then spend
more time supporting applications and less time on app tooling.
These are goals that many organizations dream of, but are often incapable
of executing due to the complexity of replicating environments and the
cost of doing so.
You may be thinking, that’s great, but what about the cost of such
environments? You could unknowingly spin up a number of cloud
resources and leave them in place, only to find out later your usage bill has
tripled. To help solve this, a manual decision stage can be put at the end
of the Environment Creator pipeline, creating a manual pause that allows
adequate time for testing to occur, then finishing the pipeline by
destroying the entire environment. While there is no way to escape the full
cost of creating such disposable environments, Spinnaker can help out by
mitigating their time to live.
is still early days. We have seen other tools begin to evolve to include
cloud-native CD capabilities and more will follow. In addition, any
contender in the space will need to address the need to integrate with, or
incorporate, build tooling to allow for complete application lifecycle
management. Spinnaker is the closest we’ve seen that delivers the scaling
capabilities and features needed in a cloud-native, continuous delivery
tool but much development work is still needed to get it to a turnkey,
enterprise-ready state.
like the aforementioned Jenkins 2.0. Most of the solutions are CI tools
that have been extended with CD capability, with a few others that focus
on CD only.
• Jenkins X is an exciting and new open source CI/CD toolset that was
released in March of 2018. It takes many of the core features from
Jenkins and enhances them to make them cloud native. It is built for
Kubernetes and is designed and optimized for deploying into that
environment. Jenkins X attempts to simplify the process of CI/CD by
automatically producing a number of predefined things: Jenkinsfiles,
Dockerfiles, Helm Charts, Kubernetes Clusters, namespaces and even
environments. It also uses predefined automation to trigger builds and
deploys from Git commits. While Jenkins X is showing a lot of promise
and adding functionality very quickly, it is still maturing as a tool. Most
notably it only runs from the command line, does not yet have a UI for
managing deployments, and is not easily used for managing the code
and infrastructure underlying the Kubernetes clusters (e.g. load
balancers, images and DNS).
The above tools include the most popular and relevant alternatives on the
market; there are a number of other CI/CD solutions that we have not
mentioned. What is important to realize is that tools that began as CI
solutions are now being adapted to do what Spinnaker has been built to
do from the beginning: focus on cloud-based deployments. While some CI
tools will likely provide a base level of capability for deployments, they
may not provide the scalability, extensibility and community support that
Spinnaker has proven itself with. And most importantly, it may take a long
time before these tools match Spinnaker’s capability as a true state
machine: allowing an operator to automate changes, perform canary
deployments, roll them back and scale instances, all while having a high-
1. Ensure Resilience
It is important to put as much emphasis in ensuring resiliency for your
Spinnaker stack as you would for any application in your infrastructure.
Ensuring uptime of your deployment tooling is as important as any feature
or application. Fortunately, Spinnaker is already built with resiliency in
mind, and it typically only requires configuration to achieve your specific
needs. We suggest the following setup:
• Data Replication: Depending on the data stores that you are using
2. Employ Namespacing
We typically put our Spinnaker installation in a separate namespace within
Kubernetes. This permits us the ability to size the Spinnaker resources at
the entire namespace level, and also prevents against Spinnaker resources
taking away from other namespaces or vice versa. Having partitioned
Spinnaker off, we can then closely monitor the needed resources for our
deployment microservices.
From within its own namespace, it should be noted that Spinnaker can
deploy into other specific namespaces. This is a very nice ability that
allows deployments to only target one namespace at a time.
3. Monitor Spinnaker
Monitoring Spinnaker is as important as monitoring any application in
your infrastructure. Spinnaker currently supports three major monitoring
systems — Datadog, Prometheus and Stackdriver — but others could be
added relatively easily. We recommend using the same monitoring tools
that you use for the rest of your infrastructure. The next chapter will go
out the feature and test it directly in production. We typically use some
form of automated canary analysis (ACA) to monitor the health of the
deployment and compare logs files of the newly deployed to the previous
deployment. Fortunately, this is getting even easier to accomplish now
that Google and Netflix have open sourced Kayenta, which integrates with
Spinnaker to monitor the quality of canary deployments. This was
released in April of 2018 and will make achieving continuous delivery —
and then continuous deployment — easier than ever.
1. Mindset Shift
The overall organization needs to support CD. This means that the
business, development and operations need to align their practices to
support continuous delivery. Over the years, we have seen several
common themes emerge that are key to this journey:
• Focus on Releasing Code and Not Building Code: This has been
mentioned a couple of times but it is probably worth emphasizing
again. It is important that everyone sees quality and speed of delivery
as the most important aspect of any application.
• You Build it You Own it: Creating ownership with the team that built
the code is paramount in our experience. Every piece of code —
microservice, pipeline and infrastructure — should have a clear owner,
and we find it best not to create separations between the teams that
manage and the teams that build the code. These types of separations
typically end up creating handoffs that halt or impede the speed and
automation CD pipelines are inherently designed for.
2. Infrastructure as Code
Automating everything is made much easier if your infrastructure is
managed as code and configuration. This will make it much easier to create
consistencies and parity across all of your environments. Accomplishing
this is dependent on your specific implementation, but we typically use
some sort of scripting, such as with Terraform, and also employ dedicated
infrastructure pipelines to automate the deployment. This ensures that
6. Start Small
You won’t achieve full CD in one week. The process will take time to shift
over all projects. When moving towards implementing CD in an
organization, we typically start very small with a single application or
group. This allows you to prove out the ideas and organizational changes
on a smaller scale and then roll them out to a larger group.
Spinnaker Tutorial
We hope this chapter has piqued your interest in Spinnaker as a CD
platform. If you’d like to get started with the tool, Kenzan has an open
source repository that we use to set up a fully functioning CD environment
using Spinnaker (with Jenkins) running within a Google Kubernetes Engine
(GKE). The tool uses some simple Terraform scripts to set up and configure
your environment, and is a great way to begin examining Spinnaker
hands-on. Check it out at: https://github.com/kenzanlabs/capstan
Monitoring has a different meaning today. In the past, developers built the
applications. Deployments were done by operations teams who managed
the applications in production. The health of services was mainly
determined based on customer feedback and hardware metrics such as
disk, memory or central processing unit (CPU) usage. In a cloud-native
environment, developers are increasingly involved in monitoring and
operational tasks. Monitoring tools have emerged for developers who use
them, to set up their markers and fine tune application-level metrics to
suit their interests. This, in turn, allows them to detect potential
performance bottlenecks sooner.
The topic of observability is fairly new, but highly pertinent. Our authors
are software engineers who have studied the new monitoring approaches
that are emerging with cloud-native architectures. Ian Crosby, Maarten
Hoogendoorn, Thijs Schnitger and Etienne Tremel are experts in
application deployment on Kubernetes for Container Solutions, a
consulting organization that provides support for clients who are doing
cloud migrations. These engineers have deep experience with monitoring
using Prometheus, which has become the most popular monitoring tool
for Kubernetes, along with Grafana as a visualization dashboard.
external outputs. For a modern SRE, this means the ability to understand
how a system is behaving by looking at the parameters it exposes through
metrics and logs. It can be seen as a superset of monitoring.
1. Logging.
3. Tracing.
increases the amount of data and metrics being logged. This increases
demand for storage and processing capacity when analyzing these data
and metrics. Both of these challenges lead to the use of time-series
databases, which are especially equipped to store data that is indexed by
timestamps. The use of these databases decreases processing times and
this leads to quicker results.
These large amounts of data also allow for gaining insights by applying
principles of artificial intelligence and machine learning. These techniques
can lead to increased performance, because they allow the system to
adapt the way it changes in response to the data it’s collecting by learning
from the effect of previous changes. This in turn leads to the rise of
predictive analytics, which uses data of past events to make predictions
for the future, thereby preventing errors and downtime.
Logging
+ +
+ +
Tracing
Logging
Logging in the simplest sense is about recording discrete events. This is
the first form of monitoring which any new developer gets exposed to,
usually in the form of print statements. In a modern system, each
application or service will log events as they occur, be it to standard out,
syslog or a file. A log aggregation system will then centralize all logs to be
viewed or searched as needed. In our example of a 500 error occuring, this
would be visible by a service, or possibly multiple services, logging an
error which resulted in the 500 status code. This error can be deciphered
through an evaluation of the other three pillars.
Metrics
By contrast, metrics are a combination of data from measuring multiple
events. Cloud-native monitoring tools cater to different types of
measurements by having various metrics such as counters, gauges,
histograms and meters.
• Meter: Measures the rate at which an event occurs. The rate can be
measured over different time intervals. The mean rate spans the
lifetime of your application, while one-, five- and fifteen-minute rates
are generally more useful.
Tracing
Tracing is about recording and ordering connected events. All data
transactions, or events, are tied together by injecting a unique ID into an
initial request, and passing that ID to all further events through the
system. In a distributed system, a single call will end up passing through
multiple services. Tracing provides a complete picture at the application
level. Again, coming back to our example of a 500 error response, we can
see the entire flow of the specific request which resulted in a 500. By
seeing which services the request passed through we gain valuable
context, which will allow us to find the root cause.
Alerting
Alerting uses pattern detection mechanisms to discover anomalies that
may be potentially problematic. Alerts are made by creating events from
data collected through logging, metrics and tracing. Once engineers have
identified an event, or group of events, they can create and modify the
alerts according to how potentially problematic they may be. Returning to
our example: How do we start the process of debugging the 500 error?
Establish thresholds to define what constitutes an alert. In this case, the
threshold may be defined by the number of 500 errors over a certain
period of time. Ten errors in five minutes means an alert for operations
managed by Container Solutions. Alerts are sent to the appropriate team,
marking the start of the debugging and resolution process. Take into
consideration that what constitutes an alert also depends on what the
normal state of the system is intended to be.
Monitoring Patterns
Of the four pillars, metrics provide the most insight into how an
application performs. Without metrics, it is impossible to tell if an
application behaves the way it should in order to meet service-level
objectives. There are different strategies used to collect and analyze
metrics in order to report the health of cloud-native systems, which is the
foremost concern.
From these metrics you can apply one of the following four methodologies
to determine how performant the system is:
• Golden signals: This method was promoted by the Google SRE team
and relies on four key metrics to determine the state of a system:
latency, throughput, errors and saturation.
In any given system, if the right metrics are collected, engineers — even if
they’re not aware of the entire architecture of the system they use — can
Anomaly Detection
In modern production systems, observability is a core feature which is
needed to detect and troubleshoot any kind of failure. It helps teams
make decisions on actionable items in order to return the system to its
normal state. All the steps taken to resolve a failure should be
meticulously recorded and shared through a post-mortem, which can be
used later on to speed up the resolution time of recurrent incidents.
Analytics can tell a lot about the behavior of a system. Based on historical
data it is possible to predict a potential trend before it becomes a
problem. That’s where machine learning comes into play. Machine
learning is a set of algorithms which progressively improves performance
on a specific task. It is useful to interpret the characteristics of a system
from observed behavior. With enough data, finding patterns that do not
conform to a model of “normal” behavior is an advantage, which can be
used to reduce false positive alerts and help decide on actions that will
FIG 4.2: The Holt-Winters method has the potential to deliver accurate predictions
since it incorporates seasonal fluctuations to predict data points in a series over time.
Abnormal behaviors
Event Data
Time
CI/CD WITH KUBERNETES 101
one of the many methods that can be used to predict data points in a
series over time. Figure 4.2 provides an overview of what the Holt-Winters
method evaluates.
The push model works best for event-driven, time-series datasets. It’s
more accurate, as each event is sent when it’s triggered at the source.
With the push model, it takes some time to tell if a service is unhealthy,
as the instance health is based on the event it receives. The push
FIG 4.3: Many monitoring solutions expect to be handed data, which is known as the
push model. Others reach out to services and scrape data, which is known as the pull
model.
Monitoring Monitoring
Service Service
Each serves a different purpose. A pull model is a good fit for most use
cases, as it enforces convention by using a standard language, but it does
have some limitations. Pulling metrics from internet of things (IoT) devices
or browser events requires a lot of effort. Instead, the push model is a
better fit for this use case, but requires a fixed configuration to tell the
application where to send the data.
Monitoring at Scale
Observability plays an important role in any large distributed system. With
the rise of containers and microservices, what happens when you start
scraping so many containers that you need to scale out? How can you
make it highly available?
There are two ways to solve this problem of monitoring at scale. The first
is a technical solution to use a federated monitoring infrastructure.
Federation allows a monitoring instance to gather selected metrics from
other monitoring instances. The other option is an organizational
approach to improve monitoring by adopting a DevOps culture and
empowering teams by providing them with their own monitoring tools.
This reorganization could be further split into domains — frontend,
backend, database, etc. — or product. Splitting can help with isolation
and coupling issues that can arise when teams are split by role. By
deciding on roles ahead of time, you can prevent scenarios like, “I’m
going to ignore that frontend alert because I’m working on the backend
at the moment.” A third option, and the best yet, is a hybrid of both
Federation
A common approach when having a set of applications running on
multiple data centers or air-gapped clusters is to run a single monitoring
instance for each data center. Having multiple servers requires a “global”
monitoring instance to aggregate all the metrics. This is called
hierarchical federation.
Much later, you might grow to the point where your scrapes are too slow
because the load on the system is too high. When this happens you can
enable sharding. Sharding consists of distributing data across multiple
servers in order to spread the load. This is only required when a
monitoring instance is handling thousands of instances. In general, it is
recommended to avoid this as it adds complication to the monitoring
system.
High Availability
High availability (HA) is a distributed setup which allows for the failure of
one or more services while keeping the service up and running at all
times. Some monitoring systems, like Prometheus, can be made highly
available by running two monitoring instances simultaneously. It scrapes
targets and stores metrics in a database. If one goes down, the other is
still available to scrape.
• Simplicity.
Alternatives to Prometheus
Grafana and Prometheus are the preferred monitoring tools among
Kubernetes users, according to the CNCF’s fall 2017 community survey.
The open source data visualization tool Grafana is used by 64 percent of
organizations that manage containers with Kubernetes, and Prometheus
follows closely behind at 59 percent. The two tools are complementary
and the user data shows that they are most often employed together:
Some 67 percent of Grafana users also use Prometheus, and 75 percent of
Prometheus users also use Grafana.
Grafana 64%
Prometheus 59%
InfluxDB 29%
Datadog 22%
Graphite 17%
Other 14%
Sysdig 12%
OpenTSDB 10%
Stackdriver 8%
Weaveworks 5%
Hawkular 5%
Source: The New Stack Analysis of Cloud Native Computing Foundation survey conducted in Fall 2017.
Q. What monitoring tools are you currently using? Please select all that apply. English n=489; Mandarin, n=187.
Note, only respondents managing containers with Kubernetes were included in the chart. © 2018
FIG 4.5: Grafana and Prometheus are the most commonly used monitoring tools,
with InfluxDB coming in third.
• New Relic is focused on the business side and has probably better
features than Nagios. Most features can be replicated with open
source equivalents, but New Relic is a paid product and has more
functionality than Prometheus alone can offer.
• Stores data.
2
Grafana
Node HDD/SSD Store data
locally or
Server externally. API clients
FIG 4.6: Components outside of the Prometheus core provide complementary fea-
tures to scrape, aggregate and visualize data, or generate an alert.
Prometheus Concepts
Prometheus is a service especially well designed for containers, and it
provides perspective about the data intensiveness of this new, cloud-
native age. Even internet-scale companies have had to adapt their
monitoring tools and practices to handle the vast amounts of data
generated and processed by these systems. Running at such scale
creates the need to understand the dimensions of the data, scale the
data, have a query language and make it all manageable to prevent
servers from becoming overloaded and allow for increased observability
and continuous improvement.
Data Model
Prometheus stores all of the data it collects as a time series which
represents a discrete measurement, or metric, with a timestamp. Each
time series is uniquely identified by a metric name and a set of key-value
pairs, aka labels.
Prometheus Optimization
If used intensively, a Prometheus server can quickly be overloaded
depending on the amount of rules to evaluate or queries run against the
server. This happens when running it at scale, when many teams make use
of query-heavy dashboards. There are a few ways to leverage the load on
the server, however. The first step is to set up recording rules.
— where each instance scrapes its own set of applications. Such a setup
can easily be transformed into a hierarchical federation architecture,
where a global Prometheus instance is used to scrape all the other
Prometheus instances and absorb the load of query-heavy dashboards
used by the business, without impacting the performance of the
primary scrapers.
Installing Prometheus
Installing Prometheus and its components is really simple. Each
component is a binary which can be installed on any popular operating
system, such as Unix and Windows. The most common way to install
Prometheus is to use Docker. The official image can be pulled from Docker
Hub prom/prometheus. A step-by-step guide to install Prometheus is
available on the Prometheus website.
Conclusion
Cloud-native systems are composed of small, independent services
intended to maximize resilience through predictable behaviors. Running
containers in a public cloud infrastructure and taking advantage of a
container orchestrator to automate some of the operational routine is just
the first step toward becoming cloud native.
Systems have evolved, and bring new challenges that are more complex
than decades ago. Observability — which implies monitoring, logging,
tracing and alerting — plays an important role in overcoming the
challenges that arise with new cloud-native architectures, and shouldn’t
be ignored. Regardless of the monitoring solution you ultimately invest in,
it needs to have the characteristics of a cloud-native monitoring system
which enables observability and scalability, as well as standard
monitoring practices.
With great execution can come great results. But the scope has changed.
There are historical barriers to overcome that inhibit Kubernetes use,
namely the social issues that surface when people from different
backgrounds and company experiences enter an open source project and
work together. The Kubernetes community is maturing, and defining
values has become a priority as they work to strengthen the project’s
core. Still, the downside to Kubernetes does have to be taken into context
when thinking through longer term business and technical goals. It is
imperative to have a trust in the Kubernetes project as it matures. There
will be conflicts and stubbornness. And it will all be deep in the project,
affecting testing and the ultimate delivery of updates to the Kubernetes
engine. It’s up to the open source communities to work through how the
committees and the Special Interest Groups align to move the project
forward. It’s a problem that won’t go away. Here, too, the feedback loop
becomes critical between users and the Kubernetes community.
Coming up next for The New Stack is a new approach to the way we
develop ebooks. Look for books on microservices and serverless this year
with corresponding podcasts, in-depth posts, and activities around the
world wherever pancakes are being served.
Alex Williams
Founder, Editor-in-Chief
The New Stack