0% found this document useful (0 votes)
122 views

Pragmatic Evolution of Cloud-Native Application Infrastructure

This document outlines the agenda for a talk on pragmatic evolution of cloud-native application infrastructure. The talk will discuss interesting trends in tools like Terraform and AWS CloudFormation, define application infrastructure, and how infrastructure can be provisioned as code. It will cover two use cases where the speaker used a pragmatic approach to address challenges with infrastructure evolution. The talk aims to tie the concepts together and leave time for Q&A.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views

Pragmatic Evolution of Cloud-Native Application Infrastructure

This document outlines the agenda for a talk on pragmatic evolution of cloud-native application infrastructure. The talk will discuss interesting trends in tools like Terraform and AWS CloudFormation, define application infrastructure, and how infrastructure can be provisioned as code. It will cover two use cases where the speaker used a pragmatic approach to address challenges with infrastructure evolution. The talk aims to tie the concepts together and leave time for Q&A.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Pragmatic evolution of

Cloud-native application
Infrastructure
Agenda
• Interesting trends

• Application Infrastructure as Code

• More trends

• Use case analysis (2)

• Tying it all together

• Q&A

This is the agenda for today’s talk. I will share couple of interesting facts from google trends, I will try and define application infrastructure in today’s world and talk about
how we provision that scale as code. I will talk through a couple of use cases where we had challenges with infrastructure evolution and how we addressed that and
finally, hopefully tie it all together in a meaningful way. Also I am happy to take any questions at the end of the talk.
? ?
in ?
through ?

Terraform AWS CloudFormation

I guess what I really wanna talk about is this! Terraform and CloudFormation are often compared against each other and there is no shortage of blogs and videos about
this comparison. So I wanted to analyse the global trends of their usage on google trends.
By looking at the trends it is clear to me that terraform is clearly the market leader in the IAAC space. But I wanted to go a bit further and see the trends per region.
Terraform has no boundaries! This is amazing as we can see that people across the globe are interested in terraform. Before we understand why people love terraform,
let’s understand the complexity of today’s Application Infrastructure landscape.
Application Infrastructure
• VCS

• CI/CD System

• Cloud Infrastructure

• Logging + Auditing

• Observability

• Analytics

- Often times when people talk about cloud native application infrastructure, it is not just the underlying cloud infrastructure that they talk about, but also about the
third party tooling that is so critical to the application lifecycle like observability tooling, CI/CD Systems and etc.

- Traditionally, the approaches for managing this application stacks have relied on point-and-click interfaces or manual operator configuration or ad-hoc scripts, however
these approaches are prone to human error. Is there a better way for managing this?

- What if we can decompose this complex system into modular components, apply best practices like reviews & versioning, and automate the execution.

- What if we can provision all of this as code?


Everything as Code with
• Open source

• Declarative

• Reproducible

• Allows Collaboration

• “Plan” & Predict before “Apply”

• Strong Community

• 1000+ Modules

• 200+ providers

• 25000 commits

We can do all of that good stuff with Terraform.

It has simple json like declarative language (expressions, loops for more power)

It makes builds reproducible by storing remote state

It allows collaboration through locking for remote backends

It allows us to plan the changes by surfacing the diff.. promoting visibility into what’s going to happen

It has a great community backing it up

This is why we picked terraform to build the building blocks of our cloud native platform and it was a great choice.
But how does terraform compare against cloudFormation? Ok, there is some interest in people about this comparison. But hey, I am from India and it’s interesting that
google trends doesn’t have data from India. So let me tweak the search query slightly and see if we can get some data points for India.
Interesting when I searched for terraform vs cloudFormation instead of CloudFormation vs terraform I could see some data points for India. But I live in Australia and I
don’t see any data for that, so I have tweaked the query again.
There we go! While this analysis is not really useful, the underlying idea is sound. People are interested in putting terraform and cloud formation together and that was
Interesting to me. When I first saw this I was like “Why would I want to do that?” But then I remembered a blog from Hashicorp about application delivery.
- principled yet pragmatic

While I loved everything that’s mentioned in this blog about application delivery lifecycle, one thing that really resonated with me was “pragmatism”. I realised the value of
staying principled but pragmatic that day and that pragmatism is the underlying idea of what I wanna talk about today. To be precise I want to talk about a couple of
scenarios where pragmatism has helped me design & build resilient systems by using terraform and cloudFormation together.
in

First use case CloudFormation in Terraform.


Buildkite
• BYO Infrastructure

• Open source agent

• Artifact + Metadata support

• Scalable

• Dynamic Pipelines

• StatsD Metrics

• Hooks for customisation

• API Support

Before I talk about the specifics of terraform and CloudFormation, I’ll try and provide some context around the problem we were trying to solve. CI/CD system was a core
& very important part of our cloud native platform. We evaluated a bunch of providers and settled with Buildkite as it fits our operational / risk model.

Buildkite’s architecture comprises of a control plane which they host and a data plane that includes machines in our vpc’s running their agents.
Elastic CI stack for AWS

To run these agents as a cluster and to scale them they provide an open source solution that basically creates a ASG and run the agents on the machines in this ASG.
The challenge with that is they only provide this solution as a cloud formation stack.

Pros:

1. Single click / Cloudformation API call will build the entire cluster of build agents

Cons:

1. Not suitable for scale

2. 150+ AWS Accounts

3. Each Account has different CI Stacks for PLP and Isolation

4. Doesn’t have support for adding third party tooling around it like datadog for example.
resource + data source combination to the rescue

So instead of building our own provider for this or creating adhoc scripts to do this, we have decide to use the provided cloud formation stack, use terraform’s cloud
formation resource to provision the stack and the data source to fetch the outputs from the stack to terraform state.

Why is this better?

• Can consume official CloudFormation template.

• Safe Upgrade workflow

• Easier to compose with other tooling e.x. DataDog

• Compliance & Governance tooling compatible

• Increased Flexibility & reduced dependancy

• And all other good stuff from terraform community.

Why?

1. Easier to consume buildkite’s official CloudFormation template.

2. Safer upgrade path, we tried creating a custom provider before and had critical issues during upgrades.

3. It became easier to compose this along with other tooling available like data dog

We could have been rigid with our choices and stick to terraform and build and maintain our own provider, but being pragmatic about it and using the flexibility and the
power of terraform’s data sources we ended up building a much simpler system.
through

Next I want to talk about a scenario where we were using a cloudformation resource as a proxy for terraform.

But before I talk about the specifics, I want to provide some context and discuss some concepts that would help understand the problem that we were trying to solve
and the tradeoffs we made during that.
Terraform Modules
• A module is a container for multiple resources that are
used together.

• Every terraform configuration has at least one module


called the “root module”.

• A module can call other modules.

• Modules can be called multiple times.

Modules is my favourite feature in terraform. It is an extremely powerful building block using which we can compose multiple resources as a single working system and
save it as a whole for reusing it later.

“A good module should raise the level of


abstraction by describing a new concept in your
architecture that is constructed from resource types
offered by providers”

–Terraform Docs

Terraform docs mentions that a good module…


Application Infrastructure Units

• Another layer of abstraction built on top of modules.

• Application centric approach

• Each AIU corresponds to an application workflow

• Examples: Internal Service, Data Lake, Static Website etc.

• Simple git workflows for developers.

We want to build and model our infrastructure systems based on those guidelines and by extending it to application level abstraction. Application is centric to this
approach. Examples include an infrastructure for an internal service , a static website with a CDN + S3 etc..
Open source example

And this is not new in the terraform ecosystem. There are plenty of open source examples available.
Challenges
• Security / Audit teams unhappy with git as a distribution
system.

• No direct visibility into product usage

• How do we distribute these AIUs across all(150+) AWS


Accounts? All AWS regions (4)

• How do we standardise these products?

• How do we enforce consistency and compliance?

• How do we limit access?

- Lines of business requires multiple AWS Accounts, acquisitions took us to multiple regions

- Security Controls are not implemented consistently across these products and misconfigurations happened

- Distribution and RBAC are major challenges.

- Licensing & Usage was another important concern as well.


Desired features
• Developer Autonomy

• Track metrics & licence contracts

• Cost management (by automatic resource termination in


AWS for example)

• RBAC

• one-stop shop

• Versioning & deployment automation

• Agile governance & Integration with existing tooling.

So we wanted a system that ticks most, if not all of these boxes


AWS Service Catalog

And evaluating a couple of different products we ended up deciding to use AWS Service catalog as our platform product catalog.
AWS Service Catalog - Concepts

• Distribution units are called “Products”.

• Products can be grouped together into “Portfolios”

• RBAC support at both Product and Portfolio level

• Portfolios can either be

• Imported into the account

• Shared by the master account to all spoke accounts

• “ProvisionedArtifact” - A versioned artefact of the product

Few concepts that are important about AWS Service Catalog are

- Products are Provisioned versions of artefacts similar to resources in terraform

- Portfolios are a set of products and these are similar to modules in terraform
Challenges
• Terraform AWS Provider supports
creation of Portfolios alone.

• These portfolios are not useful as


there is no IAM role integration.

• Hub Account cannot place


products in these portfolios

• Spoke Accounts cannot


consume products.

• Issue / PR Pending for almost 18


months

• So custom provider?

The first challenge that we had with converting the application infrastructure units that are sitting in git to service catalog products was the lack support for service
catalog in terraform AWS provider. While we had challenges with maintaining a custom fork of AWS Provider that is something we had done in the past and thanks to
terraform’s extensibility it is not an impossible task. However we did not went down that road as we discovered another big challenge which won’t be solved by writing a
custom provider.
More challenges

• Info: The URL of the CloudFormation template in S3, in JSON


format

Service Catalog can only provision through CloudFormation template that is stored in S3 in JSON format. So here we are

1. Heavily invested in terraform and quite happy with it on one hand

2. An executive level guidance to use Service Catalog as our infra product catalog which only likes CloudFormation on the other.

It seemed like an impossible goal at that point. How can we execute terraform configuration files through service catalog when it only likes cloudFormation.

Solution

• CloudFormation Proxy for Terraform

• Service Catalog Factory - with terraform

• Service Catalog Puppet - with terraform

But that pragmatism and workflow oriented solution design which we borrowed from terraform has helped us find a solution in a sensible way.

- We ended up building a custom cloudformation resource which is just a lambda that gets triggered by the service catalog product and proxies these requests to a
terraform server.

- A factory service that is provisioned with terraform which creates these service catalog products and portfolios on demand.

- A puppet service provisioned with our terraform Infrastructure units that copies these service catalog products to multiple accounts and regions.
through

The architecture looks something like this.

1. 2 AWS Account types. A Hub Account / Fulfilment account that hosts the terraform servers. A Spoke account that contains the Service Catalog Products.

2. The terraform server is just a thin http wrapper around terraform API.

3. When a user provisions a service catalog product in a spoke account, SC invokes a lambda, lambda receives the request and place that in a queue which will be
picked up by the terraform fulfilment server.

4. TF server will parse the payload, identifies terraform configuration suitable for that payload and creates the resources in the spoke account by assuming a IAM role in
the spoke account.

5. TF State and Configuration is stored in the Hub account. So this simplifies the attack surface and makes it easier to monitor / secure / upgrade.
With that architecture we were able to easily expand it to multiple AWS accounts.
Limitations

• CloudFormation Plan does not show resources created


by terraform

• Terraform execution time <= CloudFormation max stack


timeout

• Interesting issues will arise without state locking as the


execution model is asynchronous.
Additional details

• Lambda proxy only covers consumption

• SC Factory service takes care of creation of Service


Catalog portfolios and products

• YAML -> Factory -> SC Product

• SC Puppet service takes care of distribution

• YAML -> Puppet -> Copied SC Products in all accounts

• Distribution is based on inventory service


The Tao of Terraform
Terraform has powered us to

- Think in terms of workflows by not limiting us to a specific technology

- Build simple, modular and composable infra modules which we extended to products

- Made it possible for us to have immutable infrastructure across 150+ AWS Accounts
with minimal complexity

- Has empowered our cloud native platform though the builtin support for versioning
and automation

- Made our systems resilient by persisting the state of our systems and facilitating
resilience.

- Has motivated us to be pragmatic by providing mechanisms to extend it and exclude


it when it makes sense.

So if we think about the two scenarios that we just discussed and try to extract the underlying principles that guided the solution design for our cloud native platform, we
can see that terraform has powered us to

- Think in terms of workflows by not limiting us to a specific technology

- Build simple, modular and composable infra modules which we extended to products

- Made it possible for us to have immutable infrastructure across 150+ AWS Accounts with minimal complexity

- Has empowered our cloud native platform though the builtin support for versioning and automation

- Made our systems resilient by persisting the state of our systems and facilitating reconciliation.

- Has motivated us to be pragmatic by providing mechanisms to extend it and exclude it when it makes sense.
And when we have a tool that is built upon on such sensible values, that makes sense in any given Infrastructure platform evolution context it is no wonder that it will be
loved across the globe and the data for terraform reflects that.
Thank you

You might also like