Pragmatic Evolution of Cloud-Native Application Infrastructure
Pragmatic Evolution of Cloud-Native Application Infrastructure
Cloud-native application
Infrastructure
Agenda
• Interesting trends
• More trends
• Q&A
This is the agenda for today’s talk. I will share couple of interesting facts from google trends, I will try and define application infrastructure in today’s world and talk about
how we provision that scale as code. I will talk through a couple of use cases where we had challenges with infrastructure evolution and how we addressed that and
finally, hopefully tie it all together in a meaningful way. Also I am happy to take any questions at the end of the talk.
? ?
in ?
through ?
I guess what I really wanna talk about is this! Terraform and CloudFormation are often compared against each other and there is no shortage of blogs and videos about
this comparison. So I wanted to analyse the global trends of their usage on google trends.
By looking at the trends it is clear to me that terraform is clearly the market leader in the IAAC space. But I wanted to go a bit further and see the trends per region.
Terraform has no boundaries! This is amazing as we can see that people across the globe are interested in terraform. Before we understand why people love terraform,
let’s understand the complexity of today’s Application Infrastructure landscape.
Application Infrastructure
• VCS
• CI/CD System
• Cloud Infrastructure
• Logging + Auditing
• Observability
• Analytics
- Often times when people talk about cloud native application infrastructure, it is not just the underlying cloud infrastructure that they talk about, but also about the
third party tooling that is so critical to the application lifecycle like observability tooling, CI/CD Systems and etc.
- Traditionally, the approaches for managing this application stacks have relied on point-and-click interfaces or manual operator configuration or ad-hoc scripts, however
these approaches are prone to human error. Is there a better way for managing this?
- What if we can decompose this complex system into modular components, apply best practices like reviews & versioning, and automate the execution.
• Declarative
• Reproducible
• Allows Collaboration
• Strong Community
• 1000+ Modules
• 200+ providers
• 25000 commits
It has simple json like declarative language (expressions, loops for more power)
It allows us to plan the changes by surfacing the diff.. promoting visibility into what’s going to happen
This is why we picked terraform to build the building blocks of our cloud native platform and it was a great choice.
But how does terraform compare against cloudFormation? Ok, there is some interest in people about this comparison. But hey, I am from India and it’s interesting that
google trends doesn’t have data from India. So let me tweak the search query slightly and see if we can get some data points for India.
Interesting when I searched for terraform vs cloudFormation instead of CloudFormation vs terraform I could see some data points for India. But I live in Australia and I
don’t see any data for that, so I have tweaked the query again.
There we go! While this analysis is not really useful, the underlying idea is sound. People are interested in putting terraform and cloud formation together and that was
Interesting to me. When I first saw this I was like “Why would I want to do that?” But then I remembered a blog from Hashicorp about application delivery.
- principled yet pragmatic
While I loved everything that’s mentioned in this blog about application delivery lifecycle, one thing that really resonated with me was “pragmatism”. I realised the value of
staying principled but pragmatic that day and that pragmatism is the underlying idea of what I wanna talk about today. To be precise I want to talk about a couple of
scenarios where pragmatism has helped me design & build resilient systems by using terraform and cloudFormation together.
in
• Scalable
• Dynamic Pipelines
• StatsD Metrics
• API Support
Before I talk about the specifics of terraform and CloudFormation, I’ll try and provide some context around the problem we were trying to solve. CI/CD system was a core
& very important part of our cloud native platform. We evaluated a bunch of providers and settled with Buildkite as it fits our operational / risk model.
Buildkite’s architecture comprises of a control plane which they host and a data plane that includes machines in our vpc’s running their agents.
Elastic CI stack for AWS
To run these agents as a cluster and to scale them they provide an open source solution that basically creates a ASG and run the agents on the machines in this ASG.
The challenge with that is they only provide this solution as a cloud formation stack.
Pros:
1. Single click / Cloudformation API call will build the entire cluster of build agents
Cons:
4. Doesn’t have support for adding third party tooling around it like datadog for example.
resource + data source combination to the rescue
So instead of building our own provider for this or creating adhoc scripts to do this, we have decide to use the provided cloud formation stack, use terraform’s cloud
formation resource to provision the stack and the data source to fetch the outputs from the stack to terraform state.
Why?
2. Safer upgrade path, we tried creating a custom provider before and had critical issues during upgrades.
3. It became easier to compose this along with other tooling available like data dog
We could have been rigid with our choices and stick to terraform and build and maintain our own provider, but being pragmatic about it and using the flexibility and the
power of terraform’s data sources we ended up building a much simpler system.
through
Next I want to talk about a scenario where we were using a cloudformation resource as a proxy for terraform.
But before I talk about the specifics, I want to provide some context and discuss some concepts that would help understand the problem that we were trying to solve
and the tradeoffs we made during that.
Terraform Modules
• A module is a container for multiple resources that are
used together.
Modules is my favourite feature in terraform. It is an extremely powerful building block using which we can compose multiple resources as a single working system and
save it as a whole for reusing it later.
–Terraform Docs
We want to build and model our infrastructure systems based on those guidelines and by extending it to application level abstraction. Application is centric to this
approach. Examples include an infrastructure for an internal service , a static website with a CDN + S3 etc..
Open source example
And this is not new in the terraform ecosystem. There are plenty of open source examples available.
Challenges
• Security / Audit teams unhappy with git as a distribution
system.
- Lines of business requires multiple AWS Accounts, acquisitions took us to multiple regions
- Security Controls are not implemented consistently across these products and misconfigurations happened
• RBAC
• one-stop shop
And evaluating a couple of different products we ended up deciding to use AWS Service catalog as our platform product catalog.
AWS Service Catalog - Concepts
Few concepts that are important about AWS Service Catalog are
- Portfolios are a set of products and these are similar to modules in terraform
Challenges
• Terraform AWS Provider supports
creation of Portfolios alone.
• So custom provider?
The first challenge that we had with converting the application infrastructure units that are sitting in git to service catalog products was the lack support for service
catalog in terraform AWS provider. While we had challenges with maintaining a custom fork of AWS Provider that is something we had done in the past and thanks to
terraform’s extensibility it is not an impossible task. However we did not went down that road as we discovered another big challenge which won’t be solved by writing a
custom provider.
More challenges
Service Catalog can only provision through CloudFormation template that is stored in S3 in JSON format. So here we are
2. An executive level guidance to use Service Catalog as our infra product catalog which only likes CloudFormation on the other.
It seemed like an impossible goal at that point. How can we execute terraform configuration files through service catalog when it only likes cloudFormation.
Solution
But that pragmatism and workflow oriented solution design which we borrowed from terraform has helped us find a solution in a sensible way.
- We ended up building a custom cloudformation resource which is just a lambda that gets triggered by the service catalog product and proxies these requests to a
terraform server.
- A factory service that is provisioned with terraform which creates these service catalog products and portfolios on demand.
- A puppet service provisioned with our terraform Infrastructure units that copies these service catalog products to multiple accounts and regions.
through
1. 2 AWS Account types. A Hub Account / Fulfilment account that hosts the terraform servers. A Spoke account that contains the Service Catalog Products.
2. The terraform server is just a thin http wrapper around terraform API.
3. When a user provisions a service catalog product in a spoke account, SC invokes a lambda, lambda receives the request and place that in a queue which will be
picked up by the terraform fulfilment server.
4. TF server will parse the payload, identifies terraform configuration suitable for that payload and creates the resources in the spoke account by assuming a IAM role in
the spoke account.
5. TF State and Configuration is stored in the Hub account. So this simplifies the attack surface and makes it easier to monitor / secure / upgrade.
With that architecture we were able to easily expand it to multiple AWS accounts.
Limitations
- Build simple, modular and composable infra modules which we extended to products
- Made it possible for us to have immutable infrastructure across 150+ AWS Accounts
with minimal complexity
- Has empowered our cloud native platform though the builtin support for versioning
and automation
- Made our systems resilient by persisting the state of our systems and facilitating
resilience.
So if we think about the two scenarios that we just discussed and try to extract the underlying principles that guided the solution design for our cloud native platform, we
can see that terraform has powered us to
- Build simple, modular and composable infra modules which we extended to products
- Made it possible for us to have immutable infrastructure across 150+ AWS Accounts with minimal complexity
- Has empowered our cloud native platform though the builtin support for versioning and automation
- Made our systems resilient by persisting the state of our systems and facilitating reconciliation.
- Has motivated us to be pragmatic by providing mechanisms to extend it and exclude it when it makes sense.
And when we have a tool that is built upon on such sensible values, that makes sense in any given Infrastructure platform evolution context it is no wonder that it will be
loved across the globe and the data for terraform reflects that.
Thank you