sysadvent: AWS

Showing posts with label AWS. Show all posts

December 4, 2021

Day 4 - GWLB: Panacea for Cloud DMZ on AWS

By: Atif Siddiqui
Edited by: Jennifer Davis (@sigje)

Organizations aspire to apply the same security controls to ingress traffic in Cloud as they have on-premises, ideally taking advantage of Cloud value propositions to provide resiliency and scalability to traffic inspection appliances.

Within the AWS ecosystem, until last year, there wasn’t an elegant solution. Consequently, the most notable challenge it created, especially for regulated organizations, was designing the DMZ (demilitarized zone) pattern in AWS. It took two announcements to close this gap: VPC Ingress routing and Gateway Load Balancer (GWLB).

Two years ago, AWS announced VPC Ingress routing. This provided the capability where ingress traffic could be directed to an Elastic Network interface (ENI). Last year, Amazon followed it up with a complementary announcement of GWLB.

GWLB is AWS's fourth load balancer offering following Classic, Application and Network Load Balancer. Unlike the first three types, GWLB solves a niche problem and is, specifically, targeted towards partner appliances.

GWLB has a novel design with two distinct sides. The front end is connected to VPC endpoint service and corresponding VPC endpoints. This front end acts as a Layer 3 gateway. The backend is connected to third party appliances. This backend acts as a Layer 4 Load Balancer. An oversimplified diagram of the traffic flow is shown:

Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB → 3^rd party appliance

So how do you provision a GWLB?

There are 4 key resources that need to be provisioned in order:

Target Group
GWLB using the above as the target group.
VPC endpoint service using above as the load balancer type.
VPC endpoints bound to the above endpoint service.

Target Group

As part of this announcement, AWS implemented the GENEVE protocol and added this option to the UX for Target Group. If you are unfamiliar with this protocol it will be explained after going through GWLB provisioning requirements.

To configure this as infrastructure code (IaC), you could use a terraform code snippet as follows:


    resource "aws_lb_target_group" "blog_gwlb_tgt_grp" {
        name      = "blog_gwlb_tgt_grp"
        port        = 6081
        protocol = "GENEVE"
        vpc_id   = aws_vpc.fw.id
      }

GWLB

As with Application Load Balancing, GWLB requires a target group to forward traffic; however, the target group must be created with the GENEVE protocol.

Health checks for TCP, HTTP and HTTPS are supported; however, it should be noted that health check packets are not GENEVE encapsulated.

An example of a terraform code snippet is as follows.


    resource "aws_lb" "blog_gwlb" {
        name                       = "blog_gwlb"
        load_balancer_type = "gateway"
        subnets                    = blog-gwlb-subnet.pvt.*.id
      
        tags = {
          Name                     = “blog-gwlb”,
          Environment              = "sandbox"
            }
      }

Endpoint Service

Prior to GWLB announcement, if an endpoint service was being created, the only option offered was Network Load Balancer (NLB). With GWLB’s availability, gateway is now the second option for load balancer type when creating an endpoint service. It should be noted that endpoint service whether it uses NLB or GWLB relies on the underlying PrivateLink technology.

An example of terraform code snippet is as follows.


    resource "aws_vpc_endpoint_service" "blog-vpce-srvc" {
        acceptance_required              = false
        gateway_load_balancer_arns       = [aws_lb.blog-gwlb.arn]
      
        tags = {
          Name                           = “blog-gwlb”,
          Environment                    = "sandbox"
            }
      }

VPC endpoint

The last key piece of the set is provisioning of VPC end points which will bind to end point service created in the prior step.


    resource “aws_vpc_endpoint “blog_gwlbe” {
        count          = length(var.az)
        service_name   = aws_vpc_endpoint_service.blog-vpce-srvc.service_name 
        subnet_ids     = [var.blog-gwlb-subnets[count.index]]
        vpc_id         = aws_vpc.fw.id
    
    tags = {
        Name           = “blog-gwlb”,
        Environment    = "sandbox"
          }
    }

GENEVE

This is an encapsulation protocol created by the Internet Engineering Task Force (IETF). GENEVE stands for Generic Network Virtualization Encapsulation and leverages UDP for the transport layer. This encapsulation is what achieves the transparent routing of packets to third party appliances from vendors such as Big-IP, Palo Alto Networks, Aviatrix etc.

Special route table

The glue that blends VPC Ingress routing and GWLB feature is through a special use of route table.

Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB → 3^rd party appliance e.g marketplace subscription.

This table does not have any explicit subnet association. It, however, has Internet Gateway (IGW) specified on the Edge associations.

Within routes, quad 0 points to Network interfaces (ENIs) of the Gateway Load Balancer endpoints (GWLBe).

It is this routing rule that enforces ingress traffic to be routed to GWLBe which in turns sends to GWLB (endpoint service) that is then routed to appliances.

Limitations

Target group using the GENEVE protocol does not support tags.

Cloud DMZ: Centralized Inspection Architecture

Conclusion

The pairing of VPC ingress routing and GWLB allows enterprises to have a much sought after security posture where now both ingress and egress traffic can undergo firewall inspection. This set of capability is, especially, notable when the Cloud DMZ architecture is being created.

Afterthought: AWS Network Firewall

It is always fascinating to me how AWS keeps vendors on their toes. There seems to be an aura of ineluctability where vendors strive to stay a step ahead of AWS’s offering. While customers can use marketplace subscriptions (e.g. firewall) with GWLB, there is a competing service by Amazon named AWS Network Firewall. This is essentially Firewall as a Service where VPC ingress routing primitive will be used to point to AWS Network Firewall which uses GWLB behind the scenes. It is easy to predict that AWS will push for new products that will compete in this space that will use GWLB under the hood.

Over time, choices will rise whether it is with AWS products or more vendors certifying their products with GWLB. This abundance will serve to only benefit customers with more choices in their pursuit of secure network architecture.

References

December 19, 2017

Day 19 - Infrastructure Testing: Sanity Checking VPC Routes

By: Dan Stark (@danstarkdevops)
Edited By: James Wen

Testing Infrastructure is hard; testing VPCs is even harder

Infrastructure testing is hard. My entire career I’ve tried to bring traditional development testing practices into operations. Linters. Rspec. Mock objects. These tools provide semantic and syntax checking, as well as unit and integration level coverage for infrastructure as code. Ideally, we would also test the system after the code is deployed. End-to-end infrastructure testing has always been a stretch goal – too time-consuming to implement from scratch. This is especially true of network level testing. I am not aware of any existing tools that provide self-contained, end-to-end tests to ensure VPCs, subnets, and route tables are properly configured. As a result, production network deployments can be incredibly anxiety-inducing. Recently, my coworkers and I set up an entire VPC (virtual private cloud) using infrastructure as code, but felt we needed a tool that could perform a VPC specification validation to catch bugs or typos before deployment. The goal was to sanity check our VPC routes using a real resource in every subnet.

Why is this necessary

A typical VPC architecture may contain multiple VPCs and include peering, internal/external subnets, NAT instances/gateways, and internet gateways. In the “Reliability Pillar” of their “Well-Architected Framework” whitepaper, AWS recommends designing your system based on your availability need. At Element 84, we desired 99.9% reliability for EC2, which required 3 external and internal subnets having CIDR blocks of /20 or smaller. In addition, we needed 9 of these redundant VPCs to provide required network segregation. It took significant effort to carve out VPCs with dependent rules and resources for three availability zones.

Here is a hypothetical example of multiple VPCs (Dev, Stage, Prod) over two regions:

Extending this example with additional VPCs for bastion hosts, reporting, and Demilitarized Zones for contractors (DMZs), both NonProd and Prod:

It’s too easy for a human to make a mistake, even with infrastructure as code.

Managing VPC infrastructure

Here’s one example of how we want a VPC to behave:

We want a Utility VPC peered to a Staging VPC. The Utility VPC contains bastion EC2 instances living in external subnets and the Staging VPC contains internally subnetted application EC2 instances. We want to test and ensure the connectivity between these resources. Also, we want to verify that every bastion EC2 can communicate with all the potential application EC2 instances, across all subnets. Additionally, we want to test connectivity to the external internet for the application EC2 instances, in this case via a NAT gateway.

These behaviors are well defined and should be tested. We decided to write a tool to help manage testing our VPCs and ensuring these kinds of behaviors. It contains:

a maintainable top level DSL written in YAML to declare the VPC specification that sits above the VPC configuration code; and
a mechanism to be able to test the network level connectivity between VPCs, subnets, IGW/NAT and report any problems.

Introducing: VpcSpecValidator

This project is “VpcSpecValidator,” a Python 3 library built on top of the boto3.

There are a few requirements in how you deploy your VPCs to use this library:

You must have deployed your VPCs with CloudFormation and have Outputs for internal subnet with the strings “Private” or “Public” and “Subnet”, e.g. “DevPrivateSubnetAZ1A”, “DevPrivateSubnetAZ1B”, “DevPrivateSubnetAZ1C.”
All VPCs should be tagged with ‘Name’ tags in your region(s).
You must ensure your that a security group attached to these instances allows SSH access between your VPC peers. This is not recommended for production so you may want to remove these rules after testing.
You must have permissions to create/destroy EC2 instances for complete setup and teardown. The destroy method has multiple guards to prevent you from accidentally deleting EC2 instances not created by this project.

You supply a YAML configuration file to outline your VPCs’ structure. Using our example above, this would look like:

project_name: mycompany
region: us-east-1
availability_zones:
  - us-east-1a
  - us-east-1b
  - us-east-1c

# Environment specification
dev:
  peers:
    - nonprod-util
    - nonprod-reporting
  us-east-1a:
    public: 172.16.0.0/23
    private: 172.16.6.0/23
  us-east-1b:
    public: 172.16.2.0/23
    private: 172.16.8.0/23
  us-east-1c:
    public: 172.16.4.0/23
    private: 172.16.10.0/23

stage:
  peers:
    - nonprod-util
    - nonprod-reporting
  us-east-1a:
    public: 172.17.0.0/23
    private: 172.17.6.0/23
  us-east-1b:
    public: 172.17.2.0/23
    private: 172.17.8.0/23
  us-east-1c:
    public: 172.17.4.0/23
    private: 172.17.10.0/23

nonprod-util:
  peers:
    - dev
    - stage
  us-east-1a:
    public: 172.19.0.0/23
    private: 172.19.6.0/23
  us-east-1b:
    public: 172.19.2.0/23
    private: 172.19.8.0/23
  us-east-1c:
    public: 172.19.4.0/23
    private: 172.19.10.0/23

nonprod-reporting:
  peers:
    - dev
    - stage
  us-east-1a:
    public: 172.20.0.0/23
    private: 172.20.6.0/23
  us-east-1b:
    public: 172.20.2.0/23
    private: 172.20.8.0/23
  us-east-1c:
    public: 172.20.4.0/23
    private: 172.20.10.0/23

prod:
  peers:
    - prod-util
  us-east-1a:
    public: 172.18.0.0/23
    private: 172.18.6.0/23
  us-east-1b:
    public: 172.18.2.0/23
    private: 172.18.8.0/23
  us-east-1c:
    public: 172.18.4.0/23
    private: 172.18.10.0/23

prod-util:
  peers:
    - prod
  us-east-1a:
    public: 172.19.208.0/23
    private: 172.19.214.0/23
  us-east-1b:
    public: 172.19.210.0/23
    private: 172.19.216.0/23
  us-east-1c:
    public: 172.19.212.0/23
    private: 172.19.218.0/23

prod-reporting:
  peers:
    - prod
  us-east-1a:
    public: 172.20.208.0/23
    private: 172.20.214.0/23
  us-east-1b:
    public: 172.20.210.0/23
    private: 172.20.216.0/23
  us-east-1c:
    public: 172.20.212.0/23
    private: 172.20.218.0/23

The code will:

parse the YAML for a user-specified VPC,
get the public or private subnets associated with each Availability Zone’s CIDR range in that VPC,
launch an instance in those subnets,
identify the peering VPC(s),
create a list of the test instances in the peer’s subnets (public or private, depending on what was specified in step 2),
attempt an TCP socket connection using the private IP and port 22 for each instance in this list.

Step 5 posed an interesting deployment challenge. We decided UserData was a good option to bootstrap and clone the repo on an EC2 instance, but did not know how to pass it peered VPC’s private IP addresses as SSH targets.

Given the entire specification is in one file and the CIDR ranges are available, we can cheat and look at the Outputs of the peer(s)’ CloudFormation stack and see if any instances created in Step 3 match.

def get_ip_of_peer_instances_and_write_to_settings_file(self):

    '''
    This is run on the source EC2 instance as part of UserData bootstrapping
    1) Look at the peer(s)' VPC CloudFormation Stack's Outputs for a list of subnets, public or private as defined in the constructor.
    2) Find instances in those subnets created by this library
    3) Get the Private IP address of target instances and write it to a local configuration file
    '''
        
    # Query for peer CloudFormation, get instances
    target_subnet_list = []
    target_ip_list = []
    with open(self.config_file_path, 'r') as ymlfile:
        cfg = yaml.load(ymlfile)
    
    for peer in self.peers_list:
        peer_stack_name = "{}-vpc-{}-{}".format(self.project_name, peer, cfg['region'])
    
        # Look at each peer's CloudFormation Stack Outputs and get a list of subnets (public or private)
        client = boto3.client('cloudformation')
        response = client.describe_stacks(StackName=peer_stack_name)
        response_outputs = response['Stacks'][0]['Outputs']
    
        for i in range(0,len(response_outputs)):
            if self.subnet_type == 'public':
                if 'Subnet' in response_outputs[i]['OutputKey'] and 'Public' in \
                        response_outputs[i]['OutputKey']:
                    subnet_id = response_outputs[i]['OutputValue']
                    target_subnet_list.append(subnet_id)
    
            else:
                if 'Subnet' in response_outputs[i]['OutputKey'] and 'Private' in \
                        response_outputs[i]['OutputKey']:
                    subnet_id = response_outputs[i]['OutputValue']
                    target_subnet_list.append(subnet_id)
    
    
        # Search the instances in the targeted subnets for a Name tag of VpcSpecValidator
        client = boto3.client('ec2')
        describe_response = client.describe_instances(
            Filters=[{
                'Name': 'tag:Name',
                'Values': ['VpcSpecValidator-test-runner-{}-*'.format(peer)]
            }]
        )
    
        # Get Private IP addresses of these instances and write them to target_ip_list.settings
    
        for i in range(0,len(describe_response['Reservations'])):
            target_ip_list.append(describe_response['Reservations'][i]['Instances'][0]['PrivateIpAddress'])
    
        # Write the list to a configuration file used at runtime for EC2 instance
        with open('config/env_settings/target_ip_list.settings', 'w') as settings_file:
            settings_file.write(str(target_ip_list))

There is also a friendly method to ensure that the YAML specification matches what is actually deployed via CloudFormation templates.

spec = VpcSpecValidator('dev', subnet_type='private')

spec.does_spec_match_cloudformation()

Next Steps

At Element 84, we believe our work benefits our world, and open source is one way to personify this value. We’re in the process of open sourcing this library at the moment. Please check back soon and we’ll update this blog post with a link. We will also post a link to the repo on our company Twitter.

Future features we would like to add:

Make the VpcSpecValidator integration/usage requirements less strict.
Add methods to test internet connectivity.
Dynamic EC2 keypair generation/destruction. The key pairs should be unique and throwaway after the test.
Compatibility with Terraform.
CI as a first-class citizen by aggregating results in JUnit-compatible format. Although I think it would be overkill to run these tests with every application code commit, it may make sense for infrastructure commits or running on a schedule.

Wrap Up - Testing VPCs is difficult but important

Many businesses use one big, poorly defined (default) VPC. There are a few problems with this:

Development resources can impact production

At a fundamental level, you want to have as many barriers between development and production environments as possible. This isn’t necessarily to stop developers from caring about production. As an operator, I want to prevent my developers from being in a position where they might unintentionally impact production. In addition to security group restrictions, avoid these potential mishaps impossible from a network perspective. To steal an idea from Google Cloud Platform, we want to establish "layers of security.” This tool helps to enforce these paradigms by validating VPC behavior prior to deployment.

Well-defined and well-tested architecture is necessary for scaling

This exercise forced our team to think about our architecture and its future. What are the dependencies as they sit today? How would we scale to multi-region? What about third party access? We would want them in a DMZ yet still able to get the information they need. How big do we expect these VPCs to scale?

It’s critical to catch these issues before anything is deployed

The best time to find typos, configuration, and logic errors is before the networking is in use. Once deployed, these are hard errors to troubleshoot because of the built-in redundancy. The goal is to prevent autoscaling events yielding a “how was this ever working” alarm at 3AM because one subnet route table is misconfigured. That’s why we feel a tool like this has a place in the community. Feel free to add comments and voice a +1 in support.

Happy SysAdvent!

December 14, 2017

Day 14 - Pets vs. Cattle Prods: The Silence of the Lambdas

By: Corey Quinn (@quinnypig)
Edited By: Scott Murphy (@ovsage)

“Mary had a little Lambda
S3 its source of truth
And every time that Lambda ran
Her bill went through the roof.”

Lambda is Amazon’s implementation of a concept more broadly known as “Functions as a Service,” or occasionally “Serverless.” The premise behind these technologies is to abstract away all of the infrastructure-like bits around your code, leaving the code itself the only thing you have to worry about. You provide code, Amazon handles the rest. If you’re a sysadmin, you might well see this as the thin end of a wedge that’s coming for your job. Fortunately, we have time; Lambda’s a glimpse into the future of computing in some ways, but it’s still fairly limited.

Today, the constraints around Lambda are somewhat severe.

You’re restricted to writing code in a relatively small selection of languages– there’s official support for Python, Node, .Net, Java, and (very soon) Go. However, you can shoehorn in shell scripts, PHP, Ruby, and others. More on this in a bit.
Amazon has solved the Halting Problem handily– after a certain number of seconds (hard capped at 300) your function will terminate.
Concurrency is tricky: it’s as easy to have one Lambda running as a time as it is one thousand. If they each connect to a database, it’s about to have a very bad day. (Lambda just introduced per-function concurrency, which smooths this somewhat.)
Workflows around building and deploying Lambdas are left as an exercise for the reader. This is how Amazon tells developers to go screw themselves without seeming rude about it.
At scale, the economics of Lambda are roughly 5x the cost of equivalent compute in EC2. That said, for jobs that only run intermittently, or are highly burstable, the economics are terrific. Lambdas are billed in Gigabyte-Seconds (of RAM).
Compute and IO scale linearly with the amount of RAM allocated to a function. Exactly what level maps to what is unpublished, and may change without notice.
Lambda functions run in containers. Those containers may be reused (“warm starts”) and be able to reuse things like database connections, or have to be spun up from scratch (“cold starts”). It’s a grand mystery, one your code will have to take into account.
There are a finite list of things that can trigger Lambda functions. Fortunately, cron-style schedules are now one of them. The Lambda runs
within an unprivileged user account inside of a container. The only place inside of this container where you can write data is /tmp, and it’s limited to 500mb.
Your function must fit into a zip file that’s 50MB or smaller; decompressed, it must fit within 250MB– including dependencies.

Let’s focus on one particular Lambda use case: replacing the bane of sysadmin existence, cron jobs. Specifically, cron jobs that affect your environment beyond “the server they run on.” You still have to worry about server log rotation; sorry.

Picture being able to take your existing cron jobs, and no longer having to care about the system they run on. Think about jobs like “send out daily emails,” “perform maintenance on the databases,” “trigger a planned outage so you can look like a hero to your company,” etc.

If your cron job is written in one of the supported Lambda languages, great– you’re almost there. For the rest of us, we probably have a mashup of bash scripts. Rejoice, for hope is not lost! Simply wrap your terrible shell script (I’m making assumptions here– all of my shell scripts are objectively terrible) inside of a python or javascript caller that shells out to invoke your script. Bundle the calling function and the shell script together, and you’re there. As a bonus, if you’re used to running this inside of a cron job, you likely have already solved for the myriad shell environment variable issues that bash scripts can run into when they’re called by a non-interactive environment.

Set your Lambda trigger to be a “CloudWatch Event - Scheduled” event, and you’re there. It accepts the same cron syntax we all used to hate but have come to love in a technical form of Stockholm Syndrome.

This is of course a quick-and-dirty primer for getting up and running with Lambda in the shortest time possible– but it gives you a taste of what the system is capable of. More importantly, it gives you the chance to put “AWS Lambda” on your resume– and your resume should always be your most important project.

If you have previous experience with AWS Lambda and you’re anything like me, your first innocent foray into the console for AWS Lambda was filled with sadness, regret, confusion, and disbelief. It’s hard to wrap your head around what it is, how it works, and why you should care. It’s worth taking a look at if you’ve not used it– this type of offering and the design patterns that go along with it are likely to be with us for a while. Even if you’ve already taken a dive into Lambda, it’s worth taking a fresh look at– the interface was recently replaced, and the capabilities of this platform continue to grow.

December 9, 2017

Day 9 - Using Kubernetes for multi-provider, multi-region batch jobs

By: Eric Sigler (@esigler)
Edited By: Michelle Carroll (@miiiiiche)

Introduction

At some point you may find yourself wanting to run work on multiple infrastructure providers — for reliability against certain kinds of failures, to take advantage of lower costs in capacity between providers during certain times, or for any other reason specific to your infrastructure. This used to be a very frustrating problem, as you’d be restricted to a “lowest common denominator” set of tools, or have to build up your own infrastructure primitives across multiple providers. With Kubernetes, we have a new, more sophisticated set of tools to apply to this problem.

Today we’re going to walk through how to set up multiple Kubernetes clusters on different infrastructure providers (specifically Google Cloud Platform and Amazon Web Services), and then connect them together using federation. Then we’ll go over how you can submit a batch job task to this infrastructure, and have it run wherever there’s available capacity. Finally, we’ll wrap up with how to clean up from this tutorial.

Overview

Unfortunately, there isn’t a one-step “make me a bunch of federated Kubernetes clusters” button. Instead, we’ve got several parts we’ll need to take care of:

Have all of the prerequisites in place.
Create a work cluster in AWS.
Create a work cluster in GCE.
Create a host cluster for the federation control plane in AWS.
Join the work clusters to the federation control plane.
Configure all clusters to correctly process batch jobs.
Submit an example batch job to test everything.

Disclaimers

Kubecon is the first week of December, and Kubernetes 1.9.0 is likely to be released the second week of December, which means this tutorial may go stale quickly. I’ll try to call out what is likely to change, but if you’re reading this and it’s any time after December 2017, caveat emptor.
This is not the only way to set up Kubernetes (and federation). One of the two work clusters could be used for the federation control plane, and having a Kubernetes cluster with only one node is bad for reliability. A final example is that kops is a fantastic tool for managing Kubernetes cluster state, but production infrastructure state management often has additional complexity.
All of the various CLI tools involved (gcloud, aws, kube*, and kops) have really useful environment variables and configuration files that can decrease the verbosity needed to execute commands. I’m going to avoid many of those in favor of being more explicit in this tutorial, and initialize the rest at the beginning of the setup.
This tutorial is based off information from the Kubernetes federation documentation and kops Getting Started documentation for AWS and GCE wherever possible. When in doubt, there’s always the source code on GitHub.
The free tiers of each platform won’t cover all the costs of going through this tutorial, and there are instructions at the end for how to clean up so that you shouldn’t incur unplanned expense — but always double check your accounts to be sure!

Setting up federated Kubernetes clusters on AWS and GCE

Part 1: Take care of the prerequisites

Sign up for accounts on AWS and GCE.
Install the AWS Command Line Interface - brew install awscli.
Install the Google Cloud SDK.
Install the Kubernetes command line tools - brew install kubernetes-cli kubectl kops
Install the kubefed binary from the appropriate tarball for your system.
Make sure you have an SSH key, or generate a new one.
Use credentials that have sufficient access to create resources in both AWS and GCE. You can use something like IAM accounts.
Have appropriate domain names registered, and a DNS zone configured, for each provider you’re using (Route53 for AWS, Cloud DNS for GCP). I will use “example.com” below — note that you’ll need to keep track of the appropriate records.

Finally, you’ll need to pick a few unique names in order to run the below steps. Here are the environment variables that you will need to set beforehand:

export S3_BUCKET_NAME="put-your-unique-bucket-name-here"
export GS_BUCKET_NAME="put-your-unique-bucket-name-here"

Part 2: Set up the work cluster in AWS

To begin, you’ll need to set up the persistent storage that kops will use for the AWS work cluster:

aws s3api create-bucket --bucket $S3_BUCKET_NAME

Then, it’s time to create the configuration for the cluster:

kops create cluster \
 --name="aws.example.com" \
 --dns-zone="aws.example.com" \
 --zones="us-east-1a" \
 --master-size="t2.medium" \
 --node-size="t2.medium" \
 --node-count="1" \
 --state="s3://$S3_BUCKET_NAME" \
 --kubernetes-version="1.8.0" \
 --cloud=aws

If you want to review the configuration, use kops edit cluster aws.example.com --state="s3://$S3_BUCKET_NAME". When you’re ready to proceed, provision the AWS work cluster by running:

kops update cluster aws.example.com --yes --state="s3://$S3_BUCKET_NAME"

Wait until kubectl get nodes --show-labels --context=aws.example.com shows the NODE role as Ready (it should take 3–5 minutes). Congratulations, you have your first (of three) Kubernetes clusters ready!

Part 3: Set up the work cluster in GCE

OK, now we’re going to do a very similar set of steps for our second work cluster, this one on GCE. First though, we need to have a few extra environment variables set:

export PROJECT=`gcloud config get-value project`
export KOPS_FEATURE_FLAGS=AlphaAllowGCE

As the documentation points out, using kops with GCE is still considered alpha. To keep each cluster using vendor-specific tools, let’s set up state storage for the GCE work cluster using Google Storage:

gsutil mb gs://$GS_BUCKET_NAME/

Now it’s time to generate the configuration for the GCE work cluster:

kops create cluster \
 --name="gcp.example.com" \
 --dns-zone="gcp.example.com" \
 --zones="us-east1-b" \
 --state="gs://$GS_BUCKET_NAME/" \
 --project="$PROJECT" \
 --kubernetes-version="1.8.0" \
 --cloud=gce

As before, use kops edit cluster gcp.example.com --state="gs://$GS_BUCKET_NAME/" to peruse the configuration. When ready, provision the GCE work cluster by running:

kops update cluster gcp.example.com --yes --state="gs://$GS_BUCKET_NAME/"

And once kubectl get nodes --show-labels --context=gcp.example.com shows the NODE role as Ready, your second work cluster is complete!

Part 4: Set up the host cluster

It’s useful to have a separate cluster that hosts the federation control plane. In production, it’s better to have this isolation to be able to reason about failure modes for different components. In the context of this tutorial, it’s easier to reason about which cluster is doing what work.

In this case, we can use the existing S3 bucket we’ve previously created to hold the configuration for our second AWS cluster — no additional S3 bucket needed! Let’s generate the configuration for the host cluster, which will run the federation control plane:

kops create cluster \
 --name="host.example.com" \
 --dns-zone="host.example.com" \
 --zones=us-east-1b \
 --master-size="t2.medium" \
 --node-size="t2.medium" \
 --node-count="1" \
 --state="s3://$S3_BUCKET_NAME" \
 --kubernetes-version="1.8.0" \
 --cloud=aws

Once you’re ready, run this command to provision the cluster:

kops update cluster host.example.com --yes --state="s3://$S3_BUCKET_NAME"

And one last time, wait until kubectl get nodes --show-labels --context=host.example.com shows the NODE role as Ready.

Part 5: Set up the federation control plane

Now that we have all of the pieces we need to do work across multiple providers, let’s connect them together using federation. First, add aliases for each of the clusters:

kubectl config set-context aws --cluster=aws.example.com --user=aws.example.com
kubectl config set-context gcp --cluster=gcp.example.com --user=gcp.example.com
kubectl config set-context host --cluster=host.example.com --user=host.example.com

Next up, we use the kubefed command to initialize the control plane, and add itself a member:

kubectl config use-context host
kubefed init fed --host-cluster-context=host --dns-provider=aws-route53 --dns-zone-name="example.com"

If the message “Waiting for federation control plane to come up” takes an unreasonably long amount of time to appear, you can check the underlying pods for any issues by running:

kubectl get all --context=host.example.com --namespace=federation-system
kubectl describe po/fed-controller-manager-EXAMPLE-ID --context=host.example.com --namespace=federation-system

Once you see “Federation API server is running,” we can join the work clusters to the federation control plane:

kubectl config use-context fed
kubefed join aws --host-cluster-context=host --cluster-context=aws
kubefed join gcp --host-cluster-context=host --cluster-context=gcp
kubectl --context=fed create namespace default

To confirm everything’s working, you should see the aws and gcp clusters when you run:

kubectl --context=fed get clusters

Part 6: Set up the batch job API

(Note: This is likely to change as Kubernetes evolves — this was tested on 1.8.0.) We’ll need to edit the federation API server in the control plane, and enable the batch job API. First, let’s edit the deployment for the fed-apiserver:

kubectl --context=host --namespace=federation-system edit deploy/fed-apiserver

And within the configuration in the federation-apiserver section, add a –runtime-config=batch/v1 line, like so:

  containers:
  - command:
    - /hyperkube
    - federation-apiserver
    - --admission-control=NamespaceLifecycle
    - --bind-address=0.0.0.0
    - --client-ca-file=/etc/federation/apiserver/ca.crt
    - --etcd-servers=http://localhost:2379
    - --secure-port=8443
    - --tls-cert-file=/etc/federation/apiserver/server.crt
    - --tls-private-key-file=/etc/federation/apiserver/server.key
  ... Add the line:
    - --runtime-config=batch/v1

Then restart the Federation API Server and Cluster Manager pods by rebooting the node running them. Watch kubectl get all --context=host --namespace=federation-system if you want to see the various components change state. You can verify the change applied by running the following Python code:

# Sample code from Kubernetes Python client
from kubernetes import client, config


def main():
    config.load_kube_config()

    print("Supported APIs (* is preferred version):")
    print("%-20s %s" %
          ("core", ",".join(client.CoreApi().get_api_versions().versions)))
    for api in client.ApisApi().get_api_versions().groups:
        versions = []
        for v in api.versions:
            name = ""
            if v.version == api.preferred_version.version and len(
                    api.versions) > 1:
                name += "*"
            name += v.version
            versions.append(name)
        print("%-40s %s" % (api.name, ",".join(versions)))

if __name__ == '__main__':
    main()

You should see output from that Python script that looks something like:

> python api.py
Supported APIs (* is preferred version):
core                 v1
federation           v1beta1
extensions           v1beta1
batch                v1

Part 7: Submitting an example job

Following along from the Kubernetes batch job documentation, create a file, pi.yaml with the following contents:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
spec:
  template:
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

This job spec:

Runs a single container to generate the first 2,000 digits of Pi.
Uses a generateName, so you can submit it multiple times (each time it will have a different name).
Sets restartPolicy: Never, but OnFailure is also allowed for batch jobs.
Sets backoffLimit. This generates a parse violation in 1.8, so we have to disable validation.

Now you can submit the job, and follow it across your federated set of Kubernetes clusters. First, at the federated control plane level, submit and see which work cluster it lands on:

kubectl --validate=false --context=fed create -f ./pi.yaml 
kubectl --context=fed get jobs
kubectl --context=fed describe job/<JOB IDENTIFIER>

Then (assuming it’s the AWS cluster — if not, switch the context below), dive in deeper to see how the job finished:

kubectl --context=aws get jobs
kubectl --context=aws describe job/<JOB IDENTIFIER>
kubectl --context=aws get pods
kubectl --context=aws describe pod/<POD IDENTIFIER>
kubectl --context=aws logs <POD IDENTIFIER>

If all went well, you should see the output from the job. Congratulations!

Cleaning up

Once you’re done trying out this demonstration cluster, you can clean up all of the resources you created by running:

kops delete cluster gcp.example.com --yes --state="gs://$GS_BUCKET_NAME/"
kops delete cluster aws.example.com --yes --state="s3://$S3_BUCKET_NAME"
kops delete cluster host.example.com --yes --state="s3://$S3_BUCKET_NAME"

Don’t forget to verify in the AWS and GCE console that everything was removed, to avoid any unexpected expenses.

Conclusion

Kubernetes provides a tremendous amount of infrastructure flexibility to everyone involved in developing and operating software. There are many different applications for federated Kubernetes clusters, including:

Using Cluster Autoscaler or kubernetes-ec2-autoscaler in combination with spot pricing to run workloads at the lowest cost.
Leveraging the recently added GPU scheduling capability and Node Affinity rules to assign jobs to wherever there’s available GPU capacity.
Having fault isolation across geographic regions and/or infrastucture providers, lower latency to your end users, and much more!

Good luck to you in whatever your Kubernetes design patterns may be, and happy SysAdvent!

December 15, 2014

Day 15 - Cook your own packages: Getting more out of fpm

Written by: Mathias Lafeldt (@mlafeldt)
Edited by: Joseph Kern (@josephkern)

Introduction

When it comes to building packages, there is one particular tool that has grown in popularity over the last years: fpm. fpm’s honorable goal is to make it as simple as possible to create native packages for multiple platforms, all without having to learn the intricacies of each distribution’s packaging format (.deb, .rpm, etc.) and tooling.

With a single command, fpm can build packages from a variety of sources including Ruby gems, Python modules, tarballs, and plain directories. Here’s a quick example showing you how to use the tool to create a Debian package of the AWS SDK for Ruby:

$ fpm -s gem -t deb aws-sdk
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}

It is this simplicity that makes fpm so popular. Developers are able to easily distribute their software via platform-native packages. Businesses can manage their infrastructure on their own terms, independent of upstream vendors and their policies. All of this has been possible before, but never with this little effort.

In practice, however, things are often more complicated than the one-liner shown above. While it is absolutely possible to provision production systems with packages created by fpm, it will take some work to get there. The tool can only help you so far.

In this post we’ll take a look at several best practices covering: dependency resolution, reproducible builds, and infrastructure as code. All examples will be specific to Debian and Ruby, but the same lessons apply to other platforms/languages as well.

Resolving dependencies

Let’s get back to the AWS SDK package from the introduction. With a single command, fpm converts the aws-sdk Ruby gem to a Debian package named rubygem-aws-sdk. This is what happens when we actually try to install the package on a Debian system:

$ sudo dpkg --install rubygem-aws-sdk_1.59.0_all.deb
...
dpkg: dependency problems prevent configuration of rubygem-aws-sdk:
 rubygem-aws-sdk depends on rubygem-aws-sdk-v1 (= 1.59.0); however:
  Package rubygem-aws-sdk-v1 is not installed.
...

As we can see, our package can’t be installed due to a missing dependency (rubygem-aws-sdk-v1). Let’s take a closer look at the generated .deb file:

$ dpkg --info rubygem-aws-sdk_1.59.0_all.deb
 ...
 Package: rubygem-aws-sdk
 Version: 1.59.0
 License: Apache 2.0
 Vendor: Amazon Web Services
 Architecture: all
 Maintainer: <vagrant@wheezy-buildbox>
 Installed-Size: 5
 Depends: rubygem-aws-sdk-v1 (= 1.59.0)
 Provides: rubygem-aws-sdk
 Section: Languages/Development/Ruby
 Priority: extra
 Homepage: http://aws.amazon.com/sdkforruby
 Description: Version 1 of the AWS SDK for Ruby. Available as both `aws-sdk` and `aws-sdk-v1`.
  Use `aws-sdk-v1` if you want to load v1 and v2 of the Ruby SDK in the same
  application.

fpm did a great job at populating metadata fields such as package name, version, license, and description. It also made sure that the Depends field contains all required dependencies that have to be installed for our package to work properly. Here, there’s only one direct dependency – the one we’re missing.

While fpm goes to great lengths to provide proper dependency information – and this is not limited to Ruby gems – it does not automatically build those dependencies. That’s our job. We need to find a set of compatible dependencies and then tell fpm to build them for us.

Let’s build the missing rubygem-aws-sdk-v1 package with the exact version required and then observe the next dependency in the chain:

$ fpm -s gem -t deb -v 1.59.0 aws-sdk-v1
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}

$ dpkg --info rubygem-aws-sdk-v1_1.59.0_all.deb | grep Depends
 Depends: rubygem-nokogiri (>= 1.4.4), rubygem-json (>= 1.4), rubygem-json (<< 2.0)

Two more packages to take care of: rubygem-nokogiri and rubygem-json. By now, it should be clear that resolving package dependencies like this is no fun. There must be a better way.

In the Ruby world, Bundler is the tool of choice for managing and resolving gem dependencies. So let’s ask Bundler for the dependencies we need. For this, we create a Gemfile with the following content:

# Gemfile
source "https://rubygems.org"
gem "aws-sdk", "= 1.59.0"
gem "nokogiri", "~> 1.5.0" # use older version of Nokogiri

We then instruct Bundler to resolve all dependencies and store the resulting .gem files into a local folder:

$ bundle package
...
Updating files in vendor/cache
  * json-1.8.1.gem
  * nokogiri-1.5.11.gem
  * aws-sdk-v1-1.59.0.gem
  * aws-sdk-1.59.0.gem

We specifically asked Bundler to create .gem files because fpm can convert them into Debian packages in a matter of seconds:

$ find vendor/cache -name '*.gem' | xargs -n1 fpm -s gem -t deb
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}
Created package {:path=>"rubygem-json_1.8.1_amd64.deb"}
Created package {:path=>"rubygem-nokogiri_1.5.11_amd64.deb"}

As a final test, let’s install those packages…

$ sudo dpkg -i *.deb
...
Setting up rubygem-json (1.8.1) ...
Setting up rubygem-nokogiri (1.5.11) ...
Setting up rubygem-aws-sdk-v1 (1.59.0) ...
Setting up rubygem-aws-sdk (1.59.0) ...

…and verify that the AWS SDK actually can be used by Ruby:

$ ruby -e "require 'aws-sdk'; puts AWS::VERSION"
1.59.0

Win!

The purpose of this little exercise was to demonstrate one effective approach to resolving package dependencies for fpm. By using Bundler – the best tool for the job – we get fine control over all dependencies, including transitive ones (like Nokogiri, see Gemfile). Other languages provide similar dependency tools. We should make use of language specific tools whenever we can.

Build infrastructure

After learning how to build all packages that make up a piece of software, let’s consider how to integrate fpm into our build infrastructure. These days, with the rise of the DevOps movement, many teams have started to manage their own infrastructure. Even though each team is likely to have unique requirements, it still makes sense to share a company-wide build infrastructure, as opposed to reinventing the wheel each time someone wants to automate packaging.

Packaging is often only a small step in a longer series of build steps. In many cases, we first have to build the software itself. While fpm supports multiple source formats, it doesn’t know how to build the source code or determine dependencies required by the package. Again, that’s our job.

Creating a consistent build and release process for different projects across multiple teams is hard. Fortunately, there’s another tool that does most of the work for us: fpm-cookery. fpm-cookery sits on top of fpm and provides the missing pieces to create a reusable build infrastructure. Inspired by projects like Homebrew, fpm-cookery builds packages based on simple recipes written in Ruby.

Let’s turn our attention back to the AWS SDK. Remember how we initially converted the gem to a Debian package? As a warm up, let’s do the same with fpm-cookery. First, we have to create a recipe.rb file:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name    "aws-sdk"
  version "1.59.0"
end

Next, we pass the recipe to fpm-cook, the command-line tool that comes with fpm-cookery, and let it build the package for us:

$ fpm-cook package recipe.rb
===> Starting package creation for aws-sdk-1.59.0 (debian, deb)
===>
===> Verifying build_depends and depends with Puppet
===> All build_depends and depends packages installed
===> [FPM] Trying to download {"gem":"aws-sdk","version":"1.59.0"}
...
===> Created package: /home/vagrant/pkg/rubygem-aws-sdk_1.59.0_all.deb

To complete the exercise, we also need to write a recipe for each remaining gem dependency. This is what the final recipes look like:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <[email protected]>"

  chain_package true
  chain_recipes ["aws-sdk-v1", "json", "nokogiri"]
end

# aws-sdk-v1.rb
class AwsSdkV1Gem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk-v1"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <[email protected]>"
end

# json.rb
class JsonGem < FPM::Cookery::RubyGemRecipe
  name       "json"
  version    "1.8.1"
  maintainer "Mathias Lafeldt <[email protected]>"
end

# nokogiri.rb
class NokogiriGem < FPM::Cookery::RubyGemRecipe
  name       "nokogiri"
  version    "1.5.11"
  maintainer "Mathias Lafeldt <[email protected]>"

  build_depends ["libxml2-dev", "libxslt1-dev"]
  depends       ["libxml2", "libxslt1.1"]
end

Running fpm-cook again will produce Debian packages that can be added to an APT repository and are ready for use in production.

Three things worth highlighting:

fpm-cookery is able to build multiple dependent packages in a row (configured by chain_* attributes), allowing us to build everything with a single invocation of fpm-cook.
We can use the attributes build_depends and depends to specify a package’s build and runtime dependencies. When running fpm-cook as root, the tool will automatically install missing dependencies for us.
I deliberately set the maintainer attribute in all recipes. It’s important to take responsibility of the work that we do. We should make it as easy as possible for others to identify the person or team responsible for a package.

fpm-cookery provides many more attributes to configure all aspects of the build process. Among other things, it can download source code from GitHub before running custom build instructions (e.g. make install). The fpm-recipes repository is an excellent place to study some working examples. This final example, a recipe for chruby, is a foretaste of what fpm-cookery can actually do:

# recipe.rb
class Chruby < FPM::Cookery::Recipe
  description "Changes the current Ruby"

  name     "chruby"
  version  "0.3.8"
  homepage "https://github.com/postmodern/chruby"
  source   "https://github.com/postmodern/chruby/archive/v#{version}.tar.gz"
  sha256   "d980872cf2cd047bc9dba78c4b72684c046e246c0fca5ea6509cae7b1ada63be"

  maintainer "Jan Brauer <[email protected]>"

  section "development"

  config_files "/etc/profile.d/chruby.sh"

  def build
    # nothing to do here
  end

  def install
    make :install, "PREFIX" => prefix
    etc("profile.d").install workdir("chruby.sh")
  end
end

# chruby.sh
source /usr/share/chruby/chruby.sh

Wrapping up

fpm has changed the way we build packages. We can get even more out of fpm by using it in combination with other tools. Dedicated programs like Bundler can help us with resolving package dependencies, which is something fpm won’t do for us. fpm-cookery adds another missing piece: it allows us to describe our packages using simple recipes, which can be kept under version control, giving us the benefits of infrastructure as code: repeatability, automation, rollbacks, code reviews, etc.

Last but not least, it’s a good idea to pair fpm-cookery with Docker or Vagrant for fast, isolated package builds. This, however, is outside the scope of this article and left as an exercise for the reader.

December 1, 2014

Day 1 - Docker in Production: Reality, Not Hype

Written by: Bridget Kromhout (@bridgetkromhout)
Edited by: Christopher Webber (@cwebber)

Why Docker?

When I started talking with DramaFever in summer 2014 about joining their ops team, one of many appealing factors was that they’d already been running Docker in production since about October 2013 (well before it even went 1.0). Cutting (maybe bleeding) edge? Sounds fun!

But even before I joined and we were acquired by SoftBank (unrelated events! I am not an acquisition magnet, even if both startups I worked at in 2014 were acquired), DramaFever was already a successful startup, and important technology stack decisions are not made by running a Markov text generator against the front page of Hacker News.

So, why Docker? Simply put, it makes development more consistent and deployment more repeatable. Because developers are developing locally all using the same containers, integration is much easier when their code moves on to their EC2-based personal dev environment, the shared dev environment, QA, staging, and production. Because a production instance is serving code from a container, every new autoscaled instance that has any code at all is going to have the correct code.

As renowned infosec professional Taylor Swift says, “haters gonna hate”. And I’ve been guilty of “get off my lawn” snark about the recent hype, pointing out that containerization isn’t exactly new. We’ve had FreeBSD chroot jails and Solaris Zones (pour one out for Sun Microsystems) for ages. But the genius of Docker is that it provides just enough training wheels for LXC that everyone can use it (for rapidly increasing values of everyone).

Our Own Private Registry

We’re using a local copy of the registry backed by an S3 bucket accessible to those with developer IAM credentials. (If you don’t use AWS, that just means it uses a shared storage location that our devs can access without needing production keys.) Apparently people usually go with a centralized private registry; instead, we traded a SPOF for S3’s eventual consistency. We start the local registry on a host via upstart, and there are a few configuration items of interest:

# this goes in /etc/default/docker to control docker's upstart config
DOCKER_OPTS="--graph=/mnt/docker --insecure-registry=reg.example.com:5000"

Since on some instances we are pulling down multiple Docker images that can be hundreds of megabytes in size, and running or stopped containers also take up room on disk, we use --graph=/mnt/docker to set the root of the docker runtime to the ephemeral disk instead of to the default /var/lib/docker.

With Docker 1.3’s improved security requiring TLS, this means we need to allow our localhost-aliased non-TLS registry.

The docker registry upstart job (used on all EC2 instances) runs these commands:

docker pull public_registry_image
docker run -p 5000:5000 --name registry \
-v /etc/docker-reg:/registry-conf \
-e DOCKER_REGISTRY_CONFIG=/registry-conf/config.yml \
public_registry_image

That second line may require some explanation:

# this is the local port we'll run the registry on
docker run -p 5000:5000 \  

# giving the container a name makes it easier to identify
--name registry \  

# we're mounting in the directory holding a config file that specifies
AWS credentials, etc. Unlike mount(1), this creates the directory it's
mounting to.
-v /etc/docker-reg:/registry-conf \  

# defining the config file location
-e DOCKER_REGISTRY_CONFIG=/registry-conf/config.yml

# the publicly-registered image we're launching this local registry from
public_registry_image

To run locally, we pull the image and then run like this (with DFHOME being where we have the source code checked out):

docker run -d -p 5000:5000 --name docker-reg -v ${DFHOME}:${DFHOME} -e 
DOCKER_REGISTRY_CONFIG=${DFHOME}/config/docker-registry/config.yml
public_registry_image

docker build; docker push

Weekly Jenkins jobs build a base container for the main django app and another that mimics our RDS environment with a containerized, all-data-fixtures-loaded MySQL.

We do trunk-based development with developers submitting pull requests. After being peer-reviewed and merged to master, the new code is available for Jenkins to poll GitHub and build. If all tests pass, then it’s time for exciting post-build action! (What? If you’ve gotten this far in a post about container strategy, then you probably agree with me that this stuff is exciting.)

While all our Go microservices are built essentially the same way, let’s focus on the main django app. Its Dockerfile starts from the weekly base build, as that speeds things up a bit:

FROM our-local-repo-alias:5000/www-base

We keep a number of Dockerfiles around, and in this case, since we have both a base build and a master build for www, we have multiple Dockerfiles in this github repository. Since it’s not possible to pass a file to docker build, it’s necessary to rename the file:

mv 'Dockerfile-www' Dockerfile; sudo docker build -t="67cd893" .

Jenkins builds the new layers for the www master image, tags it with the git SHA, then tests it. Only if it passes the tests do we retag it as dev and then push it to our private docker repository.

sudo docker push our-local-repo-alias:5000/www:'dev'

When we’re ready to cut a release, we build the www-QA job from the release branch. After testing, that same container is re-tagged for staging, then production, and new autoscaling instances will pick it up (giving us the flexibility to do blue/green deploys, which we’re just starting to explore).

Docker in a Mac-using Dev World

Before summer 2014, we were using Vagrant for local development. Building a new image with a local chef-solo provisioner took 17 minutes to install everything, and the local environment diverged enough from the production environment to be annoying. Moving all development into Docker containers proved very effective, especially as we worked through some of the inevitable gotchas and corner cases.

For the local developer environment, we’re using boot2docker, and we’re just about to move back to mainline from a fork that Tim Gross, our head of operations, wrote to get around VirtualBox shared folder mount issues present in previous versions of boot2docker.

One issue we’ve noticed using boot2docker on Mac laptops is that when they wake from sleep, the clock in the VM can be skewed. This means the local registry doesn’t work, since it relies on S3 and S3 expects a correct clock.

$ boot2docker ssh sudo date -u
Mon Nov 24 16:09:02 UTC 2014

$ date -u
Tue Nov 25 01:43:49 UTC 2014

$ docker pull our-local-repo-alias:5000/mysql
Pulling repository our-local-repo-alias:5000/mysql
2014/11/24 19:44:31 HTTP code: 500

Ry4an Brase, our head of back-end development, came up with this delightful incantation:

$ boot2docker ssh sudo date --set \"$(env TZ=UTC date '+%F %H:%M:%S')\"

Adding that to the utils sourced by all our various wrapper scripts (so that devs don’t need to remember a lot of docker syntax to go about their daily lives) seemed like a better alternative than having slackbot deliver it as a reply to all local registry questions.

Containerizing Front-End Dev

I created a container for front-end development which allows us to replicate a front-end environment on Jenkins, using angular, npm, grunt, and bower; you know, the sort of mysterious tools that are inordinately fond of $CWD and interactive prompts.

There are a number of Dockerfiles out there for this; here’s what I found helpful to know (for values of “know” that include “asking Ryan Provost, our head of front-end development” and “mashing buttons until it works”).

Although it defies all logic, node is already old enough to have legacy something. (Insert rant about you kids needing to get off my lawn with your skinny jeans and your fixies.)

RUN apt-get install -y nodejs nodejs-legacy npm

You need a global install of these three; they can’t come from your package.json:

RUN npm install -g [email protected]
RUN npm install -g [email protected]
RUN npm install -g [email protected]

And bower doesn’t want to be installed as root, and sometimes will ask questions that expect an interactive answer:

ADD bower.json /var/www/dependencies/bower.json
RUN cd /var/www/dependencies && bower install --allow-root 
--config.interactive=false --force

The nice thing about having this container is that it allows someone without all the right versions of the front-end tools installed to try out running such parts of the site locally, and it also allows a more replicable deployment of something that will definitely be the same between all the environments as opposed to the “it works on their laptop” fun we all know and love.

Getting the Logs Out

A certain Docker Orthodoxy treats a container as entirely apart from the host instance. Since we aren’t using containers for isolation, we approach this a little differently. Tim blogged about Docker logs when DramaFever first started using Docker.

On EC2 we want to ship logs to our ELK stack, so we mount in a filesystem from the host container:

-v /var/log/containers:/var/log

On local developer machines we want to be able to use a container for active development, editing code locally and running it in the containerized environment. We use the -v flag to mount the developer’s checked-out code into the container, effectively replacing that directory as-shipped:

-v ${DFHOME}/www:/var/www

We still want logs, too, so we expose those for the dev here:

-v ${DFHOME}/www/run:/var/log

Totally Weird Bugs for $1,000, Alex

On an instance where the docker runtime root disk filled up, the container images became corrupt. Even after a reboot, they started yielding inconsistent containers whose behavior would vary over time. For example, a running container (invoked with /bin/bash) would have the ls command, and then a few minutes later, it would not. Eventually, a docker run would lead to errors like these:

Error response from daemon: Unknown filesystem type on /dev/mapper/
docker-202:16-692241-81e4db1aaf5ea5ec70c2ef8542238e8877bbdb4b0
7b253f67b888e713a738dc2-init

Error response from daemon: mkdir /var/lib/docker/devicemapper/mnt/
9db80f229fdf9ebb75ed22d10443c90003741a6770f81db62 f86df881cfb12ae-init/
rootfs: input/output error

It’s likely that we’re seeing one of the devicemapper bugs that seem to plague docker. Replacing the local volume in question was a reasonable workaround.

About Those Race Conditions

A much more prevalent (and annoying) place we’ve run into docker race conditions is in the Jenkins builds. We’d been seeing builds fail with messages like this:

Removing intermediate container 4755dce8cfcc
Step 5 : ADD /example/file /example/file
2014/11/18 18:46:59 Error getting container init rootfs 
a226d3503180de091fde2a410e2b037fde94237dd2171d49a866d43ff03e724c from 
driver devicemapper: Error mounting '/dev/mapper/docker-9:127-14024705-
a226d3503180de091fde2a410e2b037fde94237dd2171d49a866d43ff03e724c-init'
on '/var/lib/docker/devicemapper/mnt/
a226d3503180de091fde2a410e2b037fde94237dd2171d49a866d43ff03e724c-
init': no such file or directory

I added the Naginator plugin so it would retry failed jobs if they’d failed with the most common strings we’d see:

(Cannot destroy container|Error getting container init rootfs)

While that’s an acceptable workaround, I still plan to change what gets reported to Slack, since it’s annoying to have to click on the broken build to find out if it’s just Docker again.

Cron Zombies

A few weeks ago, a developer noticed an unwelcome new message in interactive use on QA:

Error: Cannot start container appname: iptables failed: iptables -t 
nat -A DOCKER -p tcp -d 0/0 --dport 8500 ! -i docker0 -j DNAT --to-
destination 172.17.0.7:8500:  (fork/exec /sbin/iptables: cannot
allocate memory)

On specific instances (such as QA) that aren’t part of the production autoscaling groups, we run cron jobs that invoke a container and give it arguments. A look at the process table showed that multiple days of docker run commands started by cron and the python processes they’d spawned were still running. docker ps disagreed, though; the containers weren’t running anymore, so they weren’t getting cleaned up by these cron jobs:

# remove stopped containers
@daily docker rm `sudo docker ps -aq`
# remove images tagged "none" 
@daily docker rmi `sudo docker images | grep none | awk -F' +' 
'{print $3}'`

At the time, we were starting the cron containers with docker run -i -a stdout -a stderr.

Changing the container-invoking cron jobs to instead use docker run -it cleared it up. It appears a controlling tty was necessary for them to successfully signal their child processes to exit.

Containerize All the Things?

We’re actually running just about everything in containers currently - including the more static bits of our infrastructure. Do Sentry, Jenkins, Graphite, and the ELK stack actually benefit from being in containers? Possibly not; at the time of containerizing all the things, it was the closest thing we had to a configuration management system.

But while it’s excellent for releasing software, it’s a giant hassle sifting through all the changes in a monolithic “config” repo to figure out why the graphite container no longer builds to a working state. Now that we’re using Chef and Packer to drive our AMI creation, we’ll likely move to using Chef cookbooks to manage our next iteration on those infrastructure pieces.

Just because it’s possible to run everything inside a container doesn’t mean it’s useful. While we are no longer using containers as our main method of capturing all configuration, we continue to see great value in using them for consistency throughout development and repeatability of deployments.

Not (Just) Hype

The core of devops is empathy, and it’s important to remember there are people behind all the software we’re discussing. Nathan LeClaire of Docker took to Twitter recently, talking about how it feels to have your project called “marketing BS”. (Let’s pause for a moment while I feel guilty about everything I’ve ever said about MongoDB. I don’t think I ever called it marketing BS, but I’ve definitely made “web scale” jokes.)

Given the recent announcements out of AWS re:Invent about EC2 Container Service, it’s safe to say that containers are about as mainstream as they’re going to get. Do I think ECS is going to be ready for prime time immediately? Anyone who read my sysadvent post from last year about HBase on EMR (Amazon’s training wheels for Hadoop) is saying “lolnope” right about now.

But containers are definitely not just for the Googles of the world anymore, and they’re increasingly no longer just for those of us who are willing to chase devicemapper bugs down a rabbit-hole into GitHub issue land. Docker is the real deal, it works in production, and if you’d like to go stream some dramas powered by it from our site or native apps, you can do that today. (If you’d like to read more about containers, stay tuned for an exciting post tomorrow…)

Sound Like Fun?

If this sounds like exactly the sort of fun you enjoy having at work, we’re hiring ops and dev folks at DramaFever. We’re remote-friendly with NYC and Philly offices. You can read more about our positions on the DramaFever careers page or contact me via email or on twitter. I’d love to talk with you!

December 19, 2013

Day 19 - Automating IAM Credentials with Ruby and Chef

Written by: Joshua Timberman (@jtimberman)
Edited by: Shaun Mouton (@sdmouton)

Chef, nee Opscode, has long used Amazon Web Services. In fact, the original iteration of "Hosted Enterprise Chef," "The Opscode Platform," was deployed entirely in EC2. In the time since, AWS has introduced many excellent features and libraries to work with them, including Identity and Access Management (IAM), and the AWS SDK. Especially relevant to our interests is the Ruby SDK, which is available as the aws-sdk RubyGem. Additionally, the operations team at Nordstrom has released a gem for managing encrypted data bags called chef-vault. In this post, I will describe how we use the AWS IAM feature, how we automate it with the aws-sdk gem, and store secrets securely using chef-vault.

Definitions

First, here are a few definitions and references for readers.

Hosted Enterprise Chef - Enterprise Chef as a hosted service.
AWS IAM - management system for authentication/authorization to Amazon Web Services resources such as EC2, S3, and others.
AWS SDK for Ruby - RubyGem providing Ruby classes for AWS services.
Encrypted Data Bags - Feature of Chef Server and Enterprise Chef that allows users to encrypt data content with a shared secret.
Chef Vault - RubyGem to encrypt data bags using public keys of nodes on a chef server.

How We Use AWS and IAM

We have used AWS for a long time, before the IAM feature existed. Originally with The Opscode Platform, we used EC2 to run all the instances. While we have moved our production systems to a dedicated hosting environment, we do have non-production services in EC2. We also have some external monitoring systems in EC2. Hosted Enterprise Chef uses S3 to store cookbook content. Those with an account can see this with knife cookbook show COOKBOOK VERSION, and note the URL for the files. We also use S3 for storing the packages from our omnibus build tool. The omnitruck metadata API service exposes this.

All these AWS resources - EC2 instances, S3 buckets - are distributed across a few different AWS accounts. Before IAM, there was no way to have data segregation because the account credentials were shared across the entire account. For (hopefully obvious) security reasons, we need to have the customer content separate from our non-production EC2 instances. Similarly, we need to have the metadata about the omnibus packages separate from the packages themselves. In order to manage all these different accounts and their credentials which need to be automatically distributed to systems that need them, we use IAM users, encrypted data bags, and chef.

Unfortunately, using various accounts adds complexity in managing all this, but through the tooling I'm about to describe, it is a lot easier to manage now than it was in the past. We use a fairly simple data file format of JSON data, and a Ruby script that uses the AWS SDK RubyGem. I'll describe the parts of the JSON file, and then the script.

IAM Permissions

IAM allows customers to create separate groups which are containers of users to have permissions to different AWS resources. Customers can manage these through the AWS console, or through the API. The API uses JSON documents to manage the policy statement of permissions the user has to AWS resources. Here's an example:

{
  "Statement": [
    {
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::an-s3-bucket",
        "arn:aws:s3:::an-s3-bucket/*"
      ]
    }
  ]
}

Granted to an IAM user, this will allow that user to perform all S3 actions to the bucket an-s3-bucket and all the files it contains. Without the /*, only operations against the bucket itself would be allowed. To set read-only permissions, use only the List and Get actions:

"Action": [
  "s3:List*",
  "s3:Get*"
]

Since this is JSON data, we can easily parse and manipulate this through the API. I'll cover that shortly.

See the IAM policy documentation for more information.

Chef Vault

We use data bags to store secret credentials we want to configure through Chef recipes. In order to protect these secrets further, we encrypt the data bags, using chef-vault. As I have previously written about chef-vault in general, this section will describe what we're interested in from our automation perspective.

Chef vault itself is concerned with three things:

The content to encrypt.
The nodes that should have access (via a search query).
The administrators (human users) who should have access.

"Access" means that those entities are allowed to decrypt the encrypted content. In the case of our IAM users, this is the AWS access key ID and the AWS secret access key, which will be the content to encrypt. The nodes will come from a search query to the Chef Server, which will be added as a field in the JSON document that will be used in a later section. Finally, the administrators will simply be the list of users from the Chef Server.

Data File Format

The script reads a JSON file, described here:

{
  "accounts": [
    "an-aws-account-name"
  ],
  "user": "secret-files",
  "group": "secret-files",
  "policy": {
    "Statement": [
      {
        "Action": "s3:*",
        "Effect": "Allow",
        "Resource": [
          "arn:aws:s3:::secret-files",
          "arn:aws:s3:::secret-files/*"
        ]
      }
    ]
  },
  "search_query": "role:secret-files-server"
}

This is an example of the JSON we use. The fields:

accounts: an array of AWS account names that have authentication credentials configured in ~/.aws/config - see my post about managing multiple AWS accounts
user: the IAM user to create.
group: the IAM group for the created user. We use a 1:1 user:group mapping.
policy: the IAM policy of permissions, with the action, the effect, and the AWS resources. See the IAM documentation for more information about this.
search_query: the Chef search query to perform to get the nodes that should have access to the resources. For example, this one will allow all nodes that have the Chef role secret-files-server in their expanded run list.

These JSON files can go anywhere, the script will take the file path as an argument.

Create IAM Script

Note This script is cleaned up to save space and get to the meat of it. I'm planning to make it into a knife plugin but haven't gotten a round tuit yet.

require 'inifile'
require 'aws-sdk'
require 'json'
filename = ARGV[0]
dirname  = File.dirname(filename)
aws_data = JSON.parse(IO.read(filename))
aws_data['accounts'].each do |account|
  aws_creds = {}
  aws_access_keys = {}
  # load the aws config for the specified account
  IniFile.load("#{ENV['HOME']}/.aws/config")[account].map{|k,v| aws_creds[k.gsub(/aws_/,'')]=v}
  iam = AWS::IAM.new(aws_creds)
  # Create the group
  group = iam.groups.create(aws_data['group'])
  # Load policy from the JSON file
  policy = AWS::IAM::Policy.from_json(aws_data['policy'].to_json)
  group.policies[aws_data['group']] = policy
  # Create the user
  user = iam.users.create(aws_data['user'])
  # Add the user to the group
  user.groups.add(group)
  # Create the access keys
  access_keys = user.access_keys.create
  aws_access_keys['aws_access_key_id'] = access_keys.credentials.fetch(:access_key_id)
  aws_access_keys['aws_secret_access_key'] = access_keys.credentials.fetch(:secret_access_key)
  # Create the JSON content to encrypt w/ Chef Vault
  vault_file = File.open("#{File.dirname(__FILE__)}/../data_bags/vault/#{account}_#{aws_data['user']}_unencrypted.json", 'w')
  vault_file.puts JSON.pretty_generate(
    {
      'id' => "#{account}_#{aws_data['user']}",
      'data' => aws_access_keys,
      'search_query' => aws_data['search_query']
    }
  )
  vault_file.close
  # This would be loaded directly with Chef Vault if this were a knife plugin...
  puts <<-eoh data-blogger-escaped---admins="" data-blogger-escaped---json="" data-blogger-escaped---mode="" data-blogger-escaped---search="" data-blogger-escaped--="" data-blogger-escaped--sd="" data-blogger-escaped-account="" data-blogger-escaped-admins="" data-blogger-escaped-aws_data="" data-blogger-escaped-be="" data-blogger-escaped-client="" data-blogger-escaped-code="" data-blogger-escaped-create="" data-blogger-escaped-data_bags="" data-blogger-escaped-encrypt="" data-blogger-escaped-end="" data-blogger-escaped-eoh="" data-blogger-escaped-humans="" data-blogger-escaped-knife="" data-blogger-escaped-list="" data-blogger-escaped-of="" data-blogger-escaped-paste="" data-blogger-escaped-search_query="" data-blogger-escaped-should="" data-blogger-escaped-unencrypted.json="" data-blogger-escaped-user="" data-blogger-escaped-vault="" data-blogger-escaped-who="">

This is invoked with:

% ./create-iam.rb ./iam-json-data/filename.json

The script iterates over each of the AWS account credentials named in the accounts field of the JSON file named, and loads the credentials from the ~/.aws/config file. Then, it uses the aws-sdk Ruby library to authenticate a connection to AWS IAM API endpoint. This instance object, iam, then uses methods to work with the API to create the group, user, policy, etc. The policy comes from the JSON document as described above. It will create user access keys, and it writes these, along with some other metadata for Chef Vault to a new JSON file that will be loaded and encrypted with the knife encrypt plugin.

As described, it will display a command to copy/paste. This is technical debt, as it was easier than directly working with the Chef Vault API at the time :).

Using Knife Encrypt

After running the script, we have an unencrypted JSON file in the Chef repository's data_bags/vault directory, named for the user created, e.g., data_bags/vault/secret-files_unencrypted.json.

{
  "id": "secret-files",
  "data": {
    "aws_access_key_id": "the access key generated through the AWS API",
    "aws_secret_access_key": "the secret key generated through the AWS API"
  },
  "search_query": "roles:secret-files-server"
}

The knife encrypt command is from the plugin that Chef Vault provides. The output of the create-iam.rb script outputs how to use this:

% knife encrypt create vault an-aws-account-name_secret-files \
  --search 'roles:secret-files-server' \
  --mode client \
  --json data_bags/vault/an-aws-account-name_secret-files_unencrypted.json \
  --admins "`knife user list | paste -sd ',' -`"

Results

After running the create-iam.rb script with the example data file, and the unencrypted JSON output, we'll have the following:

An IAM group in the AWS account named secret-files.
An IAM user named secret-files added to the secret-files.
Permission for the secret-files user to perform any S3 operations
on the secret-files bucket (and files it contains).
A Chef Data Bag Item named an-aws-account-name_secret-files in the vault Bag, which will have encrypted contents.
All nodes matching the search roles:secret-files-server will be present as clients in the item an-aws-account-name_secret-files_keys (in the vault bag).
All users who exist on the Chef Server will be admins in the an-aws-account-name_secret-files_keys item.

To view AWS access key data, use the knife decrypt command.

% knife decrypt vault secret-files data --mode client
vault/an-aws-account-name_secret-files
    data: {"aws_access_key_id"=>"the key", "aws_secret_access_key"=>"the secret key"}

The way knife decrypt works is you give it the field of encrypted data to encrypt which is why the unencrypted JSON had a field named data created - so we could use that to access any of the encrypted data we wanted. Similarly, we could use search_query instead of data to get the search query used, in case we wanted to update the access list of nodes.

In a recipe, we use the chef-vault cookbook's chef_vault_item helper method to access the content:

require 'chef-vault'
aws = chef_vault_item('vault', 'an-aws-account_secret-files')['data']

Conclusion

I wrote this script to automate the creation of a few dozen IAM users across several AWS accounts. Unsurprisingly, it took longer to test the recipe code and access to AWS resources across the various Chef recipes, than it took to write the script and run it.

Hopefully this is useful for those who are using AWS and Chef, and were wondering how to manage IAM users. Since this is "done" I may or may not get around to releasing a knife plugin.

December 18, 2013

Day 18 - Wide Columns, Shaggy Yaks: HBase on EMR

Written by: Bridget Kromhout (@bridgetkromhout)
Edited by: Shaun Mouton (@sdmouton)

My phone rang at 4am one day last spring. When I dug it out from under my pillow, I wasn't greeted by the automated PagerDuty voice, which makes sense; I wasn't on call. It was the lead developer at 8thBridge, the social commerce startup where I do operations, and he didn't sound happy. "The HBase cluster is gone," he said. "Amazon says the instances are terminated. Can you fix it?"

Spoiler alert: the answer to that question turned out to be yes. In the process (and while stabilizing our cluster), I learned an assortment of things that weren't entirely clear to me from the AWS docs. This SysAdvent offering is not a step-by-step how-to; it's a collection of observations, anecdotes, and breadcrumbs for future googling.

When Automatic Termination Protection Isn't
Master and Servants
A Daemon in My View
Watching Your Every Move
Only Back Up the Data You Want to Keep
A Penny Saved is a Penny You Can Throw at Your AWS Bill
Coroner Cases

1. When Automatic Termination Protection Isn't

But wait, you say! "Amazon EMR launches HBase clusters with termination protection turned on." True, but only if your HBase intent is explicit at launch time.

You'll notice that Amazon Web Services calls the Elastic MapReduce Hadoop clusters "job flows". This term reveals a not-insignificant tenet of the AWS perception of your likely workflow: you are expected to spin up a job flow, load data, crunch data, send results elsewhere in a pipeline, and terminate the job flow. There is some mention of data warehousing in the docs, but the defaults are geared towards loading in data from external to your cluster (often S3).

Since AWS expects you to be launching and terminating clusters regularly, their config examples are either in the form of "bootstrap actions" (configuration options you can only pass to a cluster at start time; they run after instance launch but before daemons start) or "job flow steps" (commands you can run against your existing cluster while it is operational). The cluster lifecycle image in the AWS documentation makes this clear.

Because we don't launch clusters with the CLI but rather via the boto python interface, we start HBase as a bootstrap action, post-launch:

BootstrapAction("Install HBase", "s3://elasticmapreduce/bootstrap-actions/setup-hbase", None)

When AWS support says that clusters running HBase are automatically termination-protected, they mean "only if you launched them with the --hbase option or its gui equivalent".

There's also overlap in their terms. The options for long-running clusters show an example of setting "Auto-terminate" to No. This is the "Keep Alive" setting (--alive with the CLI) that prevents automatic cluster termination when a job ends successfully; it's not the same as Termination Protection, which prevents automatic cluster termination due to errors (human or machine). You'll want to set both if you're using HBase.

In our case, the cluster hit a bug in the Amazon Machine Image and crashed, which then led to automatic termination. Lesson the first: you can prevent this from happening to you!

2. Master and Servants

Master group

For whatever size of cluster you launch, EMR will assign one node to be your Master node (the sole member of the Master group). The Master won't be running the map-reduce jobs; rather, it will be directing the work of the other nodes. You can choose a different-sized instance for your Master node, but we run the same size as we do for the others, since it actually does need plenty of memory and CPU. The Master node will also govern your HBase cluster.

This is a partial jps output on the Master node:

NameNode manages Hadoop's distributed filesystem (HDFS).

JobTracker allocates the map-reduce jobs to the slave nodes.

HRegionMaster manages HBase.

ZooKeeper is a coordination service used by HBase.

$ cat /mnt/var/lib/info/instance.json
{"instanceGroupId":"ig-EXAMPLE_MASTER_ID","isMaster":true,"isRunningNameNode":
true,"isRunningDataNode":false,"isRunningJobTracker":true,
"isRunningTaskTracker":false}
$

Core group

By default, after one node is added to the Master group, EMR will assign the rest of the nodes in your cluster to what it terms the Core group; these are slave nodes in standard Hadoop terms. The Master, Core, and Task nodes (more on those in a moment) will all need to talk to one another. A lot. On ports you won't anticipate. Put them in security groups you open completely to one another (though not, obviously, the world - that's what SSH tunnels are for).

This is a partial jps output on the Core nodes:

DataNode stores HDFS data.

TaskTracker runs map-reduce jobs.

HRegionServer runs HBase by hosting regions.

$ cat /mnt/var/lib/info/instance.json
{"instanceGroupId":"ig-EXAMPLE_CORE_ID","isMaster":false,"isRunningNameNode":
  false,"isRunningDataNode":true,"isRunningJobTracker":false,
  "isRunningTaskTracker":true}
$

If you start with a small cluster and then grow it, be warned that the AWS replication factor defaults to 1 for a cluster with 1-3 Core nodes, 2 for 4-9 Core nodes, and 3 for 10 or more Core nodes. The stock Hadoop default is a replication factor of 3, and you probably want to set that if running HBase on EMR. The file is hdfs-site.xml and here is the syntax:

<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>

Every now and again, a Core node will become unreachable (like any EC2 instance can). These aren't EBS-backed instances; you can't stop and re-launch them. If they cannot be rebooted, you will need to terminate them and let them be automatically replaced. So, having each of your blocks replicated to more than one instance's local disk is wise. Also, there's a cluster ID file in HDFS called /hbase/hbase.id which HBase needs in order to work. You don't want to lose the instance with the only copy of that, or you'll need to restore it from backup.

If you are co-locating map-reduce jobs on the cluster where you also run HBase, you'll notice that AWS decreases the map and reduce slots available on Core nodes when you install HBase. For this use case, a Task group is very helpful; you can allow more mappers and reducers for that group.

Also, while you can grow a Core group, you can never shrink it. (Terminating instances leads to them being marked as dead and a replacement instance spawning.) If you want some temporary extra mapping and reducing power, you don't want to add Core nodes; you want to add a Task group.

Task group

Task nodes will only run TaskTracker, not any of the data-hosting processes. So, they'll help alleviate any mapper or reducer bottlenecks, but they won't help with resource starvation on your RegionServers.

A Task group can shrink and grow as needed. Setting a bid price at Task group creation is how to make the Task group use Spot Instances instead of on-demand instances. You cannot modify the bid price after Task group creation. Review the pricing history for your desired instance type in your desired region before you choose your bid price; you also cannot change instance type after Task group creation, and you can only have one Task group per cluster. If you intend an HBase cluster to persist, I do not recommend launching its Master or Core groups as Spot Instances; no matter how high your bid price, there's always a chance someone will outbid you and your cluster will be summarily terminated.

If you'd like a new node to choose its configs based on if it's in the Task group or not, you can find that information here:

$ cat /mnt/var/lib/info/instance.json
{"instanceGroupId":"ig-EXAMPLE_TASK_ID","isMaster":false,"isRunningNameNode":
false,"isRunningDataNode":true,"isRunningJobTracker":false,
"isRunningTaskTracker":true}
$

With whatever you're using to configure new instances, you can tell the Task nodes to increase mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. These are set in mapred-site.xml, and AWS documents their recommended amounts for various instance types. Now your Task nodes won't have the jobrunning constraints of the Core nodes that are busy serving HBase.

3. A Daemon in My View

The EMR Way of changing config files differs from how one would approach stock Hadoop.

Use the AWS-provided bootstrap actions mentioned earlier to configure any EMR clusters:

s3://elasticmapreduce/bootstrap-actions/configure-hadoop

But HBase-specific configs need to be changed in their own files:

s3://elasticmapreduce/bootstrap-actions/configure-hbase

For any configuration setting, you'll need to specify which file you expect to find it in. This will vary greatly between Hadoop versions; check the defaults in your conf directory as a starting point. You can also specify your own bootstrap actions from a file on S3, like this:

s3://bucketname/identifying_info/bootstrap.sh

Bootstrap actions are performed on all newly-instantiated instances (including if you need to replace or add a Core instance), so you will need logic in your custom bootstrap that makes specific changes for only new Task instances if desired, as mentioned above.

As mentioned above, the AWS way to make changes after your cluster is running is something called "job flow steps".

While you may find it easier to edit the config files in ~hadoop/conf/ or to have your config management replace them, if you want to capture what you did, so you can replicate it when you relaunch your cluster, framing it as bootstrap actions or job flow steps in your launch script is advisable.

Note that a config change made in a job flow step just logs in and updates your config files for you; it does not restart any relevant daemons. You'll need to determine which one(s) you need to restart, and do so with the appropriate init script.

The recommended way to restart a daemon is to use the init script to stop it (or kill it, if necessary) and then let service nanny (part of EMR's stock image) restart it.

The service nanny process is supposed to keep your cluster humming along smoothly, restarting any processes that may have died. Warning, though; if you're running your Core nodes out of memory, service nanny might also get the OOM hatchet. I ended up just dropping in a once-a-minute nanny-cam cron job so that if the Nagios process check found it wasn't running, it would get restarted:

#!/bin/bash

if ( [ -f /usr/lib/nagios/plugins/check_procs ] && [[ $(/usr/lib/nagios/plugins/check_procs -w 1:1 -C service-nanny) =~ "WARNING" ]]); then
  sudo /etc/init.d/service-nanny restart >> /home/hadoop/nanny-cam.log 2>&1
else
  exit 0
fi

Insert standard disclaimer about how, depending on versions, your bash syntax may vary.

4. Watching Your Every Move

Ganglia allows you to see how the memory use in your cluster looks and to visualize what might be going on with cluster problems. You can install Ganglia on EMR quite easily, as long as you decide before cluster launch and set the two required bootstrap actions.

Use this bootstrap action to install Ganglia:

s3://elasticmapreduce/bootstrap-actions/install-ganglia

Use this bootstrap action to configure Ganglia for HBase:

s3://elasticmapreduce/bootstrap-actions/configure-hbase-for-ganglia

If you're using Task nodes which come and go, while your job flow persists, your Ganglia console will be cluttered with the ghosts of Task nodes past, and you'll be muttering, "No, my cluster is not 80% down." This is easy to fix by restarting gmond and gmetad on your EMR Master (ideally via cron on a regular basis - I do so hourly):

#!/bin/bash

sudo kill $(ps aux | grep gmond | awk '{print $2}') > /dev/null 2>&1
sudo -u nobody gmond
sudo kill $(ps aux | grep gmetad | awk '{print $2}') > /dev/null 2>&1

You don't need to restart gmetad; that will happen automagically. (And yes, you can get fancier with awk; going for straightforward, here.)

As for Nagios or your preferred alerting mechanism, answering on a port isn't any indication of HBase actually working. Certainly if a RegionServer process dies and doesn't restart, you'll want to correct that, but the most useful check is hbase hbck. Here's a nagios plugin to run it. This will alert in most cases that would cause HBase not to function as desired. I also run the usual NRPE checks on the Master and Core nodes, though allowing for the much higher memory use and loads that typify EMR instances. I don't actually bother monitoring the Task group nodes, as they are typically short-lived and don't run any daemons but TaskTracker. When memory frequently spikes on a Core node, that's a good sign that region hotspotting is happening. (More on that later.)

Other than HBase being functional, you might also want to keep an eye on region balance. In our long-running clusters, the HBase balancer, which is supposed to distribute regions evenly across the RegionServers, turns itself off after a RegionServer restart. I check the HBase console page with Nagios and alert if any RegionServer has fewer than 59 regions. (Configure that according to your own expected region counts.)

check_http!-H your-emr-master -p 60010 -u "/master-status" -r "numberOfOnlineRegions\=[0-5][0-9]?\," --invert-regex

We're trying to keep around 70 regions per RegionServer, and if a RegionServer restarts, it often won't serve as many regions as it previously did. You can manually run the balancer from the HBase shell. The balance_switch command returns its previous status, while the balancer command returns its own success or failure.

$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.0, r0ab71deb2d842ba5d49a261513d28f862ea8ce60, Fri May 17 00:16:53 UTC 2013
hbase(main):001:0> balance_switch true
false
0 row(s) in 0.0410 seconds
hbase(main):002:0> balancer
true
0 row(s) in 0.0110 seconds
hbase(main):035:0>

Regions aren't well-balanced per-table in HBase 0.92.x, but that is reportedly improved in 0.94.x. You can manually move regions around, if you need to get a specific job such as a backup to run; you can also manually split regions. (I'll elaborate on that in a bit.) Note that the automatic balancer won't run immediately after a manual region move.

5. Only Back Up the Data You Want to Keep

The AWS backup tool for HBase uses a wrapper around Hadoop's DistCp to back your data up to S3. Here's how we use it as a job flow step against a running cluster, using the Ruby elastic-mapreduce CLI (as the new unified AWS cli doesn't yet offer EMR support):

$ elastic-mapreduce --jobflow j-your-jobflow-ID --hbase-schedule-backup --full-backup-time-interval 7 --full-backup-time-unit days --incremental-backup-time-interval 24 --incremental-backup-time-unit hours --backup-dir s3://your-S3-bucketname/backups/j-your-jobflow-ID --start-time 2013-12-11T11:00Z

Make sure that in a case of an incremental backup failure, you immediately run a full backup and then carry on with periodic incrementals after that. If you need to restore from these backups, a failed incremental will break your chain back to the most recent full backup, and the restore will fail. It's possible to get around this via manual edits to the Manifest the backup stores on S3, but you're better off avoiding that.

To identify successful backups, you'll see this line on your EMR Master in the file /mnt/var/log/hbase/hbase-hadoop-master-YOUR_EMR_MASTER.out:

13/12/11 13:34:35 INFO metrics.UpdateBackupMetrics: Changing /hbaseBackup/backupMetricsInfo  node data to: {"lastBackupFailed":false,"lastSuccessfulBackupStartTime":"2013-12-11T11:00:20.287Z","lastBackupStartTime":"2013-12-11T11:00:20.287Z","lastBackupFinishTime":"2013-12-11T13:34:35.498Z"}

Backups created with S3DistCp leave temporary files in /tmp on your HDFS; failed backups leave even larger temporary files. To be safe you need as much room to run backups as your cluster occupies (that is, don't allow HDFS to get over 50% full, or your backups will fail.) This isn't as burdensome as it sounds; given the specs of the available EMR instance options, long before you run out of disk, you'll lack enough memory for your jobs to run.

If a backup job hangs, it is likely to hang your HBase cluster. Backups can hang if RegionServers crash. If you need to kill the backup, it's running on your HBase cluster like any map-reduce job and can be killed like any job. It will look like this in your jobtracker:

S3DistCp: hdfs://your-emr-master:9000/hbase -> s3://your-S3-bucket/backups/j-your-jobflow/20131211T110020Z

HBase cluster replication is not supported on EMR images before you get to AWS's premium offerings. If you ever need to migrate your data to a new cluster, you will be wanting replication, because backing up to and then restoring from S3 is not fast (and we haven't even discussed the write lock that consistent backups would want). If you plan to keep using your old cluster until your new one is up and operational, you'll end up needing to use CopyTable or Export/Import.

I've found it's easy to run your Core instances out of memory and hang your cluster with CopyTable if you try to use it on large tables with many regions. I've gotten better results using a time-delimited Export starting from before your last backup started, and then Importing it to your new cluster. Also note that although the docs on Export don't make it explicit, it's implied in CopyTable's example that the time format desired is epoch time in milliseconds (UTC). Export also requires that. Export respects HBase's internal versioning, so it won't overwrite newer data.

After asking AWS Support over many months, I was delighted to see that Hadoop MRv2 and HBase 0.94.x became available at the end of October. We're on the previous offering of MRv1 with HBase 0.92.0, and with retail clients we aren't going to upgrade during prime shopping season, but I look forward to January this year for reasons beyond great snowshoeing. For everything in this post, assume Hadoop 1.0.3 and HBase 0.92.0.

6. A Penny Saved Is a Penny You Can Throw at Your AWS Bill

Since we are using Python to talk to HBase, we use the lightweight Thrift interface. (For actual savings, you want Spot Instances.) Running the Thrift daemon on the EMR Master and then querying it from our applications led to your friend and mine, the OOM-killer, hitting Thrift more often than not. Running it on our Core nodes didn't work well either; they need all their spare memory for the RegionServer processes. (Conspiracy theory sidebar: Java is memory-gobbling, and Sun Microsystems (RIP) made hardware. I'm just saying.) I considered and rejected a dedicated Thrift server, since I didn't want to introduce a new single point of failure. It ended up working better installing a local Thrift daemon on select servers (via a Chef recipe applied to their roles). We also use MongoDB, and talking to local Thrift ended up working much like talking to local mongos.

There's a lot of info out there about building Thrift from source. That's entirely unnecessary, since a Thrift daemon is included with HBase. So, all you need is a JVM and HBase (not that you'll use most of it.) Install HBase from a tarball or via your preferred method. Configure hbase-site.xml so that it can find your EMR Master; this is all that file needs:

<configuration>
<property><name>hbase.zookeeper.quorum</name><value>your-EMR-master-IP</value></property>
</configuration>

Start the Thrift daemon with something along these lines from your preferred startup tool (with paths changed as appropriate for your environment):

env JAVA_HOME="/usr/lib/jvm/java-6-openjdk-amd64/jre/"
exec /opt/hbase/bin/hbase thrift start >> /var/log/thrift/thrift.log 2>&1

Now you can have your application talk to localhost:9090. You'll need to open arbitrary high ports from application servers running this local Thrift daemon to your EMR Master and Core nodes both.

7. Coroner Cases

You know, the corner cases that give you a heart attack and then you wake up in the morgue, only to actually wake up and realize you've been having a nightmare where you are trampled by yellow elephants... just me, then?

HBase HMaster process won't start

You know it's going to be a fun oncall when the HBase HMaster will not start, and logs this error:

NotServingRegionException: Region is not online: -ROOT-,,0

The AWS support team for EMR is very helpful, but of course none of them were awake when this happened. Enough googling eventually led me to the exact info I needed.

In this case, Zookeeper (which you will usually treat as a black box) is holding onto an incorrect instance IP for where the -ROOT- table is being hosted. Fortunately, this is easy to correct:

$ hbase zkcli
zookeeper_cli> rmr /hbase/root-region-server

Now you can restart HMaster if service-nanny hasn't beat you to it:

$ /etc/init.d/hbase-hmaster start

Instance Controller

If the instance controller stops running on your Master node, you can see strange side effects like an inability to launch new Task nodes or an inability to reschedule the time your backups run.

It's possible that you might need to edit /usr/bin/instance-controller and increase the amount of memory allocated to it in the -Xmx directive.

Another cause for failure is if the instance controller has too many logs it hasn't yet archived to S3, or if the disk with the logs fills up.

If the instance controller dies it can then go into a tailspin with the service-nanny attempting to respawn it forever. You may need to disable service-nanny, then stop and start the instance-controller with its init script before re-enabling service-nanny.

A Word On Hotspotting

Choose your keys in this key-value game carefully. If they are sequential, you're likely to end up with hotspotting. While you can certainly turn the balancer off and manually move your regions around between RegionServers, using a hashed key will save you a lot of hassle, if it's possible. (In some cases we key off organic characteristics in our data we can neither control nor predict, so it's not possible in our most hotspotting-prone tables.)

If you limit automatic splitting you might need to manually split a hot region before your backups will succeed. Your task logs will likely indicate which region is hot, though you can also check filesize changes in HDFS. The HBase console on your-emr-master:60010 has links to each table, and a section at the bottom of a table's page where you can do a split:

http://your-emr-master:60010/hbase-webapps/master/table.jsp?name=yourtablename

Optionally you can specify a "Region Key", but it took a bit to figure out which format this means. (The "Region Start Key" isn't the same thing.) The format you want for a Region Key when doing a manual split is what is listed as "Name" on that HBase console page. It will have a format like this:

yourtablename,start_key,1379638609844.dace217f50fb37b69844a0df864999bc.

In this, 1379638609844 is an epoch timestamp in milliseconds and dace217f50fb37b69844a0df864999bc is the region ID.

hbase.regionserver.max.filesize

The Apache project has a page called "the important configurations". An alternate title might be "if you don't set these, best of luck to you, because you're going to need it". Plenty of detailed diagrams out there to explain "regions", but from a resource consumption standpoint, you minimally want to know this:

A table starts out in one region.
If the region grows to the hbase.regionserver.max.filesize, the region splits in two. Now your table has two regions. Rinse, repeat.
Each region takes a minimum amount of memory to serve.
If your RegionServer processes end up burdened by serving more regions than ideal, they stand a good chance of encountering the Out of Memory killer (especially while running backups or other HBase-intensive jobs).
RegionServers constantly restarting makes your HBase cluster unstable.

I chased a lot of ghosts (Juliet Pause, how I wanted it to be you!) before finally increasing hbase.regionserver.max.filesize. If running 0.92.x, it's not possible to use online merges to decrease the number of regions in a table. The best way I found to shrink our largest table's number of regions was to simply CopyTable it to a new name, then cut over production to write to the new name, then Export/Import changes. (Table renaming is supported starting in 0.94.x with the snapshot facility.)

The conventional wisdom says to limit the number of regions served by any given RegionServer to around 100. In my experience, while it's possible to serve three to four times more on m1.xlarges, you're liable to OOM your RegionServer processes every few days. This isn't great for cluster stability.

Closure on the Opening Anecdote

After that 4am phone call, I did get our HBase cluster and (nearly) all its data back. It was a monumental yak-shave spanning multiple days, but in short, here's how: I started a new cluster that was a restore from the last unbroken spot in the incremental chain. On that cluster, I restored the data from any other valid incremental backups on top of the original restore. Splitlog files in the partial backups on S3 turned out to be unhelpful here; editing the Manifest so it wasn't looking for them would have saved hassle. And for the backups that failed and couldn't be used for a restore with the AWS tools, I asked one of our developers with deep Java skills to parse out the desired data from each partial backup's files on S3 and write it to TSV for me. I then used Import to read those rows into any tables missing them. And we were back in business, map-reducing for fun and profit!

Now that you know everything I didn't, you can prevent a cascading failure of flaky RegionServers leading to failed backups on a cluster unprotected against termination. If you run HBase on EMR after reading all this, you may have new and exciting things go pear-shaped. Please blog about them, ideally somewhere that they will show up in my search results. :)

Lessons learned? Launch your HBase clusters with both Keep Alive and Termination Protection. Config early and often. Use Spot Instances (but only for Task nodes). Monitor and alert for great justice. Make sure your HBase backups are succeeding. If they aren't, take a close look at your RegionServers. Don't allow too much splitting.

And most important of all, have a happy and healthy holiday season while you're mapping, reducing, and enjoying wide columns in your data store!

Subscribe to: Posts ( Atom )

December 4, 2021

So how do you provision a GWLB?

GWLB

Endpoint Service

VPC endpoint

GENEVE

Limitations

Conclusion

Afterthought: AWS Network Firewall

References

December 19, 2017

Testing Infrastructure is hard; testing VPCs is even harder

Why is this necessary

Managing VPC infrastructure

Introducing: VpcSpecValidator

Next Steps

Wrap Up - Testing VPCs is difficult but important

Development resources can impact production

Well-defined and well-tested architecture is necessary for scaling

It’s critical to catch these issues before anything is deployed

December 14, 2017

December 9, 2017

Introduction

Overview

Disclaimers

Setting up federated Kubernetes clusters on AWS and GCE

Part 1: Take care of the prerequisites

Part 2: Set up the work cluster in AWS

Part 3: Set up the work cluster in GCE

Part 4: Set up the host cluster

Part 5: Set up the federation control plane

Part 6: Set up the batch job API

Part 7: Submitting an example job

Cleaning up

Conclusion

December 15, 2014

Introduction

Resolving dependencies

Build infrastructure

Wrapping up

Further reading

December 1, 2014

Why Docker?

Our Own Private Registry

docker build; docker push

Docker in a Mac-using Dev World

Containerizing Front-End Dev

Getting the Logs Out

Totally Weird Bugs for $1,000, Alex

About Those Race Conditions

Cron Zombies

Containerize All the Things?

Not (Just) Hype

Sound Like Fun?

December 19, 2013

Definitions

How We Use AWS and IAM

IAM Permissions

Chef Vault

Data File Format

Create IAM Script

Using Knife Encrypt

Results

Conclusion

December 18, 2013

Master group

Core group

Task group

HBase HMaster process won't start

Instance Controller

A Word On Hotspotting

hbase.regionserver.max.filesize

Closure on the Opening Anecdote

What is sysadvent?

Subscribe

Blog Archive