AWS Operational Checklists
AWS Operational Checklists
Steve Morad
Peter Dalbhanjan
June 2013
Page 1 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Table of Contents
Table of Contents .................................................................................................................................................................... 2
Abstract ................................................................................................................................................................................... 3
Introduction ............................................................................................................................................................................ 3
Conclusion ............................................................................................................................................................................. 15
Page 2 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Abstract
Deploying an application on Amazon Web Services (AWS) is fast, easy, and cost-effective. Before deploying a cloud
application in production, it is useful to have a checklist to assist in evaluating your application against a list of essential
and recommended best practices. This paper highlights useful operational and architectural considerations that you
should consider as you deploy your applications on AWS.
Introduction
Amazon Web Services is a flexible, cost-effective, and easy-to-use cloud computing platform. AWS provides a suite of
infrastructure services that you can use to deploy your applications. To get the maximum benefit out of the cloud
platform, we recommend that you leverage AWS’ complimentary services and follow the best practices. Organizations
that invest time and resources assessing the operational readiness of their applications before launch have a much
higher rate of satisfaction than those who don’t. When performing this work, checklists can be an invaluable mechanism
to ensure that applications are evaluated consistently and holistically.
The level of operational assessment will vary depending on the organization’s cloud maturity level and the application’s
development phase, availability needs, and data sensitivity requirements. This paper provides two checklists to support
these varying assessment needs:
Basic Operations Checklist - covers common high-level technical questions that organizations should consider as
they adopt different AWS services and are planning for a launch.
Enterprise Operations Checklist - provides a more in-depth operational review of suggested best practices that
an enterprise should consider when developing a mature cloud strategy.
This paper is targeted at developers and architects who are looking for operational and architectural guidance from AWS
to help assess their application’s operational readiness. These individuals typically support enterprise organizations
developing formal cloud strategies or performing formal technology reviews. However, it could also be useful to any
organization for comparing its planned use of AWS against these essential and recommended best practices.
In addition, AWS provides an Auditing Security Checklist whitepaper which provides a high-level list of considerations for
auditing how customers use AWS. Security, risk, and compliance teams can use to design and execute a security
assessment of an organization’s information systems and data as they are deployed on AWS. The following diagram
depicts how these resources can be used together by various groups with in their organization:
Page 3 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Enterprise Operations Checklist - This checklist is intended to help enterprises think through various operational
considerations as they deploy sophisticated enterprise applications on AWS. It can also be used to help you build a cloud
migration and operation strategy for your organization. This section is also further divided into two parts. The first part
provides a high-level checklist with brief descriptions for each operational consideration. The second part provides
more detail about each checklist item, as well as links to additional information.
Enterprise Operations Checklist To assist enterprises identify key items to Enterprise architects
think about as they build a cloud
migration and operational strategy
Auditing Security Checklist To assist customers when they evaluate Auditors or risk and
the security controls required by their compliance professional
specific industry or governing body like
the AICPA, NIST, ISO, PCI SSC, etc.
Page 4 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
We understand which of our instances is Amazon Elastic Block Store (Amazon EBS)-backed versus instance store-backed,
have intentionally chosen the most appropriate type of storage, and understand the implications to data persistence,
backup and recovery. Learn more…
We understand AWS dynamic IP addressing and have ensured that our application will function when application
rd
components are restarted (e.g., using 3 -party or Elastic Load Balancing, Amazon Virtual Private Cloud (Amazon VPC) static
address assignments, elastic IP addresses, or dynamic DNS). Learn more…
We use separate Amazon EBS volumes for the operating system and application/database data where appropriate. Learn
more …
We regularly back up our Amazon Elastic Compute Cloud (Amazon EC2) instances using Amazon EBS snapshots or another
rd
3 -party backup tool. Learn more …
We regularly test our process of recovering our Amazon EC2 instances or Amazon EBS volumes when they fail, either
through customized “golden” Amazon Machine Images (AMIs), Amazon EBS snapshots, bootstrapping, or using our own
backup and recovery tools. Learn more…
We have deployed critical components of our applications across multiple availability zones, are appropriately replicating
data between zones, and have tested how failure within these components affects application availability. Learn more…
We understand how failover will occur across application components deployed in multiple availability zones and are using
rd
3 -party or Elastic Load Balancing and elastic IP addresses where appropriate. Learn more…
We regularly test our process for patching, updating, and securing our Amazon EC2 operating system, applications, and
customized AMIs. Learn more…
We use appropriate operating system user account access credentials and are not sharing the AWS instance key pair
private key with all systems administrators. Learn more…
We have implemented secure Security Group rules and nested Security Groups to create a hierarchical network topology
where appropriate. Learn more…
We use “CNAME” records to map our DNS name to our Elastic Load Balancing or Amazon Simple Storage Service
(Amazon S3) buckets and NOT “A” records.
Before sharing our customized Amazon Machine Images with others, we removed all confidential or sensitive information
including embedded public/private instance key pairs and reviewed all SSH authorized_keys files. Learn more…
We have fully tested our AWS-hosted application, including performance testing, prior to going live. Learn more…
We have signed our production AWS accounts up for business or enterprise support and have a plan for incorporating AWS
Trusted Advisor reports into our ongoing operational reviews. Learn more…
Page 5 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Security & Access Has your organization developed a strategy for managing AWS API, console,
Management operating system, network, and data access?
Asset Management Does your organization have a strategy for identifying and tracking AWS
provisioned resources?
Application HA/Resilience Does the implemented AWS solution meet or exceed the application’s high
availability and resilience requirements?
Application DR/Backup Does the implemented AWS solution meet or exceed the application’s disaster
recovery (DR) and backup requirements?
Monitoring & Incident Has your organization instrumented appropriate monitoring tools and integrated
Management your AWS resources into its incident management processes?
Configuration & Change Does your organization have a configuration and change management strategy
Management for its AWS resources?
Release & Deployment Has your organization determined how it will integrate application releases and
Management deployments with its configuration and change management strategy?
Page 6 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Page 7 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Security and access management is extremely important to AWS and our customers. An organization should review and
incorporate the following resources into their access management strategy:
AWS Security Center AWS Compliance
Overview of Security Processes Whitepaper Risk and Compliance Whitepaper
Security Best Practices Auditing Security Checklist
Page 8 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Identity Federation
Organizations that use directory services such as Lightweight Directory Access Protocol (LDAP) or Microsoft’s Active
Directory services to manage access controls can use Identity Federation in order to bring similar controls and
governance to access AWS APIs and resources. Identity Federation mitigates the necessity to create IAM users for
each identity in your enterprise. This method enables several use cases such as Single Sign-On (SSO) to AWS
Management Console, Granting access to AWS resources using your company’s own authentication system,
Granting a Mobile application to access AWS resources etc.
Network Access
Connectivity to and from an organization’s AWS and corporate environments must be well understood and may
leverage the Internet, hardware or software virtual private networks, or direct connections. The AWS Amazon VPC
Connectivity Options whitepaper discusses several of these options in detail. AWS recommends having at least two
connections for HA and fault tolerance. Customers can choose a combination of connectivity options in order to
provide redundant failover mechanisms when appropriate.
Additionally, an organization should have a strategy for managing Amazon EC2 Security Groups, Amazon VPC
network routing or access control lists, and host-based firewalls/intrusion detection systems if applicable. A
comprehensive network access strategy will likely incorporate the following components:
Use of Amazon EC2 Classic and Amazon Virtual Private Cloud (VPC) environments
Use of network connectivity and controls between AWS and corporate networks
Use of transport encryption protocols
Use of operating system access controls including Amazon EC2 Security Group rules, VPC network access
control lists, OS hardening, security patches, host-based firewall, intrusion detection/prevention, and
monitoring software configuration.
Centralized log management
Security event management, correlation and reporting
Additional information related to network access:
Amazon EC2 Network and Security
Amazon Virtual Private Cloud
Amazon VPC Security Groups
Amazon Elastic Network Interfaces
Elastic Load Balancing Security Features
AWS Direct Connect
AWS Security Best Practices
Page 9 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Data Access
An organization must determine how it will protect access to its data. AWS provides physical barriers to protect
physical access to underlying storage media as well as a number of logical access controls that can be used by your
organization to protect its data. Additionally, organizations should consider what level, if any, additional controls
are required. Potential data access controls include the following:
AWS logical access control lists or access policies
Disk, file, or database-level encryption for data-at-rest
Application or data-level access control, intrusion or leakage detection software
Data lifecycle or records management tools
Additional information related to data access:
AWS CloudHSM
Amazon S3 Access Control
Amazon S3 Data Encryption
Amazon RDS Transparent Data Encryption
Amazon EBS Snapshot Permissions
AWS Security Best Practices
Page 10 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Asset Management
Does your organization have a strategy for tracking and managing its AWS deployed assets? An effective strategy would
include whether or not an internal asset management system will be integrated with AWS and how AWS provided asset
management capabilities will be leveraged. At a minimum, an organization’s asset management strategy should be able
to answer the following questions:
Is your organization leveraging AWS provided instance and service specific metadata as part of its asset
management strategy?
AWS provides out-of-the-box metadata for each of its services to help your organization identify, track,
and manage your AWS resources. Customers can leverage this metadata to track Amazon EC2 instances
or storage by server image (AMI), operating system, compute architecture (32-bit or 64-bit), volume id,
snapshot, attached storage, and many other categories.
Is your organization leveraging custom resource tags to track and identify AWS resources?
In addition to the out-of-the-box metadata, AWS allows customers to apply their own custom tags.
Resources could be tagged by support team, application version, cost center, environment type, lifecycle
status or any other category that will help your organization more effectively manage its AWS resource
assets.
Does your organization have a resource tagging strategy?
Although AWS supports ad hoc resource tagging, an organization will get the most benefit from tagging
if they strategically plan for the intentional and systematic use of resource tags.
How will AWS assets be integrated with internal asset management systems?
AWS resources can be programmatically or manually queried to easily pull service and resource
metadata into existing asset management systems and processes.
Additional information related to asset management:
Amazon EC2 Instance Metadata
Using Amazon EC2 Tags
Using AWS CloudFormation or Tagging Auto Scaling Groups to automatically tag resources
Amazon S3 Object Key and Metadata
Page 11 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Application HA/Resilience
Every application has specific High Availability (HA) requirements and characteristics. AWS provides a number of
infrastructure building blocks to help your organization meet these requirements cost effectively. At a high level, an
effective HA strategy would include instance redundancy, use of multiple availability zones within a region, load
balancing, auto scaling, monitoring, and recovery within a region. Critical applications should ensure that all single
points of failure are identified and evaluated based on the application’s availability requirements and risk profile. An
effective HA strategy will include not only how a single component will recover (e.g. will a failed instance be
automatically or manually re-launched in the same or different availability zone?), but also what HA testing will be
performed to ensure that the application can recover as expected. The following are Amazon Web Services that your
organization can consider leveraging for high availability:
Running multiple Amazon EC2 instances in multiple Availability Zones
Elastic Load Balancing for load balancing across multiple Availability Zones
Auto Scaling for automated instance recovery or scaling
Amazon CloudWatch Metrics (custom metrics as well as out-of-the-box) and Alarms
Multi-AZ Amazon RDS for multiple Availability Zone managed databases
Elastic IP Addresses for static IP addresses that can be remapped between instances
Amazon EBS Snapshots for point-in-time snapshots of Amazon EBS volumes
Storing objects in Amazon S3 and/or using Amazon CloudFront for content distribution
Storing key/value pairs in Amazon SimpleDB or Amazon DynamoDB
AWS Elastic Beanstalk for environment and application version management
AWS OpsWorks for a DevOps solution for managing applications of any scale or complexity
Leverage synchronous data replication technologies like database mirroring
Reserving HA capacity through Amazon EC2 Reserved Instances
Additional information related to designing highly available applications in the cloud includes the following:
Designing Fault-Tolerant Applications in the Cloud
Architecting for the Cloud: Best Practices
RDBMS in the Cloud: Microsoft SQL Server 2008 R2
MongoDB on AWS
Storage Options in the Cloud
Cloud Architectures
Page 12 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Application DR/Backup
Every application has specific Disaster Recovery (DR) requirements that should be tied to recovery point and recovery
time objectives as well as any geographic requirements that restrict the physical proximity between primary and disaster
recovery sites. An effective DR strategy will include not only how a single component will recover (e.g. will a failed
region be automatically or manually re-launched in another region using cold, warm, or hot stand-bys?), but also what
DR testing will be performed to ensure that the application can be recovered as expected. At a high level, an effective
DR strategy would include regional redundancy, global traffic management or load balancing, monitoring, and region-to-
region recovery. The following are Amazon Web Services and techniques that your organization can consider as part of
a DR strategy:
Store data, Amazon Machine Images, or run additional instances in multiple AWS regions
Use Amazon Route 53 for DNS-based regional fail-over
Use Amazon Glacier for archive data
Leverage asynchronous data replication technologies like database log shipping
Reserve DR capacity in another region through Amazon EC2 Reserved Instances
Use EBS Snapshot copy and/or Amazon Machine Image (AMI) copy across regions
Leverage Amazon S3’s versioning to provide even further protection for your stored data
Take periodic Amazon EBS Snapshots or third party tools for quick recovery from data loss
Leverage S3 object lifecycle policies to archive data to Amazon Glacier
Additional information related to designing cloud applications for DR includes the following:
Designing Fault-Tolerant Applications in the Cloud
RDBMS in the Cloud: Microsoft SQL Server 2008 R2
Using AWS for Disaster Recovery
Page 13 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
Application and operating system monitoring is essential for any organization to ensure that it can effectively detect and
respond to incidents. At a high level, an effective monitoring strategy will ensure that monitoring tools are
instrumented at the appropriate level for an application based on its business criticality. An effective incident
management strategy will incorporate both AWS and customer monitoring data with its event and incident
management tools and processes. An organization should consider the following tools as part of this strategy:
Amazon CloudWatch out-of-the-box metrics for AWS monitoring, alerting, and automated provisioning
Amazon CloudWatch custom metrics for Application monitoring, alerting, and automated provisioning
Amazon EC2 instance health for viewing status checks and scheduled events for your instances
Amazon SNS for setting up, operating, and sending notifications from the cloud
Operating system monitoring tools for OS-level monitoring
Application monitoring tools for Application-level monitoring
Simulated transaction monitoring tools for end-to-end system monitoring
Additional information related to monitoring and incident management includes the following:
Amazon EC2 Troubleshooting
Monitoring Amazon EC2 instances
AWS Management Pack for Microsoft System Center
Configuration & Change Management
Has your organization determined how it will integrate cloud configuration and change management into its IT
operational processes? Intentional, controlled change management is essential whether an organization has mature IT
Service Management processes, fully implemented Information Technology Infrastructure Library (ITIL), or has yet to
create a change management database (CMDB). An effective cloud configuration and change management practice
would include cloud concepts like the following:
How will your organization manage server images (AMIs)?
Server images must be periodically updated with patches and software updates. AWS provides a
number of tools that can be incorporated in your organization’s image management processes to assist
in the creation and management of AWS images.
Will instances be automatically configured at launch or manually configured later?
Automating instance configuration on boot, by passing user-data to the instance on boot or embedding
change and configuration management agents in a server image, allows instances and applications to
take advantage of instance meta-data, cloud automation, scaling, and high-availability capabilities.
How will OS credentials be instrumented and controlled when instances are launched or terminated?
Typically, organizations preconfigure their server images to automatically connect and register with
corporate LDAP or Active Directory domains when they are launched to provide centralized OS
credentials management and control.
How will patches and upgrades be applied?
Organizations take different patch and upgrade management approaches depending on their
application’s characteristics and requirements. Updates can be applied to existing instances using
traditional software deployment tools or by replacing outdated software running on older instances
with newer, patched, and upgraded server images.
Will applications be managed as homogeneous fleets?
Managing applications as homogeneous fleets allows infrastructure to be dynamically and automatically
provisioned or released based on predictable utilization patterns.
How will your organization manage changes to OS hardening baselines, configure security groups or OS firewalls,
and monitor their instances for intrusions or unauthorized changes?
Most organizations already have existing internal IT change and configuration management processes
and tools that can incorporate AWS related changes with minimal modification.
Page 14 of 15
Amazon Web Services – Operational Checklists for AWS June 2013
How will your organization integrate application releases and deployments with its change and configuration
management strategy? In the cloud, the traditional lines between infrastructure changes and application deployments
can be blurred, if not completely erased. In addition to traditional code development, testing, and versioning concepts,
organizations should also consider the following cloud integration points for application releases or deployments:
What software release and deployment process or methodology will your organization leverage?
Organizations have full control over their software release and deployment processes. Some
organizations utilize traditional release and deployment processes that deploy approved releases from a
controlled software repository to existing servers. Other organizations bundle, promote, and release
complete software stacks incorporating applications and server images combined throughout the
development lifecycle.
Will newer versions of an application be phased-in to existing server farms and older versions phased-out?
AWS provides organizations with the opportunity to implement new, shorter maintenance window
deployment models by quickly and cheaply spinning up new application versions to gradually replace
older instances over time.
Will weighted load distribution patterns be used to intentionally deploy, test, migrate, and roll-out or roll-back
new application releases?
Just as AWS enables organizations to take advantage of new deployment models, these same models
can also be used as part of testing, migration, or roll-back processes to more quickly and seamlessly
support release and deployment processes.
How can your organization leverage infrastructure boot-strapping and application deployment tools to more
quickly and effectively introduce or roll-back changes?
Releasing and deploying applications on AWS provides organizations with an opportunity to reevaluate
its existing processes to determine where they can improve efficiencies through cloud-friendly change,
configuration, release, and deployment automation.
How can your organization make its applications more infrastructure-aware so applications can become active
participants in making the infrastructure changes required to support a specific software release or
deployment?
Traditional applications are dependent on coordination with completely independent infrastructure
management teams and change control processes. Often, a deployment weekend consists of separate,
coordinated infrastructure changes. After deployment, the infrastructure will ideally remain static until
the next change weekend. With AWS, applications now have the option of initiating and automating
infrastructure changes either during scheduled deployments or automatically in response to changing
user demands on the application. When releasing and deploying applications on AWS, organizations
should at least consider how actively the application should be able to participate in the process of
ensuring the infrastructure is deployed and configured to best support its business functions.
Conclusion
Many organizations have successfully deployed and operated their cloud applications on AWS. The checklists provided
in this whitepaper highlights several best practices that we believe are essential and will help you to increase the
likelihood of successful deployments and frustration-free operations. We highly recommend these operational and
strategic considerations for your existing and new application deployments on AWS.
Page 15 of 15