AWS Certified Data Engineer
AWS Certified Data Engineer
Introduction
✓ 720 / 1000
Passing Score ✓ Goal: Achieve a score of 850+
Master the Exam
Not needed for the exam.
Free Trial Account Help with memorizing
Give you a practical knowledge.
A data engineer needs to create an ETL process that automatically extracts data from an Amazon 53 bucket,
transforms it, and loads it into Amazon Redshift. Which
AWS service is the EASIEST to achieve this?
❑ AWS Lambda
✓
❑ AWS Glue
❑ Amazon Step Functions
❑ Amazon EMR
Recipe to clear the exam
Step-by-step incl. Demos
Lectures ~ 30-60 min / day
Book Exam
Confident & Prepared
Final Tips
Resources
Q&A Section
Reviews
S3 One Zone-IA
Same as Standard-IA, but stored in a
99.999999999% 99.50% Lifecylcle Rules
single AZ for cost savings
Define change of storage classes
over time
Long-term archiving with retrieval 99.99%
S3 Glacier 99.999999999%
times ranging from minutes to hours (after retrieval)
Versioning
▪ Visual interface:
Easily create ETL jobs without code
▪ Various integrations:
Amazon S3, Amazon Redshift, and Amazon RDS
AWS Glue
▪ Fully-managed ETL service
▪ Visual interface:
Easily create ETL jobs without code
▪ Various integrations:
Amazon S3, Amazon Redshift, Amazon RDS,
Kinesis Data Streams, DocumentDB etc.
AWS Glue
▪ Script auto-generated behind the scenes
▪ Serverless:
AWS Glue takes care of the underlying
infrastructure
▪ Pay-as-you-go:
You only pay for what you use
AWS Glue Data Catalog
▪ Centralized Data Catalog:
Stores table schemas and metadata
⇒ allows querying by:
AWS Athena, Redshift, Quicksight, EMR
▪ Glue Crawlers:
scan data sources,
infer schema,
stored in the AWS Glue Data Catalog
⇒ Can automatically classify data
▪ Scheduling:
Run on a schedule or based on triggers
+ incremental loads/crawling (ETL jobs & Crawlers)
Section 3:
Querying with Athena
▪ Serverless:
No infrastructure to manage
Pay-as-you-go
▪ Ad-Hoc Analysis:
Ad-hoc queries on data lakes stored in S3
▪ Real-Time Analytics:
Integrating Athena with streaming data sources
such as Amazon Kinesis
Athena
Federated Queries
▪ Control…
• Query execution settings
• Access
• Cost
• Type of engine (Athena SQL vs. Apache Spark)
▪ What it does?
Reuses previous results that match your query and max. age
▪ Benefits?
Improve query performance & cost
Athena –Query Result Reuse
▪ When to use?
o Query where the source data doesn’t change frequently
o Repeated queries
o Large datasets with complex queries
Athena - Performance Optimization
▪ Data Compression:
Reduce file size to speed up queries
▪ Format Conversion:
transform data into an optimized structure such as Apache
Parquet or Apache ORC columnar formats
Section 4:
AWS Glue Deep Dive
▪ Data Catalog:
o Up to a million objects for free
o $1.00 per 100,000 objects over a million, per month
Glue Costs
▪ ETL jobs:
▪ Hourly rate based on the number of DPUs used
▪ Billed by seconds with with a 10-minute minumum
▪ AWS Glue versions 2.0 and later have a 1-minute minimum
▪ Cost of DPUs
o $0.44 per DPU-Hour (may differ and depend on region)
Glue Costs
▪ Glue Job Notebooks / Interactive Sessions:
▪ Used to interactively develop ETL code in notebooks
▪ Based on time session is active and number of DPUs
▪ Configurable idle timeouts
▪ 1-minute minimum billing
▪ Minimum of 2 DPUs – Default: 5 DPUs
Glue Costs
▪ ETL jobs cost example:
▪ Apache Spark job
▪ Runs for 15 minutes
▪ Uses 6 DPU
▪ 1 DPU-Hour is $0.44
▪ Budget Types:
o Cost budget
o Usage budget
o Saving plans budget
o Reservation plans budget
▪ Stateful:
Systems remember past interactions for influencing future ones.
▪ Stateless:
Systems process each request independently without relying on
past interactions.
Stateful vs. Stateless
Data Ingestion Context:
▪ Stateful:
Maintain context for each data ingestion event.
▪ Stateless:
Process each data ingestion event independently.
Stateful vs. Stateless
Data Ingestion in AWS:
● Amazon Kinesis:
Supports both stateful (Data Streams) and stateless (Data
Firehose) data processing.
● AWS Glue:
Offers stateful or stateless ETL jobs with features like job
bookmarks for tracking progress.
Glue Transformations
● Ideally used for managing AWS Glue operations but also can be
leveraged other services.
● Pre-built transformations.
● No coding required.
Amazon S3 AWS Glue Amazon
(Data Lake) Data Brew Redshift
● NEST_TO_ARRAY:
○ convert columns into an array
● NEST_TO_STRUCT
o Like NEST_TO_MAP but retains exact data type and order
AWS Glue DataBrew - Transformations
● UNNEST_ARRAY:
○ Expands array to multiple columns
● UNPIVOT
o Pivot column into rows (attribute + value)
○ Event-driven ingestion:
■ S3
■ DynamoDB
■ Kinesis
⇒ Process real-time events, such as file upload
○ Automation:
Automate tasks and workflows by triggering Lambda functions in
response to events
AWS Lambda
Advantages
Trigger
File transfer
File upload
Data processing
S3 Bucket
AWS Lambda – Kinesis Data Stream
Continuously
Event Source executes
Trigger
File transfer
Kinesis
Data processing
Data Streams
⇒ Executes in batches
⇒ Automatically scales
Lambda Layers
• Contains supplementary code
• library dependencies,
• custom runtime or
• configuration file
Functio Functio
n1 n2 Lambda function
Fuction code Fuction code components:
Without layers
Code dependencies, Code dependencies,
custom runtimes, custom runtimes,
configuration files, etc. configuration files, etc.
Functio Functio
n1 n2
Fuction code Fuction code
Lambda function
Lambda Layer 1 Lambda Layer 1 components:
With layers
Why is it important?
○ Error Handling: Corrects processing mistakes and recovers lost data.
○ Data Consistency: Ensures uniform data across distributed systems.
○ Adapting to Changes: Adjusts to schema or source data changes.
○ Testing and Development: Facilitates feature testing and debugging
without risking live data integrity.
Replayability
Strategies for Implementing Replayability
○ Idempotent Operations:
Ensure repeated data processing yields consistent results.
○ Logging and Auditing:
Keep detailed records for tracking and diagnosing issues.
○ Checkpointing:
Use markers for efficient data process resumption.
○ Backfilling Mechanisms:
Update historical data with new information as needed.
Replayability
Replayability is an important safety net for data
processing.
▪ Kinesis Firehose
Fully managed service to deliver streaming data to destinations more easily
´⇒ Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk.
1) Real-time analytics
Analyzing streaming data to gain insights
2) IoT Processing
Ingesting and processing data from IoT devices or sensors
Shard
Formatted as Data Record
AWS SDK 1
Shard
Kinesis Producer Library
(KPL) 2
Includes a Partition Key Shard
Amazon Kinesis Agent 3
determines the shard
to which the record will be assigned …
⇒ Resilient to failure
Data is immutable
Amazon Kinesis Data Streams
Scalability
IoT devices
3 Lambda
… Other Services
● Definition Latency:
The time from initiating a process to the availability of the result.
● Propagation Delay:
Specific latency from when a record is written to when it's read by a consumer.
● Influencing Factor:
Significantly affected by the polling interval, or how often consumer
applications check for new data.
Latency
Latency and Propagation Delay
● Recommendation:
Poll each shard once per second per consumer application to balance efficiency
and avoid exceeding API limits.
● Improved Scalability:
The system can handle more concurrent consumers
without performance degradation.
● Reduced Latency:
Improved latency (~70ms)
o Automatic Scaling:
Dynamically adjusts to the volume of incoming data,
providing seamless scalability.
Amazon Kinesis Data Firehose
Fallback S3 bucket
Reliability
Transformations
Producer Consumers
Buffering
AWS Lambda integrated
generate data streams
File Size
On-the-fly / integrated
up to 128MB
Time interval
⇒ Elastically scale
up to 900 seconds
up and down
Key Features
o Near Real-Time Processing:
Buffers based on time and size, balancing prompt data handling
with the efficiency of batch deliveries.
SQL
Real-time monitoringReal-time website trafficStreaming ETLPython
Scala
Java
o Apache Flink under the hood:
Open-source stream processing framework managed by AWS
Amazon MSK
Amazon MSK
Amazon S3
Amazon S3
Custom
Custom Data Sources
Data Sources
On-the-fly processing
Flink's streaming engine
Minimal Code Near real-time Maintain state based on Checkpoints & Snapshots
incoming data
Ensuring fault tolerance
Pricing
o Pay-as-You-Go
Consumption based
o Application Orchestration
Each application requires one additional KPU for orchestration.
o Automatic Scaling
Number of KPUs automatically scaled based on needs.
Manually provisioning possible.
o Purpose
Enables processing of streaming data in real-time.
Architecture
o Kafka Brokers:
Servers that manage the publishing and subscribing of records
o ZooKeeper Nodes:
Coordinates Kafka brokers and manages cluster metadata.
High Availability & Storage
o Multi-AZ Deployment
Clusters distributed across Availability Zones for fault tolerance.
o EBS Storage
Utilizes Elastic Block Store for durable message storage, supporting
seamless data recovery.
o Scalable Clusters
Easily adjustable broker counts and storage sizes to handle varying
workloads.
Producers & Consumers
o Producers
Applications that send data to MSK clusters.
o Consumers
Applications or services retrieving and processing data from MSK.
o Message Handling
MSK supports larger message sizes, critical for specific use cases.
MSK Kinesis
● More granular control, ● More Managed Experience
more management
● Straight-forward setup
● More complex pipelines
VS ● Message size:
● Message size: 1 MB Limit
Up to 10MB (Default 1 BM)
● Access Control:
○ Mutual TLS ● Access Control:
○ SASL/SCRAM Username/password ○ Uses IAM policies for both
authentication mechanism, also authentication and authorization
relying on Kafka ACLs.
○ IAM Access Control
authentication and authorization
using IAM
Section 7:
Storage with S3
S3 Standard S3 Intelligent-Tiering
• Is appropriate for
• It can improve data access speeds by 10x and reduce request costs by 50%
compared to S3 Standard.
• Is best for data that is accessed less frequently, but requires rapid
access when needed.
• ideal for long-term storage, backups, and as a data store for disaster
recovery files.
• Is best for data that is accessed less frequently, but requires rapid
access when needed.
S3 Standard
S3 Standard-IA
S3 Intelligent-Tiering
S3 One Zone-IA
• Useful for
S3 Object
Versions
Delete Protection
Versioning
• Is disabled by default
Versioning Enabled
S3 Object
S3 Object
S3 Object S3 Object
Version ID: Null
Version ID: SwS…
Version ID: NXD…
Versioning
Encryption and Bucket Policy
Encrypted
Secure Socket
Unencrypted Layer/Transport Layer Unencrypted Bucket
Client
Security (SSL/TLS)
Encryption
Encryption at rest
Bucket
S3 Managed Key
User
Unencrypted Encrypted
Encryption
Server-side encryption with AWS KMS (SSE-KMS)
Bucket
KMS Managed Key
User
Unencrypted Encrypted
Encryption
Dual-layer server-side encryption with AWS KMS keys (DSSE-KMS)
Bucket
KMS Managed Key KMS Managed Key
User
Key
Bucket
Client Key
HTTPS
User
Unencrypted Encrypted
Encryption
Client-side Encryption
User Bucket
Encrypted
Client Key
Encrypted
Unencrypted Encrypted
Bucket Policy
Specifies
User
Bucket
Access
Point A
User
Access
Point A
User
• Lets you create customized entry points to a bucket, each with its own
unique policies for access control.
Access Points
Each Access Point has
Features
Use Cases
Access Point
Object Lambda
Use Cases
• Replication events
S3 Event Notification
• S3 Lifecycle expiration events
Lambda function
EventBridge
S3 Event Notification
The Event Examples
• s3:ObjectCreated:Put:
Object is uploaded to the bucket using the PUT method.
• s3:ObjectCreated:Post:
Object is uploaded to the bucket using the POST method.
• s3:ObjectRemoved:Delete:
Object is deleted from the bucket.
Wildcards
• s3:ObjectCreated:*
Captures all object creation events
Bucket
Insert INTO …
File Upload
RDS Database
User
S3 Event Notification
• Replication events
S3 Event Notification
• S3 Lifecycle expiration events
Lambda function
EventBridge
S3 Event Notification
The Event Examples
• s3:ObjectCreated:Put:
Object is uploaded to the bucket using the PUT method.
• s3:ObjectCreated:Post:
Object is uploaded to the bucket using the POST method.
• s3:ObjectRemoved:Delete:
Object is deleted from the bucket.
Wildcards
• s3:ObjectCreated:*
Captures all object creation events
Bucket
Insert INTO …
File Upload
RDS Database
User
Section 8:
Other Storage Services
EC2 instances
Highly scalable
AWS EBS - Storage
• Scalability
• Durability
Key features
• Block-level storage
• Persistent storage
• High performance
• Cost effective
EBS functionality
SSD Based
Volume Type
Select
⇒ Bound to a specific AZ
▪ S3 Compatible
Monitoring and
Optimization
Volume types
GP2/GP3
IO1/OI2
ST1/SC1
Magnetic
EBS Volume in AWS
Provisioned IOPS
Volume Size
Option to increase
provision for IOPS type
SSD (IO1)
Delete on Termination
Determines whether the volume should be automatically deleted when its
associated EC2 instance is terminated.
● Enabled:
The volume will be automatically deleted by AWS when the associated EC2 instance is
terminated.
● Disabled:
The EBS volume will persist even after the associated EC2 instance is terminated.
▪ Multi-AZ Availability
▪ Scalability
▪ Elasticity
▪ NFSv4.1 protocol
▪ Performance Mode:
▪ General Purpose (broad range)
▪ Max I/O (high troughput / IOPS)
▪ Pay as You Go Pricing
Posix System Standard API
The POSIX (Portable Operating System Interface)
standard defines a set of APIs for
compatibility between various UNIX-based
operating systems.
• Performace Scaling
VS
AWS DynamoDB
{
"BookID": "B103",
"Title": "Modern Web Development",
"Author": "Lisa Ray",
"Price": 31.50,
"Genres": ["Technology", "Educational"],
"PublishYear": 2018,
"RelatedCourses": ["Web Development 101", "Advanced JavaScript"]
}
AWS DynamoDB
• Primary Keys
o Uniquely identifies items in a table.
o Items in a table cannot have the same “key”
o They must be scalar (string, number or binary)
o Specified at the creation time of the table
o Required for data organization and retrieval
▪ Items cannot have the same ▪ Items can have the same Partition key
partition key provided the sort key is different.
AWS DynamoDB
Simple Partition Key Composite Key
Primary Key Attributes Primary Key Attributes
Sort Key
Partition Key
Partition Key
AWS DynamoDB
Primary keys
Global Secondary Index ▪ An index with a partition key and sort key
that can be different from the base table.
Changes
One record
Note:
o Changes made before activation of Stream are not recorded
o Data in the stream is retained for 24 hours
AWS DynamoDB
Processing Streams
1. AWS Lambda:
2. Amazon Kinesis Data Streams ⇒ Data Firehose
3. Amazon Elasticsearch Service
4. Custom Applications
5. AWS Glue
6. Cross-Region Replication
AWS DynamoDB
Lambda
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
AWS DynamoDB
o Control Plane
o Data Plane
o DynamoDB Streams APIs
o Transaction
Data Plane
These operations allow you to perform Create, Read, Update, and Delete on
data in tables. (CRUD operations)
o I. PartiQL APIs
AWS DynamoDB o
o
Data Plane
DynamoDB Streams APIs
I. PartiQL APIs :
o ExecuteStatement – Reads multiple items from a table. You can also
write or update a single item from a table.
o BatchExecuteStatement – Writes, updates, or reads multiple items from
a table.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
1. Creating data
o PutItem – Writes a single item to a table.
o BatchWriteItem – Writes up to 25 items to a table.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
Data Plane o Transaction
2. Reading Data
o GetItem - Retrieves a single item from a table.
o BatchGetItem – Retrieves up to 100 items from one or more tables.
o Query – Retrieves all items that have a specific partition key.
o Scan – Retrieves all items in the specified table or index.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
Data Plane o Transaction
3. Updating data
o UpdateItem – Modifies one or more attributes in an item.
4. Deleting data
o DeleteItem – Deletes a single item from a table.
o BatchWriteItem – Deletes up to 25 items from one or more
tables.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
Operations for enabling/disabling a stream on a table and allowing access to the data
modification records in a stream.
PartiQL APIs:
o ExecuteTransaction – A batch operation that allows CRUD
operations on multiple items both within and across tables.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
String set,
Can represent multiple
Set Type scalar values.
Number set, and
Binary set.
Cost & Configuration
Read/Write Capacity Modes
On-Demand
o Cost Implications: More expensive for predictable workloads (premium for scalability).
Cost & Configuration
Read/Write Capacity Modes
Provisioned Mode
Provisioned Mode
Reserved Capacity
o Long-Term Commitment:
Commit to specific RCUs and WCUs for 1 or 3 years.
o Discounted Pricing:
Reduced rates
o Use Case:
Suited for stable, predictable workloads over long periods.
Cost & Configuration
Read/Write Capacity Modes
Provisioned Mode
Auto Scaling
o Setup Complexity: Requires setting minimum, maximum, and target utilization levels.
o Response Time: Minor delays in scaling might occur, but it's designed to be responsive.
AWS DynamoDB
Write Capacity Units (WCUs) & Read Capacity Units (RCUs)
• Consumption:
Items over 1 KB require more WCUs, e.g., a 3 KB item needs 3 WCUs per write.
• Provisioning:
Allocate WCUs based on expected writes.
Adjust with over-provisioning for spikes or use Auto Scaling for dynamism.
• Cost:
Pay for provisioned WCUs, used or not. Planning is important to manage expenses.
• Throttling:
Exceeding WCUs leads to throttled writes, potentially causing
ProvisionedThroughputExceededExceptions.
Opt for on-demand capacity for automatic scaling.
AWS DynamoDB
Write Capacity Units (WCUs) & Read Capacity Units (RCUs)
Read Consistency
• Application Needs:
Choose based on the criticality of data freshness vs. throughput
requirements.
AWS DynamoDB
Write Capacity Units (WCUs) & Read Capacity Units (RCUs)
Example Example
Throttling
Adaptive Capacity
Reserved Capacity
o Unused reserved capacity is applied to accounts in the same AWS organization - can be
turned off
o DAX Cluster: A DAX cluster has one or more nodes running on individual
instances with one node as the primary node.
o Accsssing DAX: Applications can access DAX through endpoints of the DAX cluster
o “ThrottlingException”: Returned if requests exceed the capacity of a node
⇒ DAX limits the rate at which it accepts additional requests by returning a
ThrottlingException.
AWS DynamoDB
DynamoDB Accelerator (DAX)
Read Operations
If an Item requested is in the cache, DAX returns it to the application without
accessing DynamoDB(cache hit)
If item is not in the cache (cache miss), DAX forwards the request to DynamoDB.
API calls:
o Batch
o GetItem
o Query
o Scan
AWS DynamoDB
DynamoDB Accelerator (DAX)
Write Operations
Data is written to the DynamoDB table, and then to the DAX cluster.
API calls:
o BatchWriteItem
o UpdateItem
o DeleteItem
o PutItem
AWS DynamoDB
DynamoDB Accelerator (DAX)
Write Operations
Data is written to the DynamoDB table, and then to the DAX cluster.
API calls:
o BatchWriteItem
o UpdateItem
o DeleteItem
o PutItem
AWS DynamoDB
DynamoDB TTL
DynamoDB Time To Live feature allows for the automatic deletion of items in a table
by setting an expiration date.
Leader Node
Compute Nodes
Leader Node
Compute Nodes
Amazon Redshift Cluster
Compute Nodes
• Pricing is the same for RMS regardless of whether the data resides in
high-performance SSDs or in S3.
RA3 and DC2 Node types
• You can add more compute nodes to increase the storage capacity of the
cluster.
Amazon Redshift Cluster
Node Slices:
• Compute nodes are split into slices
• Each handling a part of the workload.
• Leader node distributes data & tasks to the slices for parallel processing
Elastic resize
Can be used to …
• Add or remove nodes from your cluster.
• Change Node Type: From DS2 to RA3
Amazon Redshift Cluster
Classic resize
Snapshot
• Is enabled by default.
• You can specify the retention period when you create a manual snapshot,
or you can change the retention period by modifying the snapshot.
Sharing data across AWS
Regions
• With cross-Region data sharing, you can share data across clusters in the
same AWS account, or in different AWS accounts even when the clusters are
in different Regions.
Distribution Styles
• The leader node places matching values on the same node slice.
• Appropriate when there isn't a clear choice between KEY distribution and
ALL distribution.
Distribution Styles
AUTO distribution
• By default, VACUUM skips sorting for tables where > 95% already sorted.
• SORT ONLY : Sorts the specified table (or all tables) without
reclaiming space
• TO threshold PERCENT :
Specifies a threshold above which VACUUM skips sort phase &
reclaiming space in delete phase.
• BOOST :
Runs the VACUUM command with additional resources as they are available.
Redshift Integration
○ Is faster.
Source 1
Target
Source 1
Source 3
Source 1
Source 1
Source 3
• SQL Transformations
• Stored Procedures
• Popular ETL platforms that integrate with Amazon Redshift include third-
party tools like
• Informatica
• Matillion
• dbt
Amazon RDS
Amazon Aurora
Amazon RDS
Amazon Aurora
• You define external tables in Redshift cluster that reference the data
files stored in S3.
Amazon Redshift Spectrum
• Handles the execution of queries by retrieving only the necessary data
from S3.
• Redshift Spectrum tables can be queried and joined just as any other
Amazon Redshift table.
Amazon Redshift Spectrum
Amazon Redshift Spectrum considerations
• Redshift cluster and the S3 bucket must be in the same AWS Region.
• Redshift Spectrum doesn't support VPC with Amazon S3 access point aliases
Amazon Redshift Spectrum
Amazon Redshift Spectrum considerations (continued)
• To create a new external table in the specified schema, you can use
CREATE EXTERNAL TABLE.
• Unless you are using an AWS Glue Data Catalog that is enabled for AWS
Lake Formation, you can't control user permissions on an external table.
Amazon Redshift Spectrum
Amazon Redshift Spectrum considerations (continued)
• To run Redshift Spectrum queries, the database user must have permission
to create temporary tables in the database.
• Some system tables can only be used by AWS staff for diagnostic purposes.
Redshift System Tables and Views
Types of system tables and views
Details on database objects;
• SVV views
SVV_ALL_TABLES see all tables
SVV_ALL_COLUMNS see a union of columns
Details about queries on both the main and concurrency scaling clusters;
• SVCS views SVCS_QUERY_SUMMARY general information about the execution of a query
Contain references to STL tables & logs for more detailed information;
• SVL views SVL_USER_INFO data about Amazon Redshift database users
Redshift Data API
Serverless, event-
Web applications Third-party services
driven architecture
• Can be setup with very little operational overhead (very easy setup!)
Redshift Data API
Serverless Execution
Asynchronous Processing
Simplified Management
Redshift Data API
• Maximum duration of a query is 24 hours.
ra3.16xlarge
Acess Control
• Authorize user/service by adding managed policy
⇒ AmazonRedshiftDataFullAccess
Availability Zones
• Data is live
Data Sharing
Datashare
Producer Cluster Consumer Cluster
- Tables
- Views
- User defined
functions
Data Sharing
• Data sharing continues to work when clusters are resized or when the
producer cluster is paused.
Data Sharing
• Data sharing is supported for all provisioned ra3 cluster types and
Amazon Redshift Serverless.
• You can only share SQL UDFs through datashares. Python and Lambda
UDFs aren't supported.
Data Sharing
• Adding external schemas, tables, or late-binding views on external
tables to datashares is not supported.
• Redshift manages:
▪ How many queries run concurrently
▪ How much memory is allocated to each dispatched query.
• On demand scaling.
● You can choose port from the ● You can choose any port to
port range of 5431–5455 or 8191– connect
8215.
VS
Resizing Resizing
● Always encrypted with AWS KMS, ● Can be encrypted with AWS KMS
with AWS managed or customer (with AWS managed or customer
managed keys. managed keys), or unencrypted.
Redshift ML
• Users can utilize SQL commands to create and manage machine learning
models, which are then trained using data stored in Redshift.
• Amazon Redshift ML enables you to train models with one single SQL CREATE
MODEL command.
Redshift ML
• Lets users create predictions without the need to move data out of Amazon
Redshift.
• Users can create, train, and deploy machine learning models directly
within the Redshift environment.
• Useful for users that doesn't have expertise in machine learning, tools,
languages, algorithms, and APIs.
Redshift ML
• Redshift ML supports common machine learning algorithms and tasks, such
as
• Sign-in credentials
• Cluster encryption
Redshift Security
• Cluster security groups
○ Default Lockdown:
When you provision an Amazon Redshift cluster, it is locked down by
default so nobody has access to it.
• Load data encryption to encrypt the data during the load process
Amazon Redshift Spectrum
• Data in transit
○ Redshift
• The privileges to access specific objects are tightly coupled with the DB
engine itself.
Access Control In Redshift
Manage your users and groups within Redshift
Group A Group B
Access Control In Redshift
/* Give sales USAGE rights in schema, and read-only (SELECT) access to the
tables within the schema */
User 1 User 2
Group A
Access Control In Redshift
AWS introduced RBAC (Role-based access control)
GRANT ROLES
User 1 User 2
Role 2
Access Control In Redshift
/* Creating role*/
/* Give sales USAGE rights in schema, and read-only (SELECT) access to the
tables within the schema */
• Used to control and manage access permissions for various users or roles.
USING ( 102
Electric
85 2
Drill
warehouse_id IN (
103 Hammer 75 1
SELECT managed_warehouse_id
Nails
FROM warehouse_managers 104 (Pack of 150 3
WHERE manager_username = current_user 100)
)
);
/* Masking Policy */
• Rows have criteria that defines which role can access the specific item
(row).
What is RDS?
▪ Fully managed relational database service.
Main characteristics:
▪ Scalable, reliable, and cost-effective.
▪ Fully encrypted at rest and in transit.
▪ Support for Multiple Database Engines:
o MySQL
o PostgreSQL
o MariaDB
o Oracle Database
o SQL Server
o Amazon Aurora
Security in RDS
3) Patch Management:
Automatic minor version upgrades for minor updates and patches
4) AWS backup:
Centralization and automation of data backup.
ACID compliance in RDS
Important set of properties for data reliability.
Definition:
○ Atomicity
○ Consistency
Data integrity and consistency
○ Isolation
○ Durability
ACID compliance in RDS
73273
Locking mechanisms for concurrent data access
and modification in multiple transactions.
1760 0009-14563.7
Prevent other transactions
1 Exclusive locks from reading or writing to
the same data.
Syntax: FOR SHARE.
1250 003-77156.8
Allow multiple transactions
2 Shared locks to read data at the same
time without blocking.
Syntax: FOR UPDATE.
003-1040559
These can be locked to
3 Tables & Rows keep data integrity and
control.
Syntax and Deadlocks
▪ Postgre SQL command to lock a table:
LOCK TABLE table_name IN ACCESS EXCLUSIVE MODE;
▪ Understanding Deadlocks
What Happens:
During concurrent transactions; lock resources and wait on each other.
Result:
No transaction can proceed, halting progress.
AWS RDS Basic Operational Guidelines.
o Monitor metrics for memory, CPU, replica lag, and storage via CloudWatch.
o MySQL: Ensure tables don't exceed the 16TiB size limit by partitioning
large tables.
• Performance
Features
Offers up to 5x the performance of MySQL, 3x that of PostgreSQL.
• Scalability
Scales from 10GB to 128TB as needed.
• Read replicas:
Up to 15 replicas to extend read capacity.
• Security:
IAM for authentication, supports encryption at rest and in transit.
Amazon Aurora
Aurora Serverless • Automatic Scaling:
Auto-adjusts to the application's needs, no manual scaling required.
• On-Demand Usage:
Pay-for-what-you-use is ideal for sporadic or unpredictable workloads.
• Simple Setup:
No management of database instances; automatic capacity handling.
• Cost-Effective:
Bills based on Aurora Capacity Units (ACUs), suitable for fluctuating
workloads.
Amazon Aurora
Use Cases
• Caching:
Ideal for high-performance caching
Reducing load and improving response times
• Session Store:
Session information for web applications, ensuring fast
retrieval and persistence.
Amazon Aurora
Pricing
• Node Pricing: Charges based on type and number of nodes, varying by CPU,
memory, and network performance.
• Data Transfer Pricing: Costs for data transferred "in" and "out" of the service;
intra-region transfers typically not charged.
• Backup Pricing: Charges for backup storage beyond the free tier, priced per GB
per month.
• Serverless:
Scales automatically with needs; no infrastructure management.
Features
• Managed Service:
AWS handles provisioning, setup etc.
• Security:
IAM for authentication, supports encryption at rest and in transit.
Amazon DocumentDB
Pricing
• Knowlegde graphs,
Use Cases • Fraud detection,
• Recommendation engines
• Serverless:
Scales automatically with needs; no infrastructure management.
Features
• Managed Service:
AWS handles provisioning, setup etc.
• Security:
IAM for authentication, supports encryption at rest and in transit.
Amazon Keyspaces
Pricing
• In-memory storage
for low-latency and high-throughput access
Features
• Scalability
Automatically scales to adapt to workload changes.
• Durability:
Ensures data persistence with snapshotting and replication across
multiple Availability Zones.
• Security:
IAM for authentication, supports encryption at rest and in transit.
Amazon MemoryDB for Redis
Use Cases
• Caching:
Ideal for high-performance caching
Reducing load and improving response times
• Session Store:
Session information for web applications, ensuring fast
retrieval and persistence.
Amazon MemoryDB for Redis
Pricing
• Node Pricing: Charges based on type and number of nodes, varying by CPU,
memory, and network performance.
• Data Transfer Pricing: Costs for data transferred "in" and "out" of the service;
intra-region transfers typically not charged.
• Backup Pricing: Charges for backup storage beyond the free tier, priced per GB
per month.
Ingestion Output
Lambda
Managed Grafana
Amazon MSK
Managed Service
for Apache Flink QuickSight
Kinesis
Data Streams SageMaker
IoT Core
Real-time processing with IoT decives
Use case
IoT devices Kinesis Data Stream Managed Apache Flink Timestream Managed Grafana
Transport Visualization
Generate data Transform & insert Time-series database
streaming events of metrics & logs
Real-time processing with IoT decives
Use case
IoT devices Kinesis Data Stream Managed Apache Flink Timestream Managed Grafana
• Auto Scaling.
Accelerated Computing
Memory Optimized
• Use hardware
• Designed for workloads
accelerators, or co-
that process large data
processors, to perform
sets in memory.
functions.
EC2 Instance Types
Storage Optimized
HPC Optimized
Lambda
❑ Lightweight, event-driven tasks
❑ Run code in response to events (ideal for real-time)
Glue
❑ Specialized ETL / Data Integration service
Batch
❑ General purpose, versatile (compute-intensive) batch
jobs
AWS Batch
● Batch jobs based on docker images
● Automated Scaling
● Job Scheduling
● Serverless:
○ Uses EC2 instances and Spot instances.
○ Can be used together with Fargate
○ No need to manage infrastructure.
● Pricing:
○ EC2 Instance / Fargate / Spot Instance costs.
AWS Batch – How It Works
sam package
sam deploy
Section 13:
Analytics
Data Preparation
Data Collection Data Cataloging
Analytics
Security & Access Data Sharing
Integration
AWS Lake Formation
● Collects data from different sources.
AWS
Storage Crawler Data Data
Services Catalog Lake
AWS Lake Formation – How It Works
3) Catalog and
1) Define Data
2) Data Ingestion Organize
Sources
5) Use and
4) Clean and 5) Set up Analyze
Transform Security
AWS Lake Formation – Security
● Centralized security management.
● Cross-account access.
AWS Lake Formation – Cross-Account
• Share Setup:
Use named resources or LF-Tags for easy database and table sharing across AWS
accounts.
• Granular Access:
Use data filters to control access at the row and cell levels in shared tables.
• Permissions Management:
Leverage AWS RAM to handle permissions and enable cross-account sharing.
• Resource Acceptance:
Once shared resources are accepted via AWS RAM, the recipient's data lake
administrator can assign further permissions.
• Resource Link:
Establish a resource link for querying shared resources via Athena and Redshift
Spectrum in the recipient account.
AWS Lake Formation – Cross-Account
Troubleshooting Points:
● Security:
○ IAM, VPC, KMS.
Amazon EMR
What Is Hadoop
● Distributed storage and processing framework.
○ 1) Hadoop Distributed File System (HDFS)
○ 2) MapReduce
- x86-based instance:
Versatile & traditional choice
Hive Query
Apache Hive
Hive Metastore
Use cases:
- Monitoring systems and application
- Dasbhoards for log or sensor data
- Real-time monitoring for IT infrastructure
- Use alerts
Workspace
Data sources
User Authentication
Integrates with identity providers that support SAML 2.0 and
AWS IAM Identity Center
Amazon Managed Grafana - How It Works
CloudWatch Very common
▪ Use Cases
o Real-time Application Monitoring
o Business Intelligence (BI) and Data Analytics
o …
AWS OpenSearch
▪ Pricing
Pay as you go.
▪ Infrastructure:
Fully managed.
▪ Scalability:
Ability to scale up or down manually.
▪ Integration:
Integrates with other AWS services
▪ Availability:
Multi-AZ deployments, automated snapshots.
AWS OpenSearch Key Components
Documents Types Indices
{ { {
… ... ...
{ ... ...
… {
aa aa } }
aa aa
... } } { {
... ... ...
... ...
} } }
AWS OpenSearch
Node, Cluster and Shard
▪ What is a Node:
Single running instance of OpenSearch.
▪ Data Nodes
▪ Master Nodes
▪ Client Nodes
▪ What is a Shard:
Partitions of an index’s data.
▪ Primary Shards
▪ Replica Shards
AWS OpenSearch – Infrastructure
Cluster
Node 1 Node 2 Node 3
▪ Master Node:
Responsible for cluster-wide operations.
▪ S3 Snapshot:
S3 snapshots can be used to restore your cluster.
AWS OpenSearch – What to Avoid
▪ OLTP and Ad-Hock Queries:
Can lead to suboptimal performance.
▪ Over-Sharding:
Can lead to increased overhead and reduced
efficiency.
Shard Optimization:
▪ Reduce shard count to mitigate memory issues
Data Offloading:
▪ Delete older or unused indices;
▪ Consider archiving data to Amazon Glacier to improve
performance.
AWS OpenSearch – Security
▪ Authentication:
▪ Native Authentication: Users and Roles
▪ External Authentication: Active Directory, Kerberos,
SAML, and OpenID
▪ Authorization:
▪ Role Baes Access Control (RBAC): Through Users and Roles
▪ Attribute-based access control (ABAC): Based on user
attributes
AWS OpenSearch – Security
▪ Encryption:
▪ In-transit encryption: TLS encryption.
▪ At-rest encryption: Third party applications.
▪ Audit Logging:
▪ Identifying potential security issues.
AWS OpenSearch – Dashboards
▪ Data Visualization:
▪ Line charts
▪ Bar graphs
▪ Pie charts
▪ Dashboard Creation:
Each dashboard is customizable; users are able to
create dashboards as per their needs.
AWS OpenSearch – Storage Types
Hot Storage Warm Storage Ultra-Warm Storage Cold Storage
▪ Less frequently ▪ Rarely used
▪ Frequently accessed but still ▪ Cost-effective (e.g. archival
accessed & instant available. for read-only data)
retrieval.
▪ Requires ▪ S3 + caching ▪ Lowest cost
▪ Fastest dedicated master
performance nodes ▪ Lower cost ▪ Requires
UltraWarm;
▪ Real-time ▪ Not compatible ▪ Not frequently
analytics and with T2 or T3 data written or ▪ Uses Amazon S3,
recent log data node instance queried hence no compute
types. overhead
AWS OpenSearch – Reliability and Efficiency
Cross-Cluster Replication:
▪ What It Is:
▪ Allows you to copy and synchronize data.
▪ Used for increasing data availability.
▪ Why It Is Important:
▪ Provides availability.
▪ Prevents hardware failure and network issues related
blocks.
AWS OpenSearch – Reliability and Efficiency
Index Management:
▪ What It Is:
▪ Automates the index managing process.
▪ Defines indices lifecycles.
▪ Why It Is Important:
▪ Provides cost efficiency.
▪ Provides performance improvements.
AWS OpenSearch – Reliability and Efficiency
Infrastructure Management:
▪ What It Is:
▪ Deciding disk scale.
▪ Deciding master node quantity.
▪ Why It Is Important:
▪ Determines your system resilliance and stability.
AWS OpenSearch – Serverless
● Serverless Flexibility:
Auto-scales based on demand, reducing management
overhead.
● Cost Efficiency:
Pay only for what you use, ideal for variable
workloads.
○ ML-Powered dashboards.
▪ Use Cases
o Data Exploration
o Anomaly Detection
o Forecasting
o …
Amazon QuickSight
▪ Scalability
Automatically scales up and down.
▪ Serverless:
Fully managed.
▪ Data Visualization:
Visuals (Charts)
▪ Dashboards:
Published version of an analysis
Amazon QuickSight
SPICE Super-fast, Parallel, In-memory Calculation Engine
▪ In memory engine
▪ Benefits:
▪ Speed and Performance
▪ Automatic Data Refreshes
▪ Synchronization
QuickSight – Dashboards
Features
▪ Automatic Refreshs:
Automatically refreshes dashboards.
▪ Mobile Accessibility:
Mobile responsive dashboards.
QuickSight – Data Sources
▪ AWS Services:
o S3
o RDS
o Aurora
o Redshift
o Athena
▪ OpenSearch
▪ Aurora/RDS/Redshift
▪ Using ETL
Amazon QuickSight
Licensing
Included SPICE
▪ Small group of users Type Price
Capacity
▪ Administrative features:
▪ User management
▪ Encryption
▪ Pricing
▪ Pay-per-session for readers
QuickSight – Enterprise
With annual
Month-to-month
commitment
▪ Author License:
▪ Connect to Data Author $24/month $18/month
▪ Reader License:
▪ Explore Dashboards
Month-to-month
▪ Get Reports
▪ Download Data Reader $0.30/session
up to $5 max/month
▪ QuickSight Embedded
▪ Paginated Reports
▪ Alerts and Anomaly Detection
Data On-Premise
▪ Use AWS Direct Connect
Standard Edition
▪ Accessing QuickSight from Private Subnet not possible
VPC Peering
▪ VPC Peering
Connect them using VPC peering.
QuickSight – Cross-Region/Account
VPC – Account A VPC – Account B
Transit
AWS
VPC
VPC PrivateLink
Sharing
Gateway
Peering
● AWS PrivateLink:
Securely exposes services across accounts without exposing data
to the public internet.
● VPC Sharing:
Allows multiple AWS accounts to share a single VPC, facilitating
access to shared resources like QuickSight and databases.
Row-Level Security (RLS) Enterprise
● Customizable Access
Which data rows can be seen
● Dataset Filters
Applied using dataset filters
● Data Security
Users only see data relevant to their role
needed
Section 14:
Machine Learning
● Resource-Based Policies:
○ You can specify accesses based on SageMaker resources.
Configure Feature
Ingest Data
Store
• Online store
ML Model
Benefits
● Efficiency: Reduces the effort and complexity.
● Auditability:
○ Provides a detailed history of.
● Governance:
○ Enhances the governance of ML projects.
● Collaboration:
○ Makes it easier for teams to collaborate.
Amazon SageMaker – Data Wrangler
● It simplifies data preparation process.
Data Feature
Data Import Preparation Visualization Engineering Export Data
• Provides
• Get
• S3, an • To
insights
Redshift, interface • Create and SageMaker
EMR, modify or other
• See data
Feature • Normalize features AWS
distributi
Store data, Services
ons
clean data
Amazon SageMaker – Data Wrangler
Quick Model
● You can quickly test your data.
● Use Cases
• Application Orchestration: Automates tasks across applications.
State = Step
• Succeed State: Ends the execution successfully with the provided output.
• Map State: Iterates over a list, processing each item with a sub-workflow.
• Pass State: Passes input directly to output, with optional data transformation.
AWS Step Functions
● State Machine defined in ASL
ASL = Amazon States Language
AWS Step Functions
● Built-in controls
⇒ To examine the state of each step in the workflow use
• Retry
• Catch
• Timeouts
• Parallel
• Choice
• Wait
• Pass
AWS Step functions
AWS SDK Integrations
⇒ Allow Step Functions to execute API actions across more than 200 AWS
services directly within a workflow.
Optimized Integrations
⇒ Optimized integrations are designed specifically for seamless
integration with popular AWS services, such as Lambda, S3, DynamoDB,
ECS, SNS, and SQS.
IAM permissions
○ State machine needs to have appropriate IAM permissions
■ E.g. execute a Lambda function
⇒ Lambda function might need permission to other services as well
Standard Workflows Express Workflows
● Ideal for high-volume, short-
● Used for long-running, durable, duration, event-processing
and auditable workflows. workloads.
▪ Event Bus: This is the central hub where events are sent.
Rules
Lambda
Step Functions
Amazon EventBridge Rules
Two types of rules
• Define event pattern to define events • E.g. periodically run a Lambda function
Available by default;
Default Event Bus
Automatically receives events fom AWS service
Created by user;
Custom Event Buses
Specify which events should be received
Created by user;
Partner Event Buses
Receives eents from integrated SaaS partners
EventBridge
Schema Registry
• Descentralized Management
• Cross-Account Access
Workflow Overview ▪
▪
Successful validation triggers the
execution
ETL job performs further data
transformation
• SaaS Integration:
Simplifies connections with popular SaaS applications
Salesforce, Snowflake, Slack etc..
Source Destination
AppFlow
AppFlow
• Data Transformation:
Features
Provides mapping and transformations for compatible data formats.
• Bi-Directional Sync:
Supports two-way data synchronization.
• Event-Driven Flows:
Can initiate transfers based on SaaS application events.
• Workflow Automation:
Enables automated interactions between SaaS and AWS services.
Amazon SNS
⇒ Standard topics:
o High throughput
o At least once delivery
o Unordered delivery
o Multiple subscribers
Publishers
Amazon SNS
Mobile push
Filter policies
Mobile text(SMS)
Publishers Email
Amazon SNS
Amazon SNS How to Publish
Topic Publish
• Send a message to an SNS topic
• Topic acts as a channel
• Automatically gets delivered
▪ Consumers
Message A Message E
Producers
Message B Message D
Message A
Message C
Message E
Producers
Message B Message D
Message A
Message C
Standard Queues FIFO Queues
● High Throughput. ● Ordering Guarantee.
REST APIs
WebSocket APIs
▪ HTTP integration
Forward requests to endpoints
Suitable for microservices
Quotas:
Rate Limiting:
CloudFormation interface
AWS CloudFormation Templates
Text files that describe the desired state of your AWS infrastructure.
HOW IT WORKS
CloudWatch interface
AWS CloudWatch Metrics
A metric is a numbers to monitor
• Features:
o Name Spaces: Serves as a container for CloudWatch metrics.
Custom setup with Firehose Quick S3 setup Quick AWS partner setup
Redshift
CloudWatch
Metric Streams S3
Kinesis Data
Firehose
AWS CloudWatch Alarms
Monitor metrics and trigger actions when defined thresholds are breached.
• Types of Alarms:
o Metric Alarm: Monitors a single metric
o Composite Alarm: Monitors the state of other alarms
AWS CloudWatch Alarms
● Alarm States:
OK - Metric is within the defined threshold
ALARM - Metric is over the defined threshold
INSUFFICIENT_DATA - Not enough data available to determine the alarm state
• Metric filter: Extract data from log events to create custom metrics
• Subscription filter: Filter log data being sent to other AWS services
Kinesis Data
Streams
Kinesis Data
Firehose
CloudWatch Logs
Subscription Filter
Cross Accounts Access
1) Setup Data Stream in Destination Account
WRITE
Resource Policy
Send Logs
Filters Kinesis
CloudWatch Subscription IAM Role Trust Policy
Logs Filter Data Streams
ASSUME role
Source Destination
Account Account
AWS CloudWatch Logs Agent
o EC2 does not send any data to CloudWatch, to send its logs to
CloudWatch – a logs agent is needed.
CloudTrail interface
AWS CloudTrail Events
Record(s) of activities
• Types of Events
o Management Events: Captures high-level operations
o Data Events: Captures data-level operations
Events History
● Trail Types:
Multi-Region - Trail applies to all regions
• Features:
o Multiple Trails Per region creates multiple trails within a single AWS region.
AWS CloudTrail Lake
A Managed data lake for AWS user and API activity
● Lake Channels:
Config interface
AWS Config Concepts
o Configuration Item is the current state of individual AWS resources
o Configuration Recorder stores configuration items for resources in your account.
o Configuration History is a historical record of configuration changes
o Configuration Snapshot is a collection of configuration items
o Configuration Stream is an automatically updated list of configuration items
for resources recorded by AWS Config.
o Conformance packs bundles Config rules, remediation actions, and required AWS
resource configurations into a single, reusable package.
• Types of Rules
EC2
S3 Bucket
Configuration
Change
• Trigger Types
● Evaluation Modes:
Proactive
• Concepts
o Security:
Protect data and systems; manage access, and respond to security events.
o Reliability:
Ensure systems perform as expected, handle changes in demand, and recover from disruptions.
o Performance Efficiency:
Use resources efficiently, adapt to changing needs, and leverage new technologies.
o Cost Optimization:
Reduce and control costs without sacrificing performance or capacity.
AWS Well-Architected Framework
o Sustainability
Minimize environmental impact by efficiently using resources and reducing carbon emissions.
AWS Well-Architected Tool
• Questionnaire-Based Assessment
o Lenses = Evaluate your architectures against best practices and identify areas of
improvement.
impact a business.
o Medium risk issues (MRIs) = Architectural and operational choices that may negatively
How it Works
▪ Root User:
Initial user with full access and services
Intended for account set up and emergency
▪ Federated users:
Authenticated through external identity providers:
IAM - Groups
Groups:
▪ A collection of users managed as a single entity
▪ Assign policies to group => all users inherit permissions
▪ A user can belong to multiple groups
▪ No credentials associated
IAM Groups
assume
attach
Role
Policies
IAM - Policies
Policies are documents that define permissions for IAM entities
Managed policies:
▪ Centrally managed standalone policies
Resource-based policies:
▪ Attached to a resource instead of IAM identity
▪ Grant or deny permissions on the resource
▪ Inline policy only
IAM – Trust Policy
Define which entities (accounts, users, or services) are allowed to assume a role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111122223333:root"
},
"Action": "sts:AssumeRole"
}
]
}
assume access
• Docker Image
• Docker Containers
• Docker Registry
Docker Registries
Build Run
Docker File Docker Container
Docker Image
Push Pull
Docker Registries
Elastic Container Service
• Cluster
• Container Agents
• AWS just runs ECS Tasks for you based on the CPU /
RAM you need
• Member of
• Task Role
• Is Secure
Amazon Elastic Container Registry
• Lifecycle policies
• Image scanning
• Versioning
• Tagging
Amazon Elastic Container Registry
Public Repository
Private Repository
of containerized applications.
• Scalability
• Security
Self-managed nodes
Amazon EKS architecture
AWS Fargate
● Security:
○ Secure at rest and in transit.
○ Uses AWS KMS.
AWS Snow Family - Snowcone
● Small and lightweight.
● 8TB of storage.
● Data Processes:
○ AWS IoT Greengrass, AWS Lambda.
AWS Snow Family – Snowball Edge
● 1)Storage Optimized:
○ Storage focused.
○ 80 TB.
● 2)Compute Optimized:
○ Run applications, process data.
○ 42 TB.
● Data is encrypted.
40 vCPU(104vCPU for
Usable vCPUs 2 vCPU -
Compute Optimized)
AWS Transfer Family
● Fully managed.
● Pricing:
○ Pay as you go.
Use Cases
● Secure data distribution
▪ Integration:
▪ Amazon S3, Amazon EFS, Amazon FSx for Windows
On-Premises
AWS DataSync – How It Works
Setting Up AWS DataSync NFS
or
On-Premises
▪ Task Configuration:
▪ Define sources(NFS,SMB) and targets(S3,EFS,FSx). TLS
AWS
AWS Database Migration
Service (DMS)
● Continuous replication.
● Security:
○ Data encrypted during transit.
● Pricing:
○ Pay as you go
AWS DMS – Source & Target
SOURCES TARGETS
- Amazon Aurora
- Amazon Aurora
- RDS
- Oracle
- Redshift
- Microsoft SQL
Server - DynamoDB
- MySQL - DocumentDB
- PostgreSQL - S3
- MongoDB - Kinesis Data
Streams
- SAP ASE
- Apache Kafka
AWS DMS – How It Works
Homogeneous Migration
Replication
Task
Source Target
Source DB Endpoint Endpoint Target DB
(Full Data
Load, CDC)
Replication Instance
AWS DMS – How It Works
Heterogeneous Migration
Schema
Conversion
Source DB Target DB
SCT Server
Replication
Task
Source Target
Source DB Endpoint Endpoint Target DB
(Full Data
Load, CDC)
Replication Instance
AWS Application Discovery
Service
Automates
⇒ Migration of applications, databases, servers to AWS
Migration
Test Before
⇒ Supports creating test environments
Switching
⇒ Regional Service
Amazon VPC Subnets
A range of IP addresses in your VPC
⇒ Zonal Service
• Types of Subnets
o Public Subnets have access to the internet
o Private Subnets do not have direct access to the internet
o VPN-only Subnets are accessed via a VPN connection
o Isolated Subnets are only accessed by other resources in the same VPC
Subnet Routing
Route Tables are sets of rules that dictate how traffic is
routed in your VPC.
Networking
Components
Virtual Private
Network
Direct Connect
• Security Groups control inbound or outbound • Network Access Control List (NACL)
traffic at resource level Control inbound or outbound traffic at the
subnet level
VPC
Additional Features
VPC Flow Logs: Capture information about IP traffic going to and from network
interfaces.
Reachability Analyzer: Analyze network reachability between resources within your VPC
and external endpoints
VPC Sharing: Share your VPC resources with other AWS accounts in the same AWS
Organization,
Section 20:
Security
• Integration:
Integrates to other services (S3, databases, EBS volumes etc.)
• API calls:
Don't store secrets in code
• Owned by service
• Good choice unless you you need to audit and manage key
• Regulated industries
AWS KMS
Pricing
• Pricing:
▪ $1.00 per customer-managed key per month;
▪ $0.03 per 10,000 API requests.
• Key Rotation:
• Automatic: Free for AWS-managed keys.
• Manual: No extra charge for customer-managed keys; requires setup.
• Cross-Region Requests: $0.01 per 10,000 requests for using a KMS key in a
different region.
Cross-region
Keys are bound to the region in which they are created
Region A Region B
Volume Volume
Encrypted
Snapshot Snapshot
copy
• Each set of related multi-Region keys has the same key material and key ID
• Create a multi-Region primary key ⇒ replicate it into Regions that you select
Use Cases
Set policies
Volume Volume
Snapshot
permissions
Snapshot Snapshot
access
copy
Customer Key
AWS Macie
▪ Regulatory Compliance
Use Cases ▪ Security Monitoring
▪ Risk Assessment
AWS Secrets
Region A Region B
Region C
Secret
Secret
Primary
Secret Replica Promote to
Enable Secret standalone
replication Replica
Add region Secret
Secret Secret
Add region
Secret
Cross-region replication
• ARN Consistency:
ARN remains the same
Primary: arn:aws:secretsmanager:Region1:123456789012:secret:MySecret-a1b2c3
Replica: arn:aws:secretsmanager:Region2:123456789012:secret:MySecret-a1b2c3
• Automatic Rotation:
If rotation is enabled on the primary secret, the updated secret values
automatically propagate to all replicas.
Cross-account
Secrets can be shared across accounts
Configurable using policies
Account A Account B
Attach
resource
Secret policy
access
Modify
access
policy
AWS Shield
• Attacks agains network and transport layers (layer 3 and 4) and the
application layer (layer 7).
AWS Shield Advanced • Enhanced DDoS Protection: Guards against complex DDoS attacks.