Amazon Redshift Overview and Guide
Amazon Redshift Overview and Guide
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1
1. AWS Services Overview 2. Redshift Overview
Agenda
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
2
AWS Services Overview
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
3
What is the cloud?
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
4
How does it work?
AWS owns and maintains the network-connected hardware
You provision and use what you need
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
5
Shared responsibility model
Customer data
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
6
AWS global infrastructure
eu-west-1a eu-west-1b
AZ AZ
eu-west-1c
Availability
Data center Zone (AZ) AZ eu-west-1
(Ireland)
Region
Typically houses • One or more data centers • Each AWS Region is made up of two or more AZs
thousands of servers • Designed for fault isolation • AWS has 32 Regions worldwide
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
7
AWS Global Infrastructure Regions & AZs
EUROPE
N AMERICA
Frankfurt 3 Stockholm 3 ASIA PACIFIC
Canada Central 3 Oregon 4
Ireland 3 Zurich 3 *Beijing 3 Osaka 3
GovCloud US-East 3 Canada West
London 3 *Ningxia 3 Seoul 4
GovCloud US-West 3
Milan 3 Hong Kong 3 Singapore 3
Northern California 3
Paris 3 Hyderabad 3 Tokyo 4
Northern Virginia 6
Spain 3 Jakarta 3 Malaysia
Ohio 3
Mumbai 3 Thailand
MIDDLE EAST
AFRICA
Bahrain 3
Cape Town 3
Tel Aviv 3
S AMERICA UAE 3
Melbourne 3
Sydney 3
Auckland
Your
application
Amazon EBS
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
10
Amazon Elastic Compute Cloud (Amazon EC2)
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
11
AWS storage options
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
12
Amazon S3
Object-level
storage Use cases
• Content storage and distribution
Designed for • Backup and archiving
99.999999999%
• Big data analytics
durability
Amazon • Disaster recovery
S3 • Static website hosting
Event triggers
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
13
Amazon EBS
• Persistent network-attached
AWS Cloud
block storage for instances
• Different drive types Monday’s snapshot EC2 EC2
instance instance
• Scalable Tuesday’s snapshot
Wednesday’s snapshot
• Pay only for what
you provision Thursday’s snapshot
Friday’s snapshot
• Snapshot functionality Amazon EBS volumes
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
14
Redshift Overview
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
15
Amazon Redshift
FULLY MANAGED, AI-POWERED CLOUD DATA WAREHOUSING
Data Insights
Analyze and
Transactional data
visualize data
Build data-driven
Best-in-class security, applications
Application logs governance, and compliance
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
16
Amazon Redshift Spark and ML
Integration for Amazon Amazon
BI tools Data API Query Editor Apache Spark Redshift ML Data Exchange
Best price
performance
cloud DW Third-party
data exchanges
Amazon Redshift
Serverless Automatic compute management Pay for use
Amazon Redshift
Compute Compute
Automatic scaling
for consistent
performance
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
17
Customers – sample list
Tens of thousands of customers process exabytes of data with Amazon Redshift daily
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
18
Redshift cluster architecture
SQL Clients / BI Tools
Leader node
• SQL endpoint JDBC/ODBC
• Stores metadata
• Coordinates parallel SQL processing & Leader
• ML optimizations node
• Leader node is no-charge for clusters
with 2+nodes
Compute Compute Compute
Compute nodes node node node
• Split into “Slices”
• Local SSDs for caching Load
• Executes queries in parallel
Unload
• Load, unload, backup, restore from S3
Redshift Managed Storage Backup
• Resides in S3 Restore
• Available across entire Region
• Pay for space used (not provisioned) Redshift Managed Storage
• Scales independently of Compute
Amazon S3
Exabyte-scale object storage
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
19
Redshift instance types Additional Documentation
• Working with clusters
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
20
Get started with Experience better
analytics in seconds price-performance
YOU
focus on
Amazon Redshift
Save costs and stay Pay for what
insights on budget you use
Serverless
Automatic Advanced
provisioning monitoring
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
21
Amazon Redshift Serverless
Data
sharing Amazon Redshift Serverless
clusters ML-based Streams
workload monitoring
Automatic
workload management
Operational
Automatic scaling
Databases
Automatic tuning
Automatic maintenance
Amazon
Performance at scale
Sagemaker
Pay for use
Storage
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
23
Redshift Spectrum Overview
Run SQL queries directly against data in S3 using
Redshift Spectrum is a feature of Redshift that allows thousands of nodes
SQL queries on external data stored in Amazon S3
Benefits Spectrum
Amazon Redshift
7 Spectrum projects, ... 6 Amazon Redshift Spectrum nodes
filters, joins and scan S3 data
1 2 3 4 N
aggregates
Glue Data Catalog
Hive metastore
Lake Formation
Amazon S3
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
25
Data storage in Redshift
• Data loaded into Redshift is stored in Redshift Managed Storage (RMS), storage is columnar
• Structured and semi-structured data can be loaded
• Amazon Redshift is ANSI SQL and ACID compliant
• Does not require indexes or db hints. Leverages sort keys, distribution keys, compression instead, to
achieve fast performance through parallelism and efficient data storage
• Data is organized as: Namespace > database > schema > objects
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
26
Data sharing with Amazon Redshift
• Instant, secure, and
live data sharing across
Redshift data warehouses
BI and Machine Data processing
analytics apps learning & advanced
analytics
• Within and across AWS Amazon
accounts and across AWS Redshift
Regions
Amazon Redshift
• Live and transactionally
consistent
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
27
Lab Setup
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
28
Redshift Getting Started
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
29
Redshift: Use Popular Data Models
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
30
Redshift Datatypes
Scalar Vector
Datatypes Datatype
DOUBLE
INT NCHAR TIMETZ
PRECISION
BPCHAR TIMESTAMPTZ
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
31
Semi-structured data – SUPER datatype
[{"type":"work",
Easy, efficient, and powerful JSON processing {"given":"Jane", "num":"9255550100"},
1
"family":"Doe"} {"type":"cell",
"num": 6505550101} ]
Fast row-oriented data ingestion
{"given":"Richard",
"family":"Roe“, [{"type":"work",
2
Fast column-oriented analytics with "middle":“John" "num": 5105550102}]
materialized views over SUPER/JSON },
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
32
Row -Store vs Column Store
• Row storage (e.g. MySQL): all row fields are stored
together on disk (typically in a sequential file)
• Accessing a column (example: scanning SSN of all
residents) with row storage:
• Scan every column in every row of the table
• Resultant unnecessary I/O and caching overhead
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
34
Materialized Views
• Improve performance of complex, SLA sensitive, predictable and
repeated queries using Materialized views
• Materialized view persists the result set of the associated SQL
Redshift Materialized Views
• Materialized views can be refreshed automatically or manually
Materialized views can be created using the
• Redshift automatically determines best way to update data in
the materialized view (incremental or full refresh) CREATE statement, and can be included
(default) or excluded from Redshift backups.
• Automatic query rewrite leverages relevant materialized views Materialized views can also have table
and can improve query performance by order(s) of magnitude attributes such as dist style and sort keys, and
• Automated materialized views: Redshift continuously monitors be refreshed at any time
workload to identify queries that will benefit from having a MV
CREATE MATERIALIZED VIEW mv_name
and automatically creates and manages MVs for them
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
36
Table Design Best Practices
• Redshift performance is about efficient I/O
• Make columns only as wide as they need to be
• Define primary key and foreign key constraints
• Let COPY choose compression encodings
• Choose the best distribution style
• Choose the best sort key
• AUTO vs Timestamp vs Filtering vs Frequent Joins
Additional Documentation
• Best Practices for Designing Tables
• Querying Spatial Data in Redshift
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
37
Data Loading Best Practices
• Use COPY command to load data whenever possible
• Use a single COPY command per table
• Writes are serial per table
• Commits are serial per cluster
ETL Best Practices
• Use multi-row inserts if COPY is not possible
• Bulk insert operations (INSERT INTO...SELECT and CREATE TABLE AS)
provide high performance data insertion
▪ Staging tables are more performant
when created using CREATE TABLE
• Enforce Primary, Unique or Foreign Key constraints outside of Redshift
LIKE instead of SELECT INTO
• Wrap workflow/statements in an explicit transaction #my_temp_table
• Consider using TRUNCATE instead of DELETE ▪ Merge operations should be
• ALTER TABLE APPEND to move rows faster from source to target table. performance via INSERT/UPDATE
• Staging Tables to target table with deduplication
• Use temporary or permanent table with “BACKUP NO” option
• CREATE TABLE LIKE to mirror compression settings.
Additional Documentation
• Define same key column as DISTSTYLE KEY between staging and
production table. ▪ Data Loading Best Practices
▪ Loading Data from S3
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
38
Unloading Data: UNLOAD Command
• UNLOAD output
• CSV, JSON or Parquet (Data Lake Export) file formats
• Generates > 1 file per slice for all compute nodes
• Max file size written on S3 can be controlled (max
UNLOAD ('select-statement')
internal limit 6.2GB)
TO 's3://object-path/name-prefix'
• Generates a manifest for all unloaded files (useful for iam_role "arn" [ option [ ... ] ]
COPY into another cluster)
• Control if files can overwrite existing locations or not
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
39
Query SQL Best Practices
• Avoid using select *. Include only the columns you specifically need to reduce I/O
• Use a CASE expression to perform complex aggregations instead of selecting from the
same table multiple times.
• If you use both GROUP BY and ORDER BY clauses, make sure that you put the columns in
the same order in both.
• Use subqueries in cases where one table in the query is used only for predicate conditions
and the subquery returns a small number of rows (less than about 200). The following
example uses a subquery to avoid joining the LISTING table.
Use
select sum([Link]) from sales
where salesid in (
select listid from listing where listtime > '2023-12-26'
);
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
40
Query SQL Best Practices
Join:
• Don't use cross-joins unless absolutely necessary
• Use distribution keys as join columns
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
41
Query SQL Best Practices
Query Predicate:
• Use predicates to restrict the dataset as much as possible and use sort keys in the predicates
• In the predicate, use the least expensive operators that you can.
• Comparison condition operators are preferable to LIKE operators.
• LIKE operators are still preferable to SIMILAR TO or POSIX operators.
• Avoid using functions in query predicates.
• Add predicates to filter tables that participate in joins, even if the predicates apply the same filters
Use Instead of
select [Link], sum([Link]) select [Link], sum([Link])
from sales, listing from sales, listing
where [Link] = [Link] where [Link] = [Link]
and [Link] > '2008-12-01' and [Link] > '2008-12-01'
and [Link] > '2008-12-01' group by 1 order by 1;
group by 1 order by 1;
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
42
Query Editor v2 Best Practices
• For large SQLs (>30k characters), use Notebooks
• Notebooks run SQL one-at-a-time. Editor can run SQLs in parallel
• Minimize the number of open Query Editor windows
• Close sessions once complete - Don’t leave connections open
• Queries continue to run, even after closing windows
Additional Documentation
• Using Amazon Redshift Query Editor v2
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
43
Query Editor v2 Demo
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
44
Lab
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
45
Thank you!
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
46