0% found this document useful (0 votes)
73 views46 pages

Amazon Redshift Overview and Guide

The document provides an overview of Amazon Redshift, a fully managed cloud data warehousing service that enables users to analyze and visualize data from various sources. It covers key features such as serverless options, automatic scaling, and data sharing capabilities, as well as the architecture and instance types available. Additionally, it highlights the benefits of using Redshift Spectrum for querying external data stored in Amazon S3.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views46 pages

Amazon Redshift Overview and Guide

The document provides an overview of Amazon Redshift, a fully managed cloud data warehousing service that enables users to analyze and visualize data from various sources. It covers key features such as serverless options, automatic scaling, and data sharing capabilities, as well as the architecture and instance types available. Additionally, it highlights the benefits of using Redshift Spectrum for querying external data stored in Amazon S3.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Amazon Redshift 101

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1
1. AWS Services Overview 2. Redshift Overview

Agenda

3. Redshift Getting Started 4. Lab

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
2
AWS Services Overview

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
3
What is the cloud?

Cloud computing lets you stop thinking of infrastructure as hardware,


and instead think of it (and use it) as software

Programmable Dynamic Pay as


resources abilities you go

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
4
How does it work?
AWS owns and maintains the network-connected hardware
You provision and use what you need

Storage Database Business


applications

Compute Networking Internet


& content of Things
delivery

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
5
Shared responsibility model
Customer data

Platform, applications, and identity and access management (IAM)


Customer
Operating system, network, and firewall configuration
responsibility
Client-side data Network traffic
Server-side encryption
encryption and data protection (encryption,
(file system/data)
integrity authentication integrity, identity)

AWS foundation services


Compute Storage Databases Networking
AWS
responsibility AWS global infrastructure

Regions Availability Zones Edge locations

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
6
AWS global infrastructure
eu-west-1a eu-west-1b

AZ AZ
eu-west-1c

Availability
Data center Zone (AZ) AZ eu-west-1
(Ireland)
Region

Typically houses • One or more data centers • Each AWS Region is made up of two or more AZs
thousands of servers • Designed for fault isolation • AWS has 32 Regions worldwide

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
7
AWS Global Infrastructure Regions & AZs
EUROPE
N AMERICA
Frankfurt 3 Stockholm 3 ASIA PACIFIC
Canada Central 3 Oregon 4
Ireland 3 Zurich 3 *Beijing 3 Osaka 3
GovCloud US-East 3 Canada West
London 3 *Ningxia 3 Seoul 4
GovCloud US-West 3
Milan 3 Hong Kong 3 Singapore 3
Northern California 3
Paris 3 Hyderabad 3 Tokyo 4
Northern Virginia 6
Spain 3 Jakarta 3 Malaysia
Ohio 3
Mumbai 3 Thailand
MIDDLE EAST
AFRICA
Bahrain 3
Cape Town 3
Tel Aviv 3

S AMERICA UAE 3

São Paulo 3 AUSTRALIA


& NEW ZEALAND

Melbourne 3

Sydney 3

Auckland

Available Region Announced # Availability Zone


AWS categories of services

Analytics Application AR and VR Blockchain Business Compute


Integration Applications

Cost Customer Database Developer Tools End User Game Tech


Management Engagement Computing

Internet Machine Management and Media Services Migration and Mobile


of Things Learning Governance Transfer

Storage Robotics Satellite Networking and Security, Identity,


Content Delivery and Compliance
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
9
Core service areas
Amazon
Route 53
• Compute
Amazon
• Storage VPC S3
• Databases
• Networking User
Amazon EC2 Amazon
• Security DynamoDB

Your
application
Amazon EBS

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
10
Amazon Elastic Compute Cloud (Amazon EC2)

• Complete control of your computing


resources
• Resizable compute capacity
• Reduced time required to obtain and boot
new server instances
Amazon • Over 750 types of compute instances
EC2

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
11
AWS storage options

Amazon Simple Storage Amazon Elastic File System


Service (Amazon S3) (Amazon EFS)
Scalable, highly durable Scalable network file storage
object storage in the cloud for Amazon EC2 instances

Amazon S3 Glacier Amazon Elastic Block Store


Low-cost, highly durable (Amazon EBS)
archive storage in the cloud Network-attached volumes that
provide durable block-level storage
for Amazon EC2 instances

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
12
Amazon S3

Object-level
storage Use cases
• Content storage and distribution
Designed for • Backup and archiving
99.999999999%
• Big data analytics
durability
Amazon • Disaster recovery
S3 • Static website hosting
Event triggers

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
13
Amazon EBS
• Persistent network-attached
AWS Cloud
block storage for instances
• Different drive types Monday’s snapshot EC2 EC2
instance instance
• Scalable Tuesday’s snapshot

Wednesday’s snapshot
• Pay only for what
you provision Thursday’s snapshot

Friday’s snapshot
• Snapshot functionality Amazon EBS volumes

• Encryption available Create volume snapshots Detach and reattach volumes


for backup and recovery to other EC2 instances

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
14
Redshift Overview

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
15
Amazon Redshift
FULLY MANAGED, AI-POWERED CLOUD DATA WAREHOUSING

Data Insights
Analyze and
Transactional data
visualize data

Clickstream Amazon Redshift


Deliver real-time &
Unify data across databases, data lakes and data predictive analytics
warehouses with a zero-ETL approach
IoT telemetry

Build data-driven
Best-in-class security, applications
Application logs governance, and compliance

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
16
Amazon Redshift Spark and ML
Integration for Amazon Amazon
BI tools Data API Query Editor Apache Spark Redshift ML Data Exchange
Best price
performance
cloud DW Third-party
data exchanges
Amazon Redshift
Serverless Automatic compute management Pay for use
Amazon Redshift
Compute Compute

Automatic scaling
for consistent
performance

Scale and pay for Workload isolation and charge-


Near real-time compute and storage ability with data sharing
data ingestion independently
Native data lake querying
Managed Storage Amazon S3
Streaming
ingestion
Parquet ORC JSON
Zero-ETL

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
17
Customers – sample list
Tens of thousands of customers process exabytes of data with Amazon Redshift daily

NTT DOCOMO WARNER Yelp Jack in the Box Pfizer


Moved >10 PB of BROS. Enabling a Improved ops by Provide scientists
data from on- Performance, scale, data-driven moving off of with near real-time
premises to cloud cost-effeciency organization with on-premises DW analysis
concurrency scaling

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
18
Redshift cluster architecture
SQL Clients / BI Tools
Leader node
• SQL endpoint JDBC/ODBC
• Stores metadata
• Coordinates parallel SQL processing & Leader
• ML optimizations node
• Leader node is no-charge for clusters
with 2+nodes
Compute Compute Compute
Compute nodes node node node
• Split into “Slices”
• Local SSDs for caching Load
• Executes queries in parallel
Unload
• Load, unload, backup, restore from S3
Redshift Managed Storage Backup
• Resides in S3 Restore
• Available across entire Region
• Pay for space used (not provisioned) Redshift Managed Storage
• Scales independently of Compute

Amazon S3
Exabyte-scale object storage
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
19
Redshift instance types Additional Documentation
• Working with clusters

Amazon Redshift RA3 (current generation)


• Solid-state disks + Amazon S3 A Redshift cluster can have up to 128
• Amazon Redshift Managed Storage (RMS) ra3.16xlarge nodes (16 PB of managed
storage) and can support EBs of data
Dense compute—DC2
with its Redshift Data Lake support.
• Solid-state disks

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
20
Get started with Experience better
analytics in seconds price-performance

YOU
focus on
Amazon Redshift
Save costs and stay Pay for what
insights on budget you use

Serverless
Automatic Advanced
provisioning monitoring

Automatic Backup and


scaling recovery

takes care Automated Routine


of the rest patching maintenance

Automatic Security and


failover industry compliance

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
21
Amazon Redshift Serverless

JDBC/ODBC Data API Query Editor

Data
sharing Amazon Redshift Serverless
clusters ML-based Streams
workload monitoring

Intelligent and dynamic


compute management
Compute

Automatic
workload management

Operational
Automatic scaling
Databases
Automatic tuning

Automatic maintenance

Amazon
Performance at scale
Sagemaker
Pay for use
Storage

Redshift Amazon S3 AWS Lambda


managed Apache
storage Parquet orc
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
22
Redshift Serverless or Provisioned Highlights
Provisioned Serverless
• Cluster of Compute Nodes • Workgroup is a collection of
compute resources
• Greater control of configuration and
workload management • Workgroup resources managed by
Redshift Processing Units (RPU)
• Predictable cost
• Simplified management
• Discounts with Reserved Instances
• Pay for use

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
23
Redshift Spectrum Overview
Run SQL queries directly against data in S3 using
Redshift Spectrum is a feature of Redshift that allows thousands of nodes
SQL queries on external data stored in Amazon S3

Benefits Spectrum

• Enables the Modern Data Architecture pattern to query


exabytes of data in an S3 data lake
• Data is queried in-place, no loading of data
• Keeps your data warehouse lean by ingesting warm data
locally while keeping other data in the data lake within
reach
• Write query results from Redshift direct to S3 external
tables
• Create materialized views on S3 data using Redshift
Spectrum queries
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
24
Life of a query SELECT COUNT(*)
FROM S3.EXT_TABLE
1 GROUP BY…

JDBC/ODBC Query is optimized and compiled at the leader


Amazon
2 node. Determined what gets run locally and
what goes to Amazon Redshift Spectrum
9 Result is sent back to client Redshift

Query plan is sent to all compute nodes


3
Final aggregations and joins Compute nodes dynamically prune partitions
4
8 with local Amazon Redshift
tables done in-cluster Each compute node issues multiple requests to
5 Amazon Redshift Spectrum layer

Amazon Redshift
7 Spectrum projects, ... 6 Amazon Redshift Spectrum nodes
filters, joins and scan S3 data
1 2 3 4 N
aggregates
Glue Data Catalog
Hive metastore
Lake Formation
Amazon S3
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
25
Data storage in Redshift
• Data loaded into Redshift is stored in Redshift Managed Storage (RMS), storage is columnar
• Structured and semi-structured data can be loaded
• Amazon Redshift is ANSI SQL and ACID compliant
• Does not require indexes or db hints. Leverages sort keys, distribution keys, compression instead, to
achieve fast performance through parallelism and efficient data storage
• Data is organized as: Namespace > database > schema > objects

Namespace (One per endpoint)

database1 database2 databaseN

schema1 schema2 schemaN schema1 schema10 schema20 schema1 schemaN

database database database database database database database database


code objects code objects code objects code objects code objects code objects code objects code objects
objects objects objects objects objects objects objects objects

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
26
Data sharing with Amazon Redshift
• Instant, secure, and
live data sharing across
Redshift data warehouses
BI and Machine Data processing
analytics apps learning & advanced
analytics
• Within and across AWS Amazon
accounts and across AWS Redshift

Regions

Amazon Redshift
• Live and transactionally
consistent

• Flexible multi-cluster and Amazon S3


data lake
Amazon
Redshift
data mesh architectures

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
27
Lab Setup

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
28
Redshift Getting Started

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
29
Redshift: Use Popular Data Models

Redshift can be used with a number of data models including…


A commonly used data model with Amazon Redshift
STAR Highly is the STAR schema, which separates data into large
Snowflake Schema fact and dimension (dim) tables:
Schema Denormalized • Facts refer to specific events (e.g. order
Most Common Less Common submitted.) and fact tables hold summary detail
for those events. e.g. the high-level attributes
of an order submitted such as order_id, order_dt,
product_id, & total_cost Fact tables use foreign
keys to link to dim tables
• The dimensions that make up a fact often have
attributes themselves that are more efficiently
stored in separate dim tables. e.g. a fact might
contain a product_id, but the actual product
details would be contained in a separate
products dim table (e.g. product_price,
Best Practice: Avoid highly normalized models. Models such as 3NF resemble the height_cm, width_cm, & product_id are columns
STAR schema, but has much more table normalization and are typically more that might be found in a products dim table)
appropriate with OLTP systems

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
30
Redshift Datatypes
Scalar Vector
Datatypes Datatype

Numeric Characters Datetime


BOOLEAN HLLSKETCH GEOMETRY VARBYTE SUPER
Types Types Types

Integer DECIMAL/ Floating


CHAR DATE
Type NUMERIC Point type

SMALLINT REAL VARCHAR TIME

DOUBLE
INT NCHAR TIMETZ
PRECISION

BIGINT TEXT TIMESTAMP

BPCHAR TIMESTAMPTZ

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
31
Semi-structured data – SUPER datatype

Data type: SUPER id name phones


INTEGER SUPER SUPER

[{"type":"work",
Easy, efficient, and powerful JSON processing {"given":"Jane", "num":"9255550100"},
1
"family":"Doe"} {"type":"cell",
"num": 6505550101} ]
Fast row-oriented data ingestion
{"given":"Richard",
"family":"Roe“, [{"type":"work",
2
Fast column-oriented analytics with "middle":“John" "num": 5105550102}]
materialized views over SUPER/JSON },

SELECT [Link] AS firstname, [Link] as


Access to schema-less nested data with middlename, [Link]
easy-to-use SQL extensions powered FROM customers c, [Link] ph
WHERE [Link] = ‘work’;
by the PartiQL query language
firstname | middle | num
----------+---------------
"Jane" | null | 9255550100
"Richard" | "John" | 5105550102

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
32
Row -Store vs Column Store
• Row storage (e.g. MySQL): all row fields are stored
together on disk (typically in a sequential file)
• Accessing a column (example: scanning SSN of all
residents) with row storage:
• Scan every column in every row of the table
• Resultant unnecessary I/O and caching overhead

• Column storage (e.g. Amazon Redshift): each table


column is stored separately on disk (typically in a separate
file or set of files)
• Accessing column (example: scanning SSN of all residents)
with columnar storage:
• Only scan blocks for relevant column(s)
• Significantly less I/O
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
33
Row -Store Read vs Column Store Read
Given the following table definition and data for the deep_dive table, how will a simple
SQL query behave in a row-based data store, and then in a column-based store?

CREATE TABLE deep_dive ( SELECT min(dt) FROM deep_dive;


aid INT --airport_id
Row-based storage behavior Column-based storage behavior
,loc CHAR(3) --location
,dt DATE --date • Need to read everything • Only scan blocks for relevant
); • Unnecessary I/O column
• Significantly less I/O

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
34
Materialized Views
• Improve performance of complex, SLA sensitive, predictable and
repeated queries using Materialized views
• Materialized view persists the result set of the associated SQL
Redshift Materialized Views
• Materialized views can be refreshed automatically or manually
Materialized views can be created using the
• Redshift automatically determines best way to update data in
the materialized view (incremental or full refresh) CREATE statement, and can be included
(default) or excluded from Redshift backups.
• Automatic query rewrite leverages relevant materialized views Materialized views can also have table
and can improve query performance by order(s) of magnitude attributes such as dist style and sort keys, and
• Automated materialized views: Redshift continuously monitors be refreshed at any time
workload to identify queries that will benefit from having a MV
CREATE MATERIALIZED VIEW mv_name
and automatically creates and manages MVs for them
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query

REFRESH MATERIALIZED VIEW mv_name;


© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
35
Redshift Best Practices

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
36
Table Design Best Practices
• Redshift performance is about efficient I/O
• Make columns only as wide as they need to be
• Define primary key and foreign key constraints
• Let COPY choose compression encodings
• Choose the best distribution style
• Choose the best sort key
• AUTO vs Timestamp vs Filtering vs Frequent Joins

• Use appropriate data types


• Use date/time data types for date columns
• Multibyte Characters - Use VARCHAR data type for UTF-8 multibyte
characters support (up to a maximum of four bytes)
• Spatial data can be natively stored, retrieved, and processed using
the GEOMETRY data type and spatial functions.

Additional Documentation
• Best Practices for Designing Tables
• Querying Spatial Data in Redshift

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
37
Data Loading Best Practices
• Use COPY command to load data whenever possible
• Use a single COPY command per table
• Writes are serial per table
• Commits are serial per cluster
ETL Best Practices
• Use multi-row inserts if COPY is not possible
• Bulk insert operations (INSERT INTO...SELECT and CREATE TABLE AS)
provide high performance data insertion
▪ Staging tables are more performant
when created using CREATE TABLE
• Enforce Primary, Unique or Foreign Key constraints outside of Redshift
LIKE instead of SELECT INTO
• Wrap workflow/statements in an explicit transaction #my_temp_table
• Consider using TRUNCATE instead of DELETE ▪ Merge operations should be
• ALTER TABLE APPEND to move rows faster from source to target table. performance via INSERT/UPDATE
• Staging Tables to target table with deduplication
• Use temporary or permanent table with “BACKUP NO” option
• CREATE TABLE LIKE to mirror compression settings.
Additional Documentation
• Define same key column as DISTSTYLE KEY between staging and
production table. ▪ Data Loading Best Practices
▪ Loading Data from S3

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
38
Unloading Data: UNLOAD Command

• UNLOAD command is the reverse of COPY, in that it


outputs data from Amazon Redshift to S3
• Runs from a SELECT statement. Order By clause
respected by UNLOAD if PARALLEL=OFF
• Encryption & compression handled automatically
• Runs in parallel on all compute nodes

• UNLOAD output
• CSV, JSON or Parquet (Data Lake Export) file formats
• Generates > 1 file per slice for all compute nodes
• Max file size written on S3 can be controlled (max
UNLOAD ('select-statement')
internal limit 6.2GB)
TO 's3://object-path/name-prefix'
• Generates a manifest for all unloaded files (useful for iam_role "arn" [ option [ ... ] ]
COPY into another cluster)
• Control if files can overwrite existing locations or not
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
39
Query SQL Best Practices
• Avoid using select *. Include only the columns you specifically need to reduce I/O
• Use a CASE expression to perform complex aggregations instead of selecting from the
same table multiple times.
• If you use both GROUP BY and ORDER BY clauses, make sure that you put the columns in
the same order in both.
• Use subqueries in cases where one table in the query is used only for predicate conditions
and the subquery returns a small number of rows (less than about 200). The following
example uses a subquery to avoid joining the LISTING table.
Use
select sum([Link]) from sales
where salesid in (
select listid from listing where listtime > '2023-12-26'
);

Instead of Additional Documentation


select sum([Link]) from sales
• Best Practices for Designing Queries
Join listing on [Link] = [Link]
• Redshift SQL Reference
Where listing. listtime > '2023-12-26';

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
40
Query SQL Best Practices
Join:
• Don't use cross-joins unless absolutely necessary
• Use distribution keys as join columns

Vacuum and Analyze:


• Redshift automatically performs vacuum and analyze in the background during periods of low
workloads.
• Redshift users are still empowered to explicitly invoke VACUUM and then ANALYZE as part of
their workloads
• Explicitly invoking VACUUM and then ANALYZE ensures that a table is sorted, defragmented, and
analyzed immediately and with priority for the benefit of the next steps in a workflow.

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
41
Query SQL Best Practices
Query Predicate:
• Use predicates to restrict the dataset as much as possible and use sort keys in the predicates
• In the predicate, use the least expensive operators that you can.
• Comparison condition operators are preferable to LIKE operators.
• LIKE operators are still preferable to SIMILAR TO or POSIX operators.
• Avoid using functions in query predicates.
• Add predicates to filter tables that participate in joins, even if the predicates apply the same filters
Use Instead of
select [Link], sum([Link]) select [Link], sum([Link])
from sales, listing from sales, listing
where [Link] = [Link] where [Link] = [Link]
and [Link] > '2008-12-01' and [Link] > '2008-12-01'
and [Link] > '2008-12-01' group by 1 order by 1;
group by 1 order by 1;

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
42
Query Editor v2 Best Practices
• For large SQLs (>30k characters), use Notebooks
• Notebooks run SQL one-at-a-time. Editor can run SQLs in parallel
• Minimize the number of open Query Editor windows
• Close sessions once complete - Don’t leave connections open
• Queries continue to run, even after closing windows

Additional Documentation
• Using Amazon Redshift Query Editor v2

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
43
Query Editor v2 Demo

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
44
Lab

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
45
Thank you!

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
46

You might also like