0% found this document useful (0 votes)
156 views

Snowflake Overview

Snowflake provides a scalable data platform through its unique hybrid architecture. This architecture separates storage and compute into three layers - the cloud services layer, query processing compute layer, and centralized database storage layer. The cloud services layer manages security, metadata, and query results. Virtual warehouses in the compute layer provide scalable compute resources. Data remains centralized in the storage layer, providing logical separation of storage and compute.

Uploaded by

anupsnair
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views

Snowflake Overview

Snowflake provides a scalable data platform through its unique hybrid architecture. This architecture separates storage and compute into three layers - the cloud services layer, query processing compute layer, and centralized database storage layer. The cloud services layer manages security, metadata, and query results. Virtual warehouses in the compute layer provide scalable compute resources. Data remains centralized in the storage layer, providing logical separation of storage and compute.

Uploaded by

anupsnair
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Chapter 2.

Creating and
Managing the Snowflake
Architecture
• A decade ago, data platform architectures lacked the scalability necessary to
make it easier for data-driven teams to share the same data simultaneously
regardless of the size of the team or their proximity to the data
• Snowflake is an evolutionary modern data platform that solved the
scalability problem
• Snowflake’s Data Cloud provides users with a unique experience by
combining a new SQL query engine with an innovative architecture
designed and built, from the ground up, specifically for the cloud
Create a new worksheet titled
Chapter2 Creating and Managing
Snowflake Architecture

Refer to “Navigating Snowsight


Prep Work Worksheets” if you need help
creating a new worksheet

To set the worksheet context, make


sure you are using the SYSADMIN
role and the COMPUTE_WH virtual
warehouse
In this section, we’ll briefly review some
traditional data platform architectures
and how they were designed in an
attempt to improve scalability

Traditional Scalability is the ability of a system


Data Platform to handle an increasing amount of
work
Architectures
Afterward, we will learn about each of the
three different Snowflake architecture layers
in detail: the cloud services layer, query
processing compute layer, and centralized
database storage layer
Shared-Disk
Architecture
• The shared-disk architecture was an early scaling approach
designed to keep data stored in a central storage location and
accessible from multiple database cluster nodes
• This architecture is a traditional database design and is known
for the simplicity of its data management
• While the approach is simple in theory, it requires complex on-
disk locking mechanisms to ensure data consistency which, in
turn, causes bottlenecks
Shared-Nothing
Architecture
• The shared-nothing architecture, in which storage and compute
are scaled together , was designed in response to the bottleneck
created by the shared-disk architecture
• This evolution in architecture was made possible because
storage had become relatively inexpensive
• However, distributed cluster nodes along with the associated
disk storage, CPU, and memory requires data to be shuffled
between nodes, which adds overhead
NoSQL Alternatives
• Most NoSQL solutions rely on a shared-nothing architecture; thus, they have
many of the same limitations
• NoSQL, a term that implies “Not Only SQL” rather than “No to SQL,” is a good
choice for storing email, web links, social media posts and tweets, road maps,
and spatial data
• There are four types of NoSQL databases: document stores, key-value stores,
column family data stores or wide column data stores, and graph databases
• Document-based NoSQL databases such as MongoDB store data in JSON objects
where each document has key-value pair–like structures
The unique Snowflake design physically
separates but logically integrates storage
and compute along with providing
services such as security and management

The Snowflake hybrid-model architecture is


The Snowflake composed of three layers, which are shown
Architecture in Figure 2-3: the cloud services layer, the
compute layer, and the data storage layer

Snowflake’s processing engine is native


SQL, and as we will see in later chapters,
Snowflake is also able to handle semi-
structured and unstructured data
All interactions with data in a
Snowflake instance begin in the
cloud services layer, also called the
global services layer

The Snowflake cloud services layer is


The Cloud a collection of services that coordinate
Services Layer activities such as authentication,
access control, and encryption

The cloud services layer is sometimes referred


to as the Snowflake brain because all the
various service layer components work together
to handle user requests that begin from the time
a user requests to log in
The cloud services layer manages
data security, including the security
for data sharing

Managing the The Snowflake cloud services layer runs


across multiple availability zones in each
Cloud Services cloud provider region and holds the result
cache, a cached copy of the executed query
Layer results

The metadata required for query


optimization or data filtering are also
stored in the cloud services layer
NOTE
• Just like the other Snowflake layers, the cloud services layer
will scale independently of the other layers
• Scaling of the cloud services layer is an automated process that
doesn’t need to be directly manipulated by the Snowflake end
user
Most cloud services consumption is already
incorporated into Snowflake pricing

DDL operations are metadata operations, and


Billing for the as such, they use only cloud services

Cloud Services Increased usage of the cloud services layer will likely
Layer occur when using several simple queries, especially
queries accessing session information or using session
variables

Thoughtful choices about the right frequency and granularity


for DDL commands, especially for use cases such as cloning,
helps to better balance cloud services consumption costs and
virtual warehouse costs to achieve an overall lower cost
The Query Processing
Compute Layer
• A Snowflake compute cluster, most often referred to simply as a virtual warehouse, is a
dynamic cluster of compute resources consisting of CPU memory and temporary storage
• Creating virtual warehouses in Snowflake makes use of the compute clusters—virtual
machines in the cloud which are provisioned behind the scenes
• The Snowflake compute resources are created and deployed on demand to the
Snowflake user, to whom the process is transparent
• Snowflake’s unique architecture allows for separation of storage and compute, which
means any virtual warehouse can access the same data as another, without any
contention or impact on performance of the other warehouses
Virtual Warehouse
Size
• A Snowflake compute cluster is defined by its size, with size
corresponding to the number of servers in the virtual
warehouse cluster
• For each virtual warehouse size increase, the number of
compute resources on average doubles in capacity
• Virtual warehouse resizing to a larger size, also known as
scaling up, is most often undertaken to improve query
performance and handle large workloads
TIP
Scaling Up a Virtual Warehouse to
Process Large Data Volumes and
Complex Queries

• In a perfect world , you’d pay the same total cost for using an X-Small virtual
warehouse as using a 4X-Large virtual warehouse
• The number of concurrent queries, the number of tables being queried, and the
size and composition of the data are a few things that should be considered when
sizing a Snowflake virtual warehouse
• Resizing a Snowflake virtual warehouse is a manual process and can be done
even while queries are running because a virtual warehouse does not have to be
stopped or suspended to be resized
• However, when a Snowflake virtual warehouse is resized, only subsequent
queries will make use of the new size
TIP
• In Snowflake, we can create virtual warehouses through the user interface or with SQL
• When we create a new virtual warehouse using the Snowsight user interface, as shown
in Figure 2-7, we’ll need to know the name of the virtual warehouse and the size we
want the virtual warehouse to be
• A multicluster virtual warehouse allows Snowflake to scale in and out automatically
• A virtual warehouse can be resized, either up or down, at any time, including while it is
running and processing statements
• Resizing a virtual warehouse doesn’t have any impact on statements that are currently
being executed by the virtual warehouse
The value of the Auto Resume and
Auto Suspend times should equal or
exceed any regular gaps in your
query workload

For example, if you have regular gaps


of four minutes between queries, it
TIP wouldn’t be advisable to set Auto
Suspend for less than four minutes

If you did, your virtual warehouse would be


continually suspending and resuming, which
could potentially result in higher costs since
the minimum credit usage billed is 60 seconds
WARNING
• Larger virtual warehouses do not necessarily result in better
performance for query processing or data loading
• It is the query complexity, as part of query processing, that should
be a consideration for choosing a virtual warehouse size because
the time it takes for a server to execute a complex query will likely
be greater than the time it takes to run a simple query
• The amount of data to be loaded or unloaded can also greatly affect
performance
A multicluster virtual warehouse
operates in much the same way as a
Scaling Out single-cluster virtual warehouse
with
Multicluster The goal is to optimize the
Virtual Snowflake system performance in
terms of size and number of clusters
Warehouses to
Maximize Multicluster virtual warehouses can
Concurrency be set to automatically scale if the
number of users and/or queries tends
to fluctuate
NOTE
• Just like single-cluster virtual warehouses, multicluster virtual warehouses can be
created through the web interface or by using SQL for Snowflake instances
• Unlike single-cluster virtual warehouses where sizing is a manual process, scaling in or
out for multicluster virtual warehouses is an automated process
• The Snowflake scaling policy, designed to help control the usage credits in the Auto-
scale mode, can be set to Standard or Economy
• If a multicluster virtual warehouse is configured with the scaling policy set to Economy,
a virtual warehouse starts only if the Snowflake system estimates the query load can
keep the virtual warehouse busy for at least six minutes
NOTE
• Compute can be scaled up, down, in, or out
• In all cases, there is no effect on storage used
Commands for virtual warehouses
can be executed in the web UI or
within a worksheet by using SQL

Creating and Auto Suspend and Auto Resume are


Using Virtual two options available when creating a
Snowflake virtual warehouse
Warehouses
The following SQL script will create a
medium-sized virtual warehouse, with
four clusters, that will automatically
suspend after five minutes
TIP
• It is a best practice to create a new virtual warehouse in a
suspended state
• Unless the Snowflake virtual warehouse is created initially in a
suspended state, the initial creation of a Snowflake virtual
warehouse could take time to provision compute resources
• Earlier we discussed how we can change the size of a virtual
warehouse by scaling it up or down and that doing so is a
manual process
USE WAREHOUSE
CH2_WH
• Alternatively, you can update the virtual warehouse in the context menu located at the
upper right of the web UI
• We created our virtual warehouse using SQL commands, but we can also use the
Snowsight web UI to create and/or edit Snowflake virtual warehouses
• Even though the multicluster virtual warehouse is toggled on when a cluster of one is
selected, scaling out beyond a single cluster is not possible
• To create a Snowflake multicluster virtual warehouse via SQL, you’ll need to specify the
scaling policy as well as the minimum and maximum number of clusters
• As stated previously, the scaling policy, which applies only if the virtual warehouse is
running in Auto-scale mode, can be either Economy or Standard
NOTE
• A multicluster virtual warehouse is said to be Maximized when
the minimum number of clusters and maximum number of
clusters are the same
• Additionally, value must be more than one
• An example of a Maximized multicluster virtual warehouse is
when MIN_CLUSTER_COUNT = 3 and
MAX_CLUSTER_COUNT =
Separation of Workloads and
Workload Management

• Query processing tends to slow down when the workload reaches full capacity on
traditional database systems
• In contrast, Snowflake estimates the resources needed for each query, and as the
workload approaches 100%, each new query is suspended in a queue until there
are sufficient resources to execute them
• One way is to separate the workloads by assigning different virtual warehouses to
different users or groups of users
• The virtual warehouses depicted in Figure 2-15 are all the same size, but in
practice, the virtual warehouses for different workloads will likely be of different
sizes
Separation of Workloads and
Workload Management

• For an automatically scaling multicluster virtual warehouse, we


will still need to define the virtual warehouse size and the
minimum and maximum number of clusters
Billing for the Virtual
Warehouse Layer
• Consumption charges for Snowflake virtual warehouses are calculated based
on the warehouse size, as determined by the number of servers per cluster,
the number of clusters if there are multicluster virtual warehouses, and the
amount of time each cluster server runs
• Snowflake utilizes per-second billing with a 60-second minimum each time
a virtual warehouse starts or is resized
• It is recommended that you choose an X-Small virtual warehouse to do so
because of the small size of the data set and simplicity of the query
Centralized Database
Storage Layer
• Snowflake’s centralized database storage layer holds all data, including structured and
semi-structured data
• Each Snowflake database consists of one or more schemas, which are logical groupings
of database objects such as tables and views
• Chapter 3 is devoted to showing you how to create and manage databases and database
objects
• In Chapter 9, we will learn about Snowflake’s physical data storage as we take a deep
dive into micro-partitions to better understand data clustering
• Snowflake automatically organizes stored data into micro-partitions, an optimized,
immutable, compressed columnar format which is encrypted using AES-256 encryption
Introduction to Zero-
Copy Cloning
• Zero-copy cloning offers the user a way to snapshot a
Snowflake database, schema, or table along with its associated
data
• There is no additional storage charge until changes are made to
the cloned object, because zero-copy data cloning is a
metadata-only operation
• There are many uses for zero-copy cloning other than creating
a backup
Introduction to Time
Travel
• Time Travel allows you to restore a previous version of a
database, table, or schema
• This is an incredibly helpful feature that gives you an
opportunity to fix previous edits that were done incorrectly or
restore items deleted in error
• With Time Travel, you can also back up data from different
points in the past by combining the Time Travel feature with
the clone feature, or you can perform a simple query of a
database object that no longer exists
Snowflake data storage costs are
calculated based on the daily average
size of compressed rather than
uncompressed data

Storage costs include the cost of


Billing for the persistent data stored in permanent
Storage Layer tables and files staged for bulk data
loading and unloading

Fail-safe data and the data retained


for data recovery using Time Travel
are also considered in the calculation
of data storage costs
Snowflake Caching
• When you submit a query, Snowflake checks to see whether
that query has been previously run and, if so, whether the
results are still cached
• Snowflake will use the cached result set if it is still available
rather than executing the query you just submitted
• There are three Snowflake caching types: the query result
cache, the virtual warehouse cache, and the metadata cache
Query Result Cache
• The fastest way to retrieve data from Snowflake is by using the
query result cache
• The results of a Snowflake query are cached, or persisted, for
24 hours and then purged
• The result cache is fully managed by the Snowflake global
cloud services layer, as shown in Figure 2-18, and is available
across all virtual warehouses since virtual warehouses have
access to all data
The metadata cache is fully managed
in the global services layer where the
user does have some control over the
metadata but no control over the cache

Snowflake collects and manages


Metadata metadata about tables, micro-
Cache partitions, and even clustering

Snowflake also stores the total number


of micro-partitions and the depth of
overlapping micro-partitions to
provide information about clustering
NOTE
• The information stored in the metadata cache is used to build
the query execution plan
The traditional Snowflake data cache
is specific to the virtual warehouse
used to process the query

Virtual Running virtual warehouses use SSD


Warehouse storage to store the micro-partitions that
Local Disk are pulled from the centralized database
storage layer when a query is processed
Cache
The size of the virtual warehouse’s
SSD cache is determined by the size
of the virtual warehouse’s compute
resources
NOTE
• The virtual warehouse data cache is limited in size and uses the
LRU algorithm
• Whenever a virtual warehouse receives a query to execute, that
warehouse will scan the SSD cache first before accessing the
Snowflake remote disk storage
• Although the virtual warehouse cache is implemented in the
virtual warehouse layer where each virtual warehouse operates
independently, the global services layer handles overall system
data freshness
TIP
• Whenever possible, and where it makes sense, assign the same
virtual warehouse to users who will be accessing the same data
for their queries
• This increases the likelihood that they will benefit from the
virtual warehouse local disk cache
Code Cleanup
• In the Chapter2 worksheet, make sure to set the cached results
back to TRUE: Go ahead and drop the virtual warehouses
Hopefully these first two chapters
have given you an understanding of
the power of the Snowflake Data
Cloud and its ease of use

We executed a few SQL statements in this


chapter, but in the upcoming chapters we’ll be
Summary demonstrating Snowflake’s many capabilities
by deep-diving into core concepts with hands-
on learning examples

Refer to Appendix C for more details


on getting set up in a trial account
Knowledge Check
• Name the three Snowflake architecture layers

• Which of the three Snowflake layers are multitenant?

• In which of the three Snowflake architecture layers will you find the virtual warehouse cache?

• If you are experiencing higher than expected costs for Snowflake cloud services, what kinds of things
might you want to investigate?

• Explain the difference between scaling up and scaling out

• What effect does scaling up or scaling out have on storage used in Snowflake?

• The shared-nothing architecture evolved from the shared-disk architecture

• In a Snowflake multicluster environment, what scaling policies can be selected?

• What components do you need to configure specifically for multicluster virtual warehouses?
Knowledge Check
• What are two options to change the virtual warehouse that will
be used to run a SQL command within a specific worksheet?

You might also like