100% found this document useful (1 vote)
404 views

Getting Started With Snowflake Guide

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
404 views

Getting Started With Snowflake Guide

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

GETTING STARTED WITH

SNOWFLAKE
BEST PRACTICES FOR LAUNCHING YOUR SNOWFLAKE PLATFORM

© 2020 phData, Inc. All rights reserved.


FOR MORE INFORMATION, EMAIL [email protected] GETTING
Contact phData atSTARTED WITHConfidential
[email protected]. SNOWFLAKE and Proprietary 1
TABLE OF CONTENTS
Introduction................................................................................................3 Managing Database Objects........................................................... 15
Defining a Chargeback Model...........................................................4 Repeatable Process.................................................................................... 15
Expenses................................................................................................................4 Workspaces......................................................................................................16
Configuring Your Snowflake Account.............................................5 How Will Your Projects Transform Data?...................................... 19
Data Retention..................................................................................................5 Views......................................................................................................................19
Timezone...............................................................................................................5 Streams and Tasks......................................................................................20
Security...................................................................................................................6 Spark.....................................................................................................................20
Connection Performance..........................................................................6 Monitoring Your Account.................................................................... 21
Cost Savings.......................................................................................................7 Monitoring.......................................................................................................... 21
Managing Access....................................................................................9 Alerting................................................................................................................ 22
When to use OAuth...................................................................................... 10 Auditing............................................................................................................... 22
When to use SAML2...................................................................................... 10 Get started........................................................................................................23
When To Use SCIM vs phData TRAM................................................. 10 phData is Here To Make Your Life Easier..................................... 23
Handling Service Accounts..................................................................... 11
Loading Data............................................................................................ 12
Internal vs. External Stages..................................................................... 12
Using a Storage Integration................................................................... 12
Snowpipe vs. Inserts vs. Copy Into.................................................... 12
Data Pipelines..................................................................................................14

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 2


CONGRATULATIONS! However, if there’s one thing we’ve learned from years of
successful cloud data implementations here at phData, it’s
the importance of defining and implementing processes,
You have a shiny new Snowflake building automation, and performing configuration even
account, and business units at before you create the first user account.
your company are lining up to be
This isn’t meant to be a technical how-to guide — most
onboarded. If all goes well,
of those details are readily available via a quick Google
you’ll soon be scaling up and
search — but rather an opinionated review of key processes
beginning to integrate with and potential approaches.
existing business processes with
built-in expectations. Each step of the way, we’ll explore the options available to
you and how they impact your operating costs, complexity,
security, and performance — allowing you to get the most
out of the Snowflake platform and deliver results back to
the business.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 3


DEFINING A EXPENSES

CHARGEBACK
Expenses to account for with Snowflake include:
• Warehouses

MODEL
• Snowpipe
• Materialized Views

Unlike traditionally licensed • Cloud Services


on-premises data solutions, • Data Transfers
Snowflake operates with a flexible
• Storage
pay-as-you-go model, allowing
you to create an account and start While it is possible to minimize or prevent
using it without delay. However, expenses in some areas, you should still plan for
without the proper planning to all potential expense sources. And whatever
ensure governance and visibility model you choose, you will need to track these
around utilization, this model can expenses within Snowflake to determine how
much consumption has occurred and how to
TIP You will need
also make it easy to run up a metadata to
significant bill as multiple business charge it to the right budget.
associate charges to
units ask for access. More on this topic on page 21; but for now, keep the correct budget.
To combat this, decide in advance in mind that the simplest method is to create a
how to pay for Snowflake credits. naming convention for database objects that
For example, will each project pay allows you to identify the owner and associated budget.
for its usage? How is that funded It’s possible to have more elaborate methods to store metadata in another table or even
after the project releases? (cringe) in a spreadsheet; but whatever you do, you’ll want it to remain accurate and easy
Furthermore, Snowflake gives to maintain over time.
discounts on the volume of Also note that this metadata could be used to create other reports that help individual
purchased credits. That means business units optimize their cost and performance — a possibility we’ll explore in more
making consolidated purchases depth when we talk about monitoring.
can save you money, but
calculating how you distribute
those costs to business units or
projects is up to you.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 4


CONFIGURING DATA RETENTION

YOUR
Parameter: DATA_RETENTION_TIME_IN_DAYS (90)
Data retention time is how long Snowflake will retain

SNOWFLAKE
historical views of your data.
The default data retention time is one day, but if you are

ACCOUNT
paying for the Enterprise Edition, you will likely want this to
be 90 days. The extended period will allow you to perform
Time Travel activities, such as undropping tables or
Snowflake has many parameters comparing new data against historical values.
to configure default behavior.

TIP
Most will likely work for your
requirements without changing For cost savings, you can set this value for each database and
them, but there are a few properties choose to have non-production data stored for fewer days. One day is
that deserve special attention from usually adequate for development use.
enterprise customers.
In particular, be sure to evaluate
Snowflake policies that influence: TIMEZONE
• Data Retention Parameter: TIMEZONE (etc/UTC)
• Timezone Snowflake will present time-related values with the timezone in your configuration.
• Security The default timezone used by Snowflake is for Los Angeles. However, you should
• Connection Performance evaluate what works best for you. Some companies set the timezone to their corporate
headquarters’ timezone, while others use UTC.
• Cost

TIP It is possible to set this both at an account level and at a user


level; however, consumers should always request time-related fields
with a timezone to avoid being reliant on this default.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 5


SECURITY Work with your network team to find the right CIDR blocks but expect
to make multiple adjustments to this over your first few weeks.
Storage Integrations Snowflake will not let you activate a network policy that would lock
Parameters: you out. However, if your admins are connecting over a VPN, you
may find that your IP address shifts day-to-day. If your account
• REQUIRE_STORAGE_INTEGRATION_FOR_STAGE_CREATION (True)
admins are remote and do not have a guaranteed known IP
• REQUIRE_STORAGE_INTEGRATION_FOR_STAGE_OPERATION (True) address, make a fail-safe network policy that will allow the account
A storage integration is a secure means of creating connectivity admins to connect regardless of their network location. Once you
between Snowflake and your cloud storage provider. You will need are confident that you will not get locked out of your account,
at least one if you create external stages to load data. remove the fail-safe network policy.

External stages should require storage integrations. By setting


REQUIRE_STORAGE_INTEGRATION_FOR_STAGE_CREATION and TIP Be cautious when creating your network
REQUIRE_STORAGE_INTEGRATION_FOR_STAGE_OPERATION to “True,” policies — especially if you are connecting over
you can prevent the exposure of access tokens or secret keys to a VPN.
Snowflake users.

TIP Do not create external stages without CONNECTION PERFORMANCE


storage integrations. Parameter: CLIENT_METADATA_REQUEST_USE_CONNECTION_CTX
(True)
The “CLIENT_METADATA_REQUEST_USE_CONNECTION_CTX”
Network Policies parameter is a bit technical, but it boils down to reducing the
Parameter: NETWORK_POLICY amount of information that JDBC and ODBC connections pull when
A network policy defines a list of valid network locations for user they connect. Set this to “True” at an account level and only set it
connections. Preventing access from unwanted networks is vital to to “False” for those users who require account-wide metadata. This
protecting your Snowflake account. property is strictly a performance optimization, but it’s an important
one as it can make short-lived connections much faster, especially
You can set an account-wide network policy, as well as configure
as your catalog of databases and schemas grows.
network policies at a user level.
Network policies are an area of account setup that is difficult and
time consuming. You will frequently run into issues where integrations
will come from cloud providers with broad ranges of potential
network locations (CIDR blocks). You may also have users connect
through a VPN and find that split-tunnel connections originate from a
user’s home network rather than the corporate network.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 6


COST SAVINGS Warehouses
Parameters:
Users
• RESOURCE MONITOR
Parameters:
• AUTO_SUSPEND (60)
• LOCK_TIMEOUT (60)
• WAREHOUSE_SIZE (XS)
• STATEMENT_TIMEOUT_IN_SECONDS (600)
• MIN_CLUSTER_COUNT / MAX_CLUSTER_COUNT (1/1)
• STATEMENT_QUEUED_TIMEOUT_IN_SECONDS (30)
Warehouse credit consumption can make up most of your invoice,
In your Snowflake account, you will have system users and human so optimizing for cost can have a huge benefit.
users. You might choose different default settings for each, but
regardless, it is essential to consider which configuration will work RESOURCE MONITORS
best for your organization.
A resource monitor is a Snowflake object that observes credit
Lock timeout determines how long a query will wait for a resource usage for warehouses and can alert your account administrators
that is locked. For most users, if you are waiting for more than 60 or even suspend the warehouses. You can configure the monitors
seconds, there is likely an issue, and there is no reason to waste with a finite number of credits and either not refresh or refresh at a
further warehouse credits. frequency of your choice.
A statement timeout prevents poorly optimized queries from The first and most important thing you can do to manage your
running all weekend and racking up substantial charges. (Yes, this costs is to create a resource monitor for each budget and associate
happens.) For human users who are waiting for results, ten minutes it to the appropriate warehouses. Whether your budget is tied to
is a reasonable timeout. A user still has the option of increasing business units or projects — or even your entire company — ensuring
this timeout for a given session if they know their query will take an that a warehouse does not use excessive credits is essential.
exceptional amount of time. Ten minutes may not be long enough
for system users, but some reasonable limits will prevent multi-day
queries from creating unexpected costs.
TIP Only Account Administrators who have
Warehouses that are overworked have queue times that slow opted in can see alerts on resource monitors.
queries down. If you do not have a queue timeout, users may
sometimes repeatedly queue new queries until your warehouse
is working for many hours to clear the queue. Thirty seconds is a AUTO SUSPEND
good default for human users; if you find that queries are regularly
Ingesting and transforming data will utilize warehouses for fixed
queueing, consider making your warehouse a multi-cluster that
periods, and then they won’t be needed again for an hour or more.
scales on-demand.
Therefore, there’s no reason for these warehouses to continue
running for another ten minutes. If a warehouse resumes once per

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 7


hour, you could save roughly three and a half hours worth of credits CLUSTER COUNT
per day by suspending quickly! If your warehouse has to serve many concurrent requests, you may
In many cases, setting your warehouses to auto suspend after need to increase the cluster count to meet demand.
60 seconds works well. The main exception where you should For most transformation and ingest warehouses, you can leave the
avoid auto-suspending quickly is when you have frequent traffic cluster default of one minimum and one maximum. However, for
over a period and you want to keep a warehouse’s local disk analytics warehouses, you may need to scale for usage.
cache populated.
As a rule of thumb, you will need a max cluster count of roughly one
for every ten concurrent requests. So, if you expect 20 concurrent
TIP Create warehouses for each specific queries and do not want them to queue, a multi-cluster count of
purpose, allowing you to configure them two should be adequate. In this case, the max cluster count should
also be two.
optimally for that purpose.
It's always best to start small and scale up as needed, so setting
proper configurations for the scenario above needs to be taken into
WAREHOUSE SIZE
consideration. Starting with a minimum cluster count of one will
The “t-shirt size” of your warehouse (XS, S, M, L, XL) will significantly provide some cost savings but as the number of concurrent queries
impact your credit usage. Sizing your warehouses to perform well ramp up this may cause performance issues. If performance issues
and remain cost-efficient is always a challenge, and it’s a bit more are encountered, increasing the minimum cluster count higher to
art than science. But while you may need some trial and error, there two should be considered.
are some rules of thumb for picking your starting point.
Begin with a size that roughly correlates to how much data the
warehouse in question will be processing for a given query:
XS: Multiple megabytes
S: A single gigabyte
M: Tens of gigabytes
L: Hundreds of gigabytes
XL: Terabytes

TIP A warehouse for data ingestion is generally


two sizes smaller than a warehouse for
transformation or analytics.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 8


MANAGING
ACCESS TRAM SCIM
Most enterprises have Active • Database Creation/ • Connect to Okta
Directory or another OAuth or Management
SAML2 provider and want users to • Warehouse Creation/
log in with their existing accounts to Management • AD Group/
receive access to resources based • Privilege Creation Role Mapping
on their roles. To this end, Snowflake • Standardized Workspace • Connect to
security integrations allow you Creation AD FS
to specify an external provider to • Automatic User Workspace
authenticate or authorize users. Creation
OAuth and SAML2 will allow existing • Environment Creation
users to log into Snowflake — but • Connect to any
for access only. For tasks such as AD Instance
creating users or mapping existing
groups to roles, you’ll need to look • User Creation
to SCIM (System for Cross-domain
Identity Management), and to tools
like phData Tram.
See the diagram for an idea of
which technologies work best for
your organization. IDENTITY PROVIDER:
AD FS OR OKTA
• User Authentication

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 9


WHEN TO USE OAUTH WHEN TO USE SCIM VS PHDATA TRAM
OAuth relies on cryptographic keys and an external provider SCIM manages users and groups with Azure Active Directory or
for authentication and is used to provide authorization to use Okta. phData Tram manages users and groups using any Active
Snowflake. For Snowflake, this is a situation where third-party Directory instance, or through text file configuration.
software is involved.
A user’s client software initially authenticates with the identity
provider. It then uses a token on all calls to Snowflake until that PHDATA TRAM MAKES MANAGING SNOWFLAKE
token expires, at which point, the client software either refreshes the
USERS AND PROJECTS SIMPLE
token or forces the user to authenticate again.
Tram is a software tool designed by phData to streamline
The software you might use OAuth with includes:
the creation and management of project resources within
• Tableau • Looker Snowflake rather than having to handle provisioning manually.
• Power BI This includes users, roles, schemas, databases, and warehouses.

If so, you will need an OAuth provider like Okta, Microsoft Azure AD, With phData Tram, you can:
Ping Identity PingFederate, or a Custom OAuth 2.0 authorization server.
• Quickly ramp up new projects through the reuse of
information architecture
WHEN TO USE SAML2 • Speed user onboarding and simplify access management
SAML2 is used to authenticate users logging into the Snowflake UI, • Quickly apply new changes to Snowflake
for Snowflake connectors, or ODBC/JDBC connections that rely
• Manage hundreds or thousands of project environments,
on credentials.
groups, and workspaces automatically
If you use Active Directory Federation Services (ADFS) or Okta, you may
• More easily create, verify, and reuse complex information
use the “basic” option and configure a SAML Identity Provider. This
architectures
method requires updating an account-level parameter named SAML_
IDENTITY_PROVIDER. However, the more standard security integration Additionally, Tram facilitates your information architecture.
syntax is replacing the identity provider parameter method. SCIM TRAM

TIP Even if you start with SAML_IDENTITY_


PROVIDER, you can migrate to a security
integration later using the system function
migrate_saml_idp_registration().
USERS AND ROLES USERS AND ROLES AND/OR
For greater detail, see the Snowflake documentation. INFORMATION ARCHITECTURE

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 10


HANDLING SERVICE ACCOUNTS
For most companies, directory services hold human users; but for
systems that migrate or consume data, you need service accounts.
In most cases, you should use RSA key pairs and rotate those keys
every 90 days. Some systems will not support key pairs (or cannot
do it securely), and you will need passwords associated with
the system accounts. As with key pairs, you should rotate these
passwords frequently.
Creating a process for doing the rotation is the biggest obstacle
to success for service accounts. Begin planning early on how the
rotations will work. Consider:
• Who will update the new keys or passwords?
• Who generates them?
• How are they transmitted?
• Can more than one service use the same account or password?
(Note however that this merely outlines the basic thought process
here. The actual implementation is typically very complex. For
further information or assistance, phData recommends having
deeper conversations with our technical experts.)

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 11


LOADING DATA INTERNAL VS. EXTERNAL STAGES
The first choice for you to make when loading data
Now there is a gameplan in place is whether to use internal or external stages.
for handling cost attribution,
An internal stage utilizes cloud storage managed
configuration optimization, and
by Snowflake, whereas an external stage is one you
access management. But before
manage yourself. To determine which is right for
teams can use Snowflake, you’ll
you, consider these questions:
also need a well thought-out
strategy for ingesting data. Without • Will I have processes which, for cost or
one, people will no doubt find ways performance reasons, would need the raw data
to get data into the platform, but directly from cloud storage?
they will ultimately waste a lot of • Do I have unique encryption requirements
credits and time doing it poorly. around data at rest?
In practice, you may not be able • Do I have tools or technologies that will only integrate with Snowflake through my
to use the same ingestion tool cloud storage?
for all of your needs. Establishing If you answer “yes” to any of these questions, you will need cloud storage, such as Amazon
a core technology that satisfies AWS’s S3 or Azure Data Lake Storage. This decision may influence whether you choose to
most requirements, paired with a have your Snowflake account backed by Azure or AWS, in addition to which region you select
process to review and approve new (as there are performance and cost considerations when moving data between regions).
technologies as needed, is therefore
critical to keeping projects moving You may have a strategy that mixes internal and external stages, which is fine — if the
forward, managing costs, and choice is purposeful. Internal Stages tend to be easy to use and manage, but do limit
keeping your data secure. options to consuming data strictly through Snowflake.

USING A STORAGE INTEGRATION


If you choose to use external stages, you should create storage integrations to interact with
them. While it is possible to make an external stage without them, there are security risks in
doing so, as Snowflake users with enough access may see them.

SNOWPIPE VS. INSERTS VS. COPY INTO


When loading data into Snowflake, the very first and most important rule to follow is: do not
load data with SQL inserts!

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 12


Loading small amounts of data is cumbersome and costly: infeasible, Parquet — a semi-structured format used by Spark and
• Each insert is slow — and time is credits. other applications that consume data — is a reasonable option.

• It creates poorly partitioned tables, which will make You may be limited to the formats that your data sources produce.
queries slow. There is no “wrong” choice when it comes to file format, but having
a policy may help with performance and help developers with
Instead, always load data in bulk by using either Snowpipe or the common patterns.
SQL “COPY INTO” statement, which is how all Snowflake-approved
connectors work.
Snowflake Connectors
File Sizes For accessing data, you’ll find a slew of Snowflake connectors on
the Snowflake website. You can use whatever works best for your
So, you should load data in bulk—but what qualifies as bulk, and technology (e.g., ODBC, JDBC, Python Snowflake Connector), and
why does it matter? generally, things will be okay. Be sure to test your scenarios, though.
Snowflake stores data in micro-partitions of up to 500 MB. Each of Some connectors, like the one for Python, appear to use S3 for
these stores metadata needed to optimize query performance. handling large amounts of data, and that can fail if your network
By having small micro-partitions, or micro-partitions that aren’t does not allow the connectivity.
homogeneous, your queries will read additional partitions to find And once again, for loading data, do not use SQL Inserts. You will
results. These unoptimized partitions will return results slower, find options for most major data migration tools and technologies
frustrate consumers, and increase credit consumption. like Kafka and Spark.
Knowing this, you want to have data prepared in a way to optimize
your load. It might be tempting to have massive files and let the Optimizing Data for Use
system sort it out. Unfortunately, having excessively large files will
There is no need to optimize your data prematurely. Load it as-
make the loads slower and more expensive.
is and see how queries perform. If they are slow, check the query
Aim for 100-megabyte files. Less than ten megabytes or more than profile to see whether queries are reading many micro-partitions. If
a gigabyte, and you will notice suboptimal performance. Snowflake they are, you have options.
publishes file size guidelines, and phData recommends checking
Micro-partitions help queries run faster when sized well, but you
periodically to see if they have changed.
can also influence performance by making the frequently used
columns homogeneous in partitions. Sending files from your source
File Format systems pre-sorted by frequently filtered-upon columns may help
If you want pure performance, compressed CSVs load fastest optimize partitions.
and use the least credits. But there are other considerations if If you keep your data volume low and your file sizes small, you may
applications other than Snowflake pull files from your cloud storage. not be able to influence the micro-partitions in this way. But don’t
The CSV format only supports structured data, which can be despair: you still have an option.
a nonstarter in some situations. In cases where CSVs may be

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 13


After the data is initially loaded, but before end-users query it, you One way is to load data in a semi-structured format that
can periodically optimize the table with a task. The task would inherently does not have a schema. As data matures through
use a stored procedure that performs a “create table as select” the transformation life cycle, tasks build curated or conformed
to generate a temp table which is sorted by columns that are schemas by mapping fields from the schemaless raw data layer.
filtered on, then uses an “alter table swap” statement to plop in the Data will load without failure when schemas change, but you will
optimally partitioned table. still need to change the view or task that maps data to account
Additionally, Snowflake has the concept of clustering keys. This for the source fields being different. The biggest drawback to
can help performance with huge tables, but isn’t meant for small this approach is that you may not realize that the incoming data
tables and can hurt speed in some cases. Use with caution, and test schema has changed for some time because nothing will break.
before committing to using them. Alternatively, for a structured data approach, there are two tools
that phData provides to customers that can assist with establishing
DATA PIPELINES schemas: Streamliner and SQLMorph.

“Data pipeline” means moving data in a consistent, secure, and


STREAMLINER
reliable way at some frequency that meets your requirements.
Streamliner is an open source tool that crawls a source database
Data pipelines can be built with third-party tools alone or in
and creates a schema within Snowflake.
conjunction with Snowflake’s tools.

SQLMORPH
Snowpipe
SQLMorph is a free SaaS application that can translate SQL from
If you can get data into an external stage, you can get the data into
one dialect to another.
Snowflake using Snowpipe.
For technical details, search for Snowpipe how-to guides online. Third-Party Products
But know that there are some significant decisions to make before
Many third-party products can migrate data and manage schema
using Snowpipe:
drift. A word of caution: some applications are new to the Snowflake
• How will data make it reliably and securely to your cloud storage? arena, so verify that they will work well for you.
• How will you create the schema for data loads? phData has partnerships with Qlik, HVR, Fivetran, and StreamSets.
• How will you handle changes to the schema of the incoming data? We’re able to help customers identify the tooling that best fits their
particular needs.
ESTABLISHING SCHEMAS AND HANDLING SCHEMA DRIFT Whatever products you choose, be sure to establish a process to
If you are using Snowpipe, you might maintain schemas manually. manage change over time, and handle failures in your data pipelines.
This may work if you expect your schemas to be static, but the safer
approach is to have a plan for detecting and adjusting to
schema changes.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 14


MANAGING REPEATABLE PROCESS

DATABASE
It is possible to look up the necessary SQL syntax to
create a table or establish a role, and to simply use

OBJECTS
the Snowflake UI to make objects. This manual
approach works well when you want to make a single
object. However, this is not a good practice overall.

Anything you make in Snowflake is Take role creation, for example. As a general best
a database object. practice, you should grant all custom roles to the
SYSADMIN role; otherwise, you would end up with
Make sure to have a plan to
roles floating around that cannot be managed
allocate the sets of objects that
by the people who manage the account. Beyond
comprise a given project, so
being granted to the SYSADMIN role, you may need
you can track the expenses and
to enforce your custom role hierarchies so people
resources around them as a unit,
in charge of a given project can see the objects created by people working on that project.
which is a critical part of managing
your Snowflake budget. You could always write a document that specifies these steps and rely on people following
them to create Snowflake roles correctly; but in practice, you will eventually have issues.
Fortunately, automation can make this process far less manual, time-consuming, and
prone to error.

Stored Procedures
One way to help people properly create new roles within the hierarchy is to create a stored
procedure that can create roles with the requisite grants and ownership on behalf of the
user without actually permitting the user to create the role on their own.
While a little syntactically clumsy, using stored procedures is easier and more cost-
effective than trying to fix role hierarchies by hand later.
You might end up with one fancy stored procedure that takes in multiple parameters to
allow admins to make roles for more than one project. The stored procedure would verify
that they have access to do the creation. Or, a more straightforward but verbose approach,
might be one stored procedure per project that only admins of that project can access.
Whatever you design, find a means to make these roles in a repeatable, secure, and correct
way within your development process.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 15


Tram Metadata
phData offers customers a free tool called Tram, to create One way or another, you’ll need metadata on objects within
Snowflake users, databases, warehouses, schemas, and other Snowflake so that you can associate them with budgets and users.
objects in a repeatable and secure fashion. You need this to manage credits effectively, so if something goes
Tram brings the concept of a “workspace template” — a collection wrong, you know who to contact.
of resources which you can define, and then utilize to allocate a The simplest way to introduce metadata is to create a naming
new project efficiently. convention that tracks the business unit and project associated
By establishing a naming convention and association within its with an object. Two other useful attributes to include might be the
configuration, you can manage the workspace resources more environment the object is used for and the purpose of that object.
quickly and consistently. The purpose attribute distinguishes objects of the same type that
serve different purposes.

WORKSPACES Depending on your requirements, there are also more elaborate


possibilities involving custom tables and stored procedures to track
Workspaces are a collection of resources needed by a other metadata elements that you may need.
project, funded by a single budget and with tightly coupled,
One consideration to track when defining your process is that an
interrelated objects.
object may initially be created and managed by one group, but
Having the ability to allocate a workspace as required will then later transferred to another.
significantly reduce the time it takes for new projects to be
productive, and will make adhering to standards much easier. Standardized Structure and Naming
Whether you have Tram or not, having the concept of a workspace An example format may include environment label, business unit,
and the automation to support them is valuable because they project, and purpose joined with underscores. The order of these
provide you: elements may impact your developer tools’ code-complete feature
• A familiar, repeatable pattern and process that meets and drop-down menus within the Snowflake UI, so be sure to
standards choose an order that is convenient and sensible to you.

• Time-saving automation that prevents errors and omissions


• Enforcement of proper naming and metadata structures that
allow you to manage and monitor your projects
• Simplified Role Management
• Facilitation of Continuous Integration and Continuous Delivery
(CI/CD)

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 16


Continuous Integration and Continuous Delivery Roles and Security
With Snowflake, having multiple environments for a project is easy if When defining standards and processes around your workspaces,
you have the automation to generate workspaces. consider roles and security early in the process. By creating a
Having a single production database might seem simpler than consistent pattern, users will develop expectations on how a given
implementing CI/CD in the short term. However, there are a number role will behave based upon naming conventions. You will then
of issues that make this approach infeasible for larger organizations: have less confusion and a lower likelihood of elevating someone’s
access unnecessarily.
• Security: Always limit the number of people who can change
and view data in production Unlike some other databases where users who have multiple roles
can see all tables that those roles grant them, Snowflake users can
• Stability: Making untested changes in production is going to only assume one role at a time, and can only see the resources that
break something eventually (and probably often) one role allows.
• Rollback: If things go wrong, even though you tested, being able This role design has both
to revert changes quickly can save money and time positives and negatives. For
• Accountability: Having an approval process for the promotion example, it’s great for situations
of changes holds people to a higher standard where regulations prohibit the
• Auditability: Processes produce artifacts and can be used to combination of certain data sets.
track how things went right or wrong However, it can be tricky when
your former model for roles is not
• Testing: Someone verified that everything worked before
compatible with this structure. You
deploying it to production
might end up with thousands of
For each project, create a development workspace for each person roles that hold every permutation
doing work. They can test their changes and verify their work before of role combinations in an attempt
committing the DDL into source control for a build tool to promote LOOKING FOR to imitate your original role design.
code to a shared non-prod environment. MORE INFORMATION For each workspace, plan out your
Then, after testing the changes in your non-prod environment(s), ABOUT ROLES AND role hierarchy. Plan for roles that
promote them to production to release it to your consumers.
SECURITY? are not granted directly to users,
There are great articles on CI/CD and tools to help you implement but instead granted to other roles
Check out this for other projects.
the process successfully. Find what works for you and your
blog post.
organization, and create a plan to promote changes into production.
Some tools to check out related to doing CI/CD with Snowflake:
• Flyway • Sqitch
• Liquibase • Snowchange

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 17


ROLES AND SECURITY EXAMPLE
A simple hierarchy for a non-production environment might
look like this:
• Administrator
– CICD
– Developer
◊ Tester
• accounting_view
• hr_view
In this example, two roles grant read access to two different
views. We may grant one or both of those roles to another
project that should read data from this workspace’s views,
but those roles cannot see or interact with our tables or other
internal objects.
The Tester role is granted both view roles in this example, so
they could assume either as needed. We grant the tester role
to the developer role, plus we grant the developer role the
ability to write to any table or change other objects within the
project database.
We create the CI/CD role to modify anything in the database
that your CI/CD process needs.
We grant the developer role to the administrator role, and we
also grant the ability to create or drop objects like schemas to
the administrator.
Your production environment would only have the
Administrator and the CICD roles.
Note: the SYSADMIN role has been ommitted from the
hierarchy for simplicity.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 18


HOW WILL VIEWS

YOUR
The simplest means of transforming data is to put a
view over it. There are multiple types of views, each

PROJECTS
of which has its own benefits and drawbacks.

TIP A view calling another view


TRANSFORM roughly doubles the compilation

DATA?
time of the query, even when
pulling data from the result cache.

You now have a plan to get data


Materialized
into Snowflake, but there is still
more work to do for that data to be Materialized views are very restrictive in terms of what you can do with them, but their
effectively consumed and utilized benefit is performance.
by all the right teams. Materialized views use credits and that frequently changing data can run up the meter. So
be conscientious of when you use them and of how they impact your budget. It’s also worth
noting that you can’t attach a resource monitor to a materialized view. If you do not have
custom monitoring and alerting from your operations team, you might not know how much
you are spending until you run out of credits.

Non-materialized
Non-materialized views are your standard, average view with some optimizations to help
with performance. The caller’s warehouse will pay the bill for any transformation done using
this view.

Secure
Secure views are specialized to avoid specific vulnerabilities. You can read more about
them on the Snowflake website, but it is important to note that they are slow, as they cannot
utilize some optimizations that other views are allowed to perform.
Before using a secure view, consider whether you can use another type of view or do most
of the transformation work outside of the secure view.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 19


STREAMS AND TASKS
Streams help with Change Data Capture (CDC) feeds and allow
you to handle more than simple append-only data. They can
trigger a task to run SQL when they have new data, providing an
opportunity to move and transform it.
You do not need to have a stream to use tasks, in which case your
root task would trigger on a schedule.
Tasks are hierarchical and can work together. A top-level task may
move data into one table. A dependent task might transform the
combination of multiple tables into a new temporary table that you
finally swap in as a curated table.
While there are a lot of possibilities, keep in mind that tasks
use credits.
If you migrate data from an existing database, note that phData
offers customers a tool called SQLMorph to automate the process
of translating SQL dialects.

SPARK
Either before moving data into your external stage or after the data
is loaded, new datasets can be created by combining multiple files
using external systems such as a Databricks Spark application.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 20


MONITORING MONITORING

YOUR
Snowflake provides some essential account-level usage information and a dashboard, but
that dashboard is only useful if someone is looking at it. For building custom monitoring,

PLATFORM
several Snowflake views have metadata about your account usage.

AND PIPELINES
You have data flowing, and
everything is great—or so you think!
But then you realize a data pipeline
stopped working two days ago, and
an out-of-control query (which has
apparently been running since last
weekend) has eaten up the entire
budget for a small project.
The point? All systems tend toward
entropy; things go wrong. But
without monitoring, nobody is even
aware until there’s a customer
complaint or the next budget
review happens.
Don’t let this happen to you!
Here is what phData recommends
You will quickly run into two issues:
for monitoring.
• First, the only role that can see everything useful is the ACCOUNTADMIN role, and giving
out access to this is like handing the nuclear codes to a toddler.
• Second, Snowflake does not organize information by your company’s budget groupings.
Project X has a budget, and Project Y has its budget. Although you may have bought
the credits in bulk for both projects to save money, you will presumably want to deduct
credits from specific budgets.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 21


Custom metadata associated with workspaces enable you to
create more holistic reports such as understanding the spending
ALERTING
at the regional or business-unit level rather than just at the Monitoring and alerting are closely related, with one key distinction:
project level. Or maybe you categorize your projects and want to monitoring requires somebody to see what is happening, whereas
understand which categories utilize the most storage. alerts will send an email or other notification to let the right people
know there is an immediate issue.
Planning out your monitoring
needs can help you manage your
budgets in this way. But there’s
Built-in Alerts
also another significant benefit: Snowflake comes with some built-in alerting; however, it’s only
good monitoring tools can help available to people with an ACCOUNTADMIN role — and only if those
you identify areas to improve. people opt in.

For example, having a list of the You won’t be giving out the ACCOUNTADMIN role to many people, so
top ten warehouses that are the project members who need to know about an issue with their
consuming credits, and then data pipelines will not know until you tell them. You may therefore
looking at the top ten queries for want to devise a custom solution on top of the base Snowflake
offering, in order to ensure that the people associated with a
TOO MUCH WORK?
each might present improvement
opportunities. Cost optimization workspace resource are notified of issues in a timely fashion.
is also why it is valuable to have phData has a Cloud
warehouses for each purpose: DataOps offering AUDITING
it makes it easier to identify that will monitor
these situations. Snowflake tracks 365 days of most audit-type information. If you
your platform and
need more, you may need to come up with a custom solution to
But no matter how simple or data pipelines for
store history beyond that period.
complex your needs, be sure to you. They operate
And even if you don’t have compliance reasons to store everything,
make a plan to track your daily 24x7 and keep your
and monthly usage. having data aggregated by day may allow you to create usage
data moving.
forecasts if you have access to data science resources. (And if you
This data is valuable to the Find out how. don’t, that’s another area where phData can help.)
business units using your
platform, which otherwise
wouldn’t have access to aggregate it. With the proper design,
you can expose details projects need to identify the warehouses,
queries, and processes that can be optimized.

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 22


GET STARTED
PHDATA IS HERE TO MAKE
YOUR LIFE EASIER.
Now that you’ve worked your way through all the
critical decisions that need to be made upfront,
you are, at long last, ready to hand the keys over to
those eager business units.
Drawing from years of implementation experience,
Or then again, maybe you’re not; after all, this is a
our Snowflake teams bring the services and expertise
lot easier said than done.
you need to get your Snowflake-based data products
That’s why it’s so valuable to have experienced
into production, with a focus on driving down costs
data engineers on your side, like the ones here
at phData. As the largest pure-play provider for
and delivering the best user experience possible.
data engineering and machine learning, and
We offer a free, expert-led, Accelerator Program
Snowflake’s 2020 Emerging Partner of the Year,
workshop to get you and your team started with
phData offers everything you need to be successful
with Snowflake. From solution design to 24x7 data Snowflake. This is a high-level, strategic look at the
pipeline monitoring to software and automation powerful capabilities available to you in Snowflake so
tools, we’re here to streamline many of the complex you can deploy it in a way that makes the most sense
processes required to launch Snowflake. for your needs and business objectives.

Click the button below to schedule your workshop,


and start unlocking the full value of your data.

SCHEDULE YOUR WORKSHOP TODAY

FOR MORE INFORMATION, EMAIL [email protected] GETTING STARTED WITH SNOWFLAKE 23

You might also like