Getting Started With Snowflake Guide
Getting Started With Snowflake Guide
SNOWFLAKE
BEST PRACTICES FOR LAUNCHING YOUR SNOWFLAKE PLATFORM
CHARGEBACK
Expenses to account for with Snowflake include:
• Warehouses
MODEL
• Snowpipe
• Materialized Views
YOUR
Parameter: DATA_RETENTION_TIME_IN_DAYS (90)
Data retention time is how long Snowflake will retain
SNOWFLAKE
historical views of your data.
The default data retention time is one day, but if you are
ACCOUNT
paying for the Enterprise Edition, you will likely want this to
be 90 days. The extended period will allow you to perform
Time Travel activities, such as undropping tables or
Snowflake has many parameters comparing new data against historical values.
to configure default behavior.
TIP
Most will likely work for your
requirements without changing For cost savings, you can set this value for each database and
them, but there are a few properties choose to have non-production data stored for fewer days. One day is
that deserve special attention from usually adequate for development use.
enterprise customers.
In particular, be sure to evaluate
Snowflake policies that influence: TIMEZONE
• Data Retention Parameter: TIMEZONE (etc/UTC)
• Timezone Snowflake will present time-related values with the timezone in your configuration.
• Security The default timezone used by Snowflake is for Los Angeles. However, you should
• Connection Performance evaluate what works best for you. Some companies set the timezone to their corporate
headquarters’ timezone, while others use UTC.
• Cost
If so, you will need an OAuth provider like Okta, Microsoft Azure AD, With phData Tram, you can:
Ping Identity PingFederate, or a Custom OAuth 2.0 authorization server.
• Quickly ramp up new projects through the reuse of
information architecture
WHEN TO USE SAML2 • Speed user onboarding and simplify access management
SAML2 is used to authenticate users logging into the Snowflake UI, • Quickly apply new changes to Snowflake
for Snowflake connectors, or ODBC/JDBC connections that rely
• Manage hundreds or thousands of project environments,
on credentials.
groups, and workspaces automatically
If you use Active Directory Federation Services (ADFS) or Okta, you may
• More easily create, verify, and reuse complex information
use the “basic” option and configure a SAML Identity Provider. This
architectures
method requires updating an account-level parameter named SAML_
IDENTITY_PROVIDER. However, the more standard security integration Additionally, Tram facilitates your information architecture.
syntax is replacing the identity provider parameter method. SCIM TRAM
• It creates poorly partitioned tables, which will make You may be limited to the formats that your data sources produce.
queries slow. There is no “wrong” choice when it comes to file format, but having
a policy may help with performance and help developers with
Instead, always load data in bulk by using either Snowpipe or the common patterns.
SQL “COPY INTO” statement, which is how all Snowflake-approved
connectors work.
Snowflake Connectors
File Sizes For accessing data, you’ll find a slew of Snowflake connectors on
the Snowflake website. You can use whatever works best for your
So, you should load data in bulk—but what qualifies as bulk, and technology (e.g., ODBC, JDBC, Python Snowflake Connector), and
why does it matter? generally, things will be okay. Be sure to test your scenarios, though.
Snowflake stores data in micro-partitions of up to 500 MB. Each of Some connectors, like the one for Python, appear to use S3 for
these stores metadata needed to optimize query performance. handling large amounts of data, and that can fail if your network
By having small micro-partitions, or micro-partitions that aren’t does not allow the connectivity.
homogeneous, your queries will read additional partitions to find And once again, for loading data, do not use SQL Inserts. You will
results. These unoptimized partitions will return results slower, find options for most major data migration tools and technologies
frustrate consumers, and increase credit consumption. like Kafka and Spark.
Knowing this, you want to have data prepared in a way to optimize
your load. It might be tempting to have massive files and let the Optimizing Data for Use
system sort it out. Unfortunately, having excessively large files will
There is no need to optimize your data prematurely. Load it as-
make the loads slower and more expensive.
is and see how queries perform. If they are slow, check the query
Aim for 100-megabyte files. Less than ten megabytes or more than profile to see whether queries are reading many micro-partitions. If
a gigabyte, and you will notice suboptimal performance. Snowflake they are, you have options.
publishes file size guidelines, and phData recommends checking
Micro-partitions help queries run faster when sized well, but you
periodically to see if they have changed.
can also influence performance by making the frequently used
columns homogeneous in partitions. Sending files from your source
File Format systems pre-sorted by frequently filtered-upon columns may help
If you want pure performance, compressed CSVs load fastest optimize partitions.
and use the least credits. But there are other considerations if If you keep your data volume low and your file sizes small, you may
applications other than Snowflake pull files from your cloud storage. not be able to influence the micro-partitions in this way. But don’t
The CSV format only supports structured data, which can be despair: you still have an option.
a nonstarter in some situations. In cases where CSVs may be
SQLMORPH
Snowpipe
SQLMorph is a free SaaS application that can translate SQL from
If you can get data into an external stage, you can get the data into
one dialect to another.
Snowflake using Snowpipe.
For technical details, search for Snowpipe how-to guides online. Third-Party Products
But know that there are some significant decisions to make before
Many third-party products can migrate data and manage schema
using Snowpipe:
drift. A word of caution: some applications are new to the Snowflake
• How will data make it reliably and securely to your cloud storage? arena, so verify that they will work well for you.
• How will you create the schema for data loads? phData has partnerships with Qlik, HVR, Fivetran, and StreamSets.
• How will you handle changes to the schema of the incoming data? We’re able to help customers identify the tooling that best fits their
particular needs.
ESTABLISHING SCHEMAS AND HANDLING SCHEMA DRIFT Whatever products you choose, be sure to establish a process to
If you are using Snowpipe, you might maintain schemas manually. manage change over time, and handle failures in your data pipelines.
This may work if you expect your schemas to be static, but the safer
approach is to have a plan for detecting and adjusting to
schema changes.
DATABASE
It is possible to look up the necessary SQL syntax to
create a table or establish a role, and to simply use
OBJECTS
the Snowflake UI to make objects. This manual
approach works well when you want to make a single
object. However, this is not a good practice overall.
Anything you make in Snowflake is Take role creation, for example. As a general best
a database object. practice, you should grant all custom roles to the
SYSADMIN role; otherwise, you would end up with
Make sure to have a plan to
roles floating around that cannot be managed
allocate the sets of objects that
by the people who manage the account. Beyond
comprise a given project, so
being granted to the SYSADMIN role, you may need
you can track the expenses and
to enforce your custom role hierarchies so people
resources around them as a unit,
in charge of a given project can see the objects created by people working on that project.
which is a critical part of managing
your Snowflake budget. You could always write a document that specifies these steps and rely on people following
them to create Snowflake roles correctly; but in practice, you will eventually have issues.
Fortunately, automation can make this process far less manual, time-consuming, and
prone to error.
Stored Procedures
One way to help people properly create new roles within the hierarchy is to create a stored
procedure that can create roles with the requisite grants and ownership on behalf of the
user without actually permitting the user to create the role on their own.
While a little syntactically clumsy, using stored procedures is easier and more cost-
effective than trying to fix role hierarchies by hand later.
You might end up with one fancy stored procedure that takes in multiple parameters to
allow admins to make roles for more than one project. The stored procedure would verify
that they have access to do the creation. Or, a more straightforward but verbose approach,
might be one stored procedure per project that only admins of that project can access.
Whatever you design, find a means to make these roles in a repeatable, secure, and correct
way within your development process.
YOUR
The simplest means of transforming data is to put a
view over it. There are multiple types of views, each
PROJECTS
of which has its own benefits and drawbacks.
DATA?
time of the query, even when
pulling data from the result cache.
Non-materialized
Non-materialized views are your standard, average view with some optimizations to help
with performance. The caller’s warehouse will pay the bill for any transformation done using
this view.
Secure
Secure views are specialized to avoid specific vulnerabilities. You can read more about
them on the Snowflake website, but it is important to note that they are slow, as they cannot
utilize some optimizations that other views are allowed to perform.
Before using a secure view, consider whether you can use another type of view or do most
of the transformation work outside of the secure view.
SPARK
Either before moving data into your external stage or after the data
is loaded, new datasets can be created by combining multiple files
using external systems such as a Databricks Spark application.
YOUR
Snowflake provides some essential account-level usage information and a dashboard, but
that dashboard is only useful if someone is looking at it. For building custom monitoring,
PLATFORM
several Snowflake views have metadata about your account usage.
AND PIPELINES
You have data flowing, and
everything is great—or so you think!
But then you realize a data pipeline
stopped working two days ago, and
an out-of-control query (which has
apparently been running since last
weekend) has eaten up the entire
budget for a small project.
The point? All systems tend toward
entropy; things go wrong. But
without monitoring, nobody is even
aware until there’s a customer
complaint or the next budget
review happens.
Don’t let this happen to you!
Here is what phData recommends
You will quickly run into two issues:
for monitoring.
• First, the only role that can see everything useful is the ACCOUNTADMIN role, and giving
out access to this is like handing the nuclear codes to a toddler.
• Second, Snowflake does not organize information by your company’s budget groupings.
Project X has a budget, and Project Y has its budget. Although you may have bought
the credits in bulk for both projects to save money, you will presumably want to deduct
credits from specific budgets.
For example, having a list of the You won’t be giving out the ACCOUNTADMIN role to many people, so
top ten warehouses that are the project members who need to know about an issue with their
consuming credits, and then data pipelines will not know until you tell them. You may therefore
looking at the top ten queries for want to devise a custom solution on top of the base Snowflake
offering, in order to ensure that the people associated with a
TOO MUCH WORK?
each might present improvement
opportunities. Cost optimization workspace resource are notified of issues in a timely fashion.
is also why it is valuable to have phData has a Cloud
warehouses for each purpose: DataOps offering AUDITING
it makes it easier to identify that will monitor
these situations. Snowflake tracks 365 days of most audit-type information. If you
your platform and
need more, you may need to come up with a custom solution to
But no matter how simple or data pipelines for
store history beyond that period.
complex your needs, be sure to you. They operate
And even if you don’t have compliance reasons to store everything,
make a plan to track your daily 24x7 and keep your
and monthly usage. having data aggregated by day may allow you to create usage
data moving.
forecasts if you have access to data science resources. (And if you
This data is valuable to the Find out how. don’t, that’s another area where phData can help.)
business units using your
platform, which otherwise
wouldn’t have access to aggregate it. With the proper design,
you can expose details projects need to identify the warehouses,
queries, and processes that can be optimized.