snowflake_notes
snowflake_notes
1. OVERVIEW
- Data Sharing – Direct Share:
o Objects included in a share:
Tables. I don’t think you can share temporary/transient tables.
External tables.
Dynamic tables.
Secure views.
Secure materialized views.
Secure UDFs.
o Each share consists of:
Privileges that grant access to the DB and the schema containing the objects
shared. At least USAGE.
Privileges that grant access to the specific objects in the DB. At least SELECT.
The list of consumer accounts.
o Types of Secure Data Sharing:
Direct Shares/Inbound share: objects shared with another account in your
region.
Listings: objects + metadata.
Data Exchange: group created by the provider with different consumers.
o Reader account: belongs to the provider.
o Cannot share a share.
o ACCOUNTADMIN or role that has IMPORT SHARE privilege. This user will have to create a
DB before querying DB objects of the share.
o GRANT_IMPORTED_PRIVILEGES/REVOKE_IMPORTED_PRIVILEGES
o Every account has two inbound shares: ACCOUNT_USAGE and SAMPLE_DATA.
o Create secure view from different DBs. Share the data to a consumer account.
o Cross-region sharing requires data replication. You replicate once per region, the number
of consumers in the region doesn’t matter.
o An object added by the data provider is instantly accessible by the consumer.
o You can’t create a table in a shared DB. Read-only DBs.
o Actions performed by the consumer on a share:
Query tables and join them with existing tables of their account.
Copy shared data into another table in their account.
Time travel NOT available.
- Costs:
o Storage: average daily amount of compressed data stored. Files Staged, tables, Time
Travel and Fail Safe.
o Account & Usage: to see storage costs in the UI.
o Cloud Services Compute: only charged if it represents more than 10% of daily WH usage.
- Architecture:
o Cloud services layer:
Query compilation
Snowflake DOES NOT have availability zones management.
Snowflake meets ACID (Atomicity, Consistency, Isolation, and Durability)
compliance.
- Editions:
o Enterprise:
Search Optimization Service + Query Acceleration Service.
Materialized views.
Row/column access policies.
Multi-cluster WHs.
ACCOUNT_USAGE.ACCESS_HISTORY view.
o Business Critical:
Made for organizations with extremely sensitive data.
Particularly for PHI data that must comply HIPAA and HITRUST CSF
regulations.
Supports private connectivity to Snowflake service through:
AWS PrivateLink.
Azure Private Link.
Google Cloud Private Service Connect.
Supports encrypted communication between Snowflake VPC and others VPCs (in
the same region).
Tri-Secret Secure encryption:
Requires Snowflake support to activate.
Composite master key: Snowflake + customer managed key.
DB failover and failback support between Snowflake accounts.
o Virtual Private Snowflake: data sharing and data marketplace are not allowed.
Isolation edition. It has its own metadata store and compute resources.
- Context functions:
o select current_region ()/…
o current_client () returns client’s version. The version of the JDBC driver, for example.
- Comments: -- or //
- Availability zones: each cloud region usually has 3. They are physically separated data centers.
- URL: https://account_locator.region_id.cloud.snowflakecomputing.com
- ALTER VIEW <<view name>> SET SECURE;
- Data Marketplace:
o Two types of listings:
Standard, usually publicly available and free.
Personalized, usually require a request and a payment.
o Data providers must share fresh, real and legally shareable data.
o Can be browsed by non-Snowflake users.
o ACCOUNTADMIN or role that has IMPORT SHARE privilege, as in others shares.
- Snowpark: deploy and process non-SQL code using Snowflake data (serverless). You write in
your language and Snowflake pushes it down to execute in SQL.
o Python – Java – Scala.
o The code is lazily executed.
- Snowflake Scripting:
o Extension of Snowflake SQL to support procedural logic.
o Typically used to write stored procedures.
o DECLARE, BEGIN/END, EXCEPTION.
- Snowsight: the web UI.
o Each worksheet is an independent session.
o It allows to:
Share worksheets between users in the same account.
Run ad-hoc queries and DDL/DML operations.
Export results of a SELECT statement.
o Set default Role and WH for a user.
- SnowSQL: CLI available for Windows, Linux and mac OS.
- Drivers: write applications that perform operations in Snowflake using the driver’s supported
language.
o Go – JDBC – ODBC – .NET – Node.js – PHP – Python.
o Kafka – Spark Connectors.
- SNOWFLAKE.ACCOUNT_USAGE:
o It’s a Snowflake share with metadata.
o Key differences with INFORMATION_SCHEMA table functions:
Data Latency (45’ to 3h).
Longer retention period (1 year).
Includes dropped objects.
o Some views require Enterprise Edition.
o Views examples:
QUERY_HISTORY view: 45’ latency. Monitors WH load and performance.
STORAGE_USAGE view: average daily data storage.
LOGIN_HISTORY view: login attempts.
METERING_HISTORY view: hourly credit usage for all WH.
WAREHOUSE_METERING_HISTORY view: hourly credit usage at WH level.
DATABASE_STORAGE_USAGE_HISTORY: includes time travel and fail-safe.
COPY_HISTORY: both COPY INTO and continuous data loading with Snowpipe.
LOAD_HISTORY view: data load with COPY INTO <table>.
PIPE_USAGE_HISTORY: data loading history using Snowpipe.
ACCES_HISTORY (Enterprise): information about access to tables and columns.
SQL read statements.
DML operations such as INSERT, UPDATE, DELETE.
Variations of the COPY command.
READER_ACCOUNT_USAGE:
Views for all reader accounts created.
These views are a subset of the others with the addition of the
RESOURCE_MONITOR view.
- INFORMATION_SCHEMA:
o Metadata of DB objects and some non-DB objects common across all DBs, such as roles,
WHs and DBs.
o From 7 days to 6 months of metadata depending on the view/table function.
o Query not selective enough -> error because the query returned too much data.
o LOGIN_HISTORY_BY_USER ():
These kinds of functions don’t have latency.
o AUTO_REFRESH_REGISTRATION_HISTORY:
History of data files registered in the metadata of specified objects.
Credits billed for these operations.
14 days of billing history.
- SnowCD (Snowflake Connectivity Diagnostic Tool): troubleshooting network connection.
- SELECT LAST_QUERY_ID (-1) (default). (1) would return the first query of the current session.
- Sampling:
o SYSTEM | BLOCK
o BERNOULLI | ROW
- INSERT + OVERWRITE truncates and then inserts.
- TIMESTAMP_NTZ (No Time Zones) is the default data type for timestamp column.
- Fail safe:
o You can’t alter fail safe at account, database, schema or table level. Use
temporary/transient tables instead.
4. DATA LOADING/UNLOADING
- Storage Integration: object with credentials for creating stages or unloading data in external
cloud providers.
- Snowpipe:
o 14 days of load history.
o REST APIs: endpoints to interact with pipes. For internal and external stages.
insertFiles: files to be ingested into a table.
insertReport: report of files submitted through insertFiles and ingested into a
table.
loadHistoryScan: ~ to insertReport but with specified time range. Up to 10 000
items returned. Rely more on insertReport to avoid errors for excessive calls.
o AUTO_INGEST = TRUE: enables automatic data loading from external stages.
o AUTO_INGEST = FALSE: requires making calls to the REST APIs.
o Event notifications:
AWS supports notifications from all cloud platforms.
GCP and Azure only their own.
o File sizing recommendations:
100 – 250 MB of compressed data.
+100GB files are not recommended.
Maximum allowed data load duration: 24 hours.
o Basic transformations allowed: column ordering, omitting, casting or truncation.
o It can’t reload a file with the same name twice.
- Bulk load – COPY INTO statement: from on-premises or cloud storage.
o Requires user-managed WH.
o 64 days of load history.
o copy into <table> from @~/<file>.xml
file_format = (type = 'XML' strip_outer_element = true); strip_outer_array (JSON)
o File size recommendations and transformations are the same as for Snowpipe.
o VALIDATE:
Validates de files loaded in the last execution of the COPY INTO statement and
returns all errors encountered.
Does not support COPY statements with transformations.
o VALIDATION_MODE:
Validates data files instead of loading them.
Does not support COPY statements with transformations.
Types:
RETURN_<n>_ROWS.
RETURN_ERRORS. I don’t understand the difference with the next???.
RETURN_ALL_ERRORS.
o ON_ERROR:
CONTINUE.
SKIP_FILE.
ABORT_STATEMENT. Default behavior.
o COPY options:
LOAD_UNCERTAIN_FILES = TRUE
Checks load metadata for avoiding duplication and load every file with
no load metadata.
FORCE = TRUE
Loads every file without considering load metadata.
OBJECT_CONTRUCT
Transform structured data into VARIANT data type.
o Select the files to load by:
List of specific files.
Pattern matching.
Path (internal stage) or prefix (Amazon S3 bucket).
- Load with UI wizard:
o background = PUT + COPY INTO.
o Designed for a few small files (<50MB).
- Concurrent workload processing is managed in both COPY INTO and Snowpipe.
- Avoid data loading into a table from Snowpipe and bulk load at the same time.
- Unload data:
o By default:
Parallelizing.
16MB each file.
Compressed format.
Automatically decrypted (after GET, I guess).
o SINGLE parameter set to TRUE allows up to 5GB in a single file.
o Allows PARTITION BY command for partitioned data unload into the stages.
o Tables and SELECT statements.
o Allowed compression algorithms:
CSV and JSON: GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE
Parquet: LZO | SNAPPY
- Cache:
o Resultset cache:
No changes on the underlying data.
Exact same query. Case sensitive and no alias added/deleted allowed.
Query without runtime/UDF/external functions.
Queries with functions like CURRENT_DATE are eligible for query result caching.
24 hours of retention period and up to 31 days since the first execution.
Can be turned off at session, user or account level with the
USE_CACHED_RESULT parameter.
- Query optimizer: examines metadata cache 1st, result cache 2nd and warehouse cache 3rd.
- Query profile:
o Only for completed queries???.
o Available for 14 days for any user for every query.
o Typical performance issues:
Exploding JOINS.
UNION without ALL.
Queries that don’t fit in memory.
Partition pruning issues.
o Execution time:
Processing — time spent on data processing by the CPU.
Local Disk IO — time when the processing was blocked by local disk access.
Remote Disk IO — time when the processing was blocked by remote disk access.
Network Communication — time when the processing was waiting for the
network data transfer.
Synchronization — various synchronization activities between participating
processes.
Initialization — time spent setting up the query processing.
o Statistics:
IO – input-output operations:
Scan progress: percentage of data scanned for a table so far.
Bytes scanned: # of bytes scanned so far.
Percentage scanned from cache: from local (WH) cache.
Bytes written: written when loading into a table.
Bytes written to result: size of the result set.
Bytes read from result.
External bytes scanned: from an external object such as a stage.
Pruning.
Spilling: disk usage when intermediate results don’t fit in memory. Slower
performance because it requires more IO operations and disk access is slower
than memory access.
Bytes spilled to local storage: to local disk.
Bytes spilled to remote storage: to remote disk.
Network: network communication with other applications. BI tools, for example.
Bytes sent over the network.
- Optimizing Query Performance:
o Clustering the table.
o Materialized Views (Enterprise).
o Search optimization service (Enterprise): lookup queries.
It uses a search access path. Persistent metadata about column values in each
micro-partition. May be similar to an index in relational DBs.
Queries that do not benefit:
External tables.
Dynamic tables.
Materialized views.
COLLATE columns.
Column concatenation.
Analytical expressions.
Cast on columns. Except for numeric column cast to string.
Queries that do benefit:
Equality searches.
Substring and regular expression searches.
Searches in a VARIANT column.
Searches in GEOGRAPHY column with geospatial functions.
ALTER TABLE t1 ADD SEARCH OPTIMIZATION [ON EQUALITY (c1), SUBSTRING
(c2)]
o Query Acceleration Service (Enterprise):
Offloads parts or query processing work to shared compute resources.
Server availability allows more parallelization, but performance may fluctuate.
Might benefit:
Ad-hoc analysis.
Workloads with unpredictable data volume per query.
Queries with large scans and selective filters.
Not eligible queries:
Not enough partitions to scan.
No filters or aggregations.
Not selective enough filters or high cardinality aggregations.
LIMIT without ORDER BY.
Queries with random functions.
Detect eligible queries:
SYSTEM$ESTIMATE_QUERY_ACCELERATION Function
ACCOUNT_USAGE.QUERY_ACCELERATION_ELIGIBLE view
Enable the service at WH level.
Scale factor:
Cost control mechanism to limit the compute resources used.
8 by default. It means that the service can spend up to 8 times the WH
resources.
There’re 3 columns in the QUERY_HISTORY view to see the effects of the service.
- Optimize WH performance:
o Reduce queuing.
o Resolve memory spillage.
o Increase WH size.
o Try query acceleration.
o Optimize WH cache.
o Limit concurrent queries.
- Algorithms:
o Estimate count(distinct): HyperLogLog algorithm.
o Estimate percentiles: t-Digest.
o Estimate approximate frequent values: Space-Saving.
- EXPLAIN plan: useful to evaluate query efficiency.
o Compiles the SQL but does not execute it -> no WH required.
o Gives info about:
Partition pruning.
Join ordering.
Join types.