0% found this document useful (0 votes)
11 views

snowflake_notes

The document provides a comprehensive overview of Snowflake's data sharing capabilities, virtual warehouses, storage, and protection features. It details the types of shares, costs, architecture, and various editions, as well as the functionalities of Snowflake's tools like Snowpark and SnowSQL. Additionally, it covers data loading/unloading processes, including Snowpipe and the management of micro-partitions, streams, and tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

snowflake_notes

The document provides a comprehensive overview of Snowflake's data sharing capabilities, virtual warehouses, storage, and protection features. It details the types of shares, costs, architecture, and various editions, as well as the functionalities of Snowflake's tools like Snowpark and SnowSQL. Additionally, it covers data loading/unloading processes, including Snowpipe and the management of micro-partitions, streams, and tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

SNOWFLAKE NOTES

1. OVERVIEW
- Data Sharing – Direct Share:
o Objects included in a share:
 Tables. I don’t think you can share temporary/transient tables.
 External tables.
 Dynamic tables.
 Secure views.
 Secure materialized views.
 Secure UDFs.
o Each share consists of:
 Privileges that grant access to the DB and the schema containing the objects
shared. At least USAGE.
 Privileges that grant access to the specific objects in the DB. At least SELECT.
 The list of consumer accounts.
o Types of Secure Data Sharing:
 Direct Shares/Inbound share: objects shared with another account in your
region.
 Listings: objects + metadata.
 Data Exchange: group created by the provider with different consumers.
o Reader account: belongs to the provider.
o Cannot share a share.
o ACCOUNTADMIN or role that has IMPORT SHARE privilege. This user will have to create a
DB before querying DB objects of the share.
o GRANT_IMPORTED_PRIVILEGES/REVOKE_IMPORTED_PRIVILEGES
o Every account has two inbound shares: ACCOUNT_USAGE and SAMPLE_DATA.
o Create secure view from different DBs. Share the data to a consumer account.
o Cross-region sharing requires data replication. You replicate once per region, the number
of consumers in the region doesn’t matter.
o An object added by the data provider is instantly accessible by the consumer.
o You can’t create a table in a shared DB. Read-only DBs.
o Actions performed by the consumer on a share:
 Query tables and join them with existing tables of their account.
 Copy shared data into another table in their account.
 Time travel NOT available.
- Costs:
o Storage: average daily amount of compressed data stored. Files Staged, tables, Time
Travel and Fail Safe.
o Account & Usage: to see storage costs in the UI.
o Cloud Services Compute: only charged if it represents more than 10% of daily WH usage.
- Architecture:
o Cloud services layer:
 Query compilation
 Snowflake DOES NOT have availability zones management.
 Snowflake meets ACID (Atomicity, Consistency, Isolation, and Durability)
compliance.
- Editions:
o Enterprise:
 Search Optimization Service + Query Acceleration Service.
 Materialized views.
 Row/column access policies.
 Multi-cluster WHs.
 ACCOUNT_USAGE.ACCESS_HISTORY view.
o Business Critical:
 Made for organizations with extremely sensitive data.
 Particularly for PHI data that must comply HIPAA and HITRUST CSF
regulations.
 Supports private connectivity to Snowflake service through:
 AWS PrivateLink.
 Azure Private Link.
 Google Cloud Private Service Connect.
 Supports encrypted communication between Snowflake VPC and others VPCs (in
the same region).
 Tri-Secret Secure encryption:
 Requires Snowflake support to activate.
 Composite master key: Snowflake + customer managed key.
 DB failover and failback support between Snowflake accounts.
o Virtual Private Snowflake: data sharing and data marketplace are not allowed.
 Isolation edition. It has its own metadata store and compute resources.
- Context functions:
o select current_region ()/…
o current_client () returns client’s version. The version of the JDBC driver, for example.
- Comments: -- or //
- Availability zones: each cloud region usually has 3. They are physically separated data centers.
- URL: https://account_locator.region_id.cloud.snowflakecomputing.com
- ALTER VIEW <<view name>> SET SECURE;
- Data Marketplace:
o Two types of listings:
 Standard, usually publicly available and free.
 Personalized, usually require a request and a payment.
o Data providers must share fresh, real and legally shareable data.
o Can be browsed by non-Snowflake users.
o ACCOUNTADMIN or role that has IMPORT SHARE privilege, as in others shares.
- Snowpark: deploy and process non-SQL code using Snowflake data (serverless). You write in
your language and Snowflake pushes it down to execute in SQL.
o Python – Java – Scala.
o The code is lazily executed.
- Snowflake Scripting:
o Extension of Snowflake SQL to support procedural logic.
o Typically used to write stored procedures.
o DECLARE, BEGIN/END, EXCEPTION.
- Snowsight: the web UI.
o Each worksheet is an independent session.
o It allows to:
 Share worksheets between users in the same account.
 Run ad-hoc queries and DDL/DML operations.
 Export results of a SELECT statement.
o Set default Role and WH for a user.
- SnowSQL: CLI available for Windows, Linux and mac OS.
- Drivers: write applications that perform operations in Snowflake using the driver’s supported
language.
o Go – JDBC – ODBC – .NET – Node.js – PHP – Python.
o Kafka – Spark Connectors.
- SNOWFLAKE.ACCOUNT_USAGE:
o It’s a Snowflake share with metadata.
o Key differences with INFORMATION_SCHEMA table functions:
 Data Latency (45’ to 3h).
 Longer retention period (1 year).
 Includes dropped objects.
o Some views require Enterprise Edition.
o Views examples:
 QUERY_HISTORY view: 45’ latency. Monitors WH load and performance.
 STORAGE_USAGE view: average daily data storage.
 LOGIN_HISTORY view: login attempts.
 METERING_HISTORY view: hourly credit usage for all WH.
 WAREHOUSE_METERING_HISTORY view: hourly credit usage at WH level.
 DATABASE_STORAGE_USAGE_HISTORY: includes time travel and fail-safe.
 COPY_HISTORY: both COPY INTO and continuous data loading with Snowpipe.
 LOAD_HISTORY view: data load with COPY INTO <table>.
 PIPE_USAGE_HISTORY: data loading history using Snowpipe.
 ACCES_HISTORY (Enterprise): information about access to tables and columns.
 SQL read statements.
 DML operations such as INSERT, UPDATE, DELETE.
 Variations of the COPY command.
 READER_ACCOUNT_USAGE:
 Views for all reader accounts created.
 These views are a subset of the others with the addition of the
RESOURCE_MONITOR view.
- INFORMATION_SCHEMA:
o Metadata of DB objects and some non-DB objects common across all DBs, such as roles,
WHs and DBs.
o From 7 days to 6 months of metadata depending on the view/table function.
o Query not selective enough -> error because the query returned too much data.
o LOGIN_HISTORY_BY_USER ():
 These kinds of functions don’t have latency.
o AUTO_REFRESH_REGISTRATION_HISTORY:
 History of data files registered in the metadata of specified objects.
 Credits billed for these operations.
 14 days of billing history.
- SnowCD (Snowflake Connectivity Diagnostic Tool): troubleshooting network connection.
- SELECT LAST_QUERY_ID (-1) (default). (1) would return the first query of the current session.
- Sampling:
o SYSTEM | BLOCK
o BERNOULLI | ROW
- INSERT + OVERWRITE truncates and then inserts.
- TIMESTAMP_NTZ (No Time Zones) is the default data type for timestamp column.

2. SNOWFLAKE VIRTUAL WAREHOUSES


- General considerations:
o A WH provides CPU, memory and temporary local storage resources for computing
queries.
o Bytes spilled to local/remote storage: complex queries that don’t fit in the WH memory.
o Sizes: from X-Small to 6X-Large.
o A bigger WH has more compute resources, which allows parallelization when loading
files. So, you should split and, unless you need to bulk load hundreds or thousands of
files, a small WH is enough.
 XS WH can load 8 files in parallel.
o Idle resources will be suspended if you suspend the WH even if it is running a query.
o Queries doesn’t start until the WH is fully provisioned.
 If process fails during start-up, Snowflake retries the failing nodes and start
executing queries when 50% already provisioned.
o USE WAREHOUSE <wh_name>
- Resource Monitor:
o Created by ACCOUNTADMIN. Grant MONITOR and MODIFY to a role to view and modify.
o Measure user managed virtual WH and WH managed by the cloud services layer.
o Notifications:
 Classic console notification to administrator users.
 Emails to verified addresses. Up to 5 non-administrator users.
o Resets Daily, Weekly, Monthly, Yearly, never. NOT per-minute/hourly basis!
o Must have and action specified.
 Notify.
 Notify and suspend.
 ‘Notify and suspend immediately’ can incur in additional costs while suspending
all the workload.
o If a WH is dropped, you can’t resume it again unless:
 You drop the Resource Monitor.
 You remove the WH from the Resource Monitor (not applicable to account RM).
 You increase the credit quota.
 You increase the credit threshold.
 The next interval starts.
- Multi-cluster WH (Enterprise):
o Switch from single to multi and vice-versa when you want.
o Number of compute resources = number of clusters. NO
o One cluster has many compute resources, so:
 Scaling up adds compute resources to the cluster.
 Scaling out adds clusters to the WH.
- Snowpark-optimized WH: for workload with high memory requirements such as ML training use
cases.
- States:
o STARTED: running/active.
o SUSPENDED: inactive.
o RESIZING.
- Privileges:
o OPERATE: change the state.
o MODIFY: alter the WH, in size, for example.
o MONITOR: current and past queries executed and statistics.
o USAGE: enables a user to execute queries with the WH.
o OWNERSHIP: full control. Only allowed to one role at a time.
o ALL: grant all the privileges but OWNERSHIP.

3. SNOWFLAKE STORAGE AND PROTECTION


- Micro-partitions:
o Contiguous units of storage between 50MB and 500MB of uncompressed data,
organized in columnar way.
o Snowflake automatically determines the most efficient compression algorithm for each
column.
o Immutable.
o Automatically created using the order or insertion/load.
o Metadata about rows stored in micro-partitions:
 The range of values for each of the columns in the micro-partition.
 The number of distinct values.
 Additional properties used for both optimization and efficient query processing.
o The files are stored in the cloud platform, but the user can’t either see or access them.
o Pruning does not happen when using subqueries.
- External tables:
o Read-only tables stored in the cloud providers.
o You query an external stage as if it was a table inside Snowflake.
o Useful if you only access a portion of the data and you don’t do it often.
- Data protection features:
o Triple redundancy: Snowflake crashes but cloud all right.
o Automatic AZ fail-over: data replicated across 3 cloud availability zones.
o DB replication (Business Critical): synchronized DB in another cloud provider/region.
 All DB objects but: stages, streams, tasks and external tables.
 Non – DB objects are not replicated.
 No share can be replicated.
 You can use a task to perform DB replication periodically.
- Streams: Change Data Capture during retention period.
o Offset: snapshot of every row of the object and track of DML changes.
o Tables – Directory tables – External tables – Views.
o Types of streams:
 Standard/Append-only for tables, directory tables and views (underlying tables
of the view).
 Insert-only on external tables.
o Hidden columns (consume storage):
 METADATA$ACTION
 METADATA$ISUPDATE
 METADATA$ROW_ID
o Not allowed for materialized views.
o The views allowed must meet underlying table and query requirements.
o Stale stream: not consumed.
o Different streams with different periods in the same table.
- Materialized views:
o Can have clustering keys.
o Suspended if columns in the base table dropped or changed. Recreation required.
o Can query only one table. No self-joins allowed.
o NO UDFs, window functions, LIMIT…
o GROUP BY fields must be part of the SELECT list.
- UDFs:
o Java – JavaScript – Scala – Python – SQL.
o Must return single value/tabular data.
o Overload a UDF: multiple UDFs, same identifier, different input parameters.
o Runs with owner rights.
o They don’t allow DDL or DML.
- Stored Procedures:
o Same UDF’s languages.
o Cannot be administered using Snowflake UI. Requires SnowSQL???.
o In contrast with UDFs, they don’t have to return a value.
o Caller’s rights stored procedures vs. owner’s rights stored procedures.
o USAGE or OWNERSHIP granted to the role to use it.
- Tasks:
o Types of SQL code:
 Single SQL statement.
 Call to a Stored Procedure.
 Procedural logic with Snowflake Scripting.
o Delete parent? The node becomes root/standalone task.
o Compute resources:
 Serverless compute model. Consume up to the equivalent of XXL WH.
 User-provided WH.
o No trigger. The nodes don’t have their own schedule.
o User-managed tasks:
 If you fully utilize a WH.
 Unpredictable loads (multi-cluster WH).
 If adherence to scheduled interval is less important.
o Serverless tasks:
 If you don’t fully utilize a WH.
 Predictable loads.
 Adherence is important. It can resize until equivalent of 2XL WH.
o Privileges required for viewing task history:
 ACCOUNTADMIN role.
 OWNERSHIP over the task.
 Global MONITOR EXECUTION privilege and USAGE over the DB and schema that
contains the task.
o Max execution time is 60 min??? NO.
o Snowflake automatically rolls back disconnected tasks after 4h???.
- Transactions:
o Automatically aborted and rolled back by Snowflake if open for 4 hours.
o Rolled back if the session ends.
o AUTOCOMMIT set to ON by default.
- Cloning:
o Not for shared DBs.
o You clone everything in a DB but external tables, internal stages and their pipes.
o A cloned table is just like any other table.
o Considerations:
 Table stages will be cloned but empty.
 Cloned but suspended by default:
 Clustering keys.
 Tasks.
 Alerts.
 UDFs can be cloned with some limitations.
 Streams and Time Travel will be restarted to the moment of the clone.
o Privileges required:
 Tables: SELECT over table + USAGE over Schema and DB.
 Pipes, streams and tasks: OWNERSHIP.
 Other objects: USAGE.
o Source object privileges are not inherited automatically. Use COPY GRANTS command.
 Child objects cloned do inherit all granted privileges.
o Cloning does not copy load metadata of a table.
o Examples of objects that can be cloned:
 File formats
 Sequences.
o Transient tables can be cloned to transient or temporary tables, temporary don’t.
o When you clone views and stored procedures with fully qualified table references, the
cloned views and stored procedures keep pointing to the source tables.
- Snowflake replicates the Cloud Services Layer and the Storage Layer across availability zones???.
- Time travel:
o DATA_RETENTION_TIME_IN_DAYS (account level/individual level and ACCOUNTADMIN).
 MIN_DATA_RETENTION_TIME_IN_DAYS at account level a regular data retention
time parameter over an object: the max remains.
o UNDROP for tables, schemas and DBs.
o AT|BEFORE -> TIMESTAMP, OFFSET(s), STATEMENT.
o 0-1 days of time travel for temporary/transient tables.

- Fail safe:
o You can’t alter fail safe at account, database, schema or table level. Use
temporary/transient tables instead.

4. DATA LOADING/UNLOADING
- Storage Integration: object with credentials for creating stages or unloading data in external
cloud providers.
- Snowpipe:
o 14 days of load history.
o REST APIs: endpoints to interact with pipes. For internal and external stages.
 insertFiles: files to be ingested into a table.
 insertReport: report of files submitted through insertFiles and ingested into a
table.
 loadHistoryScan: ~ to insertReport but with specified time range. Up to 10 000
items returned. Rely more on insertReport to avoid errors for excessive calls.
o AUTO_INGEST = TRUE: enables automatic data loading from external stages.
o AUTO_INGEST = FALSE: requires making calls to the REST APIs.
o Event notifications:
 AWS supports notifications from all cloud platforms.
 GCP and Azure only their own.
o File sizing recommendations:
 100 – 250 MB of compressed data.
 +100GB files are not recommended.
 Maximum allowed data load duration: 24 hours.
o Basic transformations allowed: column ordering, omitting, casting or truncation.
o It can’t reload a file with the same name twice.
- Bulk load – COPY INTO statement: from on-premises or cloud storage.
o Requires user-managed WH.
o 64 days of load history.
o copy into <table> from @~/<file>.xml
file_format = (type = 'XML' strip_outer_element = true); strip_outer_array (JSON)
o File size recommendations and transformations are the same as for Snowpipe.
o VALIDATE:
 Validates de files loaded in the last execution of the COPY INTO statement and
returns all errors encountered.
 Does not support COPY statements with transformations.
o VALIDATION_MODE:
 Validates data files instead of loading them.
 Does not support COPY statements with transformations.
 Types:
 RETURN_<n>_ROWS.
 RETURN_ERRORS. I don’t understand the difference with the next???.
 RETURN_ALL_ERRORS.
o ON_ERROR:
 CONTINUE.
 SKIP_FILE.
 ABORT_STATEMENT. Default behavior.
o COPY options:
 LOAD_UNCERTAIN_FILES = TRUE
 Checks load metadata for avoiding duplication and load every file with
no load metadata.
 FORCE = TRUE
 Loads every file without considering load metadata.
 OBJECT_CONTRUCT
 Transform structured data into VARIANT data type.
o Select the files to load by:
 List of specific files.
 Pattern matching.
 Path (internal stage) or prefix (Amazon S3 bucket).
- Load with UI wizard:
o background = PUT + COPY INTO.
o Designed for a few small files (<50MB).
- Concurrent workload processing is managed in both COPY INTO and Snowpipe.
- Avoid data loading into a table from Snowpipe and bulk load at the same time.
- Unload data:
o By default:
 Parallelizing.
 16MB each file.
 Compressed format.
 Automatically decrypted (after GET, I guess).
o SINGLE parameter set to TRUE allows up to 5GB in a single file.
o Allows PARTITION BY command for partitioned data unload into the stages.
o Tables and SELECT statements.
o Allowed compression algorithms:
 CSV and JSON: GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE
 Parquet: LZO | SNAPPY

- Stages: (what about external stages and transformations?)


o Types:
 @%table_stage
 @~user_stage
 @named_stage
o Metadata columns (you must specify them, as they are not returned with SELECT *):
 METADATA$FILE_NAME
 METADATA$FILE_ROW_NUMBER
 METADATA$FILE_CONTENT_KEY
 METADATA$FILE_LAST_MODIFIED
 METADATA$START_SCAN_TIME
o Internal stages:
 Stores data files internally within Snowflake.
 Can be permanent or temporary.
 Supports transformations.
 Can be seen from the UI.
o External stages:
 References data files stored in cloud provider by means of URL, file format,
credentials, etc.
 Supports transformations.
o Download to on-premises:
 GET if the origin is internal/table/user stage.
 Cloud services utilities for external stages.
o Upload from on-premises:
 PUT to internal/user/table stage.
 Cloud services utilities to external stage.
o SQL queries allowed to all stages. Useful for viewing data before loading.
o Directory table:
 Implicit object layered on a stage. It’s not a separate object.
 Metadata about stage files -> similar to external table.
 SELECT * FROM DIRECTORY (@<stage_name>)
 URL doesn’t expire.
 Automatic metadata refresh requires a cost overhead to manage event
notifications.
 SET DIRECTORY = (ENABLED = TRUE | FALSE)
o External and internal stages support unstructured data.
o Table stages are the only one that don’t support transformations.
o Files can be deleted after loading by:
 Specifying PURGE=TRUE in the copy options.
 Execute the REMOVE command after the COPY statement.
o INFER_SCHEMA (): retrieves metadata schema of staged files.
o LIST @<stage_name> / LS @<stage_name>
o BUILD_SCOPED_FILE_URL:
 URL to staged file active for 24 hours/query result period.
 No privileges required???. Only the caller can use it.
 Ideal for custom applications that provide unstructured data to other accounts
via a share or for downloading/ad-hoc analysis in Snowsight.
o BUILD_STAGED_FILE_URL:
 Permanent URL using stage name and relative access path.
 Role with enough privileges is required.
 Ideal for custom applications that require access to unstructured data files.
o GET_PRESIGNED_URL:
 Simple HTTPS URL for accessing a file in the web browser.
 Temporal pre-signed access token. Specify expiration_time argument.
 Ideal for BI tools that need to display unstructured content.
o GET_STAGE_LOCATION: gets stage URL.
- File formats:
o Structured data: CSV.
o Semi Structured data:
 JSON
 Avro
 Parquet
 XML
 OCR
o For unloading:
 CSV or similar.
 JSON.
 Parquet.
o Can be defined at stage, table or COPY command.
o By default, Snowflake loads data as CSV, so no named file format required.
- Pipes:
o SHOW PIPES lists all pipes for which you have access.
o CREATE OR REPLACE PIPE delete load history of the pipe.
 You must ALTER PIPE <> REFRESH after recreating.
o Transformations allowed when using a SQL statement:
 Column omission.
 Column reordering.
 …
 NO filters.
- External functions:
o They call code executed outside Snowflake, in a remote service.

5. PERFORMANCE AND TUNING


- Monitor Query History:
o Query History Page (UI -> History):
 all queries all users all WH all interfaces.
 14 days of queries, 1 days of results (same role required)
 If no cache is used: ‘Bytes scanned’ column shows a green bar.
 No delay.
 It can monitor WH load too.
o ACCOUNT_USAGE.QUERY_HISTORY view. It can monitor WH load too.
o INFORMATION_SCHEMA.QUERY_HISTORY(_BY_USER/SESSION/WAREHOUSE) table
function:
 Specified time range up to 7 days.
- Clustering keys:
o Can be defined over materialized views.
o Up to 3-4 columns, from lower to higher cardinality.
o Can be resumed/suspended by the user at any time at table level.
 Cannot be turned off at account level.
o Consume credits.
o Fully managed by Snowflake in the background.
o Enterprise Edition needed??? NO
o Can be of any data type except VARIANT, OBJECT, GEOGRAPHY, ARRAY.
o Functions for clustering depth:
 SYSTEM$CLUSTERING_INFORMATION
 SYSTEM$CLUSTERING_DEPTH
o CLUSTER BY (c1, c2)
o ALTER TABLE t1 RESUME RECLUSTER (command <t1> clause)

- Cache:
o Resultset cache:
 No changes on the underlying data.
 Exact same query. Case sensitive and no alias added/deleted allowed.
 Query without runtime/UDF/external functions.
 Queries with functions like CURRENT_DATE are eligible for query result caching.
 24 hours of retention period and up to 31 days since the first execution.
 Can be turned off at session, user or account level with the
USE_CACHED_RESULT parameter.
- Query optimizer: examines metadata cache 1st, result cache 2nd and warehouse cache 3rd.
- Query profile:
o Only for completed queries???.
o Available for 14 days for any user for every query.
o Typical performance issues:
 Exploding JOINS.
 UNION without ALL.
 Queries that don’t fit in memory.
 Partition pruning issues.
o Execution time:
 Processing — time spent on data processing by the CPU.
 Local Disk IO — time when the processing was blocked by local disk access.
 Remote Disk IO — time when the processing was blocked by remote disk access.
 Network Communication — time when the processing was waiting for the
network data transfer.
 Synchronization — various synchronization activities between participating
processes.
 Initialization — time spent setting up the query processing.
o Statistics:
 IO – input-output operations:
 Scan progress: percentage of data scanned for a table so far.
 Bytes scanned: # of bytes scanned so far.
 Percentage scanned from cache: from local (WH) cache.
 Bytes written: written when loading into a table.
 Bytes written to result: size of the result set.
 Bytes read from result.
 External bytes scanned: from an external object such as a stage.
 Pruning.
 Spilling: disk usage when intermediate results don’t fit in memory. Slower
performance because it requires more IO operations and disk access is slower
than memory access.
 Bytes spilled to local storage: to local disk.
 Bytes spilled to remote storage: to remote disk.
 Network: network communication with other applications. BI tools, for example.
 Bytes sent over the network.
- Optimizing Query Performance:
o Clustering the table.
o Materialized Views (Enterprise).
o Search optimization service (Enterprise): lookup queries.
 It uses a search access path. Persistent metadata about column values in each
micro-partition. May be similar to an index in relational DBs.
 Queries that do not benefit:
 External tables.
 Dynamic tables.
 Materialized views.
 COLLATE columns.
 Column concatenation.
 Analytical expressions.
 Cast on columns. Except for numeric column cast to string.
 Queries that do benefit:
 Equality searches.
 Substring and regular expression searches.
 Searches in a VARIANT column.
 Searches in GEOGRAPHY column with geospatial functions.
ALTER TABLE t1 ADD SEARCH OPTIMIZATION [ON EQUALITY (c1), SUBSTRING
(c2)]
o Query Acceleration Service (Enterprise):
 Offloads parts or query processing work to shared compute resources.
 Server availability allows more parallelization, but performance may fluctuate.
 Might benefit:
 Ad-hoc analysis.
 Workloads with unpredictable data volume per query.
 Queries with large scans and selective filters.
 Not eligible queries:
 Not enough partitions to scan.
 No filters or aggregations.
 Not selective enough filters or high cardinality aggregations.
 LIMIT without ORDER BY.
 Queries with random functions.
 Detect eligible queries:
 SYSTEM$ESTIMATE_QUERY_ACCELERATION Function
 ACCOUNT_USAGE.QUERY_ACCELERATION_ELIGIBLE view
 Enable the service at WH level.
 Scale factor:
 Cost control mechanism to limit the compute resources used.
 8 by default. It means that the service can spend up to 8 times the WH
resources.
 There’re 3 columns in the QUERY_HISTORY view to see the effects of the service.
- Optimize WH performance:
o Reduce queuing.
o Resolve memory spillage.
o Increase WH size.
o Try query acceleration.
o Optimize WH cache.
o Limit concurrent queries.
- Algorithms:
o Estimate count(distinct): HyperLogLog algorithm.
o Estimate percentiles: t-Digest.
o Estimate approximate frequent values: Space-Saving.
- EXPLAIN plan: useful to evaluate query efficiency.
o Compiles the SQL but does not execute it -> no WH required.
o Gives info about:
 Partition pruning.
 Join ordering.
 Join types.

6. SEMI STRUCTURED DATA


- Data types:
o VARIANT
 Can store a value of any type, including OBJECT and ARRAY.
 16MB uncompressed is the max storage per row. Usually less due to internal
overhead.
 JSON path notation to query VARIANT columns.
 Dot notation:
SELECT
<col_name>:<key_name_1>[N]. < key_name_2>.
<key_name_3>: :<cast_datatype>
FROM table_name
 Bracket Notation:
SELECT
<col_name> [‘key_name_1’] [‘key_name_2’]
[‘key_name_3’]: :<cast_datatype>
FROM table_name
 Relational table with VARIANT column: separate storage.
 Repeating keys and paths stored as separate physical columns.
 Views are recommended to make VARIANT data accessible to BI tools.
 Common use cases:
 Create hierarchical data explicitly defining a hierarchy between OBJECTs
or ARRAYs.
 Loading semi structured file like JSON, Avro, ORC, XML or Parquet
directly, without specifying its underlying hierarchical structure.
 NULL value is literal ‘null’.
o OBJECT = dictionary = JSON.
 Comparable query and storage with relational table (if contains mainly int and
strings).
o ARRAY
- Functions:
o STRIP_OUTER_ARRAY: remove JSON structure to load data in separate rows. You can use
it in the COPY INTO command.
o FLATTEN:
 Set RECURSIVE = TRUE to expand all sub-elements recursively.
 OUTER = FALSE omits the output of the input rows that cannot be expanded.
o LATERAL_FLATTEN: parse arrays in JSON file. (LATERAL joins with data outside the obj.).

7. ACCOUNT AND SECURITY


- Network policies:
o Whitelist/blacklist of IPv4 addresses.
 Blacklist has priority over whitelist.
 Snowflake does not block any IP address by default.
o You can bypass them for a specific number of minutes.
 MINS_TO_BYPASS_NETWORK_POLICY -> contact Snowflake Support.
o Privileges required:
 SECURITYADMIN or higher.
 CREATE NETWORK RULE on the schema (the schema owner has it).
o Can be applied at account or user level.
o SHOW NETWORK_POLICIES: list all network policies.
o SHOW PARAMETERS [LIKE ‘pattern’] [{IN | FOR} {USER | ACCOUNT | SESSION} {WH…}
o If a policy changes for a user, he can’t do anything until he logs in again.
o You can’t block your own IP.
o 0.0.0.0/0 represents all IPv4 addresses in your local machine.
- Account objects:
o Securable object: an entity to which access can be granted.

- Object access methods:


o RBAC: Role-based access control.
 Privileges assigned to roles and roles to users.
o DAC: Discretionary access control.
 A role that creates an object owns it and can provide access to other roles.
- Roles:
o Role selected when logging in > Default role. If none defined, then PUBLIC.
o All roles are assigned to SYSADMIN -> SYSADMIN manages all account objects.
o Types of roles:
 Account roles.
 Database roles.
 Instance roles.
o System defined roles:
 ORGADMIN: Organization Administrator.
 Can create and view all accounts.
 List all regions enabled for the organization.
 View usage information of all accounts.
 Enable DB replication for an account.
 ACCOUNTADMIN:
 Encapsulates SYSADMIN and SECURITYADMIN.
 SECURITYADMIN:
 MANAGE GRANTS: able to modify or revoke any grant.
 Encapsulates USERADMIN.
 USERADMIN:
 CREATE USER and CREATE ROLE.
 SYSADMIN:
 Create WH, DBs and other objects.
 All custom roles SHOULD be assigned to him.
 PUBLIC: automatically granted to every user.
- You can see every query. You can only see your own results.
- Grant USAGE allows the role to see the object. U-D1, U-S1, S-T1
- Authentication mechanisms:
o MFA: fully managed by Snowflake and enabled by Duo Security Service.
 Cannot be centrally enforced.
 Automatically enabled for every account.
 Any user can enroll through UI.
 SECURITYADMIN to disable MFA for a user.
 Options: push notifications, call, passcode.
o Key pair authentication:
 Enabled to all clients and all editions.
 Consist of 1 private key and up to 2 public keys for user.
 You can rotate the keys.
o Federated authentication:
 Snowflake is compatible with the majority of SAML Identity Providers (IdP)
 Okta (native)
 AD FS (native)
 OneLogin
 Ping Identity PingOne
 Google G Suite
 Microsoft Azure Active Directory
 You don’t need to log in with Snowflake after that, single sign-on (SSO) is
allowed.
 SSO: 1 log in to access multiple applications.
 If you disable a user in these kinds of environments, the user will still be able to
log in to the IdP but will receive an error when connecting to Snowflake.
 If the IdP times out, the user’s Snowflake session will remain but a new log in is
required to start a new session.
 You can log in directly with Snowflake without using the IdP.
- SCIM:
o Automated management of user and groups.
o RESTful APIs to integrate different IdP:
 Okta – Azure – Custom integration.
o CREATE SECURITY INTEGRATION for redirecting to authentication before accessing REST
API.
- Data encryption: fully managed by Snowflake and available to all Snowflake connections.
o At rest: AES 256-bit. Keys rotated every 30 days + one re-keying process every year.
o In transit: TLS 1.2.
o Active key encrypts and decrypts, retired key only decrypts.
- Column level security (Enterprise):
o Dynamic Data Masking.
 Schema level object.
o External Tokenization.
o Tag based masking.
- Row level security (Enterprise):
o Schema level object.
o Determine which rows to return.
- ACCOUNTADMIN should have a functional email for urgent Snowflake Support issues.
- Security and compliance reports:
o IRAP Protected
o ITAR
o FedRAMP
o GxP
o SOC 1 Type II – SOC 2 Type II
o CSA Star Level 1
o PCI-DSS
o HITRUST / HIPAA
o ISO/IEC 27001
o Department of Defense (DoD)
o CJIS
- Releases:
o Weekly – Full Release: new features, updates, fixes…
 Day 1: early access – Enterprise, if wanted.
 Day 1-2: regular access – All Standard.
 Day 2: last – Minimum of 24h from early access.
o Weekly – Patch Release: fixes only.
o Monthly – Behavior Changes:
 One full release that introduces behavior changes.
 Behavior changes: something that returns different results and may
affect current code and workloads.
rd th
 3 , 4 week typically.
 Each month but November and December.
 Lifecycle:
 Testing period – 1st Month. Disabled by default.
 Opt-out period – 2nd Month. Enabled by default.
 After 2 months, contact Snowflake Support to disable individual
behavior changes.
- Every account has its own region.

You might also like