From the course: Advanced Snowflake: Deep Dive Cloud Data Warehousing and Analytics

Caching, clustering, and search optimization

- [Instructor] Snowflake runs queries on huge amounts of data. In order for these queries to run quickly, the queries need to be optimized. And there are several techniques that Snowflake uses for this. We'll discuss and work with three query optimization techniques. The first is caching. Next, we'll discuss clustering. And finally, we'll discuss search optimization. Let's start with caching. It's by far the easiest technique to understand, and it happens automatically. We've already discussed that Snowflake's architecture divides Snowflake into three layers, database storage, where data is persisted, query processing, which comprise of virtual warehouses, and the cloud services layer. So we have the storage layer, the compute layer, and the service layer. The actual data is stored in the storage layer. However, the data is cached in both the compute layer as well as the service layer. The service layer accepts SQL requests from users, coordinate queries, manages transactions and results, and the service layer has the result cache. This holds the result of every query executed in the past 24 hours. And this cache is available across virtual warehouses and across sessions. The compute layer where the SQL queries are actually processed contains the local disk cache. This is used to cache data used by SQL queries. Whenever data is needed for a given query, it's retrieved from the Remote Disk storage and cached in SSD and memory in the compute layer. The storage layer holds the long term storage for our data, refer to as a Remote Disk storage. In order to improve the performance of certain kinds of queries, you can enable clustering on tables. Clustering involves sorting or ordering data along natural dimensions when storing that data in tables. Clustering in Snowflake is completely automated. You designate one or more columns as the clustering key, and this enables clustering. Data is automatically reclustered when it's updated or deleted or changes are made. Let's say you choose country as the clustering key for your data. All of the records belonging to the same country will be co-located in the same micro-partition, allowing very fast retrieval operations on queries which refer to country. If you enable clustering on a table, you will incur additional costs. There are costs for the initial clustering of the table and for maintaining the clustering as the table is updated. Clustering is very useful when you require very fast response times and you don't really care about the cost of clustering. Clustering is used when the improvements that you get for your query offsets the cost of maintaining the clustered table. Clustering improves the performance of queries that filter based on the clustering key. Queries run faster because they can now skip data that does not match the filter predicate. Clustering also gives you better column compression for your data. The only configuration that you require when specifying clustering is to correctly define the clustering key. Now clustering improves the performance of equality searches on the clustering key, as well as range searches using the clustering key. Range searches meaning searches which include the greater than and less than operator. And finally, the third optimization technique that we'll discuss is search optimization that has performed using the search optimization service. The objective of search optimization is to improve the performance of selective point lookup queries on tables, queries that use the equality operator. A point lookup query is one that returns either one or a small number of records from a table. Search optimization is a table level property in Snowflake, and users need to explicitly register their tables with the search optimization service. It's not registered by default. When you enable search optimization on a table, Snowflake uses an additional persistent data structure that serves as an optimized search access path for all of the records in your table. This data structure is initially populated when you first enable search optimization, and there on in a maintenance service runs continuously in the background and updates the search optimization data as your table is updated. Search optimization is great if most of your queries are selective point lookup queries, that is, search optimization only speeds up quality searches, it does not speed up range searches. But the advantage of search optimization is that the optimization is applied on all columns of supported types in a table. You don't have to pick and choose specific columns to which optimization is applied.

Contents