Delta Lake Cheat Sheet-1

Uploaded by

Eduardo Evangelista

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views2 pages

Delta Lake Cheat Sheet-1

Uploaded by

Eduardo Evangelista

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

DELTA LAKE DDL/DML: UPDATE, DELETE, MERGE, ALTER TABLE TIME TRAVEL (CONTINUED)

Update rows that match a predicate condition Rollback a table to an earlier version
WITH SPARK SQL UPDATE tableName SET event = 'click' WHERE event = 'clk' -- RESTORE requires Delta Lake version 0.7.0+ & DBR 7.4+.
RESTORE tableName VERSION AS OF 0
Delta Lake is an open source storage layer that brings ACID Delete rows that match a predicate condition RESTORE tableName TIMESTAMP AS OF "2020-12-18"
transactions to Apache Spark™ and big data workloads. DELETE FROM tableName WHERE "date < '2017-01-01"

delta.io | Documentation | GitHub | Delta Lake on Databricks Insert values directly into table
INSERT INTO TABLE tableName VALUES ( UTILITY METHODS
(8003, "Kim Jones", "2020-12-18", 3.875),
CREATE AND QUERY DELTA TABLES (8004, "Tim Jones", "2020-12-20", 3.750) View table details
);
Create and use managed database -- Insert using SELECT statement
DESCRIBE DETAIL tableName
DESCRIBE FORMATTED tableName
-- Managed database is saved in the Hive metastore. INSERT INTO tableName SELECT * FROM sourceTable
Default database is named "default". -- Atomically replace all data in table with new values Delete old files with Vacuum
DROP DATABASE IF EXISTS dbName; INSERT OVERWRITE loan_by_state_delta VALUES (...)
VACUUM tableName [RETAIN num HOURS] [DRY RUN]
CREATE DATABASE dbName;
USE dbName -- This command avoids having to specify Upsert (update + insert) using MERGE Clone a Delta Lake table
dbName.tableName every time instead of just tableName. MERGE INTO target -- Deep clones copy data from source, shallow clones don't.
USING updates CREATE TABLE [dbName.] targetName
Query Delta Lake table by table name (preferred) ON target.Id = updates.Id [SHALLOW | DEEP] CLONE sourceName [VERSION AS OF 0]
/* You can refer to Delta Tables by table name, or by WHEN MATCHED AND target.delete_flag = "true" THEN [LOCATION "path/to/table"]
path. Table name is the preferred way, since named tables DELETE -- specify location only for path-based tables
are managed in the Hive Metastore (i.e., when you DROP a WHEN MATCHED THEN
named table, the data is dropped also — not the case for UPDATE SET * -- star notation means all columns
path-based tables.) */ WHEN NOT MATCHED THEN Interoperability with Python / DataFrames
SELECT * FROM [dbName.] tableName INSERT (date, Id, data) -- or, use INSERT * -- Read name-based table from Hive metastore into DataFrame
VALUES (date, Id, data) df = spark.table("tableName")
Query Delta Lake table by path -- Read path-based table into DataFrame
SELECT * FROM delta.`path/to/delta_table` -- note backticks
Insert with Deduplication using MERGE df = spark.read.format("delta").load("/path/to/delta_table")
MERGE INTO logs Run SQL queries from Python
Convert Parquet table to Delta Lake format in place USING newDedupedLogs
spark.sql("SELECT * FROM tableName")
ON logs.uniqueId = newDedupedLogs.uniqueId
-- by table name spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
WHEN NOT MATCHED
CONVERT TO DELTA [dbName.]tableName
[PARTITIONED BY (col_name1 col_type1, col_name2
THEN INSERT * Modify data retention settings for Delta Lake table
col_type2)] -- logRetentionDuration -> how long transaction log history
Alter table schema — add columns
is kept, deletedFileRetentionDuration -> how long ago a file
-- path-based tables ALTER TABLE tableName ADD COLUMNS ( must have been deleted before being a candidate for VACCUM.
CONVERT TO DELTA parquet.`/path/to/table` -- note backticks col_name data_type ALTER TABLE tableName
[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2)] [FIRST|AFTER colA_name]) SET TBLPROPERTIES(
delta.logRetentionDuration = "interval 30 days",
Create Delta Lake table as SELECT * with no upfront Alter table — add constraint delta.deletedFileRetentionDuration = "interval 7 days"
schema definition -- Add "Not null" constraint: );
ALTER TABLE tableName CHANGE COLUMN col_name SET NOT NULL SHOW TBLPROPERTIES tableName;
CREATE TABLE [dbName.] tableName -- Add "Check" constraint:
USING DELTA ALTER TABLE tableName
AS SELECT * FROM tableName | parquet.`path/to/data` ADD CONSTRAINT dateWithinRange CHECK date > "1900-01-01"
[LOCATION `/path/to/table`] -- Drop constraint: PERFORMANCE OPTIMIZATIONS
-- using location = unmanaged table ALTER TABLE tableName DROP CONSTRAINT dateWithinRange
Compact data files with Optimize and Z-Order
Create table, define schema explicitly with SQL DDL *Databricks Delta Lake feature
CREATE TABLE [dbName.] tableName ( OPTIMIZE tableName
id INT [NOT NULL], TIME TRAVEL [ZORDER BY (colNameA, colNameB)]
name STRING,
date DATE, View transaction log (aka Delta Log) Auto-optimize tables
int_rate FLOAT) *Databricks Delta Lake feature
USING DELTA DESCRIBE HISTORY tableName
ALTER TABLE [table_name | delta.`path/to/delta_table`]
[PARTITIONED BY (time, date)] -- optional SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
Query historical versions of Delta Lake tables
Copy new data into Delta Lake table (with idempotent retries) SELECT * FROM tableName VERSION AS OF 0 Cache frequently queried data in Delta Cache
SELECT * FROM tableName@v0 -- equivalent to VERSION AS OF 0 *Databricks Delta Lake feature
COPY INTO [dbName.] targetTable
SELECT * FROM tableName TIMESTAMP AS OF "2020-12-18" CACHE SELECT * FROM tableName
FROM "/path/to/table"
-- or:
FILEFORMAT = DELTA -- or CSV, Parquet, ORC, JSON, etc.
Find changes between 2 versions of table CACHE SELECT colA, colB FROM tableName WHERE colNameA > 0

Provided to the open source community by Databricks SELECT * FROM tableName VERSION AS OF 12
© Databricks 2021. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11
trademarks of the Apache Software Foundation.
WORKING WITH DELTA
DELTATABLES
TABLES TIME TRAVEL (CONTINUED)

# A DeltaTable is the entry point for interacting with Find changes between 2 versions of a table
WITH PYTHON tables programmatically in Python — for example, to
df1 = spark.read.format("delta").load(pathToTable)
perform updates or deletes.
df2 = spark.read.format("delta").option("versionAsOf",
from delta.tables import *
2).load("/path/to/delta_table")
Delta Lake is an open source storage layer that brings ACID df1.exceptAll(df2).show()
deltaTable = DeltaTable.forName(spark, tableName)
transactions to Apache Spark™ and big data workloads. deltaTable = DeltaTable.forPath(spark, Rollback a table by version or timestamp
delta.`path/to/table`)
delta.io | Documentation | GitHub | API reference | Databricks deltaTable.restoreToVersion(0)
deltaTable.restoreToTimestamp('2020-12-01')

READS AND WRITES WITH DELTA LAKE DELTA LAKE DDL/DML: UPDATES, DELETES, INSERTS, MERGES

Read data from pandas DataFrame Delete rows that match a predicate condition
UTILITY METHODS
df = spark.createDataFrame(pdf) # predicate using SQL formatted string Run Spark SQL queries in Python
# where pdf is a pandas DF deltaTable.delete("date < '2017-01-01'")
# then save DataFrame in Delta Lake format as shown below # predicate using Spark SQL functions spark.sql("SELECT * FROM tableName")
deltaTable.delete(col("date") < "2017-01-01") spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
Read data using Apache Spark™ spark.sql("DESCRIBE HISTORY tableName")
# read by path Update rows that match a predicate condition
df = (spark.read.format("parquet"|"csv"|"json"|etc.) # predicate using SQL formatted string
Compact old files with Vacuum
.load("/path/to/delta_table")) deltaTable.update(condition = "eventType = 'clk'", deltaTable.vacuum() # vacuum files older than default
# read by table name set = { "eventType": "'click'" } ) retention period (7 days)
df = spark.table("events") # predicate using Spark SQL functions deltaTable.vacuum(100) # vacuum files not required by
deltaTable.update(condition = col("eventType") == "clk", versions more than 100 hours old
Save DataFrame in Delta Lake format set = { "eventType": lit("click") } )
Clone a Delta Lake table
(df.write.format("delta")
.mode("append"|"overwrite")
Upsert (update + insert) using MERGE deltaTable.clone(target="/path/to/delta_table/",
.partitionBy("date") # optional # Available options for merges [see docs for details]: isShallow=True, replace=True)
.option("mergeSchema", "true") # option - evolve schema .whenMatchedUpdate(...) | .whenMatchedUpdateAll(...) |
.saveAsTable("events") | .save("/path/to/delta_table") .whenNotMatchedInsert(...) | .whenMatchedDelete(...) Get DataFrame representation of a Delta Lake table
) (deltaTable.alias("target").merge( df = deltaTable.toDF()
source = updatesDF.alias("updates"),
Streaming reads (Delta table as streaming source) condition = "target.eventId = updates.eventId") Run SQL queries on Delta Lake tables
# by path or by table name .whenMatchedUpdateAll() spark.sql("SELECT * FROM tableName")
df = (spark.readStream .whenNotMatchedInsert( spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
.format("delta") values = {
.schema(schema) "date": "updates.date",
.table("events") | .load("/delta/events") "eventId": "updates.eventId",
) "data": "updates.data", PERFORMANCE OPTIMIZATIONS
"count": 1
Streaming writes (Delta table as a sink) }
Compact data files with Optimize and Z-Order
).execute()
(df.writeStream.format("delta") ) *Databricks Delta Lake feature
.outputMode("append"|"update"|"complete") spark.sql("OPTIMIZE tableName [ZORDER BY (colA, colB)]")
.option("checkpointLocation", "/path/to/checkpoints") Insert with Deduplication using MERGE
.trigger(once=True|processingTime="10 seconds") (deltaTable.alias("logs").merge( Auto-optimize tables
.table("events") | .start("/delta/events") newDedupedLogs.alias("newDedupedLogs"), *Databricks Delta Lake feature. For existing tables:
) "logs.uniqueId = newDedupedLogs.uniqueId") spark.sql("ALTER TABLE [table_name |
.whenNotMatchedInsertAll() delta.`path/to/delta_table`]
.execute() SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
CONVERT PARQUET TO DELTA LAKE ) To enable auto-optimize for all new Delta Lake tables:
spark.sql("SET spark.databricks.delta.properties.
defaults.autoOptimize.optimizeWrite = true")
Convert Parquet table to Delta Lake format in place TIME TRAVEL
from delta.tables import * Cache frequently queried data in Delta Cache
View transaction log (aka Delta Log) *Databricks Delta Lake feature
deltaTable = DeltaTable.convertToDelta(spark, fullHistoryDF = deltaTable.history() spark.sql("CACHE SELECT * FROM tableName")
"parquet.`/path/to/parquet_table`") -- or:
Query historical versions of Delta Lake tables spark.sql("CACHE SELECT colA, colB FROM tableName
partitionedDeltaTable = DeltaTable.convertToDelta(spark, WHERE colNameA > 0")
"parquet.`/path/to/parquet_table`", "part int") # choose only one option: versionAsOf, or timestampAsOf
df = (spark.read.format("delta")
.option("versionAsOf", 0)
Provided to the open source community by Databricks
.option("timestampAsOf", "2020-12-18")
© Databricks 2021. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are .load("/path/to/delta_table"))
trademarks of the Apache Software Foundation.

PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
Databricks Question 1668314325
No ratings yet
Databricks Question 1668314325
104 pages
Databricks Certified Data Engineer Associate 4
100% (1)
Databricks Certified Data Engineer Associate 4
13 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Azure Data Factory Interview Questions
0% (1)
Azure Data Factory Interview Questions
14 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
5 pages
Databricks Questions
No ratings yet
Databricks Questions
23 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
From Everand
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Mayank Malhotra
No ratings yet
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
Data Engineering With Databricks Da
100% (2)
Data Engineering With Databricks Da
232 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Pyspark PDF
0% (1)
Pyspark PDF
239 pages
Best Practices For Optimizing Your DBT and Snowflake Deployment
No ratings yet
Best Practices For Optimizing Your DBT and Snowflake Deployment
30 pages
Databricks Certification Preparation Associate DE
50% (2)
Databricks Certification Preparation Associate DE
65 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Connect Databricks Delta Tables With DBeaver
No ratings yet
Connect Databricks Delta Tables With DBeaver
10 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
35 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Notes of Azure Data Bricks
No ratings yet
Notes of Azure Data Bricks
16 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
Databricks Certified Data Engineer Associate Practice Exams - 1
100% (1)
Databricks Certified Data Engineer Associate Practice Exams - 1
25 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Databricks Question
No ratings yet
Databricks Question
89 pages
Introduction to Databricks
No ratings yet
Introduction to Databricks
149 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
PySpark VS SQL Interview Questions
100% (1)
PySpark VS SQL Interview Questions
16 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Azure Data Lake A Clear and Concise Reference
From Everand
Azure Data Lake A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
MC Microsoft Certified Azure Data Fundamentals Study Guide: Exam DP-900
From Everand
MC Microsoft Certified Azure Data Fundamentals Study Guide: Exam DP-900
Jake Switzer
No ratings yet
Assignment Gtu PLSQL
No ratings yet
Assignment Gtu PLSQL
10 pages
DBMS Codes
No ratings yet
DBMS Codes
28 pages
Tugas DB-BAB 4 - ALFONSIUS ANTONNIO - 825190004
No ratings yet
Tugas DB-BAB 4 - ALFONSIUS ANTONNIO - 825190004
4 pages
Emerging Database Technologies and Applications
No ratings yet
Emerging Database Technologies and Applications
22 pages
DBML 3
No ratings yet
DBML 3
144 pages
1) DDL
No ratings yet
1) DDL
4 pages
Predi Semes
No ratings yet
Predi Semes
5 pages
PL-300 Dumps-1
No ratings yet
PL-300 Dumps-1
220 pages
PDF Advanced Database Management System
No ratings yet
PDF Advanced Database Management System
282 pages
Data Base Management With MS Access
No ratings yet
Data Base Management With MS Access
7 pages
Relational Algebra
100% (1)
Relational Algebra
40 pages
Constraints
No ratings yet
Constraints
6 pages
SQL Injection: Prof. Kirtankumar Rathod Dept. of Computer Science ISHLS, Indus University
No ratings yet
SQL Injection: Prof. Kirtankumar Rathod Dept. of Computer Science ISHLS, Indus University
13 pages
IBM Data Analyst
No ratings yet
IBM Data Analyst
2 pages
3 NFvs BCNF
No ratings yet
3 NFvs BCNF
25 pages
Ssms Vs Ssis: SQL Server Management Studio (SSMS)
No ratings yet
Ssms Vs Ssis: SQL Server Management Studio (SSMS)
17 pages
Ease Into SQLite 3 With PHP and PDO
No ratings yet
Ease Into SQLite 3 With PHP and PDO
2 pages
Mysql Interviews
No ratings yet
Mysql Interviews
79 pages
Computer Science Project of SQL
No ratings yet
Computer Science Project of SQL
19 pages
Shrutika Slip 1 To 5
100% (2)
Shrutika Slip 1 To 5
17 pages
Bsa2a Paraggua Finals
No ratings yet
Bsa2a Paraggua Finals
3 pages
Assignment#1
No ratings yet
Assignment#1
3 pages
9566v05 Database TBX
No ratings yet
9566v05 Database TBX
2 pages
Practical 12 Java
No ratings yet
Practical 12 Java
7 pages
DBMS QP 2022
No ratings yet
DBMS QP 2022
2 pages
20762C 02
No ratings yet
20762C 02
40 pages
Mysql Homework
100% (1)
Mysql Homework
7 pages
Jagdish Chandra Patni, Hitesh Kumar Sharma, Ravi Tomar, Avita Katal - Database Management System-Chapman and Hall - CRC (2022)
No ratings yet
Jagdish Chandra Patni, Hitesh Kumar Sharma, Ravi Tomar, Avita Katal - Database Management System-Chapman and Hall - CRC (2022)
253 pages
Functional Dependency: Functional Dependency (FD) Determines The Relation of One Attribute To Another Attribute in
No ratings yet
Functional Dependency: Functional Dependency (FD) Determines The Relation of One Attribute To Another Attribute in
17 pages
SQL Tuto2 With Answers (2)
No ratings yet
SQL Tuto2 With Answers (2)
3 pages

Delta Lake Cheat Sheet-1

Uploaded by

Delta Lake Cheat Sheet-1

Uploaded by

DELTA LAKE DDL/DML: UPDATE, DELETE, MERGE, ALTER TABLE TIME TRAVEL (CONTINUED)

You might also like