0% found this document useful (0 votes)

17 views

Dev Ops

Delta Lake is an open source storage layer that brings reliability to data lakes by providing ACID transactions, scalable metadata handling, and unifying streaming and batch data processing. It sits on top of Apache Spark and uses Parquet files to store data, enabling reading and writing data using Spark APIs. Delta Lake also supports streaming data directly into and from tables and allows modifying tables from different workspaces concurrently.

Uploaded by

Sai Phanidhar Varanasi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Dev Ops

Uploaded by

Sai Phanidhar Varanasi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Dev Ops

What is Delta Lake?

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID
transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake
runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake on Azure Databricks allows you to configure Delta Lake based on your workload patterns and
provides optimized layouts and indexes for fast interactive queries.

How is Delta Lake related to Apache Spark?

Delta Lake sits on top of Apache Spark. The format and the compute layer helps to simplify building big
data pipelines and increase the overall efficiency of your pipelines.

What format does Delta Lake use to store data?

Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions,
Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store
directory to provide ACID transactions.

How can I read and write data with Delta Lake?

You can use your favorite Apache Spark APIs to read and write data with Delta Lake. See Read a table and
Write to a table.

Where does Delta Lake store the data?

When writing data, you can specify the location in your cloud storage. Delta Lake stores the data in that
location in Parquet format.

Can I stream data directly into and from Delta tables?

Yes, you can use Structured Streaming to directly write data into Delta tables and read from Delta tables.
See Stream data into Delta tables and Stream data from Delta tables.

Does Delta Lake support writes or reads using the Spark Streaming DStream API?

Delta does not support the DStream API. We recommend Table streaming reads and writes.

When I use Delta Lake, will I be able to port my code to other Spark platforms easily?

Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to
other Spark platforms. To port your code, replace delta format with parquet format.

How do Delta tables compare to Hive SerDe tables?

Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that
Delta Lake manages on your behalf that you should never specify manually:

 ROWFORMAT
 SERDE
 OUTPUTFORMAT AND INPUTFORMAT
 COMPRESSION
 STORED AS

What DDL and DML features does Delta Lake not support?

 Unsupported DDL features:

o ANALYZE TABLE PARTITION
o ALTER TABLE [ADD|DROP] PARTITION
o ALTER TABLE RECOVER PARTITIONS
o ALTER TABLE SET SERDEPROPERTIES
o CREATE TABLE LIKE
o INSERT OVERWRITE DIRECTORY
o LOAD DATA
 Unsupported DML features:
o INSERT INTO [OVERWRITE] table with static partitions
o INSERT OVERWRITE TABLE for table with dynamic partitions
o Bucketing
o Specifying a schema when reading from a table
o Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE

Does Delta Lake support multi-table transactions?

Delta Lake does not support multi-table transactions and foreign keys. Delta Lake supports transactions
at the table level.

How can I change the type of a column?

Changing a column’s type or dropping a column requires rewriting the table. For an example, see Change
column type.

What does it mean that Delta Lake supports multi-cluster writes?

It means that Delta Lake does locking to make sure that queries writing to a table from multiple clusters
at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for
example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail
atomically and the error will tell you to retry the operation.

Can I modify a Delta table from different workspaces?

Yes, you can concurrently modify the same Delta table from different workspaces. Moreover, if one
process is writing from a workspace, readers in other workspaces will see a consistent view.

Can I access Delta tables outside of Databricks Runtime?

There are two cases to consider: external writes and external reads.

 External writes: Delta Lake maintains additional metadata in the form of a transaction log to
enable ACID transactions and snapshot isolation for readers. In order to ensure the transaction
log is updated correctly and the proper validations are performed, writes must go through
Databricks Runtime.
 External reads: Delta tables store data encoded in an open format (Parquet), allowing other tools
that understand this format to read the data. However, since other tools do not support the
Delta Lake transaction log, it is likely that they will incorrectly read stale deleted data,
uncommitted data, or the partial results of failed transactions.

In cases where the data is static (that is, there are no active jobs writing to the table), you can
use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not
currently part of the table. This operation puts the Parquet files present in DBFS into a consistent
state such that they can now be read by external tools.

However, Delta Lake relies on stale snapshots for the following functionality, which will fail when
using VACUUM with zero retention allowance:

o Snapshot isolation for readers: Long running jobs will continue to read a consistent
snapshot from the moment the jobs started, even if the table is modified concurrently.
Running VACUUM with a retention less than length of these jobs can cause them to fail
with a FileNotFoundException.
o Streaming from Delta tables: Streams read from the original files written into a table in
order to ensure exactly once processing. When combined with OPTIMIZE, VACUUM with
zero retention can remove these files before the stream has time to processes them,
causing it to fail.

For these reasons Databricks recommends using this technique only on static data sets that must
be read by external tools.

Describe CI/CD
Completed 100 XP

 5 minutes

Azure DevOps is a collection of services that provide an end-to-end solution for the five core practices of
DevOps: planning and tracking, development, build and test, delivery, and monitoring and operations.

It is possible to put an Azure Databricks Notebook under Version Control in an Azure Devops repo. Using
Azure DevOps, you can then build Deployment pipelines to manage your release process.

CI/CD with Azure DevOps

While we won't be demonstrating all of the features of Azure DevOps in this module, here are some of
the features that make it well-suited to CI/CD with Azure Databricks.

 Integrated Git repositories

 Integration with other Azure services
 Automatic virtual machine management for testing builds
 Secure deployment
 Friendly GUI that generates (and accepts) various scripted files

But what is CI/CD?

Continuous Integration

Throughout the development cycle, developers commit code changes locally as they work on new
features, bug fixes, etc. If the developers practice continuous integration, they merge their changes back
to the main branch as often as possible. Each merge into the master branch triggers a build and
automated tests that validate the code changes to ensure successful integration with other incoming
changes. This process avoids integration headaches that frequently happen when people wait until the
release day before they merge all their changes into the release branch.

Continuous Delivery

Continuous delivery builds on top of continuous integration to ensure you can successfully release new
changes in a fast and consistent way. This is because, in addition to the automated builds and testing
provided by continuous integration, the release process is automated to the point where you can deploy
your application with the click of a button.

Continuous Deployment

Continuous deployment takes continuous delivery a step further by automatically deploying your
application without human intervention. This means that merged changes pass through all stages of your
production pipeline and, unless any of the tests fail, automatically release to production in a fully
automated manner.

Who benefits?

Everyone. Once properly configured, automated testing and deployment can free up your engineering
team and enable your data team to push their changes into production. For example:

 Data engineers can easily deploy changes to generate new tables for BI analysts.
 Data scientists can update models being used in production.
 Data analysts can modify scripts being used to generate dashboards.

In short, changes made to a Databricks notebook can be pushed to production with a simple mouse click
(and then any amount of oversight that your DevOps team feels is appropriate).

Additional Resources

 Continuous Integration & Continuous Delivery with Databricks

 Azure DevOps Services Version Control
 GitHub Version Control
 Creating Continuous Integration Pipelines on Azure Using Azure Databricks and Azure DevOps

ISTQB_CTFL_v4.0_Sample-Exam-B-Answers_v1.7
No ratings yet
ISTQB_CTFL_v4.0_Sample-Exam-B-Answers_v1.7
37 pages
DatabricksDataEngineer Associate2024
67% (3)
DatabricksDataEngineer Associate2024
157 pages
Databricks Question 1668314325
No ratings yet
Databricks Question 1668314325
104 pages
Project Sample
No ratings yet
Project Sample
84 pages
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learn SQLite in 24 Hours
From Everand
Learn SQLite in 24 Hours
Alex Nordeen
No ratings yet
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
From Everand
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
5/5 (1)
Vmware Vsphere Troubleshooting Scenarios
100% (2)
Vmware Vsphere Troubleshooting Scenarios
19 pages
Delta Lake Most Asked Questions PDF
No ratings yet
Delta Lake Most Asked Questions PDF
3 pages
De Mod 3 Manage Data With Delta Lake
No ratings yet
De Mod 3 Manage Data With Delta Lake
16 pages
DeltaLake Databricks
No ratings yet
DeltaLake Databricks
5 pages
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
Databricks LakeHouse Architectre
No ratings yet
Databricks LakeHouse Architectre
10 pages
Databricks For The SQL Developer: Gerhard Brueckl
No ratings yet
Databricks For The SQL Developer: Gerhard Brueckl
40 pages
Delta Lake
No ratings yet
Delta Lake
2 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Details of Delta Lake Tutorial
67% (3)
Details of Delta Lake Tutorial
43 pages
AWS Glue for Data Engineers: Serverless ETL Made Easy
From Everand
AWS Glue for Data Engineers: Serverless ETL Made Easy
Robert Johnson
No ratings yet
Practical Play Framework: Focus on what is really important
From Everand
Practical Play Framework: Focus on what is really important
Alberto Souza
No ratings yet
What is Delta Lake
No ratings yet
What is Delta Lake
3 pages
ORACLE PL/SQL Interview Questions You'll Most Likely Be Asked
From Everand
ORACLE PL/SQL Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
5/5 (1)
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
The Delta Lake Series Lakehouse 012921
No ratings yet
The Delta Lake Series Lakehouse 012921
19 pages
Windows Batch File Programming
From Everand
Windows Batch File Programming
Michael Elliott
2/5 (2)
Databricks Delta tables
No ratings yet
Databricks Delta tables
60 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
A Quick Technical Guide to Delta Lake
No ratings yet
A Quick Technical Guide to Delta Lake
10 pages
Use Delta Lake in Azure Synapse Analytics
No ratings yet
Use Delta Lake in Azure Synapse Analytics
37 pages
Oracle Database 12c Quickstart
From Everand
Oracle Database 12c Quickstart
Michael Elliott
5/5 (5)
Databricks Differences Abhishek
No ratings yet
Databricks Differences Abhishek
7 pages
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
From Everand
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
Myra Kelnor
No ratings yet
Learning Azure DevOps
From Everand
Learning Azure DevOps
Myra Kelnor
No ratings yet
Professional Microsoft SQL Server 2012 Integration Services
From Everand
Professional Microsoft SQL Server 2012 Integration Services
Brian Knight
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Interview Questions for DB2 z/OS Application Developers
From Everand
Interview Questions for DB2 z/OS Application Developers
Robert Wingate
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer
From Everand
Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer
Brian Knight
No ratings yet
Delta lake
No ratings yet
Delta lake
11 pages
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
From Everand
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
Tim Warren
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Must Know Before Your Next Databricks Interview
No ratings yet
Must Know Before Your Next Databricks Interview
7 pages
Oracle Quick Guides: Part 3 - Coding in Oracle: SQL and PL/SQL
From Everand
Oracle Quick Guides: Part 3 - Coding in Oracle: SQL and PL/SQL
Malcolm Coxall
No ratings yet
MySQL Lab Manual
From Everand
MySQL Lab Manual
Manish Soni
No ratings yet
Databricks 2
No ratings yet
Databricks 2
22 pages
Java / J2EE Interview Questions You'll Most Likely Be Asked
From Everand
Java / J2EE Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Separating Storage and Compute With The Databricks Lakehouse Platform
No ratings yet
Separating Storage and Compute With The Databricks Lakehouse Platform
2 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
11 pages
Professional SQL Server 2012 Internals and Troubleshooting
From Everand
Professional SQL Server 2012 Internals and Troubleshooting
Christian Bolton
4/5 (4)
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Oracle APEX Tips and Tricks
From Everand
Oracle APEX Tips and Tricks
Priyanka Agarwal
No ratings yet
Azure DataBricks Interview Questions
No ratings yet
Azure DataBricks Interview Questions
17 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Databricks_Class_1_PPT
No ratings yet
Databricks_Class_1_PPT
8 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
AZURE_ETL__1741608374
No ratings yet
AZURE_ETL__1741608374
14 pages
Databricks
No ratings yet
Databricks
36 pages
Course Notes
No ratings yet
Course Notes
11 pages
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Your Download Has Started: Check Out More From
No ratings yet
Your Download Has Started: Check Out More From
2 pages
License Overview Treasury and Financial Risk Management
No ratings yet
License Overview Treasury and Financial Risk Management
5 pages
HXP01 HP35 PFB06 00614 4 1/3: General Note
No ratings yet
HXP01 HP35 PFB06 00614 4 1/3: General Note
1 page
Hibernate Reference Documentation
100% (7)
Hibernate Reference Documentation
223 pages
Cybersecurity Strategy For Virtual Nexus, LLCv2
No ratings yet
Cybersecurity Strategy For Virtual Nexus, LLCv2
12 pages
BehnSusan ABCsOfWorkflowSetup PDF
No ratings yet
BehnSusan ABCsOfWorkflowSetup PDF
61 pages
Operating Systems and Networks
No ratings yet
Operating Systems and Networks
7 pages
Audit in A CIS Environment 2
No ratings yet
Audit in A CIS Environment 2
3 pages
Auditing in CIS Environment
89% (9)
Auditing in CIS Environment
36 pages
Iot - Country Paper
No ratings yet
Iot - Country Paper
15 pages
M-Commerce Unit 5
No ratings yet
M-Commerce Unit 5
19 pages
Rishabh Jain's Resume PDF
No ratings yet
Rishabh Jain's Resume PDF
1 page
Hanna - Effects of The Wide Use of Technology in The Academic Performance of Senior High School Students in Aroroy National High School
No ratings yet
Hanna - Effects of The Wide Use of Technology in The Academic Performance of Senior High School Students in Aroroy National High School
49 pages
Data Management For Today's Smart Grid-Turning Data Into Insight
No ratings yet
Data Management For Today's Smart Grid-Turning Data Into Insight
2 pages
Week2 Answers
No ratings yet
Week2 Answers
4 pages
Verification in Software Testing Is A Process of Checking
No ratings yet
Verification in Software Testing Is A Process of Checking
3 pages
Types of Transaction Processing System in Retail
50% (2)
Types of Transaction Processing System in Retail
21 pages
Calculate EMV Cryptogram ARQC-ARPC For ISO8583 Payments
No ratings yet
Calculate EMV Cryptogram ARQC-ARPC For ISO8583 Payments
6 pages
HPE Aruba Networking Central Switch Class4 Advanced 1-Year Subscription E-STU-PSN1014694828WWEN
No ratings yet
HPE Aruba Networking Central Switch Class4 Advanced 1-Year Subscription E-STU-PSN1014694828WWEN
4 pages
Power Bi RP
No ratings yet
Power Bi RP
8 pages
Dive Into Web 2.0: Objectives
No ratings yet
Dive Into Web 2.0: Objectives
4 pages
Unit: Cybersecurity Fundamentals (Cit 4206) : Question One
No ratings yet
Unit: Cybersecurity Fundamentals (Cit 4206) : Question One
4 pages
Spark SQL
100% (1)
Spark SQL
25 pages
IntSights Modular Datasheet TIP IOC
No ratings yet
IntSights Modular Datasheet TIP IOC
2 pages
MyCaptain Digital Marketing Pro Course With Job Assistance
No ratings yet
MyCaptain Digital Marketing Pro Course With Job Assistance
4 pages
Adarts (Ada-Based Design Approach For Real-Time Systems)
No ratings yet
Adarts (Ada-Based Design Approach For Real-Time Systems)
35 pages
The 2018 State of Testing Report 1
No ratings yet
The 2018 State of Testing Report 1
41 pages

Dev Ops

Uploaded by

Dev Ops

Uploaded by

Dev Ops

What is Delta Lake?

How is Delta Lake related to Apache Spark?

What format does Delta Lake use to store data?

How can I read and write data with Delta Lake?

Where does Delta Lake store the data?

Can I stream data directly into and from Delta tables?

How do Delta tables compare to Hive SerDe tables?

 Unsupported DDL features:

Does Delta Lake support multi-table transactions?

How can I change the type of a column?

What does it mean that Delta Lake supports multi-cluster writes?

Can I modify a Delta table from different workspaces?

Can I access Delta tables outside of Databricks Runtime?

CI/CD with Azure DevOps

 Integrated Git repositories

But what is CI/CD?

 Continuous Integration & Continuous Delivery with Databricks

You might also like