Dev Ops
Dev Ops
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID
transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake
runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake on Azure Databricks allows you to configure Delta Lake based on your workload patterns and
provides optimized layouts and indexes for fast interactive queries.
Delta Lake sits on top of Apache Spark. The format and the compute layer helps to simplify building big
data pipelines and increase the overall efficiency of your pipelines.
Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions,
Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store
directory to provide ACID transactions.
You can use your favorite Apache Spark APIs to read and write data with Delta Lake. See Read a table and
Write to a table.
When writing data, you can specify the location in your cloud storage. Delta Lake stores the data in that
location in Parquet format.
Yes, you can use Structured Streaming to directly write data into Delta tables and read from Delta tables.
See Stream data into Delta tables and Stream data from Delta tables.
Does Delta Lake support writes or reads using the Spark Streaming DStream API?
Delta does not support the DStream API. We recommend Table streaming reads and writes.
When I use Delta Lake, will I be able to port my code to other Spark platforms easily?
Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to
other Spark platforms. To port your code, replace delta format with parquet format.
Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that
Delta Lake manages on your behalf that you should never specify manually:
ROWFORMAT
SERDE
OUTPUTFORMAT AND INPUTFORMAT
COMPRESSION
STORED AS
What DDL and DML features does Delta Lake not support?
Delta Lake does not support multi-table transactions and foreign keys. Delta Lake supports transactions
at the table level.
Changing a column’s type or dropping a column requires rewriting the table. For an example, see Change
column type.
It means that Delta Lake does locking to make sure that queries writing to a table from multiple clusters
at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for
example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail
atomically and the error will tell you to retry the operation.
Yes, you can concurrently modify the same Delta table from different workspaces. Moreover, if one
process is writing from a workspace, readers in other workspaces will see a consistent view.
There are two cases to consider: external writes and external reads.
External writes: Delta Lake maintains additional metadata in the form of a transaction log to
enable ACID transactions and snapshot isolation for readers. In order to ensure the transaction
log is updated correctly and the proper validations are performed, writes must go through
Databricks Runtime.
External reads: Delta tables store data encoded in an open format (Parquet), allowing other tools
that understand this format to read the data. However, since other tools do not support the
Delta Lake transaction log, it is likely that they will incorrectly read stale deleted data,
uncommitted data, or the partial results of failed transactions.
In cases where the data is static (that is, there are no active jobs writing to the table), you can
use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not
currently part of the table. This operation puts the Parquet files present in DBFS into a consistent
state such that they can now be read by external tools.
However, Delta Lake relies on stale snapshots for the following functionality, which will fail when
using VACUUM with zero retention allowance:
o Snapshot isolation for readers: Long running jobs will continue to read a consistent
snapshot from the moment the jobs started, even if the table is modified concurrently.
Running VACUUM with a retention less than length of these jobs can cause them to fail
with a FileNotFoundException.
o Streaming from Delta tables: Streams read from the original files written into a table in
order to ensure exactly once processing. When combined with OPTIMIZE, VACUUM with
zero retention can remove these files before the stream has time to processes them,
causing it to fail.
For these reasons Databricks recommends using this technique only on static data sets that must
be read by external tools.
Describe CI/CD
Completed 100 XP
5 minutes
Azure DevOps is a collection of services that provide an end-to-end solution for the five core practices of
DevOps: planning and tracking, development, build and test, delivery, and monitoring and operations.
It is possible to put an Azure Databricks Notebook under Version Control in an Azure Devops repo. Using
Azure DevOps, you can then build Deployment pipelines to manage your release process.
While we won't be demonstrating all of the features of Azure DevOps in this module, here are some of
the features that make it well-suited to CI/CD with Azure Databricks.
Throughout the development cycle, developers commit code changes locally as they work on new
features, bug fixes, etc. If the developers practice continuous integration, they merge their changes back
to the main branch as often as possible. Each merge into the master branch triggers a build and
automated tests that validate the code changes to ensure successful integration with other incoming
changes. This process avoids integration headaches that frequently happen when people wait until the
release day before they merge all their changes into the release branch.
Continuous Delivery
Continuous delivery builds on top of continuous integration to ensure you can successfully release new
changes in a fast and consistent way. This is because, in addition to the automated builds and testing
provided by continuous integration, the release process is automated to the point where you can deploy
your application with the click of a button.
Continuous Deployment
Continuous deployment takes continuous delivery a step further by automatically deploying your
application without human intervention. This means that merged changes pass through all stages of your
production pipeline and, unless any of the tests fail, automatically release to production in a fully
automated manner.
Who benefits?
Everyone. Once properly configured, automated testing and deployment can free up your engineering
team and enable your data team to push their changes into production. For example:
Data engineers can easily deploy changes to generate new tables for BI analysts.
Data scientists can update models being used in production.
Data analysts can modify scripts being used to generate dashboards.
In short, changes made to a Databricks notebook can be pushed to production with a simple mouse click
(and then any amount of oversight that your DevOps team feels is appropriate).
Additional Resources