Databricks State of Data Report 010524 v9 Final
Databricks State of Data Report 010524 v9 Final
Data + AI
STATE OF DATA + AI
We’re in the
golden age of
data and AI
STATE OF DATA + AI 2
INTRO
In the 6 months since ChatGPT launched, the world has woken up to the vast potential
of AI. The unparalleled pace of AI discoveries, model improvements and new products
on the market puts data and AI strategy at the top of conversations across every
organization around the world. We believe that AI will usher in the next generation of
product and software innovation, and we’re already seeing this play out in the market.
The next generation of winning companies and executives will be those who understand
and leverage AI.
In this report, we examine patterns and trends in data and AI adoption across more
than 9,000 global Databricks customers. By unifying business intelligence (BI) and AI
applications across companies’ entire data estates, the Databricks Lakehouse provides
a unique vantage point into the state of data and AI, including which products and
technologies are the fastest growing, the types of data science and machine learning
(DS/ML) applications being developed and more.
STATE OF DATA + AI 3
Here are the major stories we uncovered:
Companies are adopting Open source wins in today’s Organizations are increasingly
machine learning and large data and AI markets. Eight out using the Lakehouse for data
language models (LLMs) of 10 of our most widely warehousing, as evidenced
at a rapid pace. Natural adopted AI and machine by the high growth of data
language processing (NLP) learning products are based integration tools dbt and
is dominating use cases, on open source. Fivetran, and the accelerated
with an accelerated focus adoption of Databricks SQL.
on LLMs.
We hope that by sharing these trends, data leaders will be able to benchmark
their organizations and gain insights that help inform their strategies for an
era defined by data and AI.
STATE OF DATA + AI 4
Summary of
Key Findings
• NLP accounts for 49% of daily Python data science library usage,
making it the most popular application
• Organizations are getting more efficient with ML; for every three
experimental models, roughly one is put into production, compared
to five experimental models a year prior
STATE OF DATA + AI 5
2 FASTEST-GROWING DATA
AND AI PRODUCTS
STATE OF DATA + AI 6
Methodology: How did Databricks
create this report?
The State of Data + AI is built from fully aggregated, anonymized data collected
from our customers based on how they are using the Databricks Lakehouse
and its broad ecosystem of integrated tools. This report focuses on machine
learning adoption, data architecture (integrations and migrations) and use cases.
The customers in this report represent every major industry and range in size
from startups to many of the world’s largest enterprises.
Unless otherwise noted, this report presents and analyzes data from February 1,
2022, to January 31, 2023, and usage is measured by number of customers.
When possible, we provide YoY comparisons to showcase growth trends over time.
STATE OF DATA + AI 7
Data Science and
Machine Learning
NATURAL LANGUAGE PROCESSING AND LARGE
LANGUAGE MODELS ARE IN HIGH DEMAND
STATE OF DATA + AI 8
Time Series
Time Series
Recommender Systems
DS/ML Applications
Natural
Language
Processing
Graph
STATE OF DATA + AI 9
Natural language processing dominates
machine learning use cases
To understand how organizations are applying AI and Our second most popular DS/ML application is
ML within the Lakehouse, we aggregated the usage simulations and optimization, which accounts for 30% of
of specialized Python libraries, which include NLTK, all use cases. This signals organizations are using data to
Transformers and FuzzyWuzzy, into popular data science model prototypes and solve problems cost-effectively.
use cases.1 We look at data from these libraries because
Python is on the cutting edge of new developments in ML,
advanced analytics and AI, and has consistently ranked In our data set, 49% of
as one of the most popular programming languages in specialized Python libraries
recent years. used are associated with NLP
STATE OF DATA + AI 10
USE OF LARGE LANGUAGE MODELS (LLMS)
LLM Tools
Feb Mar Apr May June July Aug Sept Oct Nov Dec Jan Feb Mar Apr May
2022 2023
Note: There are several popular types of Python libraries that are commonly used for LLMs.
These libraries provide pretrained models and tools for building, training and deploying LLMs.
We have rolled these libraries up into groupings based on the type of functionality they provide.
STATE OF DATA + AI 12
Machine learning experimentation and production
take off across industries
The increasing demand for ML solutions and the growing MLflow Model Registry launched in May 2021. Overall, the
availability of technologies have led to a significant number of logged models has grown 54% since February
increase in experimentation and production, two distinct 2022, while the number of registered models has grown
parts of the ML model lifecycle. We look at the logging 411% over the same period. This growth in volume
and registering of models in MLflow, an open source suggests organizations are understanding the value of
platform developed by Databricks, to understand how ML investing in and allocating more people power to ML.
is trending and being adopted within organizations.
STATE OF DATA + AI 13
Organizations test numerous approaches and variables saw that for roughly every five experimental models, one
before committing an ML model to production. We was registered. Recent advances in ML, such as improved
wanted to understand, “How many models do data open source libraries like MLflow and Hugging Face, have
scientists experiment with before moving to production?” radically simplified building and putting models into
production. The result is that 34% of logged models are
Our data shows the ratio of logged to registered models now candidates for production today, an improvement
is 2.9 : 1 as of January 2023. This means that for roughly from over 20% just a year ago.
every three experimental models, one model will get
registered as a candidate for production. This ratio has
improved significantly from just a year prior, when we
2.9 : 1
Ratio of Logged to Registered
Feb Mar Apr May June July Aug Sept Oct Nov Dec Jan
Models in Jan 2023 2022 2023
STATE OF DATA + AI 14
The Modern Data
and AI Stack
Over the last several years, the trend toward building
open, unified data architectures has played out in our
own data. We see that data leaders are opting to preserve
choice, leverage the best products and deliver innovation
across their organizations by democratizing access to
data for more people.
STATE OF DATA + AI 15
TOP 5 AI AND MACHINE LEARNING PRODUCTS
Hugging Face
Labelbox
Customers
NVIDIA
of of
Number
Number
Feb Mar Apr May June July Aug Sept Oct Nov Dec Jan
2022
Feb Mar Apr May June July Aug Sept Oct Nov 2023
Dec Jan
2022 2023
STATE OF DATA + AI 16
Top AI and ML Products
As companies integrate data science and ML into
their business strategies, many leaders are looking for
guidance on the right tools to add to their arsenals. The
Rising
Databricks Lakehouse integrates with a growing number Star
of AI and ML solutions to support these use cases.
One of our most interesting findings is that open source Launched in October 2022, LangChain is an
is dominating the top ranks; 3 out of our 5 most widely open source framework for developing LLM
applications. As a new integration, LangChain
adopted AI and ML products on the Lakehouse are based
does not qualify for this year’s Top 5 AI and
on open source. This indicates a growing sentiment across ML Products list. But its accelerated growth
industries: open platforms and products are critical to with the Databricks Lakehouse is worth
today’s AI and ML strategies. We anticipate this trend to highlighting, as it speaks volumes about the
continue with the rise of generative AI. Many organizations current state of the industry.
STATE OF DATA + AI 17
FASTEST-GROWING DATA AND AI PRODUCTS
dbt 206%
Fivetran 181%
Informatica 174%
Esri 145%
Looker 141%
Lytics 101%
Kepler.gl 95%
STATE OF DATA + AI 18
DBT IS THE FASTEST-GROWING DATA
AND AI PRODUCT OF 2023
The data ecosystem is undergoing a major transition,
and selecting the right products is critical for companies
looking to take advantage of the newest innovations.
Because the Databricks Lakehouse is used broadly across
this ecosystem, it provides unique insights into how
customers adopt hundreds of data products and services.
STATE OF DATA + AI 19
GROWTH OF DATA AND AI MARKETS
Business Intelligence
Data Governance
& Security
Data Integration
Number of Customers
Feb Mar Apr May June July Aug Sept Oct Nov Dec Jan
2022 2023
Note: In this chart, we count the number of customers deploying one or more data and AI products in each category. These four
categories do not encompass all products. Databricks products, such as Unity Catalog, are not included in this data.
20
Data and AI markets: business intelligence is
standard, organizations invest in their machine
learning foundation
To understand how organizations are prioritizing their While BI is often where organizations start their data
data initiatives, we aggregated all data and AI products on journey, companies are increasingly looking at more
the Databricks Lakehouse and categorized them into four advanced data and AI use cases.
core markets: BI, data governance and security, DS/ML,
and data integration. Our data set confirms that BI tools DEMAND FOR DATA INTEGRATION PRODUCTS
IS GROWING FAST
are more widely adopted across organizations relative to
more nascent categories — and they continue to grow, We see the fastest growth in the data integration market.
with a 66% YoY increase in adoption. This aligns with the These tools enable a company to integrate vast amounts
broader trend of more organizations performing data of upstream and downstream data in one consolidated
warehousing on a Lakehouse, covered in the next section, view. Data integration products ensure that all BI and DS/
Views from the Lakehouse. ML initiatives are built on a solid foundation.
STATE OF DATA + AI 21
Views from
the Lakehouse
MIGRATION AND DATA
FORMAT TRENDS
STATE OF DATA + AI 22
Migration trends: SOURCE OF NEW CUSTOMER
the best data warehouse MIGRATIONS TO DATABRICKS
is a Lakehouse
Ha
se
popular data and AI products, with BI and data
doo
ou
ata Wareh
integration tools at the top, organizations are
39%
p
increasingly using the data lakehouse for data
warehousing. To better understand which legacy 27%
platforms organizations are moving away from,
em D
we look at the migrations of new customers
Pr
to Databricks.
n- O
22%
An interesting takeaway is that roughly half of the
companies moving to the Lakehouse are coming
from data warehouses. This includes the 22% Cl
oud use
Data Wareho
that are moving from cloud data warehouses.
It also demonstrates a growing focus on running
data warehousing workloads on a Lakehouse
and unifying data platforms to reduce cost.
STATE OF DATA + AI 23 23
Rising tides: the volume VOLUME OF DATA MANAGED,
of data in Delta Lake BY STORAGE FORMAT
has grown 304% YoY
Volume of Data
high costs. The Lakehouse solves this problem by
providing a unified platform for all data types
and formats.
STATE OF DATA + AI 24
Data warehousing grows,
with emphasis on serverless
Over the past 2 years, companies have vastly increased their usage
of data warehousing on the Lakehouse Platform. This is especially
Data Lakehouse
demonstrated by use of Databricks SQL — the serverless data
Warehouse Platform
warehouse on the Lakehouse — which shows 144% YoY growth.
This suggests that organizations are increasingly ditching traditional
data warehouses and are able to perform all their BI and analytics
on a Lakehouse.
DATA WAREHOUSING
ON LAKEHOUSE WITH
DATABRICKS SQL
Number of Customers
Data consistently dips in the last Jan July Jan July Jan
week of December due to seasonality. 2021 2021 2022 2022 2023
STATE OF DATA + AI 25
CONCLUSION
Generation AI
We’re excited that companies are progressing into more
advanced ML and AI use cases, and the modern data
and AI stack is evolving to keep up. Along with the rapid
growth of data integration tools (including our fastest
growing, dbt), we’re seeing the rapid rise of NLP and LLM
usage in our own data set, and there’s no doubt that the
next few years will see an explosion in these technologies.
It’s never been more clear: the companies that harness
the power of DS/ML will lead the next generation of data.
STATE OF DATA + AI 26
About Databricks
DISCOVER LAKEHOUSE
© Databricks 2024. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation | Terms of Use
STATE OF DATA + AI 27