0% found this document useful (0 votes)
62 views

Gradient Flow Trend 2023 Report Final

The document provides an overview of trends in artificial intelligence and machine learning in 2023. It discusses 10 key trends, including the growth of generative AI startups, tools for evaluating large language models, research on more efficient and sustainable ML techniques, and efforts to democratize machine learning. The document aims to give insights into emerging opportunities and challenges in 2023 and beyond.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Gradient Flow Trend 2023 Report Final

The document provides an overview of trends in artificial intelligence and machine learning in 2023. It discusses 10 key trends, including the growth of generative AI startups, tools for evaluating large language models, research on more efficient and sustainable ML techniques, and efforts to democratize machine learning. The document aims to give insights into emerging opportunities and challenges in 2023 and beyond.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data, Machine

Learning, and AI:


2023 Opportunities
and Trends
Ben Lorica, Mikio Braun, Jenn Webb
Contents & Introduction

“Data, Machine Learning, and AI: 2023 Opportunities and Trends” is an annual comprehensive
look at emerging developments in data infrastructure and engineering, machine learning (ML),
and artificial intelligence. The report is divided into 10 sections, each focusing on a different
aspect of AI and ML.

Section I: Generative AI is one of 2023’s hottest areas for startup funding

Section II: Understanding, testing, and evaluating large (language) models

Section III: Model efficiency and sustainability

Section IV: Increased efforts to democratize machine learning and make it more
accessible to non-experts

Section V: Data processing and data management tools for unstructured data
including text, visual data, speech and audio

Section VI: Renewed focus on streaming (and data integration)

Section VII: Data engineers will focus more on operational tasks

Section VIII: The coming wave of regulations

Section IX: Profiling the next pegacorns

Section X: Other trends to watch

Overall, this report provides a comprehensive overview of the latest developments and
trends in the field of artificial intelligence and machine learning, and offers insights into the
opportunities and challenges that lie ahead in 2023 and beyond.

2
Section I: From research to real-world applications

Generative models → Generative AI

Generative models are used to learn to model


the true data distribution p(x) from observed
samples x. Instead of traditional ML tasks
like clustering, prediction, and classification,
generative models with esoteric names like
diffusion, VAE, GANs, and ELBO have recently
been used to enable exploration, creation,
and creative expression. Implementations of
diffusion and other generative models are
already becoming more widely available to
developers and many startups are attempting
to build products using these models.

Categorizing machine learning models

Discriminative model: Focus on learning boundaries between classes Generative model: Focus on learning classes

• Determine decision boundaries based on • What is the distribution of the


observed data input x ?
• Given an input x, what is the most likely • What is the joint distribution of
output y the input x and the output y?
• Find y that maximizes P(y | x) • Find P(x) or P(x, y)

For more, see this section on the generative model page on Wikipedia.

3
Generative AI in the real world

Large language models Image generation: Text-to-


(LLM): We continue to image generators include
see many new startups DALL-E, Imagen, Stable
that target copywriting Diffusion, and more.
and content marketing,
general writing, and support Speech synthesis: The
(chatbots). We already know artificial creation of human
of startups with significant speech. Text-to-speech
revenues that target tools have been around for
copywriters and content several years. This coming
creators. Current
​​ models are year, we expect new tools
typically used to produce and startups focused on
first drafts, but startups are speech-to-speech (input and
actively working to produce output are voice) synthesis.
better quality and longer- We are particularly keen on
form text tuned for specific the potential applications
verticals. of real-time speech-
to-speech synthesis to
Coding and programming such areas as health
assistants: We are happy and medicine, customer
users of GitHub Copilot, and support, media, and gaming.
a vast majority of its users OpenAI Whisper hints
say they feel a lot more at the potential of using
productive while coding massive amounts of data
with it. But there are many to train speech models
other tools—LLMs have that approach human-level
given rise to many AI tools robustness and accuracy.
that offer autocomplete-
style suggestions. These
tools lead to faster task
completion times, help
developers conserve mental
energy, and allow them to
focus on more satisfying
tasks.

4
Section II: Tools for understanding, testing, and evaluating large (language) models

arxiv.org papers on testing LLMs

As general-purpose models become more Researchers at Stanford’s Center for Research • They then select a subset of scenarios
prevalent, there’s a growing need for tools to on Foundation Models just unveiled the results and metrics based on societal relevance
help developers select models appropriate of a study that evaluated the strengths and (e.g., user-facing applications), coverage
for their use case and, more importantly, to weaknesses of 30 well-known large language (e.g., different English dialects/varieties),
help them understand the limitations of these models. In the process, they developed a new and feasibility (i.e., amount of compute).
models. Along those lines, the startup Hugging benchmarking framework, Holistic Evaluation
Face recently released low-code tools, which of Language Models (HELM), which can be More broadly, we expect more tools for testing
make it simple to assess the performance of described as follows: models prior to release:
a set of models along an axis such as FLOPS
and model size, and to assess how well a set of • They organize the space of scenarios (use • Why Meta’s latest large language
models performs in comparison to another. cases) and metrics (desiderata). model survived only three days
online

5
Section III: Training *and* maintaining models puts the focus on efficiency and sustainability

As models (for speech, vision, and text) get more widely deployed
and used, the seemingly positive correlation between model size
and accuracy has prompted research into less resource-intensive
methods that can produce comparable results. These research
initiatives are beginning to inspire real-world deployments.

A recent survey paper provides a comprehensive overview of such


initiatives in NLP:

• Data: Using fewer training instances, or better utilizing those


available, can increase efficiency.

• Model design: This pertains to architectural changes or new


modules that accelerate the workflow of the main model. For
example, a promising direction in text generation is to combine
parametric models with retrieval mechanisms from a database
or a knowledge graph.

• Training: Includes pre-training and fine-tuning.

• Inference and compression

6
Sustainable AI
Organizations like Allen AI and Meta are
devoting resources to green/sustainable AI, a
collection of tools and processes that explores
the environmental impact of AI from a holistic
perspective. The goal is to develop and deploy
AI systems that yield novel results while
considering computational and environmental
costs, thereby reducing resource usage.

For a given particular use case, the goal is to


measure the environmental impact of AI data,
algorithms, and system hardware. Additionally,
researchers in this area consider emissions
across the life cycle of hardware systems,
from manufacturing to operational use.

7
Section IV: Increased efforts to democratize machine learning and make it more accessible to non-experts

There are many startups using AI to build tools to


boost developer productivity, and some are building
low/no-code tools to open up programming tasks to
non-coders.
Size of Global Talent Pool
With demand for AI and machine learning rising, an
encouraging sign is that tools for building, deploying Developer / Analyst Data scientist Machine learning engineer
and maintaining models continue to get better. software engineer Data science Deep learning engineer
However, many of these tools require experimenting
with different models, hyperparameter tuning, as
well as the judgment of data scientists who have
some familiarity with the domain and the underlying
data.

As we noted in a recent post on AutoML and time


series, there is an enormous pool of potential
contributors (developers and analysts) with limited
backgrounds in machine learning and statistics.
Thankfully, there are startups, companies, and
open source projects focused on making ML more
accessible to non-expert users.

ML & AI Tools for Domain Experts ⟶ AutoML / AutoNLP / AutoForecasting

+ +
8
Section V: Tools for unstructured data

Computer vision and speech technologies


may have spurred the resurgence in interest
in AI (deep learning), but tools for processing,
wrangling, storing, and analyzing visual and
audio data are lagging. Over the past few
months we’ve come across teams that are
focused on making these important data types
much more accessible to developers and data
science teams. We suspect there’ll be more
teams focused on tools for visual and audio
data over the next year.

• Introducing a free tool for curating image


data sets at scale
• New open source tools to unlock speech
and audio data
• Deep Lake: A lakehouse for deep Learning

A related set of tools leverages developments


in nearest neighbor algorithms and neural
models. Vector databases and vector search
are on the radar of a growing number of
technical teams. Advances in neural networks
have made dense vector representations of
data more common. Embeddings are common
in organizations using neural networks. You
can think of vector databases as what you get
when you embed your entire database. Hence,
vector databases are a family of database-
managed architectures that allow you to
integrate AI into your data management
system.

• The Vector Database Index

9
Section VI: Renewed focus on streaming (and data integration)

Data integration will continue to be a very active


and exciting area, with new tools forging ahead
toward higher reliability, ease of use, more One interesting trend is the rise of low-
connectors, improved orchestration, monitoring, code/no-code tools for data integration
and observability. Some of the best solutions come
from companies that offer broad platforms (e.g., and data pipelines.
Databricks has outstanding offerings in this area).

Project Lightspeed (next-gen Spark Streaming) is


just the most recent indicator of renewed interest
in streaming and streaming applications. Our belief
is that companies that do streaming first will gain a
decisive advantage over the next few years.

• The Stream Processing Index


• The Data Integration Market
• Summer of Orchestration

Companies that master streaming


will have a decisive edge.

10
Section VII: Data engineers are focusing more on operational tasks

The rise of cloud warehouses and lakehouses means data engineers and data platform teams can get more done—at scale—
compared to a few years ago, when teams had to piece together and manage a variety of tools. Their role is shifting from
infrastructure development to operational tasks.

Emerging areas of focus include:

Ops: reliability, automation, monitoring and observability, and incident


response. A new wave of orchestration solutions are helping.

Data asset governance and lineage

Cloud computing cost management and


optimization (sometimes referred to as FinOps)

11
Section VIII: AI and data teams will be caught flat-footed by the coming wave of regulations

Unlike data privacy where companies could


organize their initiatives around a few major
frameworks (GDR, CCPA), AI compliance involves Partial list of regulations on the brink of
many more regulatory frameworks that cover passing or already in place:
disparate business areas. According to our
friends at BNH—the first U.S. law firm focused
on AI and analytics—the best step companies
can take over the next year is to operationalize • EU AI Act: Establishes a process for self-certification and
the NIST AI Risk Management Framework. government oversight of high-risk AI systems, transparency
Operationalizing the NIST framework, requirements for AI systems interacting with people, and seeks
demonstrates good faith and helps reduce risks to ban a few “unacceptable” qualities of AI systems.
associated with AI.
• NYC automated employment decision tools: Pertain to
software systems “used to substantially assist or replace
discretionary decision-making for making employment decisions
that impact natural persons”.

º New York City delays enforcement of AI bias law

• District of Columbia’s Stop Discrimination by Algorithms Act


of 2021: Establishes a framework for restricting and requiring
businesses that use algorithms to make credit and eligibility
decisions, including those directing advertising and marketing
solicitations.

• California’s Fair Employment & Housing Council: A major step


toward regulating the use of artificial intelligence and machine
learning in conjunction with employment decisions.

• Blueprint for an AI Bill of Rights (White House): Though not


a propsed legislation, the White House Office of Science and
Technology Policy has identified five principles for designing,
using, and deploying automated systems.

12
Section IX: Applications will continue to lead to more pegacorns

We are in a challenging economic environment for startups.


Given the importance of AI and data intelligence for most
companies, AI and data startups will continue to thrive even
in these tough economic times. In fact, we know of some new
AI companies that have achieved pegacorn status ($100M in
annual revenue). As with our original list, these are primarily
companies focused on applications. The reason AI application
companies tend to do better is that there are more use cases
at the application layer than at the infrastructure layer where
a single company can serve the same purpose across multiple
companies (e.g., data management or data integration).

The AI $100M revenue club


The data pegacorns

How to get into the AI pegacorn club

Revenue Founding Date


$100M in annual revenue pieced together from various Founded around 2012 and later, this is when deep
data sources including Crunchbase, Zoominfo, public learning breakthroughs started
announcements, and other media sources

Independent ML
Privately held standalone company and/or was Has machine learning as a key component of their
recently acquired by a larger public company within product offering
the past year and operates as a standalone company

13
Section X: Other trends to watch

Tools for multi-cloud AI and geopolitics


• Terraform: An introduction The Biden administration’s export controls for
“Certain advanced computing and semiconductor
• New tools like Skyplane point toward a future manufacturing items; supercomputer and
when multi-cloud data platforms are more semiconductor end use,” will have an impact on the
common. data and AI communities. In recent years, Chinese
companies have ramped up their contributions
º Sky Lab: Skyplane, SkyPilot in research (number of publications at top-tier
conferences), open source software projects, and
º Sky computing is a new computing model startups. Decoupling China from the rest of the
where resources from multiple cloud providers world will have repercussions for AI.
are utilized to create large-scale distributed
platforms.

Data
Data and
and cybersecturity
Cybersecurity

Privacy and machine learning: Measuring


the popularity and exploring the readiness of
confidential computing tools

Ransomware threats: Ransomware accounted for


$20 billion in global losses in 2021. According to
estimates, that number will grow to $265 billion by
2031.

Securing your software supply chain: (PyPI)


attacks grew 41% in 2022. Here’s a recent example
of such an attack against PyTorch. Given how critical
Python is to data engineering, machine learning ,
and AI teams, here are some tips on how to secure
your Python supply chain.

14
The emergence of Twitter alternatives

With the chaos taking place on Twitter, where will


the ML, AI, and engineering communities migrate?
There are three emerging venues.

We think a combination of 1 and 2, with a higher


proprtion going to 1, will be most likely:

1. Federated communities built around ActivityPub,


an open, decentralized social networking
protocol.
2. New (centralized) platforms.
3. Federated communities based on OStatus, an
open standard for federated microblogging.

15
Subscribe to the Gradient
Flow Newsletter to stay up
to date on emerging trends

16

You might also like