From Machine Learning to
Autonomous Intelligence
Lecture 1
Yann LeCun
NYU - Courant Institute & Center for Data Science
Meta - Fundamental AI Research
http://yann.lecun.com
Summer School on Statistical
Physics & Machine Learning
Les Houches, 2022-07-[20-22]
Y. LeCun
Plan
Applications of AI / ML / DL today
Largely rely on supervised Deep Learning. A few on Deep RL.
Increasingly rely on Self-Supervised pre-training.
Current ML/DL sucks compare to humans and animals
Humans and animals learn models of the world
Self-Supervised Learning
Main problem: representing uncertainty, learning abstractions.
Energy-Based Models
Sample contrastive learning methods
Non-contrastive learning methods
Y. LeCun
Main Messages
Deep SSL is the enabling element for the next AI revolution
I’ll try to convince you to:
Give up on supervised and reinforcement learning
well, not completely, but as much as possible.
Give up on probabilistic modeling
use the energy-based framework instead
Give up on generative models
Use joint-embedding architectures instead
Use hierarchical latent-variable energy-based models
To enable machines to reason and plan.
See position paper: “A Path Towards Autonomous Machine Intelligence”
https://openreview.net/forum?id=BZ5a1r-kVsf
AI can do pretty
amazing things
today
Y. LeCun
Deep Learning: Protecting Lives and the Environment
Transportation
Driving assistance / autonomous driving
On-line Safety / Security
Filtering harmful/hateful content
Filtering dangerous misinformation
Environmental monitoring
Medicine
Medical imaging
Diagnostic aid
Patient care
Drug discovery
Y. LeCun
Deep Learning Connects People to knowledge & to each other
Meta (FB, Instagram), Google, YouTube, Amazon, are built around
Deep learning
Take Deep Learning out of them, and they crumble.
DL helps us deal with the information deluge
Search, retrieval, ranking, question-answering
Requires machines to understand content
Translation / transcription / accessibility
language ↔ language; text ↔ speech; image → text
People speak thousands of different languages
3 billion people can’t use technology today.
800 million are illiterate, 300 million are visually impaired
Y. LeCun
Deep Learning for On-Line Content Moderation
Filtering out objectionable content
What constitutes acceptable or objectionable content?
Meta doesn’t see itself as having the legitimacy to decide
But in the absence of regulations, it has to do it.
Types of objectionable content on Facebook
(with % taken down preemptively & prevalence, Q1 2022)
Hate Speech (95.6%, 0.02%), up from 30-40% in 2018
Violence incitement (98.1%, 0.03%), Violence (99.5%, 0.04%),
Bullying/Harassment (67%, 0.09%), Child endangerment (96.4%),
Suicide/Self-Injury (98.8%), Nudity (96.7%, 0.04%),
Taken down (Q1’22): Terrorism (16M), Fake accounts (1.5B), Spam (1.8B)
https://transparency.fb.com/data/community-standards-enforcement
Y. LeCun
Image understanding
Y. LeCun
FastMRI: 4x speed up for MRI acquisition (NYU Radiology + FAIR)
MRI images subsampled
in k-space by 4x and 8x
U-Net architecture
4-fold acceleration
[Zbontar et al.
ArXiv:1811.08839]
K-space masks
Y. LeCun
FastMRI (NYU Radiology+FAIR): 4x speed up for MRI acquisition
Radiologists could not tell the difference
between clinical standard and 4x
accelerated/restored images
They often preferred the accelerated/restored
images
[Recht et al., American Journal of Roentgenology 2020]
Similar systems are now integrated in new
MRI machines.
Y. LeCun
Why produce an image at all? [S. Chopra’s group, NYU]
Why not directly from raw data
to diagnosis / screening?
Humans need 2D image sliced
displayed on a monitor
DL systems can accept grossly
undersampled (10-20x) or low-
field raw data representing the
entire volume.
They can be trained to directly
produce a screening result
Y. LeCun
AI accelerates progress of biomedical sciences
Neuroscience
Neural nets as models of the brain
Models of vision, audition, & speech
understanding
Genomics
Identifying gene regulation networks
Curing genetic diseases?
Biology / biochemistry
Predicting protein structure and function
Designing proteins
Drug discovery
[DeepMind, AlphaFold]
Y. LeCun
AI accelerates the progress of physical sciences
Physics
Analyzing particle physics experiments
Accelerating complex simulations: fluids,
aerodynamics, atmosphere, oceans,….
Astrophysics: enabling universe-wide
simulations, classifying galaxies,
discovering exoplanets….
Chemistry
Finding new compounds
Material science
Predicting new material properties
Designing new meta-materials
[He 2019]
Y. LeCun
Open Catalyst Project: open competition
Want to solve climate change?
Discovering new materials to enable
large-scale energy storage
Efficient & scalable extraction of hydrogen
from water through electrolysis
Sponsored by FAIR & CMU
[Zitnick https://arxiv.org/abs/2010.09435]
Y. LeCun
Make-A-Scene: making art with the help of AI
1. Type a text description,
2. Draw a sketch
“A colorful sculpture of a cat”
Y. LeCun
Playing with Make-A-Scene
Drinking a glass of
Burgundy by the sea
painting of a physicist on a mountain path
watching the sunset, in the style of Van Gogh
Current ML Sucks!
Where are my self-driving car,
virtual assistant, domestic robot?
Y. LeCun
Requirements for Future ML/AI Systems
Understand the world, understand humans, have common sense
Level-5 autonomous cars
That learn to drive like humans, in about 20h of practice
Virtual assistants that can help us in our daily lives
Manage the information deluge (content filtering/selection)
Understands our intents, takes care of simple things
Real-time speech understanding & translation “Her”
(2013)
Overlays information in our AR glasses.
Domestic Robots
Takes care of all the chores
For this, we need machines near-human-level AI
Machines that understand how the world works
Y. LeCun
Machine Learning sucks! (compared to humans and animals)
Supervised learning (SL) requires large numbers of labeled samples.
Reinforcement learning (RL) requires insane amounts of trials.
SL/RL-trained ML systems:
are specialized and brittle
make “stupid” mistakes
Machines don’t have common sense
Animals and humans:
Can learn new tasks very quickly.
Understand how the world works
Humans and animals have common sense
Y. LeCun
Machine Learning sucks! (plain ML/DL, at least)
Machine Learning systems (most of them anyway)
Have a constant number of computational steps between input and
output.
Do not reason.
Cannot plan.
Humans and some animals
Understand how the world works.
Can predict the consequences of their actions.
Can perform chains of reasoning with an unlimited number of steps.
Can plan complex tasks by decomposing it into sequences of subtasks
Y. LeCun
Three challenges for AI & Machine Learning
1. Learning representations and predictive models of the world
Supervised and reinforcement learning require too many samples/trials
Self-supervised learning / learning dependencies / to fill in the blanks
learning to represent the world in a non task-specific way
Learning predictive models for planning and control
2. Learning to reason, like Daniel Kahneman’s “System 2”
Beyond feed-forward, System 1 subconscious computation.
Making reasoning compatible with learning.
Reasoning and planning as energy minimization.
3. Learning to plan complex action sequences
Learning hierarchical representations of action plans
How do humans
and animals
learn so quickly?
Not supervised.
Not Reinforced.
Y. LeCun
poin2ng
How could machines learn like animals and humans?
Social
helping vs false perceptual
Communication hindering beliefs
How do babies
Actions face tracking
learn how the
ra2onal, goal-
directed ac2ons
Perception
biological
mo2on world works?
gravity, iner2a
Physics stability,
support
conserva2on of
momentum
How can
teenagers learn
Object permanence shape
constancy to drive with
Objects solidity, rigidity 20h of practice?
[Emmanuel natural kind categories Age (months)
Dupoux] 0
Production
1 2 3 4 5 6 7 8 9 10 11 12 13 14
proto-imita2on
crawling walking
emo2onal contagion
Y. LeCun
How do Human and Animal Babies Learn?
How do they learn how the world works?
Largely by observation, with remarkably little interaction (initially).
They accumulate enormous amounts of background knowledge
About the structure of the world, like intuitive physics.
Perhaps common sense emerges from this knowledge?
Photos courtesy of
Emmanuel Dupoux
Y. LeCun
Common sense is a collection of models of the world
Jitendra Malik
Architecture of
Autonomous AI
Y. LeCun
Modular Architecture for Autonomous AI
Configurator
Configures other modules for task configurator
Perception Short-term
memory
Estimates state of the world World Model
World Model
Predicts future world states Perception
Actor Critic
Cost Intrinsic Cost
Compute “discomfort” cost
Actor
Find optimal action sequences action
Short-Term Memory
Stores state-cost episodes percept
Y. LeCun
Mode-2 Perception-Planning-Action Cycle
Akin to Model-Predictive Control (MPC) in optimal control.
Actor proposes an action sequence
World Model imagines predicted outcomes
Actor optimizes action sequence to minimize cost
e.g. using gradient descent, dynamic programming, MC tree search…
Actor sends first action(s) to effectors
C(s[0]) C(s[t]) C(s[t+1]) C(s[T-1]) C(s[T])
Pred(s,a) Pred(s,a) Pred(s,a) Pred(s,a)
s[0] s[t] s[t+1] s[T-1]
action
a[0] a[t] a[t+1] a[T-1]
Actor
Training the World Model
with
Self-Supervised Learning
Capturing dependencies between inputs.
Representing uncertainty.
Y. LeCun
Self-Supervised Learning = Learning to Fill in the Blanks
Reconstruct the input or Predict missing parts of the input.
time or space →
This is a [...] of text extracted [...] a large set of [...] articles
Y. LeCun
Self-Supervised Learning = Learning to Fill in the Blanks
Reconstruct the input or Predict missing parts of the input.
time or space →
This is a piece of text extracted from a large set of news articles
Y. LeCun
Two Uses for Self-Supervised Learning
1. Learning hierarchical representations of the world
SSL pre-training precedes a supervised or RL phase
2. Learning predictive (forward) models of the world
Learning models for Model-Predictive Control, policy
learning for control, or model-based RL.
Question: how to represent uncertainty & multi-
modality in the prediction?
Y. LeCun
Learning Paradigms: information content per sample
“Pure” Reinforcement Learning (cherry)
The machine predicts a scalar reward given once in a
while.
A few bits for some samples
Supervised Learning (icing)
The machine predicts a category or a few numbers
for each input
Predicting human-supplied data
10→10,000 bits per sample
Self-Supervised Learning (cake génoise)
The machine predicts any part of its input for any
observed part.
Predicts future frames in videos
Millions of bits per sample
Y. LeCun
The world is stochastic
Training a system to make a single
prediction makes it predict the
average of all plausible predictions
Blurry predictions!
Divergence
Prediction measure
y C(y,y)
G(x) Deterministic
Function
x y
Y. LeCun
The world is unpredictable. Output must be multimodal.
Training a system to make a single
prediction makes it predict the
average of all plausible predictions
Blurry predictions!
Y. LeCun
How do we represent uncertainty in the predictions?
The world is only partially
predictable
How can a predictive model
represent multiple
predictions?
Probabilistic models are
intractable in continuous
domains.
Generative Models must
predict every detail of the
world
My solution: Joint-
Embedding Predictive
Architecture
Energy-Based
Models
Capture dependencies through
an energy function.
See “A tutorial on Energy-Based
Learning” [LeCun et al. 2006]
Y. LeCun
Energy-Based Models: Implicit function
Gives low energy for compatible pairs of x and y
Gives higher energy for incompatible pairs
F(x,y) Energy
Function
x y y
time or space →
x
Y. LeCun
Energy-Based Models
Divergence
Feed-forward nets use a finite number of steps
to produce a single output. Prediction measure
What if… y C(y,y)
The problem requires a complex computation to
produce its output? (complex inference) G(x) Feed-forward
architecture
There are multiple possible outputs for a single
input? (e.g. predicting future video frames) x y
Set of constraints
Inference through constraint satisfaction That y must satisfy
Finding an output that satisfies constraints: e.g a
F(x,y)
linguistically correct translation or speech
transcription.
Maximum likelihood inference in graphical models x y
Y. LeCun
Energy-Based Models (EBM)
Energy function F(x,y) scalar-valued. Energy
Takes low values when y is compatible with x and higher Function
F(x,y)
values when y is less compatible with x
Inference: find values of y that make F(x,y) small.
There may be multiple solutions x y
Note: the energy is used for inference, not for learning
Example
Blue dots are
data points
y x
Y. LeCun
Energy-Based Model: implicit function
Energy function that captures the x,y dependencies:
Low energy near the data points. Higher energy everywhere else.
If y is continuous, F should be smooth and differentiable, so we can use
gradient-based inference algorithms.
y
Energy
Function
F(x,y)
x y
x
Y. LeCun
Energy-Based Model: unconditional version
Conditional EBM: F(x,y) Energy
Unconditional EBM: F(y) Function
measures the compatibility between the F(x,y)
components of y
If we don’t know in advance which part of
y is known and which part is unknown x y
F(y)
Energy
y2 Function
Dark = low energy (good)
Bright = high energy (bad)
y
Purple = data manifold
y1
Y. LeCun
Energy-Based Models vs Probabilistic Models
Energy
Probabilistic models are a special case of EBM Function
Energies are like un-normalized negative log probabilities F(x,y)
Why use EBM instead of probabilistic models?
EBM gives more flexibility in the choice of the scoring
function. x y
More flexibility in the choice of objective function for
learning
From energy to probability: Gibbs-Boltzmann
distribution
Beta is a positive constant
Y. LeCun
Latent-Variable EBM
Latent variable z:
Captures the information in y that is not available in x
Computed by minimization
=
x y
Y. LeCun
Latent-Variable Generative EBM Architecture
Latent variables:
parameterize the set of predictions
y C(y,y)
Prediction
Ideally, the latent variable
represents independent Dec(z,h)
explanatory factors of variation h
of the prediction.
z
The information capacity of the Latent
latent variable must be Pred(x) Variables
minimized.
Otherwise all the information for x y
the prediction will go into it.
Observation Desired Prediction