0% found this document useful (0 votes)

461 views170 pages

CBOW vs Skip-Gram in Word2Vec

Here are the gradients computed analytically: ∂Y/∂H1 = H1 > 0 ? 1 : 0 (derivative of ReLU) ∂H1/∂H0 = 1 (identity function) ∂H0/∂W = X (matrix multiplication) ∂H0/∂X = W (matrix multiplication) So the overall gradients are: ∂Y/∂W = ∂Y/∂H1 * ∂H1/∂H0 * ∂H0/∂W = X * (H1 > 0 ? 1 : 0) ∂Y/∂

Uploaded by

pavancreative81

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

461 views170 pages

CBOW vs Skip-Gram in Word2Vec

Uploaded by

pavancreative81

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Large Scale Deep Learning

Vincent Vanhoucke
Quick Introduction

Tech Lead on the Google Brain team.

Reach me at: vanhoucke@[Link]

Objectives

Give you a practical understanding of neural network training.

Emphasize what matters at scale, when models and data get large.

Dive into some of the more important classes of models.

Talk about some of the most exciting lines of research in the field.

Point out along the way the many dead-ends as well!

Lecture Agenda

Wednesday 2nd 13:40-15:10

Introduction to neural networks.
The fundamentals of model training.

Wednesday 2nd 15:30-17:00

Topics on model training: Regularization and Parallelism.
Notable models: Convolutional Networks, LSTMs, and Embeddings.

Thursday 3rd 10:30-12:00

Deep dive into applications.
Hot topics in deep learning research.
Session I
The promise (or wishful dream) of Deep Learning

Universal Machine Learning

Speech Speech
Text Text
Search Queries Search Queries
Images Images
Videos Simple Videos
Labels Reconfigurable Labels
Entities High Capacity Entities
Words Trainable end-to-end Words
Audio Building Blocks Audio
Features Features
The promise (or wishful dream) of Deep Learning

Common representations across domains.

Replacing smarts with data.

Would merely be an interesting academic exercise…

…if it didn’t work so well!

Recent Kaggle ([Link] ML Competitions

Plankton Identification

Molecular Activity Prediction

Galaxy classification

Higgs Particle Detection

Note which other techniques often win competitions:

● Gradient Boosting
● Random Forests
If your ML class doesn’t cover those (they rarely do), ask for a refund.
In Research and Industry

Speech Recognition
Speech Recognition with Deep Recurrent Neural Networks
Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton
Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks
Tara N. Sainath, Oriol Vinyals, Andrew Senior, Hasim Sak

Object Recognition and Detection

Going Deeper with Convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
Scalable Object Detection using Deep Neural Networks
Dumitru Erhan, Christian Szegedy, Alexander Toshev, Dragomir Anguelov
In Research and Industry
Machine Translation
Sequence to Sequence Learning with Neural Networks
Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

Parsing
Grammar as a Foreign Language
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton

Language Modeling
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson
Neural Networks … without the Neuro-Babble

Imagine you want to build a:

Tractable
Highly Non-linear X P Y
Parametric
function that you can use as a predictor.
How To Build a Tractable, Non-Linear, Parametric Function

Step 1: Start with a linear function: Y = AX A

Linear functions are nice!

Very efficient to compute (BLAS). GPUs were designed for them!
Very stable numerically.
Well behaved and numerically efficient derivatives.
Lots of free parameters: O(N2) for N inputs.

If you do optimization (and your name is not Steve Boyd),

you really want to optimize linear functions.
How To Build a Tractable, Non-Linear, Parametric Function

Step 2: Add a non-linearity.

Meet the Rectified Linear Unit (ReLU)

The simplest non-linearity possible:

max(0, X)

Well behaved derivative: X > 0 ? 1 : 0

How To Build a Tractable, Non-Linear, Parametric Function

Step 3: Repeat!

X A1 A2 A3 A4 Y

Very efficient representation:

Parameters are all in linear functions, yet the stack is very non-linear.
Empirically, deeper models require many fewer parameters than
‘shallow’ models of same representational power.
Where’s my neuron?

‘This is how the brain works’ - G. Hinton

You can paint a neuro-inspired picture of neural networks / deep

learning, but you don’t have to, and it comes with decades of ‘baggage’.

You can just think in terms of parametrizing a large non-linear function in

a way that makes sense computationally, and ignore the neuro-talk.

Your pick!
Orders of Magnitude

State of the Art Object Recognition Models:

A1 A2 A3 kitten

15-25 layers
10-200M parameters
1-5B multiply-adds / image
Training Models

1. The Maths
2. The Stats
3. The Hacks
4. The Computer Science
The Maths
A Neural Network in Equations Inputs
Images
Outputs Spectrograms
Labels Features
Predictions Words
Cat / No Cat …
Phonemes
Next Word
y=nn(x,w)
…

Weights
Parameters
Training Data
Targets
Training Sample True Labels
Correct Labels

x, y̅
Objective: y=nn(x,w) ≈ y̅
The Loss: ‘How Close are We?’
y=nn(x,w)
Sum over
all the
training ∑L(y̅, y)
data.

L2 loss: ∑|y̅ - y|2

Cross-entropy loss: -∑ y̅ log y
Training == Minimizing The Loss w.r.t. Weights

argminw L(w)=L(y̅, nn(x,w))

Minimize using Gradient Descent: Learning Rate

w
w’ = w - α ∂wL(w)
w’
Gradient
Derivative
Delta
The Loss is a Very Complicated Function of the Weights

L(y̅, nn(x,w))
1. nn() is a very non-linear, non-convex function!
2. Depends on all the Inputs and Targets in the training set!

Two main tricks to simplify the problem:

1. Back-Propagation
2. Stochastic Gradient Descent
Back-Propagation: Factoring of the Neural Net
nn() = nn1(nn0())

nn0(x, w0) nn1(h, w1)

x h y L(y, w0, w1)

Activations
Hidden States
Remember the ‘Chain Rule’ from High School:

g(f(x))’=g’(f(x)).f’(x)

∂g∘f ∂g ∂f
= .
∂x ∂f ∂x
Chaining:

∂j∘i∘h∘g∘f ∂j ∂i ∂h ∂g ∂f
=
∂x ∂i ∂h ∂g ∂f ∂x
Graphical View of the Chain Rule

f g
x f(x) g∘f(x)

∂f ∂g

∂f(x) ∂g(f(x))

∂g∘f/∂x = ∂f(x).∂g(f(x))
_._
Graphical View of the Chain Rule

f g
x f(x) g∘f(x)

∂f ∂g

∂f(x) ∂g(f(x))

∂g∘f/∂x = ∂f(x).∂g(f(x))
_._
Back-Propagation using the Chain Rule

f
x f(x)

∂f

∂f(x).∂x ∂x
You can compute the gradient with respect to any quantity by:
● Taking the gradient ∂x sent back from up the chain from you.
● Multiplying it by your local gradient ∂f(x) with respect to that quantity.
Back-Propagation using the Chain Rule

f g
x f(x) = y g(y) L()

∂f ∂g

∂L() ∂f(x).∂x ∂x= ∂g(y).∂y ∂y

More Back-Propagation ‘Magic’

Computation along a directed graph can be shared:

h y L(y)

z
Sharing Gradient Computation via Back-Propagation

1- Compute gradients with respect to the inputs.

2- Compute gradients with respect to the weights.

h=nn0(x, w0) y=nn1(x, w1) L(y)

w w
0 1
Example: 1-layer Neural Network

Y=max(W.X+B, 0)
W._ _+B max(_, 0)
X H0 H1 Y

X and Y are matrices of dimension: # nodes X # examples.

Example: 1-layer Neural Network

Y=max(W.X+B, 0)
W._ _+B max(_, 0)
X H0 H1 Y
Compute analytically the Gradients:
∂Y/∂H1=
∂H0/∂X=W ∂H1/∂H0=1 _>0?1:0
∂X ∂H0 ∂H1 ∂Y
Example: 1-layer Neural Network

Y=max(W.X+B, 0)
W._ _+B max(_, 0)
X H0 H1 Y
W 1 _>0?1:0
∂X ∂H0 ∂H1 ∂Y
∂X = W⊤.∂H0 ∂H0 = ∂H1 ∂H1 = H1 > 0 ? ∂Y : 0

← Back-Propagate the Gradients: ∂X = (∂H0/∂X).∂H0, etc… ←

Example: 1-layer Neural Network
W._ _+B max(_, 0)
X H0 H1 Y
W 1 _>0?1:0
∂X ∂H0 ∂H1 ∂Y

Compute the parameter Gradients:

H0 = W.X → ∂W = ∂H0.X⊤ W ← W - α ∂W
H1 = H0+B → ∂B = ∂H1.1 B ← B - α ∂B
What Happens at The Top of the Chain?

Y L(y̅, y) = |y̅ - y|2

∂Y ∂L(y̅, y) = -2(y̅ - y)
The Loss is a Very Complicated Function of the Weights

L(y̅, nn(x,w))
1. nn() is a very non-linear, non-convex function!
2. Depends on all the Inputs and Targets in the training set!

Two main tricks to simplify the problem:

1. Back-Propagation
2. Stochastic Gradient Descent
The Stats
Stochastic Gradient Descent
Loss: ∑L(y̅, y)
● Instead of computing the true loss on all the data, we compute an
estimate on a very small subset of the data (1 - 1024 examples).
● Terrible estimate. But we can afford to do it lots of times.
● And we have tricks to smoothe it
● If efficiency was linear in batch size, we would use batch = 1.
Stochastic Gradient Descent (SGD) Summarized
For batch in training set:
For layer in network:
Compute (and store) output Activation H Forward pass
Compute Loss and Loss Gradient L, ∂Y
For layer in network backwards:
Compute Gradient w.r.t. input Activation ∂H Backward pass
Compute Gradient w.r.t. Weights ∂W
Update Weights: W ← W - α ∂W
The Hacks
Stochastic Gradient Descent is a Terrible Optimizer

SGD is very noisy, and turns the gradient

descent into a random walk over the loss.

But it’s very cheap, and cheap is all we can

afford at scale.

The hacks that follow all have one objective:

reducing the noise in the gradient estimates
while remaining very cheap to compute.
Primer on Getting Stochastic Gradient Descent to Work

Two strategies
1. Momentum + learning rate decay:
Works best if you manage to get it to run.
2. AdaGrad:
Works more often, not always gets you the best result.
Two more tricks:
1. Parameter averaging.
2. Gradient clipping.
Momentum

g’ = μ g + ∂wL(w)
w’ = w - α g’

μ=0.9
Learning Rate Decay

Think of the loss function as a fractal landscape:

Some structures / dimensions have widely differing scales.
Annealing the learning rate allows you to explore all scales:

-βt
α=α0e
AdaGrad

AdaGrad is a method for applying learning rate decay

adaptively, per-parameter.

Very useful when one wants to do little hyperparameter tuning.

Adaptive subgradient methods for online learning

and stochastic optimization.
John Duchi, Elad Hazan, and Yoram Singer. JMLR 2011
AdaGrad in Equations
Keep a history of the norm of the gradients:

2
n ← n + ∂ wL
Use it to discount the learning rate:

w ← w - α ∂wL
√n
Parameter Averaging

Keep a running average of your parameters over time.

Only use the averaged parameters at test time, not for training!
Gradient Clipping

Threshold gradient norms to protect against wall bouncing:

Weight Initialization

You want to keep your activations in a stable numerical regime: O(1.0).

X A1 A2 A3 A4 Y

|Y| ~ |A4||A3||A2||A1||X|
Initialize weights using N (0, σ) such that:
output activations ~ input activations.
Initialize biases to be positive: start in the linear regime of the ReLU.
Weight Initialization just before the Loss

Your loss is typically very dependent on the scale of the top activations:

... A5 L=-∑ y̅ log y

Norm |Y| of the activations ⇔ peakiness (T℃) of the probability

distribution

Peakiness ⇔ magnitude of the gradients that are sent back:

big peaks == big errors == big gradients.
Weight Initialization just before the Loss

High Temperature:
soft distribution, classifier not certain, small gradients.

Low Temperature:
peaky distribution, classifier very (over?)confident, big gradients.

Key: Start with a very high temperature, small weights in your last layer.

They will anneal to peakier distribution as the classifier gets more confident.
First, lower your learning rate

Faster training ≠ Better training

Detour: Second Order (Quasi-Newton) Methods.

● A way to dramatically improve gradient descent efficiency per step:

W’ = W - α H-1∂W
● H is the Hessian (basically the second derivative) of the Loss.
● H-1 requires very large batches O(training set) to be estimated well.
● Huge literature on approximate 2nd order methods: L-BFGS,
Conjugate Gradient, Hessian-free approximations.
● I have never seen them work better than SGD in practice.
The Computer Science
Parallelizing Stochastic Gradient Descent
For batch in training set:
For layer in network:
Compute (and store) output Activation H Forward pass
Compute Loss and Loss Gradient L, ∂Y
For layer in network backwards:
Compute Gradient w.r.t. input Activation ∂H Backward pass
Compute Gradient w.r.t. Weights ∂W
Update Weights: W ← W - α ∂W
Serial Algorithm
Oh Noes!
For batch in training set:
For layer in network:
Compute (and store) output Activation H
Compute Loss and Loss GradientLotsL, ∂Yof parameters
For layer in network backwards:
in contention
Compute Gradient w.r.t. input Activation ∂H
Compute Gradient w.r.t. Weights ∂W
Update Weights: W ← W - α ∂W
Multiple Levels of Parallelism

Distribute the model, keep the data local: Model Parallelism

Distribute the data, keep the model local: Data Parallelism

Model Parallelism
L2a

L1 L3
L2b

Cut up layers, distribute them onto multiple cores / devices / machines.

Each cut adds several edges to your graph!

Unless you have shared memory, this means a lot more memory and data transfer.
Model Parallelism
On a single core: Instruction parallelism (SIMD, SIMT). Pretty much free.

Across cores: thread parallelism. Almost free, unless across sockets, in

which case inter-socket bandwidth matters (QPI on Intel).

Across devices: for GPUs, often limited by PCIe bandwidth.

Across machines: limited by network bandwidth / latency.

Model Parallelism
Two key ideas when sizing a distributed system:

1- Data reuse: compute is limited by how much data can fit at any time on
the lowest level cache (e.g. L1 cache on CPU). Try to maximally reuse the
data in cache, or get more cache (i.e. more machines!).

2- Overlap computation and data transfer: in most systems compute and

data transfer can happen completely in parallel. Hide the increase in data
transfer latency by overlapping computation with it.
Synchronous Data Parallelism
Run K batches in parallel and aggregate.

It’s by far the most popular way to parallelize SGD today.

Limits:
● Per-example efficiency of gradient descent diminishes as the batch
size increases.
● Cutting a batch smaller yield diminishing returns as matrix
multiplies become less efficient.
● Cost of synchronization grows with K: need to wait for stragglers.
Asynchronous Data Parallelism - Pipelining

X1 A1 A2 A3 Y1

X2 A1 A2 A3 Y2

X3 A1 A2 A3 Y3

Time Steps
Asynchronous Data Parallelism - Pipelining

X1 A1 A2 A3 Y1

X2 A1 A2 A3 Y2

X3 A1 A2 A3 Y3

Devices (Machine or GPU)

Asynchronous Data Parallelism - Pipelining

Pipelining changes the gradient updates:

Wt+1 ← Wt - α ∂Wt-k

where k is the depth of the pipeline (# of layers or less).

Stale gradients from k steps ago are less efficient per step.
Often means the learning rate needs to be reduced.

Limited by depth of pipeline and balancing compute between layers.

Fully Asynchronous Data Parallelism
Run N training loops in parallel.
Share the weights between training loops.

Wt+1 ← Wt - α ∂Wt-k

k is now O(N), potentially very large.

k is effectively unbounded if one training loop is slower than the others.

Equivalent to running N batches in parallel, but forgetting to wait for

the workers to be done to aggregate the partial sums.
Fully Asynchronous Data Parallelism
Has some nice theoretical guarantees:

Hogwild! A Lock-Free Approach to Parallelizing

Stochastic Gradient Descent. Recht et al, NIPS’11

Works well up to ~50 replicas of the model.

Strongly diminishing returns per replica.
Great if you want speed and don’t mind spending resources to get it.
Requires some care in implementation.
Data and Model Parallelism Tradeoffs

Model Parallelism means you need to exchange activations

between workers:

O(batch size X # network edges) values sent at every step.

Data Parallelism means you need to exchange parameters

between workers:

O(# weights) values sent at every step.

DistBelief

Joint Data Parallelism

and Model Parallelism.

Large Scale Distributed

Deep Networks
Jeff Dean et al., NIPS’12
Next up!

We’ll talk about the thorny problem of regularization.

And discuss the various models you might encounter in the wild.

See you soon!

Session II
Regularization
Regularization

There is an optimal size for any machine learning system:

● Too small (Underfitting): not enough parameters to express the

complexity of the data.

● Too big (Overfitting): too many degrees of freedom. The model will
attempt to explain every little detail of the training data and will fail to
generalize.
Regularization

Problem: training a model that is just the right size impossible:

● No idea a priori what the ‘right size’ is.

● Training a model that just fits is very hard from an optimization

standpoint:

‘fitting into skinny jeans’ problem.

Solution: train a model that is way too big, but nudge the parameters
towards a more parsimonious representation.
Regularization Techniques

L2 Regularization, a.k.a. Weight Decay:

● Penalize large weights.
● Achieved using a new term to the Loss:

L(y̅, y) + ε|w|2

● ε is a new hyperparameter (global or per-layer).

Regularization Techniques: Dropout

● Set 50%+ of the activations to zero randomly.

● Force other parts of the network to learn redundant representations.

● Sounds crazy, but works great.
● Complementary to L2.
Regularization Techniques: Dropout

● At test time, don’t drop anything, but multiply the activations by ½:

x½

● Can be combined to great effect with the max() non-linearity.

See Maxout Networks, Goodfellow & al.
Prototypical Models
DNNs RNNs
Embeddings Generative Models
CNNs
The ‘Simple’ Deep Neural Net

Try Logistic Regression, Random Forests or Gradient Boosting first!

Debug your data / problem setup on simple models, a small dataset,

then scale up.

For a new problem, in the absence of any particular structure, a 2-3

layer, 128-1024 nodes / layer model is a reasonable starting point:

X A1 A2 A3 Y
Models for Text
Embeddings

Handling discrete, categorical input, in particular words!

Usual representation: 1-hot encoding of position on the dictionary

[ 0 0 0 0 0 1 0 0 0 0 0 0 0]

Problem: mapping that vector to a dense layer in a neural network

means a very large matrix whose columns (often called embeddings)
are only exercised (hence trained) when that word is seen.

Rare words can mean very poor embeddings.

What Do We Want in a Good Embedding?
Similar Words Occur in Similar Contexts

Instead of training embeddings on the supervised task at hand, train them first to
represent semantic similarity using unsupervised training on a large text corpus.

A couple of common approaches:

- Continuous skip-gram (word2vec): predict surrounding words given a word.
- Continuous Bag of Words (CBOW): predict a word given surrounding words.

Efficient Estimation of Word Representations in Vector Space, ICLR’13.

Distributed Representations of Words and Phrases and their Compositionality, NIPS’13.
Example of word2vec and CBOW

Vjump

word2vec: “The quick brown fox jumps over the lazy dog”

Vjump
CBOW: “The quick brown fox jumps over the lazy dog”
Word Embeddings

Training a very simple model on lots of text mitigates the rare word problem.

The spaces learned have very good syntactic and semantic clustering.

They also have interesting local ‘algebraic structure’:

Vking - Vman + Vwoman ~ Vqueen

Interesting applications to zero-shot learning.
Sentence, Query, Paragraph, Document Embeddings

What if your ‘dictionary’ is extremely large or infinite?

Finding good embeddings for large bodies of text is a very active area of
research. (rel: topic modeling, paraphrasing, document understanding)

Some simple approaches:

- For short sequences, pooling over word embeddings is the first
recourse.
- For long sequences, use your favorite brand of topic model, or run a
recurrent model over the sequence and use a hidden layer of the
network as an embedding.
Model for Perception
Common Statistical Invariants

Across Time Across Space

Expressing Invariants: Weight Tying

Recurrent Neural Network Convolutional Network

y y

nn(x0, W) nn(x1, W)

nn(t0, W) nn(t1, W)
t0 t1 x0 x1

Good news: Backprop ‘just works’: simply add up all the gradients.
Models for Images
Convolutional Networks

Spatially tied deep neural networks.

State-of-the-art in visual recognition

and detection / localization tasks.

One new challenge: images are large

and highly redundant.

Need to introduce new types of

nonlinearities which aggregate /
decimate their inputs.
Convolutions
Feature
Lots of jargon: matrix multiplies Map

Depth
applied over patches as a sliding
window, producing feature maps
of a certain depth. Patch

Express spatial invariance by

sharing weights across spatial
dimensions, but not across depth.

Lots of implementation details

related to stride and padding.
Non-linearities

Convolutional networks use the

same types of pointwise non-
linearities (ReLU).

In addition, spatial pooling is

often reduced to downsample
the feature maps:
- max
- average
- L2
Convolutional Networks

Stacking convolutions and pooling in a way that reduces the spatial

extent while increasing the ‘depth’ of the representation proportionally
is a good strategy to build a good convolutional network:

256x256 RGB → 128x128x16 → 64x64x64 → 32x32x256 …

Convolutional Networks

Example of AlexNet, winning ImageNet challenge entry in 2012:

Models for Time Series
Recurrent Neural Networks

Compact View Unrolled View

Tied Weights
Neural Network
Yt
Y1 Y2 Y3

t ← t+1

Xt
X1 X2 X3

Recurrent Connections
(trainable weights)
Tied Weights
Recurrent Neural Networks
Can be implemented via explicit unrolling or dynamically by keeping
state across invocations, or a combination of both.

Unrolling is conceptually simpler, but imposes a fixed sequence length.

RNNs only have one problem: they mostly don’t work!

Very difficult to train for more than a few timesteps: numerically

unstable gradients (vanishing / exploding).

Thankfully, LSTMs...
LSTMs: Long Short-Term Memory Networks

Took a long time to be recognized as ‘RNNs done right’:

● Terrible name :)
● Look like a horribly over-engineered solution to the problem.

But:
● Very effective at modeling long-term dependencies.
● Very sound theoretical and practical justifications.
● A central inspiration behind lots of recent work on using deep
learning to learn complex programs:
Memory Networks, Neural Turing Machines.
A Simple Model of Memory

Instruction Input
Output WRITE? READ?

WRITE X, M X M Y

READ M, Y

FORGET M
FORGET?
Key Idea: Make Your Program Differentiable
Sigmoids

W R
WRITE? READ?

X M Y X M Y

FORGET?
F
LSTM Cells as replacement for Recurrent Connections

Recurrent connections in a RNN can be replaced by a set of LSTM cells that

map inputs X, R, W, F to output Y.

R, W, and F are ‘control’ connections that affect the state of the memory
through a sigmoidal [0, 1] multiplicative gate.

Gating behavior makes it possible for the memory cell to retain information
longer and discard it quickly, while keeping the whole machine continuous
and differentiable.

This translates into much better stability in training and modeling of much
longer-range interactions compared to a RNN.
Unsupervised Learning
Generative Models and Unsupervised Learning

Amount of unlabeled data >> Amount of labeled data

Unsupervised / generative learning was once hoped

to be a central appeal of deep learning.

Deep models learned to detect cats in YouTube videos

without supervision! Surely they can learn anything?

Building High-level Features Using Large Scale Unsupervised Learning

Quoc V. Le et al., ICML’12
Unsupervised Learning

Likely the biggest disappointment in deep learning so far :(

Only real success is language models and word

embeddings, although these leverage context as a
supervised signal.

For any large task, even modest amounts of supervised data

typically outperform unsupervised models.
Whither Unsupervised Learning?

Two trends to blame:

- Dropout made it possible to learn much bigger models
without overfitting.
- Transfer Learning works amazingly well in practice:

It is often better to initialize your model from a supervised

model trained on a different task, than to use unlabeled
data matching your task.
Unsupervised Learning: New Approaches!

Good news: research on the topic has picked up recently.

General themes:

Variational Auto-Encoders

Adversarial Learning

It remains to be seen whether they can scale.

Generative Models
General idea: the data is the label: y̅ = x

Problem: There are many ways to map X to X in

degenerate or trivial ways!

X X

What would an ‘interesting’ mapping look like?

Generative Models
Latent Variable

X Z X

An interesting mapping could be:

1- One that compresses the data very well: Z << X

2- One that causes semantically similar X to have nearby Z.
3- One that has a very simple distribution (Gaussian, Binomial)
Makes it possible to generate sample X’s from Z’s.
Making Generative Models Non-Trivial

X Z X

Make it hard for the model to do do its job, by introducing

bottlenecks, regularizers, noise, stochasticity or
adversarial training.
Making Generative Models Non-Trivial

Bottlenecks:

X Z X

Very common idea that’s been explored at length (and

reinvented) in many fields: e.g. SVD, PCA, LSA, LDA.
Making Generative Models Non-Trivial

Regularizers: Sparse Autoencoders

L1
X Z X
Noise: Denoising Autoencoders

The ‘ancestor’ of dropout.

X+N Z X

Add noise to the input, force the autoencoder

to reconstruct the clean signal.
Dropout is one such noise source.
Stochasticity and Generative Models

X Z S X

Consider Z as the parameters of a Gaussian. Sample S from it.

Very active area of research, spurred by the concept of

Variational Auto-Encoders

Auto-encoding Variational Bayes. D.P. Kingma & M. Welling, ICLR’

14.
Adversarial Training

X Z X

Train a network to try and distinguish between the real and

generated X. Pit it against the “generator”, and make them compete!

Generative Adversarial Networks, Goodfellow et al, NIPS’14

Detour: Undirected Models

X Z

Model p(X, Z) instead of p(Z|X) and p(X|Z)

Boltzmann Machines and Deep Belief Networks.

Losing popularity. Very hard to train.

Batch Normalization
Better and Faster Way to Train Convolutional Networks

Batch Normalization: Accelerating Deep Network Training

by Reducing Internal Covariate Shift
Sergey Ioffe, Christian Szegedy, ICML’15

10x speedup in training, 2% improvement in performance on ImageNet!

Beautifully simple method attacking the core of what makes deep

networks difficult to train.
The Covariate Shift Problem

SGD is not scale-free: most efficient on whitened data.

This is true for the inputs, but also for every layer up the stack:
mean 0
variance 1

Problem: the distribution of activations changes over time!

The Covariate Shift Problem

SGD needs to do two things for each layer:

1) Update its parameters to improve the objective.
2) Track the distributions of its inputs.

Can we eliminate or at least control 2)?

Solutions
Idea #1: Whiten the activations at each layer.
Problem: very expensive, high-dimensional covariance matrix.

Idea #2: Ok, let’s just subtract the mean, and divide by the variance.
Problem: leads to degenerate gradients!

Idea #3: Let’s use a noisy, local estimate of the mean and variance, e.
g. one computed per mini-batch.
Problem: still strictly less powerful representationally: all filters in the layer
are constrained to the same dynamic range.
Solutions
Idea #4: Add a learned affine transform per activation to rescale the inputs.
Doesn’t that defeat the purpose? No! Tightly bounds the rate of change of the
input distribution: a few linear weights instead of many, many nonlinear factors.
Problem: What happens at test time, when there is no such thing as a mini-batch
to normalize over?

Idea #5: Replace the mini-batch mean and variance by the global mean and
variance over the training set, at test time only.
Problem: That sounds really crazy…
Batch Normalization

Before:

After:

(x - μ) / σ αx + β
Results

Lead the Deep Learning revival by a few years:

Deep Belief Networks for phone recognition

Abdel-rahman Mohamed, George Dahl, and Geoffrey Hinton, NIPS’09

Very large improvements in acoustic modeling performance.

A turning point: speech recognition went from “it mostly doesn’t work” to
“it mostly works” in the public’s perception.
In The Beginning

Model speech frame-by-frame,

independently.

Fully-connected networks.

Deep Neural Networks for

Acoustic Modeling in Speech
Recognition
Hinton et al. IEEE Signal
Processing Magazine, 2012
Speech is very structured.

Vertical shifts of the voiced “I owe you”

segments are essentially pitch
variations.

Irrelevant to non-tonal
languages, and surprisingly
weak cues for tonal languages.

Model translation invariance?

Speech is very structured.

Horizontal dilation is a change “I owe you”

of speaking rate.

Very badly modeled by

conventional Hidden Markov
Models.

Model time dilation invariance?

CLDNNs

Model frequency invariance using 1D convolutions.

Model time dynamics using an LSTM.

Use fully connected layers on top to add depth.

Convolutional, Long Short-Term Memory,

Fully Connected Deep Neural Networks
Sainath et al. ICASSP’15
Trend: LSTMs end-to-end!

Speech Acoustics Phonetics Language Text

Train recurrent models that also incorporate Lexical and Language Modeling.

Fast and Accurate Recurrent Neural Network

Acoustic Models for Speech Recognition, H. Sak et al.

Deep Speech: Scaling up end-to-end speech recognition, A. Hannun et al.

Listen, Attend and Spell, W. Chan et al.

Hot Topics In Deep Learning
Speech Recognition
Object Recognition
Machine Translation
Image Captioning
Memory and Computation
Hot Topics in Deep Learning: Object Recognition

Bread-and-butter task for Computer Vision

Hotly contested ImageNet ILSVRC challenge:

● First breakthrough for deep learning in 2012 (Krizhevsky et al):
brought top-5 error to 16% where the state-of-the-art was 26%.
● Progress since brought error down to 5%.
● Trained human performance is 3 to 5%.
(Humans make different kind of mistakes)

Most improvements via larger, deeper models. Except...

The Inception Architecture

Convolutions are not flexible at allocating their parameters:

● Every filter looks at the entire depth of the input.
● Every filter has the same spatial extent. (patch size)
This puts tight constraints of the geometry and computational cost.
The Inception Architecture

Concept #1: Have different convolutions look at different subsets of the

inputs via projection layers:

Projection layers are 1x1 convolutions. Very efficient to implement

because equivalent to a single matrix multiply. Few parameters.
The Inception Architecture

Concept #2: Look at each feature map using a variety of filter sizes, not
just one, and concatenate them.

5x5

3x3

1x1
The Inception Architecture

Concept #3: Provide each layer with a low-dimensional pooled view of

the previous layer. Similar to often used ‘skip connections’.

Project
Pool
The Inception Architecture

Concept #4: Help training along by providing side objectives. Tiny

classifiers added at various levels of the convolution tower:

Main Classifier

Auxiliary Classifiers
Only used in training
The Inception Architecture

Putting it all together:

Each 1x1 acts as a bottleneck

Controls the number of

parameters per layers

Lots of (too many?) knobs

The Inception Architecture

Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
The Inception Architecture

ImageNet challenge:

Only 11M parameters!

Compare to 130M parameters

for 2nd place VGG.

11MB (8 bit fixed-point):

model fits easily in a mobile app
Hot Topics In Deep Learning
Speech Recognition
Object Recognition
Machine Translation
Image Captioning
Memory and Computation
Hot Topics in Deep Learning: Machine Translation

Machine Translation typically involves multiple steps of processing:

- Reordering of words into a consistent, canonical order.
- Mapping words / phrases to candidates in the target language.
- Scoring candidates using a language model.

Can we devise a system that optimizes all these steps jointly?

LSTM = Trainable Sequence-to-Vector Mapping

X1, X2, X3 → Y

X1 X2 X3

Can we express the opposite operation and map a vector to a sequence?

LSTM = Bidirectional Trainable Sequence-to-Vector Mapping

X → Y1, Y2, Y3

Y1 Y2 Y3

X Y1 Y2

Yes! Very simple idea, but profound implications.

Mapping Sequences to Sequences

X1, X2, X3 → Y1, Y2, Y3, Y4

Y1 Y2 Y3 Y4

X1 X2 X3 Y1 Y2 Y3

Fully trainable. Agnostic to input and output sequence length.

Sequence-to-Sequence problems

Machine Translation:

Sequence to Sequence Learning with Neural Networks

Sutskever et al., NIPS’14

Parsing:

Grammar as a Foreign Language

Vinyals et al., ICLR’15

Speech Recognition? Text-to-Speech? Filtering? Event detection?

Lots of Open Issues!

Best traditional MT systems leverage monolingual data:

On Using Monolingual Corpora in Neural Machine Translation
Caglar Gulcehre et al., arXiv, 2015

Out-of-vocabulary words:
Addressing the Rare Word Problem in Neural Machine Translation
Thang Luong et al., ACL’15

Biggest issue of all: scaling!

One Scaling Issue: the Embedding Bottleneck

This embedding needs to ‘store’

the whole sequence!

Y1 Y2 Y3 Y4

X1 X2 X3 Y1 Y2 Y3

No notion of alignment between input and output.

One Approach to Scaling Neural Translation: Attention Models

Differentiable Attention: Y1 Y2 Y3 Y4
During decoding, look back at the
input sequence and derive
‘attentional’ embeddings A1, A2, A3
X1 X2 X3 A1 Y1 A2 A3
Y2` Y3
Main idea: if X2 translates to Y2,
the model can make A2 look like X2.

Neural Machine Translation by Jointly

Learning to Align and Translate
Dzmitry Bahdanau et al., ICLR’15
Hot Topics in Deep Learning: Machine Translation

Still lots of scalability issues with these models.

Modern speech recognition and machine translation systems use

one to two orders of magnitude more data than can be fed to a
sequence-to-sequence model in practice.

Having a ‘universal’, trainable bidirectional sequence-to-vector

mapper opens up interesting new avenues.

Geoff Hinton calls these Thought Vectors.

Hot Topics In Deep Learning
Speech Recognition
Object Recognition
Machine Translation
Image Captioning
Memory and Computation
Hot Topics in Deep Learning: Image Captioning
Image captioning is just another translation problem:
Map an image to a Thought Vector, and decode it back into text.
a close-up of a child ...
Conv Net

a close-up of a child
MSCOCO Challenge: [Link]
Hot Topics In Deep Learning
Speech Recognition
Object Recognition
Machine Translation
Image Captioning
Memory and Computation
Hot Topics in Deep Learning: Memory and Computation

These models can learn facts and compute complex relationships

from data. How close are we to build a fully trainable computer?

Two important lines of inquiry:

- Incorporating memory.
- Learning programs.
Memory

LSTMs have the equivalent of CPU registers:

directly addressable memory cells.

Can we provide them with RAM? Hard Drives?

RAM: indirect addressing. Content-based addressing is the main idea

behind the differentiable attention model.

Also:
Memory Networks
Jason Weston et al. ICLR’15
Memory

Fitting deep networks with Hard Drives: knowledge bases.

Currently, these models are closed systems: they have to be taught

everything. Can we teach them to retreive facts instead of teaching
them facts? Could my neural translation model learn to search the web
for unknown words? Could it simply look things up in a translation table
instead of having the translation table be fed during training?

Main issues: combing through databases and knowledge sources is not

easy to express as a differentiable process.
Computation

Expressing generic algorithms as differentiable processes that can be

backpropagated through to learn computation strategies is a huge
problem.

LSTMs are able to express a narrow set of computations: Load, Store,

Erase. Can we generalize this?

Neural Turing Machines

Alex Graves et al., arXiv, 2014
Reinforcement Learning

Deep models need to be fully (or very close to) differentiable to be

trainable.

Reinforcement learning opens up the class of possible models to

include non-differentiable representations.

The cost is that these models don’t scale well with the size of the space
to be explored...yet.

Human-level Control Through Deep Reinforcement Learning

Volodymyr Mnih et al., Nature 518, 2015
Hot Topics in Deep Learning: Robots!

End-to-end learning from example.

Bypasses much of traditional

robotics approaches: localization,
registration, motion planning.

End-to-End Training of Deep

Visuomotor Policies
Sergey Levine, Chelsea Finn et al.,
arXiv, 2015
Hot Topics in Deep Learning: Art! (sort of…)

A3 kitten

Interesting things happen when you reinforce/bias a network’s beliefs

and propagate the outcome back to the input space.

As seen on social media under the terms “inceptionism” or “deepsee”.

Different Layers -> Different Filters
ImageNet: Dogs, Birds!

Courtesy:
Alexander Mordvintsev
Factoring Style and Content!

A Neural Algorithm of Artistic Style

L.A. Gatys, A.S. Ecker, M. Bethge
Parting Thoughts

Deep Learning is a rich field at the confluence of machine learning

and computing infrastructure research.

Most direct perception tasks (audio and visual recognition) are on a

predictable improvement path. It’s time to focus instead on the difficult,
“A.I. complete” problems.

Lots to explore: as we approach human-level perception, the dream

of general artificial intelligence is looking a lot less implausible!
Questions:
vanhoucke@[Link]

Quantum ML: Challenges & Future
No ratings yet
Quantum ML: Challenges & Future
38 pages
NLTK: Python for Natural Language Processing
No ratings yet
NLTK: Python for Natural Language Processing
23 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
Neural Networks in Information Retrieval
No ratings yet
Neural Networks in Information Retrieval
290 pages
Transforming Conversational AI Exploring The Power of Large Language Models in Interactive Conversational Agents (Michael McTear, Marina Ashurkina) (Z-Library)
No ratings yet
Transforming Conversational AI Exploring The Power of Large Language Models in Interactive Conversational Agents (Michael McTear, Marina Ashurkina) (Z-Library)
235 pages
Python & ML for Beginners & Pros
No ratings yet
Python & ML for Beginners & Pros
39 pages
LCM LoRA Technical Report
No ratings yet
LCM LoRA Technical Report
7 pages
Evolution of Large Language Models
No ratings yet
Evolution of Large Language Models
32 pages
Predibase Fine-Tuning LLMs Ebook
No ratings yet
Predibase Fine-Tuning LLMs Ebook
20 pages
Fine-Tuning Large Language Models Guide
No ratings yet
Fine-Tuning Large Language Models Guide
6 pages
Linear Algebra - Intuition, Math, Code
No ratings yet
Linear Algebra - Intuition, Math, Code
565 pages
RAG Interview Questions and Answers Book by Kalyan KS
No ratings yet
RAG Interview Questions and Answers Book by Kalyan KS
42 pages
LSTM for Touchpoint Prediction
100% (1)
LSTM for Touchpoint Prediction
73 pages
Building Blocks of Rag Ebook Final
100% (2)
Building Blocks of Rag Ebook Final
9 pages
Christoph Molnar - Interpretable Machine Learning-Lulu - Com (2020)
No ratings yet
Christoph Molnar - Interpretable Machine Learning-Lulu - Com (2020)
255 pages
Types of Neural Networks Explained
No ratings yet
Types of Neural Networks Explained
8 pages
Machine Learning Interpretability Guide
No ratings yet
Machine Learning Interpretability Guide
447 pages
Understanding Transformer Models
No ratings yet
Understanding Transformer Models
29 pages
Machine Learning Simplified
100% (2)
Machine Learning Simplified
109 pages
Data Preparation For Machine Learning
No ratings yet
Data Preparation For Machine Learning
101 pages
E - Catalogue - Data Science, AI and Fintech 2025
No ratings yet
E - Catalogue - Data Science, AI and Fintech 2025
124 pages
Deep Natural Language Processing and Ai Applications For Industry 5.0
No ratings yet
Deep Natural Language Processing and Ai Applications For Industry 5.0
258 pages
64 Natural Language Processing Interview Questions and Answers-18 Juli 2019
No ratings yet
64 Natural Language Processing Interview Questions and Answers-18 Juli 2019
30 pages
Step-By-Step Roadmap To Generative AI
No ratings yet
Step-By-Step Roadmap To Generative AI
27 pages
AI/ML Infrastructure Guide 2022
No ratings yet
AI/ML Infrastructure Guide 2022
100 pages
Daily Dose of Data Science - Archive
100% (2)
Daily Dose of Data Science - Archive
580 pages
Deep Learning with Databricks Overview
No ratings yet
Deep Learning with Databricks Overview
38 pages
Understanding BERT for NLP Tasks
No ratings yet
Understanding BERT for NLP Tasks
21 pages
Large Language Model
0% (1)
Large Language Model
38 pages
DL CNN
No ratings yet
DL CNN
129 pages
Statistical Distances in ML Observability
No ratings yet
Statistical Distances in ML Observability
31 pages
Generative AI APIs For Practical Applications
No ratings yet
Generative AI APIs For Practical Applications
27 pages
Autoregressive Generative Models Guide
100% (1)
Autoregressive Generative Models Guide
57 pages
L2 Neural Network Basics
No ratings yet
L2 Neural Network Basics
105 pages
LLMs in Software Engineering
100% (1)
LLMs in Software Engineering
75 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
25 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Machine Learning Interviews V 2 Week 11715787639480
0% (1)
Machine Learning Interviews V 2 Week 11715787639480
49 pages
Machine Learning Make Your Own Recommender System Machine Learning With Python For Beginners Book 3
No ratings yet
Machine Learning Make Your Own Recommender System Machine Learning With Python For Beginners Book 3
88 pages
Evaluation Metrics PPT
No ratings yet
Evaluation Metrics PPT
10 pages
Generative AI: Creative Chaos Unleashed
No ratings yet
Generative AI: Creative Chaos Unleashed
1 page
Generative AI For Software Development by Balasubramaniam S
No ratings yet
Generative AI For Software Development by Balasubramaniam S
330 pages
Fine-Tuning Techniques for LLMs
No ratings yet
Fine-Tuning Techniques for LLMs
7 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
108 pages
Vision-Language Models Intro Guide
No ratings yet
Vision-Language Models Intro Guide
76 pages
Transformers Explained Visually (Part 3) - Multi-Head Attention, Deep Dive - by Ketan Doshi - Towards Data Science
No ratings yet
Transformers Explained Visually (Part 3) - Multi-Head Attention, Deep Dive - by Ketan Doshi - Towards Data Science
24 pages
Building A Large Language Model LLM From Scratch
No ratings yet
Building A Large Language Model LLM From Scratch
13 pages
Learning Techniques Biomedical Informatics Studies
No ratings yet
Learning Techniques Biomedical Informatics Studies
395 pages
Neural Networks: Machine Learning Insights
No ratings yet
Neural Networks: Machine Learning Insights
64 pages
Eventmodeling and Eventsourcing Sample
100% (1)
Eventmodeling and Eventsourcing Sample
18 pages
Machine Learning Terminology Explained
No ratings yet
Machine Learning Terminology Explained
20 pages
Large Language Models
No ratings yet
Large Language Models
40 pages
A Tutorial On LLM Reasoning
No ratings yet
A Tutorial On LLM Reasoning
15 pages
Machine Learning Is Fun 1565131730
No ratings yet
Machine Learning Is Fun 1565131730
48 pages
Machine Learning Platforms: The Definitive Guide To
No ratings yet
Machine Learning Platforms: The Definitive Guide To
39 pages
Neural Networks: Fundamentals and Training
No ratings yet
Neural Networks: Fundamentals and Training
20 pages
Introduction to Deep Learning Techniques
100% (1)
Introduction to Deep Learning Techniques
299 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Summarization Metrics
No ratings yet
Summarization Metrics
5 pages
Yoga Philosophy and Physical Health Syllabus
No ratings yet
Yoga Philosophy and Physical Health Syllabus
14 pages
Hans Peter Durr
No ratings yet
Hans Peter Durr
13 pages
Understanding Policy Gradient Methods
No ratings yet
Understanding Policy Gradient Methods
35 pages
Analyzing Stock Returns Correlation
No ratings yet
Analyzing Stock Returns Correlation
16 pages
Manifold Learning Techniques in Python
No ratings yet
Manifold Learning Techniques in Python
15 pages
Scalable Io in
No ratings yet
Scalable Io in
39 pages
Reinforcement Learning Basics Explained
No ratings yet
Reinforcement Learning Basics Explained
60 pages
Sqoop Data Import/Export Guide
No ratings yet
Sqoop Data Import/Export Guide
17 pages
Proxy Pattern Lab for Album Covers
No ratings yet
Proxy Pattern Lab for Album Covers
1 page
Jan 2017 Yoga
100% (1)
Jan 2017 Yoga
56 pages
03 Paper 3 DataAnalticsGross Single IntjDataMining Published 2022
No ratings yet
03 Paper 3 DataAnalticsGross Single IntjDataMining Published 2022
18 pages
CS229 Linear Algebra Review
No ratings yet
CS229 Linear Algebra Review
47 pages
Graham I., Et Al. (Eds.) Numerical Analysis of Multiscale Problems (LNCSE0083, Springer, 2012) (ISBN 9783642220609) (C) (O) (381s) - MN
No ratings yet
Graham I., Et Al. (Eds.) Numerical Analysis of Multiscale Problems (LNCSE0083, Springer, 2012) (ISBN 9783642220609) (C) (O) (381s) - MN
381 pages
Machine Learning and TensorFlow Glossary
No ratings yet
Machine Learning and TensorFlow Glossary
85 pages
Regression Models
No ratings yet
Regression Models
5 pages
Holographic Particle Localization Under Multiple Scattering: Research Article
No ratings yet
Holographic Particle Localization Under Multiple Scattering: Research Article
12 pages
Parameter Estimation and Inverse Problems Second Edition 2nd Richard C Aster
No ratings yet
Parameter Estimation and Inverse Problems Second Edition 2nd Richard C Aster
433 pages
Enterprise Artificial Intelligence and Machine Learning For Managers
100% (3)
Enterprise Artificial Intelligence and Machine Learning For Managers
97 pages
An Introduction To Statistical Learning PDF
No ratings yet
An Introduction To Statistical Learning PDF
35 pages
Regression Techniques
No ratings yet
Regression Techniques
18 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
46 pages
CUSTOMER SEGMENTATION ANALYSIS OF DATA AND MACHINE LEARNING APPROACH With Plugorism
No ratings yet
CUSTOMER SEGMENTATION ANALYSIS OF DATA AND MACHINE LEARNING APPROACH With Plugorism
6 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
89 pages
Unit V Neural Networks
No ratings yet
Unit V Neural Networks
35 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Machine Learning for Concrete Strength Prediction
No ratings yet
Machine Learning for Concrete Strength Prediction
19 pages
Understanding Linear Regression Techniques
No ratings yet
Understanding Linear Regression Techniques
31 pages
Unified Model Prediction Interpretation
No ratings yet
Unified Model Prediction Interpretation
9 pages
Amazon ML Summer School Test Prep Plan
No ratings yet
Amazon ML Summer School Test Prep Plan
3 pages
Amharic Hate Speech Detection on Facebook
100% (1)
Amharic Hate Speech Detection on Facebook
12 pages
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
No ratings yet
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
15 pages
Physic-Informed Machine Learning For Building Performance
No ratings yet
Physic-Informed Machine Learning For Building Performance
48 pages
NN-DL Introduction Class
No ratings yet
NN-DL Introduction Class
10 pages
Open Electives in Engineering and Management
No ratings yet
Open Electives in Engineering and Management
117 pages
GATE Regression Exam Questions
No ratings yet
GATE Regression Exam Questions
13 pages
Multi-Task Learning in Real Estate Analysis
No ratings yet
Multi-Task Learning in Real Estate Analysis
4 pages
Notes 04
No ratings yet
Notes 04
50 pages
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
13 pages
Machine Learning Based Prediction Model For Compressive Strength of High Volume Fly Ash Concrete Reinforced With Silica Fume
No ratings yet
Machine Learning Based Prediction Model For Compressive Strength of High Volume Fly Ash Concrete Reinforced With Silica Fume
19 pages
Intro to Supervised ML for BTech CSE
No ratings yet
Intro to Supervised ML for BTech CSE
10 pages