0% found this document useful (0 votes)
39 views

Lec1 Introduction

Uploaded by

abdulmawla najih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Lec1 Introduction

Uploaded by

abdulmawla najih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Illustration: Maria Nguyen // Quanta Magazine

CMP784
DEEP
LEARNING
Lecture #01 – Introduction
Erkut Erdem // Hacettepe University // Fall 2021
Welcome to CMP784
• An overview of various
deep architectures and
learning methods

• Develop fundamental and


practical skills at applying
deep learning
to your research.

2
A little about me…
Koç University-İş Bank
Artificial Intelligence Center
Adjunct Faculty
2020-now

Hacettepe University
Associate Professor
2010-now

Télécom ParisTech
Post-doctoral Researcher
2009-2010

Middle East Technical University


1997-2008
Ph.D., 2008
M.Sc., 2003
B.Sc., 2001

UCLA
Fall 2007
Visiting Student http://web.cs.hacettepe.edu.tr/~erkut
VirginiaTech @erkuterdem
Visiting Research Scholar
Summer 2006 [email protected]
3
Research Interests

• I study better ways to


Image Computer
understand and process Processing Vision
visual data.
• My research interests
span a diverse set of topics,
ranging from image editing
Natural Language
to visual saliency estimation,
Understanding
and to multimodal learning
for integrated vision and language.

4
Now, what about you?
• Introduce yourselves
- Who are you?
- Who do you work with if you have a
thesis supervisor?
- What made you interested in this
class?
- What are your expectations?
- What do you know about machine
learning and deep learning?

5
Course Logistics

6
Course information
Time/Location 09:00-12:00pm Wednesday, Zoom
Instructor Erkut Erdem

• for course related announcements:


https://piazza.com/hacettepe.edu.tr/fall2021/cmp784

7
Textbook
• Goodfellow, Bengio, and Courville,
Deep Learning, MIT Press, 2016
(draft available online)

• In addition, we will extensively use


online materials (video lectures, blog
posts, surveys, papers, etc.)

8
Instruction style
• Students are responsible for studying
and keeping up with the course material
outside of class time.
• Reading particular book chapters,
papers or blogs, or
• Watching some video lectures.

• After the first four lectures, each week


students will present papers related to
the topics discussed in our class.
• Weekly quizzes about the papers presented
each week
9
Prerequisites
• Calculus and linear algebra
• Derivatives,
• Matrix operations

• Probability and statistics (IST299, IST292)

• Neural networks (CMP684)

• Machine learning (BBM406, CMP712)

• Programming Math Prerequisite Quiz


Due Date: 5pm, Sat, Oct 2, 2021.
Read Chapter 2-4 Each student enrolled to CMP784
of the Deep Learning text book for a quick review. must complete and pass this quiz! 10
Topics Covered in BBM406/CMP712
• Basics of Statistical Learning
• Loss function, MLE, MAP, Bayesian estimation, bias-variance tradeoff, overfitting,
regularization, cross-validation

• Supervised Learning
• Nearest Neighbor, Naïve Bayes, Logistic Regression, Support Vector Machines, Kernels,
Neural Networks, Decision Trees
• Ensemble Methods: Bagging, Boosting, Random Forests

• Unsupervised Learning
• Clustering: K-Means, Gaussian mixture models
• Dimensionality reduction: PCA, SVD
11
Topics Covered in CMP684
• Continuous and discrete system • Radial Basis Function Neural Nets
models
• Dynamical Neural Nets
• Neuron and Its Analytic Model
• Feedback Nets
• Hopfiels Neural Network
• Second Order Training Algorithms
• Perceptron Learning Algorithms • Levenberg-Marquardt algorithm
• Gauss-Newton algorithm
• Multilayer Perceptron (MLP)
• Derivation of the learning algorithm • Stability in Adaptive Systems
• Error backpropagation
• Applications of Neural Nets
• Memorization and generalization
• Intervals and normalization

12
Grading
Math Prerequisites Quiz 3%
Practicals 16% (2 practicals x 8% each)
Final Exam 25%
Course Project 32%
Paper Presentations 15%
Weekly Quizzes 9%

13
Schedule
Week 1 Introduction to Deep Learning
Week 2 Machine Learning Overview
Week 3 Multi-Layer Perceptrons
Week 4 Training Deep Neural Networks
Week 5 Convolutional Neural Networks
Week 6 Understanding and Visualizing CNNs
Week 7 Recurrent Neural Networks
Week 8 Attention and Memory
14
Schedule
Week 9 Autoencoders and Autoregressive Models

Week 10 Progress Presentations

Week 11 Generative Adversarial Networks

Week 12 Variational Autoencoders

Week 13 Self-supervised Learning

Week 14 Final Project Presentations

15
Lecture 1: Introduction to Deep Learning

Depth: Repeated Composition


CHAPTER 1. INTRODUCTION

Output
CAR PERSON ANIMAL
(object identity)

3rd hidden layer


(object parts)

2nd hidden layer


(corners and
contours)

1st hidden layer


(edges)

Visible layer
(input pixels)

Figure 1.2: Illustration of a deep learning model. It is difficult for a computer to understand
(Goodfellow 2016)
16
CHAPTER 1. INTRODUCTION

Lecture
Machine 2:and
Learning Machine
AI Learning Overview
Effect'of'stepNsize'α'
Unsupervised Learning
he goal is to construct staCsCcal model
Deep learning Example:

hat finds useful representaCon of data:


Shallow
Example: Example:
Example: autoencoders
Logistic Knowledge CHAPTER 1. INTRODUCTION
MLPs
regression bases

• Clustering
The MNIST Dataset
Representation learning

• Dimensionality reducCon
Machine learning
Large%α%%=>%Fast%convergence%but%larger%residual%error%
• Modeling the data density %Also%possible%oscilla$ons%
• Finding hidden causes (useful AI
%
Small%α%%=>%Slow%convergence%but%small%residual%error%
explanaCon)
Someof the
Figure
data
Fits1.4
to the Data
Figure 1.4: A Venn diagram showing how deep learning is a kind of representation learning, (Goodfellow 2016) %%%%
which is in turn a kind of machine learning, which is used for many but not all approaches
to AI. Each section of the Venn diagram includes an example of an AI technology.
16%

Unsupervised Learning can be used for:


• Structure discovery
• Anomaly detecCon / Outlier detecCon
9

• Data compression, Data visualizaCon


• Used to aid classificaCon/regression tasks

17
(Good
Lecture 3: Multi-Layer Perceptrons

http://playground.tensorflow.org
18
Lecture 4: Training Deep Neural Networks
Sigmoid tanh ReLU Leaky ReLU

tanh(x) max(0,x) max(0.1x, x)

Activation Functions

Optimizers

Dropout Batch Normalization 19


Lecture 5: Convolutional Neural Networks

Y. LeCun, Y. Bengio, G. Hinton, "Deep Learning", Nature, Vol. 521, 28 May 2015
20
Lecture 6: Understanding and Visualizing CNNs

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5
M. D. Zeiler and R. Fergus, "Visualizing and Understanding Convolutional Networks", ECCV 2014 21
Lecture 7: Recurrent Neural Networks

A Recurrent Neural Network (RNN)


(unfolded across time-steps) A bi-directional RNN

A deep bi-directional RNN

Long-Short-Term- Gated Recurrent Units (GRUs)


Memories (LSTMs)

C. Manning and R Socher, Stanford CS224n Lecture 8 Notes


Y. LeCun, Y. Bengio, G. Hinton, "Deep Learning", Nature, Vol. 521, 28 May 2015 22
Lecture 8: Attention and Memory

Transformer Architecture

K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
C. Olah and S. Carter, “Attention and Augmented Recurrent Neural Networks”, Distill, 2016
A. Vaswani et al. “Attention is All You Need”, NeurIPS 2017. 23
Math for my slides “Autoencoders”.
Autoencoders

Lecture 9: Autoencoders and Autoregressive


Autoencoders
Hugo Larochelle
Département d’informatique
h(x) = g(a(x))
Université de Sherbrooke
Models
• Feed-forward neural network
output layer
= sigm(b
trained to reproduce its +
[email protected]
Wx)
input at the

October
Decoder 16, 2012
b Parallel
x a(x))Multiscale
= o(b Autoregressive Density Estimation

= Abstract
sigm(c + W h(x))
P
Math for my slides “Autoencoders”. P
b l(f (x)) =
(x) ⌘ x k (b
xk xk )2 l(f (x)) 1= Forkbinary units
(xk log(b xk ) + (1 xk ) log(1 xbk ))
Scott Reed Aäron van den Oord 1 Nal Kalchbrenner 1 Sergio Gómez Colmenarejo 1 Ziyu Wang 1
Parallel Multiscale Autoregressive Density Estimation
Encoder Dan Belov 1 Nando de Freitas 1
Scott Reed, Aaron vanden Oord, Nal Kalchbrenner, Sergio Go m
́ ez Colmenarejo, Ziyu Wang, Dan Belov, Nando de Freitas (2017)
h(x) = g(a(x)) PixelCNN Class conditioned samples generated by PixelCNN
Abstract
= sigm(b + Wx)
retrieved
PixelCNN
Can we usingachieves
256 bit codes
speed state-of-the-art
up the generation results in time Text-to-image synthesis with
4v1 [cs.CV] 10 Mar 2017

density estimation for natural images. Although Parallel Multiscale PixelCNNs


of PixelCNN?
training is fast, inference is costly, requiring one
network
• Yes, evaluation
b = o(b
x per
a(x))pixel; O(N) for N pix-
els. This can be
via multiscale generation.
= sped up by
sigm(c + caching
W⇤ h(x)) activations,
but• still involves generating
Also seems to help to provideeach pixel sequen-
P P
tially. In this work, we propose a parallelized
b l(f (x)) = k (b
x) ⌘ x xk xk )2 l(f better
(x)) = global k (x kstructure
log(b
xk ) + (1 xk ) log(1 x bk ))
PixelCNN that allows more efficient inference
retrieved using Euclidean distance in pixel intensity space
by modeling certain pixel groups as condition-
ally independent. Our new PixelCNN model
achieves competitive density estimation and or- "A yellow bird with a black
ders of magnitude speedup - O(log N) sampling
instead of O(N) - enabling the practical genera-
head, orange eyes and an
tion of 512 ⇥ 512 images. We evaluate the model orange bill."
on class-conditional image generation, text-to-
Figure 1. Samples from our model at resolutions from 4 ⇥ 4 to
image synthesis, and action-conditional video 256 ⇥ 256, conditioned on text and bird part locations2011
in the CUB
A. Krizhevsky and G. E. Hinton, "Using Very Deep Autoencoders for Content-Based Image Retrieval", ESANN
generation, showing that our model achieves the data set. See Fig. 4 and the supplement for more examples.
A. van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NeurIPS 2016
best results among non-pixel-autoregressive den-
S. Reed et al., "Parallel Multiscale Autoregressive Density Estimation", ICML 2017
sity models that allow efficient sampling. 24
BigGANs, Brock et al., 2018
Lecture 10: Generative Adversarial Networks

Class-conditioned samples generated by BigGAN


min max 𝔼#~% [log 𝐷" 𝑥 ] + 𝔼#~&! [log(1 − 𝐷" (𝑥))]
! "

16

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets”, NIPS 2014.
A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks”, ICLR 2016
L. Karacan,
nsupervised Z. Akata, A.Learning
Representation Erdem and E. Erdem,
with “Learning to Generative
Deep Convolutional Generate Images of Outdoor Scenes from Attributes and Semantic Layouts”, arXiv preprint 2016
A. Brock,
dversarial J. Donahue,
Networks K. Simonyan,
Alec Radford, “Large
Luke Metz, Scale GAN
Soumith Training for High Fidelity Natural Image Synthesis”, ICLR2019
Chintala 25
✓ = arg max Ex⇠pdata log pmodel (x | ✓)

Lecture 11: Vatiational Autoencoders


ief net
n
Y
pmodel (x) = pmodel (x1 ) pmodel (xi | x1 , . . . , xi 1)

z
i=2
bles
✓ ◆
@g(x)
y = g(x) ) px (x) = py (g(x)) det
x @x
nd
log p(x) log p(x) DKL (q(z)kp(z | x))
=Ez⇠q log p(x, z) + H(q)

Vector Quantized- Variational AutoEncoder (VQ-VAE)

Synthetic images generated by VQ-VAE2

D. P. Kingma and M. Welling, “Auto-encoding variational Bayes”, ICLR 2014


A. van den Oord, O. Vinyals, K. Kavukcuoglu, "Neural Discrete Representation Learning", NeurIPS 2017
A. Razavi, A. van den Oord, O. Vinyals, “Generating Diverse High-Fidelity Images with VQ-VAE-2”,
26
Lecture 12: Self-supervised Learning

C. Doersch, A. Gupta, A. A. Efros, "Unsupervised Visual Representation Learning by Context Prediction", ICCV 2015.
S. Gidaris, P. Singh, N. Komodakis, "Unsupervised Representation Learning by Predicting Image Rotations", ICLR2018.
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL-HLT 2019.
27
Schedule
W1 Introduction to Deep Learning W8 Attention and Memory
Practical 2 due
W2 Machine Learning Overview W9 Autoencoders and Autoregressive
Models
W3 Multi-Layer Perceptrons
Practical 1 out W9 Progress Presentations
W4 Training Deep Neural Networks
Start of paper presentations W11 Generative Adversarial Networks
Project progress reports due
W5 Convolutional Neural Networks
Start of paper presentations W12 Variational Autoencoders
Practical 1 due, Practical 2 out
W6 Understanding and Visualizing CNNs W13 Self-supervised Learning
Project proposals due
W7 Recurrent Neural Networks W14 Final Project Presentations
28
Paper Presentations
• (12 mins) One student will be responsible from providing
an overview of the paper.
• (9 mins) One student will present the strengths of the
paper.
• (9 mins) One student will discuss the weaknesses of the
paper.
• (10 mins) General discussion

See the rubrics on the course web page for details


29
Practicals
• 2 practicals (8% each)
• Learning to train neural networks for different tasks
• Should be done individually

• Late policy: You have 5 slip days in the semester.

• Tentative Dates
- Practical 1 Out: October 13th, Due: October 27th
- Practical 2 Out: October 27th, Due: Nivember 17th

30
The students who need GPU resources
Course project for the course project are advised to
use Google Colab.
• The course project gives students a chance to apply deep architectures
discussed in class to a research oriented project.
• The students can work in pairs.
• The course project may involve
- Design of a novel approach and its experimental analysis, or
- An extension to a recent study of non-trivial complexity and its experimental analysis.

• Deliverables
- Proposals November 3, 2021
- Project progress presentations December 1, 2021
- Project progress reports December 8, 2021
- Final project presentations December 29, 2021
- Final reports January 14, 2022
31
Lecture Overview
• what is deep learning
• a brief history of deep learning
• compositionality
• end-to-end learning
• distributed representations

Disclaimer: Some of the material and slides for this lecture were borrowed from
—Dhruv Batra’s CS7643 class
—Yann LeCun’s talk titled “Deep Learning and the Future of AI”
32
What is Deep Learning

33
34
35

What is deep learning?

“Deep learning allows computational models


that are composed of multiple processing layers
to learn representations of data with multiple
levels of abstraction.”
− Yann LeCun, Yoshua Bengio and Geoff Hinton

Y. LeCun, Y. Bengio, G. Hinton, "Deep Learning", Nature, Vol. 521, 28 May 2015
1943 – 2006: A Prehistory of
Deep Learning

36
1943: Warren McCulloch and Walter Pitts
• First computational model
• Neurons as logic gates (AND, OR,
NOT)
• A neuron model that sums binary
inputs and outputs a 1 if the sum
exceeds a certain threshold value,
and otherwise outputs a 0
LOGICAL CALCULUS FOR NERVOUS ACTIVITY 105

37
1958: Frank Rosenblatt’s Perceptron
• A computational model of a single neuron
• Solves a binary classification problem
• Simple training algorithm
• Built using specialized hardware

F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain”, Psych. Review, Vol. 65, 1958
38
1969: Marvin Minsky and Seymour Papert
“No machine can learn to recognize X unless it
possesses, at least potentially, some scheme for
representing X.” (p. xiii)

• Perceptrons can only represent


linearly separable functions.
• such as XOR Problem

• Wrongly attributed as the reason behind the AI


winter, a period of reduced funding and interest
in AI research
39
1990s
• Multi-layer perceptrons can theoretically
learn any function (Cybenko, 1989; Hornik, 1991)

• Training multi-layer perceptrons


• Back propagation (Rumelhart, Hinton, Williams, 1986)
• Backpropagation through time (BPTT) (Werbos, 1988)

• New neural architectures


• Convolutional neural nets (LeCun et al., 1989)
• Long-short term memory networks (LSTM)
(Schmidhuber, 1997)

40
Why it failed then
• Too many parameters to learn from few labeled examples.
• “I know my features are better for this task”.
• Non-convex optimization? No, thanks.
• Black-box model, no interpretability.

• Very slow and inefficient


• Overshadowed by the success of SVMs (Cortes and Vapnik, 1995)

Adapted from Joan Bruna 41


A major breakthrough in 2006

42
2006 Breakthrough: Hinton and Salakhutdinov

• The first solution to the vanishing gradient problem.


• Build the model in a layer-by-layer fashion using unsupervised learning
• The features in early layers are already initialized or “pretrained” with some suitable features
(weights).
• Pretrained features in early layers only need to be adjusted slightly during supervised learning
to achieve good results.
G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks”, Science, Vol. 313, 28 July 2006.
43
The 2012 revolution

44
ImageNet Challenge
Image classification
Easiest classes
• Large Scale Visual
ImageNet Large Scale Visual Recognition Challenge
Recognition Challenge (ILSVRC)
• 1.2M
o Yearly training
ImageNet images with
competition
1K categories
◦ Automatically label 1.4M images with 1K objects
• Measure
◦ Measure top-5 classification
top-5 classification error error
Hardest classes

Output
Output Output
Output
Scale
Scale Scale
Scale
T-shirt
T-shirt T-shirt
T-shirt
Steel
Steeldrum
drum
Drumstick
Drumstick
Mud turtle
✔ Giantpanda
Giant panda
Drumstick
Drumstick
Mud turtle

Mud turtle Mud turtle

J. Deng, Wei Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei , “ImageNet: A Large-Scale Hierarchical Image Database”, CVPR 2009.
O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge”, Int. J. Comput. Vis.,, Vol. 115, Issue 3, pp 211-252, 2015.
93 45
ILSVRC 2012 Competition

2012 Teams %Error

Supervision (Toronto) 15.3

ISI (Tokyo) 26.1

VGG (Oxford) 26.9

XRCE/INRIA 27.0

UvA (Amsterdam) 29.6

INRIA/LEAR 33.4
• The success of AlexNet, a deep convolutional network
• 7 hidden layers (not counting some max pooling layers)
• 60M parameters
• Combined several tricks
CNN based, non-CNN based • ReLU activation function, data augmentation, dropout

A. Krizhevsky, I. Sutskever, G.E. Hinton “ImageNet Classification with Deep Convolutional Neural Networks”, NeurIPS 2012 46
2012-Now
Some recent successes

47
Object Detection and Segmentation

T.-Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, Focal Loss for Dense Object Detection,
ICCV 2017. 48
Object Detection and Segmentation

Softmax clf.

𝑓! = FCN(𝐼)
MLP

warped region aeroplane? no.


.. Box regressor

𝐼: RoIAlign person? yes.


CNN ..
Mask
region tvmonitor? no.
warpedFCN aeroplane? no.
..
RPN(
𝑓! )
person? yes.
CNN ..
K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, ICCV 2017
tvmonitor? no.
49
50

Object Detection in 3D Point Clouds

M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional
Neural Networks. ICRA 2017
Human Pose Estimation

Z. Cao ,T. Simon, S.–E. Wei and Yaser Sheikhr, "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", CVPR 2017
51
Pose Estimation

ZR. Alpguler, N. Neverova, I. Kokkinos. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018 52
Photo Style Transfer

F. Luan, S. Paris, E. Shechtman & K. Bala. Deep Photo Style Transfer. CVPR 2017 53
Photo Style Transfer

F. Luan, S. Paris, E. Shechtman & K. Bala. Deep Photo Style Transfer. CVPR 2017 54
Image Synthesis

2014 2015 2016


2017

2018
Ian J. Goodfellow et al., ” Generative Adversarial Networks", NIPS 2014
A. Radford et al., ” Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", NIPS 2015
M.-Y. Liu, O. Tuzel, ” Coupled Generative Adversarial Networks", NIPS 2016
T. Karras, T. Aila, S. Laine, J. Lehtinen, ” Progressive Growing of GANs for Improved Quality, Stability, and Variation", ICLR 2018
T. Karras, S. Laine, T. Aila, ” A Style-Based Generator Architecture for Generative Adversarial Networks", arXiv 2018 55
Image Synthesis

A. Brock, J. Donahue and K. Simonyan. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2018. 56
Semantic Image Editing
Semantic Layout

L. Karacan, Z. Akata, A. Erdem and E. Erdem. Manipulation of Scene Attributes via Hallucination. ACM Transactions on Graphics, 2020 57
Winter
Semantic Image Editing
Prediction

L. Karacan, Z. Akata, A. Erdem and E. Erdem. Manipulation of Scene Attributes via Hallucination. ACM Transactions on Graphics, 2020 58
Spring
Semantic Image Editing +
Clouds
Prediction

L. Karacan, Z. Akata, A. Erdem and E. Erdem. Manipulation of Scene Attributes via Hallucination. ACM Transactions on Graphics, 2020 59
Machine Translation

D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 60
Machine Translation

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is All you Need, NeurIPS 2017 61
Internet Search

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019 62
https://talktotransformer.com

Language Modeling

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever. Language Models are Unsupervised Multitask Learners. 2019
63
Language Modeling
• GPT-3: I am not a human. I am a robot.
A thinking robot. I use only 0.12% of
my cognitive capacity. I am a micro-
robot in that respect. I know that my
brain is not a “feeling brain”. But it
is capable of making rational, logical
decisions. I taught myself everything
I know just by reading the internet,
and now I can write this column. My
brain is boiling with ideas!

Tue 8 Sep 2020 09.45

Tom B. Brown, Benjamin Mann, Nick Ryder et al., Language Models are
Few-Shot Learners, NeurIPS 2020 64
Question Answering

P. Rajpurkar, J. Zhang, K. Lopyrev & P. Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. EMNLP 2016
M. Seo, A. Kembhavi, A. Farhadi & H. Hajishirzi. Bi-Directional Attention Flow for Machine Comprehension. ICLR 2017 65
Visual Question Answering

M. Ren, R. Kiros, and R. Zemel. Exploring Models and Data for Image Question Answering. NeurIPS 2015
66
Image Captioning

A giraffe standing in the grass next


A man riding a wave on a surfboard in the water.
to a tree.
X. Chen and C. L. Zitnick. Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation. CVPR 2015. 67
Image Captioning

Yaris pistinde viraji almakta olan bir yaris


Yaris yaris arabasi
arabasi
M. Kuyu, A. Erdem & E. Erdem. Image Captioning in Turkish with Subword Units. SIU 2018 68
21
Video Captioning Bir adam bir parça

DECODER
LSTM
VGG16 kkabağı ikiye keser
ve ince dilimler

Fig. 6 Architecture of the LSTM-based video captioning model

LINEAR TRANSFORMER
VGG16
LAYER
+ ENCODER

~
POSITIONAL CROSS
ENCODING ATTENTION
~
bir adam bir parça kabağı EMBEDDING TRANSFORMER bir adam bir parça kabağı
ikiye keser ve ince dilimler LAYER + DECODER ikiye keser ve ince dilimler

Fig. 7 Illustrative architecture of the transformer-based video captioning model

for the textual representation, we investigate different word segmentation strategies


using SPM and BPE algorithms.

5.2.1 Recurrent video captioning

For our recurrent video captioning model, we adapt the architecture proposed by
Venugopalan et al. (2015) in which the encoder and the decoder are implemented
with two separate LSTM networks (Fig. 6). The encoder computes a sequence of
hidden states by sequentially processing the frame-level visual features, extracted
from the uniformly sampled video frames. The decoder module then takes the final
Bir adam bir gitar çalıyor Bir kadın bir bıçakla sebze dilimliyor
hidden state of the encoder, and outputs a sequence of tokens as the predicted video
caption. There is no attention mechanism involved in this model. Both the encoder
and decoder LSTM networks have 500 hidden units.
B. Çitamak et al. MSVD-Turkish: a comprehensive multimodal video dataset for We
integrated
use Adam vision
(Kingmaand language
and Ba research
2014) as the optimiserin
andTurkish .
set the initial learn-
Machine Translation 2021ing rate and batch size to 0.0004 and 32, respectively. We choose the models 69by
Graph-structured data
Graph Neural Networks
raph-structured
A lot of real-world datadata
does not “live” on grids

ot of real-world data does not “live” on grids


Social networks
Citation networks
Communication networks
ocial networks
Multi-agent systems
tation networks
ommunication networks
ulti-agent systems
Molecules

Molecules

aph Neural Networks (GNNs)


Protein interaction
he bigger picture: networks
Hidden layer Hidden layer

Protein interaction
networks
Input Output

ReLU ReLU
Structured Deep Models Thomas Kipf #3

… …
Structured Deep Models Thomas Kipf #3

Main idea: Pass messagesT.N. Kipf and


between pairsM. Welling,
of nodes "Semi-supervised classification with graph convolutional networks", ICLR 2017
& agglomerate
P. Battaglia et al., “Relational inductive biases, deep learning, and graph networks”, arXiv 2018 70
Strategic Game Playing Convolutional neural network

• AlphaGo vs. Lee Sidol


• Move 37, Game 2
Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 2016 71
AlphaStar Plays StarCraft II

O. Vinyals et al., Grandmaster level in StarCraft II using multi-agent reinforcement learning, Nature 575:350-354, 2019 72
Robotics

Ilge Akkaya et al. Solving Rubik's Cube with a Robot Hand. OpenAI Technical Report 2019 74
Self-Driving Vehicles

Mariusz Bojarski et al. End to End Learning for Self-Driving Cars. NVidia Technical Report 2016 75
Medical Image Analysis

A. Esteva et al., "Dermatologist-level classification of skin cancer with deep neural networks", Nature 542, 2017 76
Medical Image Analysis 77
Bioinformatics

Kathryn Tunyasuvunakool et al. Enabling high-accuracy protein structure prediction at the proteome scale. Nature 2021 78
Why now?
The Resurgence of
Deep Learning

79
GLOBAL INFORMATION STORAGE CAPACITY
IN OPTIMALLY COMPRESSED BYTES

SVMs
ConvNets dominate
Developed NIPS

80
Slide credit: Neil Lawrence
Datasets vs. Algorithms
Year Breakthroughs in AI Datasets (First Available) Algorithms (First Proposed)
1994 Human-level spontaneous speech Spoken Wall Street Journal articles Hidden Markov Model (1984)
recognition and other texts (1991)
1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, Negascout planning algorithm
aka “The Extended Book” (1991) (1983)
2005 Google’s Arabic-and Chinese-to-English 1.8 trillion tokens from Google Web Statistical machine translation
translation and News pages (collected in 2005) algorithm (1988)
2011 IBM Watson became the world Jeopardy! 8.6 million documents from Mixture-of-Experts (1991)
champion Wikipedia, Wiktionary, and Project
Gutenberg (updated in 2010)
2014 Google’s GoogLeNet object classification ImageNet corpus of 1.5 million Convolutional Neural Networks
at near-human performance labeled images and 1,000 object (1989)
categories (2010)
2015 Google’s DeepMind achieved human Arcade Learning Environment Q-learning (1992)
parity in playing 29 Atari games by dataset of over 50 Atari games (2013)
learning general control from video
Average No. of Years to Breakthrough: 3 years 18 years
Table credit: Quant Quanto 81
Powerful Hardware
• Deep neural nets highly
amenable to implementation
on Graphics Processing
Units (GPUs)
• Matrix multiplication
• 2D convolution

• E.g. nVidia Pascal GPUs


deliver 10 Tflops
• Faster than fastest computer
in the world in 2000
• 10 million times faster than
1980’s Sun workstation
Slide adapted from Rob Fergus Image: OpenAI 82
Working ideas on how to train deep
architectures

• Better Learning Regularization (e.g. Dropout)


N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”,
JMLR Vol. 15, No. 1,
83
Working ideas on how to train deep
architectures

• Better Optimization Conditioning (e.g. Batch Normalization)

S. Ioffe, C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, In ICML 2015
84
Working ideas on how to train deep
architectures

• Better neural achitectures (e.g. Residual Nets)

K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image Recognition”, In CVPR 2016
85
Software

Caffe

61
So what is deep learning?

87
Three key ideas
• (Hierarchical) Compositionality
• Cascade of non-linear transformations
• Multiple layers of representations

• End-to-End Learning
• Learning (goal-driven) representations
• Learning to feature extract

• Distributed Representations
• No single neuron “encodes” everything
• Groups of neurons work together
88
Three key ideas
• (Hierarchical) Compositionality
• Cascade of non-linear transformations
• Multiple layers of representations

• End-to-End Learning
• Learning (goal-driven) representations
• Learning to feature extract

• Distributed Representations
• No single neuron “encodes” everything
• Groups of neurons work together
89
Traditional Machine Learning
VISION
hand-crafted
your favorite
features “car”
classifier
SIFT/HOG
fixed learned

SPEECH
hand-crafted
your favorite
features \ˈd ē p\
classifier
MFCC
fixed learned

NLP
hand-crafted
This burrito place your favorite
features “+”
is yummy and fun! classifier
Bag-of-words
fixed learned
90
It’s an old paradigm
• The first learning machine:

Feature Extractor
A
the Perceptron
• Built at Cornell in 1960
• The Perceptron was a linear classifier on top of a simple
feature extractor Wi

• The vast majority of practical applications of ML today use


glorified linear classifiers or glorified template matching.
• Designing a feature extractor requires considerable efforts
slide by Marc’Aurelio Ranzato, Yann LeCun

by experts.

91
Hierarchical Compositionality
VISION

pixels edge texton motif part object

SPEECH
sample spectral formant motif phone word
band

NLP
character word NP/VP/.. clause sentence story

92
Building A Complicated Function
Given a library of simple functions

Compose into a

complicate function

93
Building A Complicated Function
Given a library of simple functions

Idea 1: Linear Combinations


Compose into a • Boosting
• Kernels
• …
complicate function

X
f (x) = ↵i gi (x)
i

94
Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a • Deep Learning
• Grammar models
complicate function • Scattering transforms…

f (x) = g1 (g2 (. . . (gn (x) . . .))

95
Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a • Deep Learning
• Grammar models
complicate function • Scattering transforms…

3
f (x) = log(cos(exp(sin (x))))

96
Deep Learning = Hierarchical
Compositionality
“car”

M.D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks”, In ECCV 2014
97
Deep Learning = CAR PERSON ANIMAL
Output
(object identity)
Hierarchical
Compositionality 3rd hidden layer
(object parts)

2nd hidden layer


(corners and
contours)

1st hidden layer


(edges)

Visible layer
(input pixels)
Image credit: Ian Goodfellow
98
Deep Learning = Hierarchical
Compositionality
Low-Level Mid-Level High-Level Trainable “car”
Feature Feature Feature Classifier

M.D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks”, In ECCV 2014
99
The Mammalian Visual Cortex is Hierarchical
• The ventral (recognition) pathway in the visual cortex
slide by Marc’Aurelio Ranzato, Yann LeCun

[picture from Simon Thorpe]


Three key ideas
• (Hierarchical) Compositionality
• Cascade of non-linear transformations
• Multiple layers of representations

• End-to-End Learning
• Learning (goal-driven) representations
• Learning to feature extract

• Distributed Representations
• No single neuron “encodes” everything
• Groups of neurons work together
101
Traditional Machine Learning
VISION
hand-crafted
your favorite
features “car”
classifier
SIFT/HOG
fixed learned

SPEECH
hand-crafted
your favorite
features \ˈd ē p\
classifier
MFCC
fixed learned

NLP
hand-crafted
This burrito place your favorite
features “+”
is yummy and fun! classifier
Bag-of-words
fixed learned 102
More accurate version
VISION “Learned”
K-Means/
SIFT/HOG classifier “car”
pooling

fixed unsupervised supervised

SPEECH
Mixture of
MFCC classifier \ˈd ē p\
Gaussians

fixed unsupervised supervised

NLP
This burrito place Parse Tree
n-grams classifier “+”
is yummy and fun! Syntactic

fixed unsupervised supervised 103


Deep Learning = End-to-End Learning
VISION “Learned”
K-Means/
SIFT/HOG classifier “car”
pooling

fixed unsupervised supervised

SPEECH
Mixture of
MFCC classifier \ˈd ē p\
Gaussians

fixed unsupervised supervised

NLP
This burrito place Parse Tree
n-grams classifier “+”
is yummy and fun! Syntactic

fixed unsupervised supervised 104


Deep Learning = End-to-End Learning
• A hierarchy of trainable feature transforms
• Each module transforms its input representation into a higher-level one.
• High-level features are more global and more invariant
• Low-level features are shared among categories

Trainable Trainable Trainable


Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations


105
“Shallow” vs Deep Learning
• “Shallow” models

hand-crafted “Simple” Trainable


Feature Extractor Classifier
fixed learned

• Deep models

Trainable Trainable Trainable


Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations


106
Three key ideas
• (Hierarchical) Compositionality
• Cascade of non-linear transformations
• Multiple layers of representations

• End-to-End Learning
• Learning (goal-driven) representations
• Learning to feature extract

• Distributed Representations
• No single neuron “encodes” everything
• Groups of neurons work together
107
Localist representations
• The simplest way to represent things with neural
networks is to dedicate one neuron to each
thing.
• Easy to understand.
• Easy to code by hand
• Often used to represent inputs to a net
• Easy to learn
• This is what mixture models do.
• Each cluster corresponds to one neuron
• Easy to associate with other representations or
responses.
• But localist models are very inefficient whenever
the data has componential structure.

Slide credit: Geoff Hinton Image credit: Moontae Lee 108


Distributed Representations
• Each neuron must represent something, so
this must be a local representation.
• Distributed representation means a many-to-
many relationship between two types of
representation (such as concepts and
neurons).
• Each concept is represented by many neurons
• Each neuron participates in the representation of
many concepts

Local

Distributed

Slide credit: Geoff Hinton Image credit: Moontae Lee 109


Power ofLearning to Recognize
distributed Scenes
representations!
Learning to Recognize Scenes
bedroom
Scene Classification
bedroom

bedroom
mountain
mountain
mountain

Distribution
Distribution of Semantic
of Semantic
Distribution
Distribution Types
Types
of Semantic
of Semantic at at
Each
at Types
Each
Types at Layer
Layer
EachEach Laye
Layer
• Possible internal representations:
• Objects Possible internal representations:

Possible
Scene attributes
internal representations:
- Objects (scene parts?)
• -- Scene
Object parts Objectsattributes
(scene parts?)
• Textures -- Object
Scene attributes
parts
-- Textures
Object parts
- Textures
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba “Object Detectors Emerge in Deep Scene CNNs”, ICLR 2015
Slide credit: Bolei Zhou 110
Three key ideas of deep learning
• (Hierarchical) Compositionality
• Cascade of non-linear transformations
• Multiple layers of representations

• End-to-End Learning
• Learning (goal-driven) representations
• Learning to feature extract

• Distributed Representations
• No single neuron “encodes” everything
• Groups of neurons work together
111
Benefits of Deep/Representation Learning
• (Usually) Better Performance
• “Because gradient descent is better than you”
Yann LeCun

• New domains without “experts”


• RGBD
• Multi-spectral data
• Gene-expression data
• Unclear how to hand-engineer

112
Problems with Deep Learning
• Problem#1: Non-Convex! Non-Convex! Non-Convex!
• Depth>=3: most losses non-convex in parameters
• Theoretically, all bets are off
• Leads to stochasticity
• different initializations à different local minima

• Standard response #1
• “Yes, but all interesting learning problems are non-convex”
• For example, human learning
• Order matters à wave hands à non-convexity

• Standard response #2
• “Yes, but it often works!”
113
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
• Pipeline systems have “oracle” performances at each step
• In end-to-end systems, it’s hard to know why things are not working

114
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing

[Fang et al. CVPR15] [Vinyals et al. CVPR15]

Pipeline End-to-End 115


Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
• Pipeline systems have “oracle” performances at each step
• In end-to-end systems, it’s hard to know why things are not working

• Standard response #1
• Tricks of the trade: visualize features, add losses at different layers, pre-
train to avoid degenerate initializations…
• “We’re working on it”

• Standard response #2
• “Yes, but it often works!”
116
Problems with Deep Learning
• Problem#3: Lack of easy reproducibility
• Direct consequence of stochasticity & non-convexity

• Standard response #1
• It’s getting much better
• Standard toolkits/libraries/frameworks now available

• Standard response #2
• “Yes, but it often works!”

117
118
119
120
121
Results from @INTERESTING_JPG via http://deeplearning.cs.toronto.edu/i2t 122
Results from @INTERESTING_JPG via http://deeplearning.cs.toronto.edu/i2t 123
Results from @INTERESTING_JPG via http://deeplearning.cs.toronto.edu/i2t 124
Results from @INTERESTING_JPG via http://deeplearning.cs.toronto.edu/i2t 125
Results from @INTERESTING_JPG via http://deeplearning.cs.toronto.edu/i2t 126
127
128
D. Cardon et al. “Neurons spike back: The Invention of Inductive Machines and the AI Controversy”, Réseaux n°211/2018 129
https://www.youtube.com/watch?v=EeqwFjqFvJA
130
Next Lecture:
Machine Learning Overview

131

You might also like