0% found this document useful (0 votes)
26 views90 pages

INFO AI Ch4

The document discusses deep convolutional networks (CNNs) and computer vision. It covers: 1. The basic steps to solve problems using neural networks including designing networks, training with data, and testing/analyzing. 2. How CNNs have revolutionized computer vision by enabling trainable feature extractors and classifiers rather than fixed algorithms. 3. An overview of computer vision, from its origins in biological research to applications of CNNs and the development of techniques like SIFT, HoG, and deep learning models.

Uploaded by

rojen003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views90 pages

INFO AI Ch4

The document discusses deep convolutional networks (CNNs) and computer vision. It covers: 1. The basic steps to solve problems using neural networks including designing networks, training with data, and testing/analyzing. 2. How CNNs have revolutionized computer vision by enabling trainable feature extractors and classifiers rather than fixed algorithms. 3. An overview of computer vision, from its origins in biological research to applications of CNNs and the development of techniques like SIFT, HoG, and deep learning models.

Uploaded by

rojen003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

深度卷积网络 CNN

与计算机视觉
徐 丰
复旦大学
如何用神经网络解决问题?
Step1. 根据 • CNN、RNN、FNN、R-CNN、
问题设计神 GAN、ResNet
经网络 • Loss Function、Regularization

Step2. 使用
• SGD+BP、Momentum、
数据训练神 Adam、Dropout、
经网络 BatchNorm

Step3. 测试 • Underfitting/Overfitting
• Data Augmentation
分析 • Cross-Validation
深度卷积网络:计算机视觉领域的革命
 Image/Speech Recognition
Before 2012

SIFT K-means
Trainable Object
HOG Sparse Coding
Mix of Gaussians
Classifier Class
MFCC
Low-level Features: Mid-level Features: supervised
fixed unsupervised
2012-Now: Deep Learning

Trainable Trainable Trainable Object


Trainable
Feature Feature Feature
Classifier Class
Extractor Extractor Extractor
计算机视觉
 Computer vision is an interdisciplinary field that
deals with how computers can be made to gain high-
level understanding from digital images or videos.
 From the perspective of engineering, it seeks to
automate tasks that the human visual system can do.
计算机视觉诞生

• Hubel and Wiesel ,1959


…there are simple and complex neurons in the
primary visual cortex and that visual processing
always starts with simple structures such as
oriented edges…

• Kirsch et al. 1959 (Scanner)


• Roberts, 1963
Machine perception of three-dimensional solids… one
of the precursors of modern Computer Vision.
计算机视觉的诞生

• 1960s: 1st Hype of Artificial Intelligence


• Papert@MIT: Summer Vision Project
…engineer a platform that could perform, automatically, background/foreground
segmentation and extract non-overlapping objects from real-world images…

• Marr, 1982 official birth of CV as a scientific field


《Vision: A computational investigation into the human
representation and processing of visual information》
计算机视觉的发展

传统方法:
• HoG、SIFT
• 图像处理 etc.

神经网络:
• Fukushima, 1975: Cognitron
• 1980s: BP, ANN
• 2006: Hinton Deep Belief Nets
• 2012: AlexNet
• 2016: AlphaGO
计算机视觉的发展
从照相机、计算机 到 计算机视觉

摄像技术的发展

互联网、大数据的发展

计算机技术发展

Computer Vision算法的发展
计算机视觉常见应用
 Classification
 Segmentation
 Object Recognition
 Detection
 Identification
 Motion Analysis
 Egomotion – 3D rigid-body motion estimation
 Tracking
 Optical flow
 Pose/action recognition
 Scene Reconstruction
 Image Restoration
55 years of hand-crafted features

– The traditional model of pattern recognition (since the late 50's)


– Fixed/engineered features (or fixed kernel) + trainable classifier

Hand-crafted “Simple” Trainable


Feature Extractor Classifier

– Perceptron
Architecture of “classical” recognition systems
“Classic” architecture for pattern recognition
– Speech recognition: 1990-2011
– Object Recognition: 2005-2012
– Handwriting recognition (long ago)
– Graphical model has latent variables (locations of parts)

fixed unsupervised fixed supervised fixed

MFCC SIFT, Gaussians K-


(linear) Graphical
HoG Means Pooling
Sparse Coding Classifier Model
Cuboids

Low-level Mid-level parts, Object,


Features Features phones, Utterance,
characters word
SIFT特征

 边缘检测算子
 直方图统计
Architecture of deep learning-based recognition
systems
“Deep” architecture for pattern recognition
– Speech, and Object recognition: since 2011/2012
– Handwriting recognition: since the early 1990s
– Convolutional Net with optional Graphical Model on top
– Trained purely supervised
– Graphical model has latent variables (locations of parts)

supervise fixed supervise fixed supervise fixed


d d d
Filters + Filters + Filters + Graphical
Pooling Pooling
ReLU ReLU ReLU Model

Low-level Mid-level parts, Object,


Features Features phones, Utterance,
characters word
Future Systems: deep learning + structured
prediction
Globally-trained deep architecture
– Handwriting recognition: since the mid 1990s
– Speech Recognition: since 2011
– All the modules are trained with a combination of unsupervised and
supervised learning
– End-to-end training == deep structured prediction

Unsup + Unsup + Unsup + Unsup +


supervised fixed supervised fixed supervisedsupervised

Filters + Filters + Filters + Graphical


Pooling Pooling
ReLU ReLU ReLU Model

Low-level Mid-level parts, Object,


Features Features phones, Utterance,
characters word
Deep learning = learning hierarchical
representations
It's deep if it has more than one stage of non-linear feature transformation

Low-level Mid-level High-level Trainable


feature feature feature classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2 0 1 3 ]
深度卷积网络:生物学基础
视觉神经处理
 空间平移不变性
 Translational Invariance

 物体的空间聚合属性
 Rigidity
视觉神经处理
 物体可分解/部件可组合(多层)
 Compositionality

 物体的空间线性组合性
卷积层

卷积操作变量 - Total input to the j-th feature map of layer l at position 𝑥, 𝑦 :


𝐼 𝐹−1
(𝑙) 𝑙 𝑙−1 (𝑙)
𝑙−1
𝑉𝑗 𝑥, 𝑦 = ෍ ෍ 𝑘𝑗𝑖 𝑢, 𝑣 ∙ 𝑂𝑖 𝑥 − 𝑢, 𝑦 − 𝑣 + 𝑏𝑗
• 𝑂𝑖 𝑖 = 1, ⋯ , 𝐼 : feature maps on 𝑖=1 𝑢,𝑣=0
the 𝑙 − 1 layer - Convolution layer output:
𝑙
• 𝑘𝑗𝑖 𝑢, 𝑣 : trainable convolution 𝑙 (𝑙)
𝑂𝑗 𝑥, 𝑦 = 𝑓 𝑉𝑗 𝑥, 𝑦
kernel
(𝑙)
• 𝑏𝑗 : trainable bias - REctified Linear Unit (ReLU): 𝑓 𝑥 = max(0, 𝑥)
卷积操作

(𝑙) (𝑙−1) (𝑙)


𝒐𝑗 = max(0, ෍ 𝒐𝑖 ∗ 𝒘𝑖𝑗 )
𝑗

Rectified Linear Unit (ReLU)


降维—池化
 用空域模板提取特征,特征的空间分辨率降低
 信息从空间维度转移到特征维度

Convolution Pooling
降维—池化
 用空域模板提取特征,特征的空间分辨率降低
 信息从空间维度转移到特征维度

100%

0%

Convolution Pooling
池化
 局部窗口取最大(Max Pooling)或均值(Average Pooling)
 优点:数据降维、局部平移/扭曲不变性

- Pooling layer output:


𝑙+1 𝑙
𝑂𝑖 𝑥, 𝑦 = max 𝑂𝑖 𝑥 ∙ 𝑠 + 𝑢, 𝑦 ∙ 𝑠 + 𝑣
𝑢,𝑣=0,⋯,𝐺−1
- G: pooling size
- s: stride (spacing between adjacent pooling windows)
多层 卷积+池化
 多层特征提取
 卷积操作:平移不变性
 权值共享:大大降低可训练的参数数量
 池化:空域信息经过降维转化到高维特征域
 池化丢失精确的定位能力(可通过多级尺度特征描述)
Layer 2
Feature “Maps”

Layer-3 Filters

Layer 1
Feature “Maps”
Layer-2 Filters

Input Image
Layer-1 Filters
Layer 3
Feature “Maps” Classifier

Layer 2
Feature “Maps”

Layer-3 Filters

Layer 1
Feature “Maps”
Layer-2 Filters

Input Image
Layer-1 Filters
[He, 2016]
卷积层的BP实现
 限制条件:权值相等
 We compute the gradients as usual, and then modify
the gradients so that they satisfy the constraints.
 So if the weights started off satisfying the constraints,
they will continue to satisfy them.

 ReLU的梯度
0 𝑥<0
 𝑓′ =ቊ
1 𝑥≥0

 误差回传时无衰减!
池化层的BP实现
 保存最大输入的位置信息

𝑥𝑚𝑎𝑥
𝑥𝑖 = 𝑥𝑚𝑎𝑥 max

𝐿′ 𝐿′
max

0
Softmax
 Applies to the output layer in the case of multi-class
classification
 Estimates the posterior probabilities over each class:
𝐿
exp 𝑂𝑖
𝑝𝑖 = 𝐿
(𝑖 = 1, ⋯ , 𝐾)
σ𝐾
𝑗=1 exp 𝑂𝑗
Cross-Entropy (Softmaxloss)
 Cross-entropy loss function: minimizing the discrepancy
between the ground truth y and the network prediction p:
𝐾
𝐿 𝑤 =෍ −𝑡𝑗 log 𝑝𝑗 = − log 𝑝𝑦
𝑗=1

 Gradient of Softmaxloss is balanced out between


softmax and cross-entropy

𝜕𝐿 𝜕𝐿 𝜕𝑝
 = ∙ =𝑡−𝑝
𝜕𝑂 𝜕𝑝 𝜕𝑂

 误差回传时无衰减、无失真!
CNN的网络参数
Feature Feature Feature
Kernel 1 Kernel 2 Kernel n map 1 map 2 map n
x0 x0 x0

x
y0 y0 y0 x'

d d d

y
y'

1 1 1
d

xxyxd x0 x y0 x d x n x' x y' x n


x0

(1,1)
y0

x p
s

(1,2)

x p
s

(2,1)

x p
CNN网络参数
 Padding
 Stride

𝑥+2𝑝 −𝑥0
 𝑥′ = +1
𝑠
𝑦+2𝑝 −𝑦0
 𝑦′ = +1
𝑠

 applies to both Conv and Pooling layers.

 Once parameters of input (previous) layer and conv/pool


layer are given, the size of output (next) layer are
automatically given.
全连接层
 当输入层为1x1xd,卷积核为1x1xdxn
 则输出为1x1xn
 全连接层为卷积层的特例
Convolutional network
Filter Bank +non-linearity

Pooling

Filter Bank +non-linearity

Pooling

Filter Bank +non-linearity

[LeCun et al. NIPS 1989]


典型CNN的网络结构
输入数据
 1D
 Signals
 2D
 Image
 Voice Cepstrum
 3D
 Tomography
 Video
CNN及应用
CNN及应用
 CNN类型
 RCNN
 DenseNet
 U-Net
 CV应用
 Classification
 Object Detection
 Segmentation
 etc.
Classification

Data Output

Predict A Predict B Predict C Accuracy

Actual A 9 1 0 90%

Actual B 1 7 2 70%

Actual C 1 0 8 80%

Total Accuracy 80%


2012 AlexNet

 使用ImageNet数据训练网络,ImageNet数据库含有1500多万个带标
记的图像,超过2.2万个类别。
 使用ReLU
 使用image translation、reflection、patch extraction
 用dropout
 使用批处理随机梯度下降训练模型,注明动量衰减值和权重衰减值。
 用两台GTX 580 GPU,训练了5到6天
2013 ZF Net

 AlexNet训练用了1500万张图片,而ZFNet只用了130万张。
 AlexNet在第一层中使用了大小为11×11的滤波器,而ZF
使用的滤波器大小为7x7,
 使用一台GTX 580 GPU训练了12天。
 开发可视化技术“解卷积网络”(Deconvolutional
Network)
DeconvNet
2014 VGG Net
 3x3的滤波器,3卷积
层具有7x7的有效感
受野。
 每个maxpool层后滤
波器的数量增加一倍
 训练中使用scale
jittering的数据增强技
术。
 使用4台英伟达Titan
Black GPU训练了两
到三周。
2015 GoogLeNet

 整个架构中使用了9个Inception 模型,总共超过100层。
 没有使用完全连接的层。他们使用一个平均池代替,从
7x7x1024 的体积降到了 1x1x1024,这节省了大量的参数。
比AlexNet的参数少了12倍
 在感知模型中,使用了R-CNN中的概念。
 高端的GPU一周内就能完成训练。
2015 ResNet
关于ResNet
关于ResNet
关于ResNet
 ResNet
 AlphaGO Zero
Object Detection & Recognition
Classification + localization. Results
Classification + localization: multiscale sliding window

– Apply convnet with a sliding window over the image at multiple scales
– Important note: it's very cheap to slide a convnet over an image
– Just compute the convolutions over the whole image and replicate the fully-
connected layers
Classification + Localization: sliding window +
bounding box regression
– Apply convnet with a sliding window over the image at multiple scales
– For each window, predict a class and bounding box parameters
– Even if the object is not completely contained in the viewing window, the convnet
can predict where it thinks the object is.
Classification + Localization: sliding window +
bounding box regression + bbox voting

– Apply convnet with a sliding window over the image at multiple scales
– For each window, predict a class and bounding box parameters
– Compute an “average” bounding box, weighted by scores
2013:R-CNN

 R-CNN将物体识别分为两步:区域建议+分类。
2015: Fast R-CNN
YOLO-You Only Look Once
 核心思想:YOLO将物体检测作为回归问题求解,基于一个单独的end-to-end
网络,完成从原始图像的输入到物体位置和类别的输出。

 主要特点:速度快、背景错误率低、泛化能力强

 检测流程:

 网络结构:
YOLO-You Only Look Once
 模型: 1. 图片分为S*S个网格,每个网格负责
检测中心落在该网格的物体;
2. 每个网格输出预测的边界框的位置(x,
y, w, h)及其包含目标的概率与属于某
一类别的概率confidence;

 比较:

背景识别错误率显著降低
图 Fast R-CNN与YOLO识别情况对比
Correct: correct class and IOU > .5
Localization: correct class, .1 <IOU < .5 YOLO在保证检测率的前提下,检测
Similar: class is similar, IOU > .1 速度最快
Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016: 779-788.
SSD
 YOLO: 将目标检测转换成一个回归问题

Single Shot Detector(SSD) 是在


2015提出的一种目标检测方法。SSD
综合了YOLO中的回归思想以及Faster
R-CNN的anchor机制。

 和YOLO一样通过回归获取目标位
置和类别  Faster R-CNN: Region Proposal Network, anchor 机制

 利用Faster R-CNN的anchor机制
建立位置和特征的对应关系。
 保持YOLO速度快以及Faster R-
CNN准确度高的特性

[1]Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//European conference on computer vision. Springer, Cham, 2016: 21-37.
[2] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016: 779-788.
SSD
 Single Shot Detector(SSD)网络结构图:

 SSD512模型在COCO test-dev上的检测实例。我们展示了分数高于0.6的检测。每种颜色对应一
种目标类别。
Semantic Segmentation Y
 Labeling every pixel with the object it belongs to LeCun

 Would help identify obstacles, targets, landing sites, dangerous areas


Would help line up depth map with edge maps

[Farabet et al. ICML 2012, PAMI 2013]


Scene parsing/labeling

[Farabet et al. ICML 2012, PAMI 2013]


2017:Mask R-CNN

Mask R-CNN在检测出目标位置和类别的同时,还能在预测框中将目标分割出。
关于Mask R-CNN

Mask R-CNN = Faster R-CNN + Mask


Mask R-CNN对Faster R-CNN结构基础上搭建,在Faster R-CNN的RPN
提出的Region上预测一个二进制的Mask,使网络在检测的同时对区域中
的目标进行分割。
Fully-Convolutional Network
 用CNN做按像素分类(Segmentation):
 存储开销很大
 计算效率低下
 像素块大小的限制了感知区域的大
 FCN将传统CNN中的全连接层转化成卷积层
U-Net
一般的自编码器输入与输出不
能共享低层信息,低层信息很
容易丢失。为了解决生成器遇
skip connections
到信息瓶颈的问题,我们在网
络中增加了一些跳连接,这样
的网络被称为U-Net.

GAN部分的损失函数:

𝐿𝐺𝐴𝑁 (𝐺) = − ෍ 𝐸𝑧~𝑝𝑑𝑎𝑡𝑎(𝑖) log 𝐷 𝐺 𝑧


𝑖
传统的损失函数:

𝐿𝐿1 𝐺 = ෍ 𝐸𝑥~𝑝𝑑𝑎𝑡𝑎(𝑖) ,𝑧~𝑝𝑑𝑎𝑡𝑎(𝑗) ||𝑥 − 𝐺 𝑧 ||11


𝑖
联合损失函数:
𝐿 𝐺 = 𝛽1 𝐿𝐺𝐴𝑁 𝐺 + 𝛽2 𝐿𝐿1 𝐺
U-Net

(a)

图(a)中所示,U-Net能完成图像处理和计算
机视觉中包括的很多图像转换的工作;图(b)
反映了相比于GAN,用传统的损失函数生成
的图像更加模糊。

(b)
SegNet-语义分割网络
➢ SegNet的提出: University of Cambridge 应用于自动驾驶或者智能机器人
Pooling Indices
➢ SegNet的网络结构: 输入 Encoder Decoder Softmax
输出
RGB图像 RGB分类图
13-layers 13-layers
(VGG-16)
➢ SegNet的创新点:Decoder方式

• 利用Encoder阶段下采样的最大池化的索引
• 减少了模型参数

图一:SegNet网络结构 图2:SegNet的Decoder方式
[1] Computer Vision and Robotics Group at the University of Cambridge, UK. http://mi.eng.cam.ac.uk/projects/segnet/
[2] Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation." arXiv preprint
arXiv:1511.00561 (2015).
SegNet-语义分割网络
➢ Bayesian SegNet 网络结构:
Pooling Indices 输出
输入 RGB分类图
Encoder Decoder Softmax
RGB图像 输出
13-layers 13-layers
(VGG-16) 置信度图

➢ Bayesian SegNet创新点:在卷积层中多加了一个DropOut层以输出不确定性

➢ Bayesian SegNet优势:
• 效果相比于SegNet提升2-3%
• 提升了小数据集的效果

[1] Kendall, Alex, Vijay Badrinarayanan, and Roberto Cipolla. "Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene
Understanding." arXiv preprint arXiv:1511.02680 (2015).
ConvNet for stereo matching Y LeCun

– Using a ConvNet to learn a similarity


measure between image patches

[LeCun, 2016]
Pose estimation and attribute recovery withY
ConvNets LeCun

Pose-Aligned Network for Deep Attribute Modeling [Zhang


et al. CVPR 2014] (Facebook AI Research) Real-time hand pose
recovery [Tompson et
al. Trans. on Graphics
14]

Body pose estimation [Tompson et al. ICLR, 2 0 1 4 ]


[LeCun, 2016]
[LeCun, 2016]
[LeCun, 2016]
[LeCun, 2016]
[LeCun, 2016]
[LeCun, 2016]
References
– NVIDIA/NYU – Deep Learning Institute Teaching Kit
– Goodfellow: Deep Learning, MIT Press
– LeCun, Deep Learning Tutorial, 2016
– Lee, Deep Learning Tutorial, 2017
Thank You
Q&A

[email protected]
www.emwlab.fudan.edu.cn

You might also like