Kursus Deep Learning
Kursus Deep Learning
9 October 2017
untuk NLP
Information RetrievalLab.
FakultasIlmuKomputer
Universitas Indonesia 2017
1
Deep LearningTsunami
“Deep Learning waves have lapped at the shores of
9 October 2017
computational linguistics for several years now, but 2015
seems like the year when the full force of the tsunami hit the
major Natural Language Processing (NLP) conferences.”
-Dr.Christopher D.Manning, Dec 2015
9 October 2017
yang lainnya, disarankan untuk kita mempelajari beberapa
topik berikut:
• Gradient Descent/Ascent
3
Referensi/Bacaan
• Andrej Karpathy’ Blog
9 October 2017
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/
• Colah’s Blog
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/
4
Deep Learning vs MachineLearning
• Deep Learning adalah bagian dari isu MachineLearning
9 October 2017
• Machine Learning adalah bagian dari isu Artificial Intelligence
ArtificialIntelligence
5
Machine Learning : Dirancangmanusia
(rule-based) : Di-inferotomatis
9 October 2017
Predicted label:positive
i f contains(‘menarik’):
r e t u r n positive
...
6
“Buku ini sangat menarik dan penuh manfaat”
Machine Learning : Dirancangmanusia
(classicalML) : Di-inferotomatis
9 October 2017
ouput.
Predicted label:positive
FeatureEngineering!
Hand-designed FeatureExtractor:
Contoh: Menggunakan TF-IDF, Representation
informasi syntax dengan POS Tagger, dsb.
7
“Buku ini sangat menarik dan penuh manfaat”
Machine Learning : Dirancangmanusia
(RepresentationLearning) : Di-inferotomatis
9 October 2017
Predicted label:positive
Learn FeatureExtractor
Contoh: Restricted Boltzman Machine, Representation
Autoencoder, dsb.
8
“Buku ini sangat menarik dan penuh manfaat”
Machine Learning : Dirancangmanusia
(DeepLearning) : Di-inferotomatis
9 October 2017
Predicted label:positive
FiturKompleks/High-Level
FiturSederhana Representation
9
“Buku ini sangat menarik dan penuh manfaat”
h
Sejara
10
9 October 2017
• Perceptron terdiri dari 3 layer: Sensory, Association, dan
Response.
Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and
organization in the brain.” Psychological review 65.6 (1958): 386.
The Perceptron (Rosenblatt,1958)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
Activation function adalah fungsi non-linier. Dalam kasus perceptron
Rosenblatt, activation function adalah operasi thresholding biasa (step
function).
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
13
https://www.datarobot.com/blog/a-primer-on-deep-learning/
The Fathers of Deep Learning(?)
• Di tahun 2006, ketiga orang tersebut mengembangkan cara
untuk memanfaatkan dan mengatasi masalah training
9 October 2017
terhadap deep neural networks.
• Sebelumnya, banyak orang yang sudah menyerah terkait
manfaat dari neural network, dan cara training-nya.
https://www.datarobot.com/blog/a-primer-on-deep-learning/
The Fathers of Deep Learning(?)
• Automated learning of data representations and features is
9 October 2017
what thehype is all about!
https://www.datarobot.com/blog/a-primer-on-deep-learning/
Mengapasebelumnya“deeplearning” tidak sukses?
9 October 2017
ditemukan sebelumnya.
• Bahkan Long-ShortTermMemory(LSTM)network, yang saat ini
ramai digunakan di bidang NLP, ditemukan tahun 1997 oleh
Hochreiter&Schmidhuber.
16
Mengapasebelumnya“deeplearning” tidak sukses?
9 October 2017
learning.html
• Computers were slow. So the neural networks of past were tiny. And
tiny neural networks cannot achieve very high performance on
9 October 2017
learning yang bekerja secara praktikal.
The success of Deep Learning hinges on a very fortunate fact: that
well-tuned and carefully-initialized stochastic gradient descent
(SGD) can train LDNNs on problems that occur in practice. It is not a
And yet, somehow, SGD seems to be very good at training those large
deep neural networks on the tasks that we care about. The problem
of training neural networks is NP-hard, and in fact there exists a
family of datasets such that the problem of finding the best neural
network with three hidden units is NP-hard. And yet, SGD just solves
it in practice. 18
9 October 2017
Networks(ANNs)
• Dan Neural Networks sebenarnya adalah sebuah
TumpukanFungsiMatematika
9 October 2017
Ekspresikan permasalahan ke dalam sebuah fungsi F (yang
mempunyai parameter θ), lalu secara otomatis cari
parameter θ sehingga fungsi F tepat mengeluarkan output
yang diinginkan.
Y =F(X;θ)
21
X: “Buku ini sangat menarik dan penuh manfaat”
Apa itu Deep Learning?
Untuk Deep Learning, fungsi tersebut biasanya terdiri dari
9 October 2017
tumpukan banyak fungsi yang biasanya serupa.
Y F(F (F (X ;3);2);1)
F(X;θ3)
Gambar ini sering disebut Tumpukan Fungsi ini
dengan istilah sering disebut dengan
F(X;θ2)
ComputationalGraph TumpukanLayer
F(X;θ1)
22
“Buku ini sangat menarik dan penuh manfaat”
Apa itu Deep Learning?
• Layer yang paling terkenal/umum adalah Fully-Connected
9 October 2017
Layer.
Y F (X ) f (W .X b)
• “weighted sum of its inputs, followed by a non-linear function”
W R M N
X RN
f
w x b
Munit ii
i
Nunit f
b RM
w xi i b
f (W.X b) i 23
X Non-linearity
Mengapa perlu“Deep”?
• Humans organize their ideas and concepts hierarchically
9 October 2017
• Humans first learn simpler concepts and then compose
them to represent more abstract ones
• Engineers break-up solutions into multiple levels of
abstraction and processing
24
Y. Bengio, Deep Learning, MLSS 2015, Austin, Texas, Jan2014
(Bengio &Delalleau2011)
Neural Networks
Y f (W1.X b1)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
X Y f (W1.X b1)
25
Neural Networks
Y f (W1.( f (W1.X b2 )) b1)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
Y f (W2 .H1 b2 )
X H1 f (W1.X b1 )
26
Neural Networks
Y f (W1.( f (W2.( f (W3.X b3 )) b2 )) b1)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
Y f (W3.H 2 b3)
X
H1 f (W1.X b1 ) H 2 f (W2 .H1 b2 )
27
Alasanmatematis mengapaharus“deep”?
9 October 2017
can approximate anycontinuousfunctionarbitrarily well.
28
Alasanmatematis mengapaharus“deep”?
Akan tetapi...
9 October 2017
• “Enough units” can be a very large number. There are functions
representable with a small, but deep network that would require
exponentiallymany units with a single layer.
• The proof only says that a shallow network exists, it does not say
9 October 2017
f
w xi i b
i
9 October 2017
• Secara random,kita inisialisasi semua parameter W1, b1, W2,
b2, W3,b3
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
W(3)
W(2)
W(1)
32
Bukuini sangat baik dan mendidik
Training NeuralNetworks
• Initialize trainable parameters
randomly
9 October 2017
• Loop: x =1 →#epoch:
• Pick a training example
W(2)
W(1)
33
x Bukuini sangat baik dan mendidik
Training NeuralNetworks
pos neg • Initialize trainable parameters
y’ randomly
9 October 2017
True Label 1 0
• Loop: x =1 →#epoch:
Pred. Label • Pick a training example
y 0.3 0.7
(Output) • Compute output by doing feed-
h2
W(2)
h1
W(1)
34
x Bukuini sangat baik dan mendidik
Training NeuralNetworks
pos neg • Initialize trainable parameters
y’ randomly
9 October 2017
True Label 1 0
L
• Loop: x =1 →#epoch:
y • Pick a training example
Pred. Label y 0.3 0.7
(Output) • Compute output by doing feed-
W(2)
h1
W(1)
35
x Bukuini sangat baik dan mendidik
Training NeuralNetworks
pos neg • Initialize trainable parameters
y’ randomly
9 October 2017
True Label 1 0
L
• Loop: x =1 →#epoch:
y • Pick a training example
Pred. Label y 0.3 0.7
(Output) L • Compute output by doing feed-
9 October 2017
True Label 1 0
L
• Loop: x =1 →#epoch:
y • Pick a training example
Pred. Label y 0.3 0.7
(Output) L • Compute output by doing feed-
9 October 2017
True Label 1 0
L
• Loop: x =1 →#epoch:
y • Pick a training example
Pred. Label y 0.3 0.7
(Output) L • Compute output by doing feed-
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
39
https://github.com/joshdk/pygradesc
Lebih Detaildalam Hal Teknis …
0
4
9 October 2017
Digunakan untuk mencari konfigurasi parameter-parameter
sehingga cost function menjadi optimal, dalam hal ini
mencapai local minimum.
9 October 2017
Misal, kita pilih x dimulai dari x=2.0:
Localminimum
42
Gradient Descent(GD)
Algorithm:
9 October 2017
xt 1 xt tf '(x t )
If f '(x t 1 ) then return "converged on critical point"
If xt xt1 then return "converged on x value"
Tips: pilih αt yang tidak terlalu kecil, juga tidak terlalu besar.
Gradient Descent(GD)
9 October 2017
while not converged :
(t1) (t ) 22 t f ((t) )
(t)2
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
45
https://github.com/joshdk/pygradesc
LogisticRegression
9 October 2017
tidak).
P( y 0 | x;) 1 P( y 1| x;)
46
1
(z)
1 e z
LogisticRegression
9 October 2017
i1
P( y 0 | x;) 1 P( y 1| x;)
x2 θ2
θ3
x3
θ0
+1
47
Dengan fungsi sigmoid sebagai activationfunction
LogisticRegression
9 October 2017
Bagaimana bila belum ditentukan ?
Kita bisa estimasi parameter θ dengan memanfaatkan
training data {(x(1), y(1)), (x(2), y(2)), …, (y(n), y(n))} yang
x1
θ1
x2 θ2
θ3
x3
θ0
48
+1
LogisticRegression
Learning
Misal,
n
h(x) (0 1 x1 ... n xn ) (0 i xi )
9 October 2017
i1
9 October 2017
l() log L()
m
J () y(i ) log h(x(i) ) (1 y(i ) ) log (1 h(x(i) ))
i1
50
LogisticRegression
Learning
9 October 2017
menurunkan:
J ()
i
J () (h (x) y) x
j
j
9 October 2017
while not converged :
m
1: 1 (h (x
m
n: n (h (x
i1
(i)
) y (i) ) x ni
52
LogisticRegression
Learning
9 October 2017
i n i s i a l i s a s i θ 1 , θ 2 , … , θn
while not converged :
53
LogisticRegression
Learning
9 October 2017
gradient dihitung dengan cara rata-rata/sum dari sebuah mini- batch
sample (misal, 32 atau 64 sample).
54
Multilayer NeuralNetwork (MultilayerPerceptron)
9 October 2017
x1
θ1
θ3
x3
θ0
+1
Misal, ada 3-layer NN, dengan 3 input unit, 2 hidden unit, dan
2 output unit.
9 October 2017
x1 𝑊
(1)11
𝑊 𝑊
(1)21 (2)11
9 October 2017
Dari contoh sebelumnya, ada 2 unit di output layer. Kondisi ini
biasanya digunakan untuk binary classification. Unit pertama
menghasilkan probabilitas untuk pertama, dan unit kedua
57
Multilayer NeuralNetwork (MultilayerPerceptron)
9 October 2017
Untuk menghitung output di hiddenlayer:
a1(2) f (z (2)1 )
a2(2) f (z (2)2 ) Ini hanyalah perkalian matrix !
x1
W (1)
W(1)
W b1(1)
(1)
58
z (2)
W x b
(1) (1) 11
(1)
12
(1) x2 (1)
13
(1)
W W W b2
x3
21 22 23
Multilayer NeuralNetwork (MultilayerPerceptron)
9 October 2017
z (2) W (1) x b(1)
a (2) f (z (2) )
59
Multilayer NeuralNetwork (MultilayerPerceptron)
Learning
9 October 2017
nl 1 sl sl1 (l ) 2
Regularizationterms
9 October 2017
1
J (W ,b; x, y) hW ,b (x (i ) ) y(i )
2
2 1
2 j
hW ,b (x j ) y j
(i) (i)
2 61
Multilayer NeuralNetwork (MultilayerPerceptron)
Learning
9 October 2017
i n i s i a l i s a s i W, b
whil e not converg ed : Bagaimanacara menghitung
bi(l ) b(li )
J (W,b)
bi(l)
62
Multilayer NeuralNetwork (MultilayerPerceptron)
Learning
9 October 2017
dJ(W,b;x,y) menentukan overall partial derivative dJ(W,b):
1 m
bi(l)
J (W ,b)
m i1 bi(l)
J (W ,b; x (i)
, y (i)
)
63
9 October 2017
1. Jalankan proses feed-forward
2. Untuk setiap output unit i pada layer nl(output layer)
i (nl )
J (W,b; x, y) (a (nl )
y ) f '(z (nl )
)
zi
i i i
64
j i
(l ) (l1) (l1)
J (W,b; x, y) a J (W,b; x, y)
Wij
(l )
bi(l) i
Multilayer NeuralNetwork (MultilayerPerceptron)
Learning 𝑧1(3)
(2)
Back-Propagation 𝑊 𝑎 (3
) 1
11
𝑊
(2)21
9 October 2017
Contoh hitung gradient di output ... 𝑊
(2)12
(3)
𝑧2
(3
𝑊22 𝑎
(2) ) 2
J (W ,b; x, y)
2
1 (3) 2
a1 y1
1 (3)
2
a2 y2
2
𝑏
(2) 𝑏
)2
(2
a 1
(3)
f ' z
f z (3)
1 (3)
J a1(3) z1(3)
J (W ,b; x, y) (3) (3)
z (3)
z (3) 1 W12(2)
a1 z1 W12(2)
1 1
z(3)
1
W (2)
11
a
(2)
1
W (2)
12
a (2)
2
b
(2)
1
a1(3) y1 f '(z 1(3) ) a 2(2)
z1(3) 65
a2(2)
W12 (2)
Sensitivity – JacobianMatrix
The Jacobian J is the matrix of partial derivatives of the
9 October 2017
network output vector y with respect to the input vector x.
. . . .
. . . . yk
67
Recurrent NeuralNetworks
O1 O2 O3 O4 O5
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
h1 h2 h3 h4 h5
X1 X2 X3 X4 X5
68
9 October 2017
• …
• Intinya … ada Sequences
NotRNNs SequenceInput
(Vanilla Feed-Forward NNs) (e.g. Sentence Classification)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
SequenceInput/Output
(e.g. Machine Translation) 70
SequenceOutput
(e.g. Image Captioning)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Recurrent NeuralNetworks(RNNs)
9 October 2017
RNNscombine the input vector with their state vector with a fixed
Misal, ada I input unit, Koutput unit, dan Hhidden unit (state).
9 October 2017
ht R H 1 xt R I 1 yt R K 1
st tanh ht
yt W (hy) st
X1 X2 72
h0 0
Recurrent NeuralNetworks(RNNs)
BackPropagation ThroughTime(BPTT)
The loss function depends on the activation
Y1 Y2
of the hidden layer not only through its
9 October 2017
influence on the output layer.
(y)
K (hy) ( y) H (hh) (h)
(h)
i,t Wi, j j,t Wi,n n,t1 f ' hi,t
j 1
yi,t
i,t
Di setiap step,kecuali
palingkanan:
L
X1 X2 (h)i,t t (h)i,T 1 0 73
hi,t
Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks
Recurrent NeuralNetworks(RNNs)
BackPropagation ThroughTime(BPTT)
The same weights are reused at every
Y1 Y2
timestep, we sum over the whole sequence to
9 October 2017
get the derivatives with respect to the
network weights.
W (hy) L T
W (hh)
L T
W ( xh) Wi, j
(hy )
t1
j,t s i,t
( y)
L
T
(h)
x
X1 X2 Wi, j
( xh)
t1
j,t i,t
74
Recurrent NeuralNetworks(RNNs)
BackPropagation ThroughTime(BPTT)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
75
BackPropagation ThroughTime(BPTT)
Misal, untuk parameter antar state: Term-term ini disebut temporal
contribution: bagaimana W(hh)
9 October 2017
pada step k mempengaruhi cost
pada step-step setelahnya (t >k)
Lt
t
L h
t t k
h
W (hh)
h h W (hh)
76
hk t W
( xh)
xt 1W s
(hh)
s
W (hh)
W (hh) t1
Recurrent NeuralNetworks(RNNs)
Vanishing&Exploding GradientProblems
Bengio et al., (1994) said that “the exploding gradients problem refers to the
large increase in the norm of the gradient during training. Such events are
9 October 2017
caused by the explosion of the long term components, which can grow
exponentially more then short term ones.”
And “The vanishing gradients problem refers to the opposite behaviour, when
Bengio, Y., Simard, P., and Frasconi, P.(1994). Learning long-term dependencies with gradient
descent is difficult. IEEE Transactions on Neural Networks
Sequential Jacobianbiasa
Recurrent NeuralNetworks(RNNs) digunakan untuk analisis
penggunaan konteks pada
Vanishing&Exploding Gradient Problems RNNs.
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
78
9 October 2017
2) Definisikan arsitektur baru di dalam RNNCell!, seperti Long-Short Term
Memory(LSTM)(Hochreiter &Schmidhuber,1997).
Variant: Bi-DirectionalRNNs
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
80
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
81
9 October 2017
2. These blocks can be thought of as a differentiable version
of the memory chips in a digital computer.
3. Each block contains:
82
017
2
r
ob
e
ct
O
9
9 October 2017
information over long periods of time,
thereby mitigating the vanishing gradient
problem.
84
Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks
S. Hochreiter and J.Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735
1780, 1997
Long-ShortTermMemory(LSTM)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
85
KomputasidiLSTM
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
86
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
87
Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks
S. Hochreiter and J.Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735
1780, 1997
Example: RNNsfor POSTagger
(Zennaki,2015)
9 October 2017
PRP VBD TO J NN
J
88
I went to west java
LSTM + CRF for Semantic RoleLabeling
(Zhou and Xu,ACL2015)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
89
AttentionMechanism
9 October 2017
network needs to be able to compress all the necessary information of
a source sentence into a fixed-length vector. This may make it difficult
for the neural network to cope with long sentences, especially those
that are longer than the sentences in the training corpus.
90
AttentionMechanism
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
Sutkever, Ilya et al., Sequence to Sequence Learning with Neural
Networks, NIPS 2014.
91
https://blog.heuritech.com/2016/01/20/attention-mechanism/
AttentionMechanism
9 October 2017
Sutkever, Ilya et al., Sequence
to Sequence Learning with
Neural Networks, NIPS 2014.
9 October 2017
• Each time the proposed model generates a word in a translation, it
(soft-)searches for a set of positions in a source sentence where the
most relevant information is concentrated. The model then predicts
a target word based on the context vectors associated with these
AttentionMechanism
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
94
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
95
Cell merepresentasikan bobot attention,terkait translation.
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
96
Colin Raffel, Daniel P.W.Ellis, F EED -F ORWARD NETWORKS WITH ATTENTION CAN
S OLVES OME L ONG-T ERM MEMORY P ROBLEMS, Workshop track - ICLR 2016
AttentionMechanism
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
97
Yang, Zichao, et al., Hierarchical Attention Networks for Document Classification, NAACL 2016
AttentionMechanism
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
98
Yang, Zichao, et al., Hierarchical Attention Networks for Document Classification, NAACL 2016
AttentionMechanism
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
99
https://blog.heuritech.com/2016/01/20/attention-mechanism/
AttentionMechanism
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
100
https://blog.heuritech.com/2016/01/20/attention-mechanism/
Xu, Kelvin, et al. « Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention (2016).
AttentionMechanism
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
101
https://blog.heuritech.com/2016/01/20/attention-mechanism/
Xu, Kelvin, et al. « Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention (2016).
AttentionMechanism
9 October 2017
Attention Model digunkan
Sebagai contoh:
untuk menghubungkan
• Premis: “A wedding party taking pictures“ kata-kata di premis dan
102
Tim Rocktaschel et al., REASONING ABOUT ENTAILMENT WITH NEURAL ATTENTION, ICLR 2016
AttentionMechanism
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
103
Tim Rocktaschel et al., REASONING ABOUT ENTAILMENT WITH NEURAL ATTENTION, ICLR 2016
Recursive NeuralNetworks
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
104
R. Socher, C. Lin, A. Y.Ng, and C.D. Manning. 2011a. Parsing Natural Scenes and Natural Language with
Recursive Neural Networks. In ICML
Recursive NeuralNetworks
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
p1 gW.b;c bias 105
Socher et al., Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP
2013
Recursive NeuralNetworks
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
106
Socher et al., Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP
2013
Convolutional Neural Networks (CNNs) for Sentence Classification
(Kim,EMNLP2014)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
107
Recursive Neural Network for SMT Decoding.
(Liu et al., EMNLP2014)
9 October 2017
Alfan F.Wicaksono, FASILKOM UI
108