Notions de Deep Learning
Notions de Deep Learning
18/05/2025
TABLE OF CONTENTS
Table of Contents
1 Introduction 2
Introduction 2
. ... . . . . . . . . . . . . . . . . . . . . ... . .. 3
2.1 Introduction au Deep Learning . . .
2.1.1 What is Deep Learning and how does it differ from Machine Learning? 3
. . . . . . . . . . . . ... . .. 4
2.1.3 Current applications and use cases by sector.
. . . . . . .. 7
2.1.5 Practical example: Installing development environments.
. . . ... . . . . . . . . . . . . . . . . . . . . . . . . .. 7
2.2 Mathematical fundamentals.
. . . . . . . . . . . . . . . . . . ... . .. 7
2.2.1 Linear Algebra for Deep Learning.
. . . . . . . . . . . . . . . . . . . . .. 8
2.2.2 Differential calculus and gradient descent.
. . . . . . . . . . . . . . . . . . ... . . . 11
2.2.3 Essential Probabilities and Statistics.
. ... . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Principles of neural networks.
. . . . . . . . . . . . . . . . . . . ... . . . 20
2.3.3 Activation functions and their impact.
. . . 22
2.3.4 Practical example: Create a simple perceptron for binary classification.
. . . . . . . ... . ... . . . 26
2.4.1 Backpropagation algorithm explained step by step.
. . . . . . . . . . . . . . . . . . . . ... . . . 34
3.1 Convolutional Neural Networks (CNN) . .
. ... . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Principles and architecture of CNNs.
. . . . . . . . . . . . . . . . . . . . . . ... . . . 37
3.2 Natural Language Processing (NLP).
. . . . . . . . . . . . . . . . 37
3.2.1 Word representation (one-hot, word embeddings).
... . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Word2Vec, GloVe and FastText.
. . . . . . . . ... . . . 44
3.2.3 Introduction to Recurrent Neural Networks (RNN).
. . ... . ... . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.4 Position encodings .
. . . . . . . . . . ... . ... . . . 83
. . . . . . . . . . . . . . . . . . . 83
4.1.1 Residual Networks (ResNet) and Skip Connections.
. . . . . . . . . . . . . . . . ... . . . 89
4.1.2 U-Net architectures for segmentation.
. . . . ... . . . 94
4.1.3 R-CNN, Fast R-CNN, Faster R-CNN for Object Detection.
. . . ... . . . 99
4.1.4 Practical example: Segmentation of medical images with U-Net.
1 INTRODUCTION
1 Introduction
Welcome to this comprehensive Deep Learning course. Whether you're a student, researcher, data science
professional, or simply passionate about artificial intelligence, this course is designed to provide you with a
progressive, structured, and practical introduction to deep neural networks.
Deep learning is one of the most dynamic and promising branches of artificial intelligence today. This
discipline has revolutionized many fields such as computer vision, natural language processing, speech
recognition, and many others. Its ability to solve complex problems by learning directly from data makes it a
powerful tool for technological innovation.
This course offers a complete immersion into the world of deep learning, from theoretical fundamentals to
advanced architectures. You will learn how to design, implement, and optimize deep neural networks using
popular libraries like TensorFlow, PyTorch, and Keras.
Deep learning is a sub-branch of machine learning, which itself is part of the broader field of artificial intelligence. While traditional
machine learning often relies on manual feature extraction (feature engineering), deep learning is distinguished by its ability to automatically
learn hierarchical representations directly from raw data.
Artificial intelligence
Machine Learning
Deep Learning
The main difference lies in the architecture and depth of the models. The term "deep" refers to the presence
of multiple hidden layers in neural networks, allowing complex relationships in data to be abstracted and
modeled.
Example
— In Deep Learning: A deep neural network automatically learns to extract increasingly abstract
features. The first layers detect simple contours, then textures, patterns, and finally high-level
concepts like faces or objects.
Explanation
This fundamental difference explains why deep learning excels in areas where manual feature definition
is difficult or inefficient, such as computer vision, natural language processing, or speech recognition.
Deep learning has had a fascinating history, marked by periods of sustained enthusiasm
of "AI winters", before experiencing its spectacular renaissance in the 2010s.
1943-1958 First theoretical models of artificial neurons (McCulloch & Pitts, Rosenblatt's Perceptron)
1969-1986 "First winter of AI" following the limitations of the simple perceptron
The resurgence of Deep Learning in the 2010s was mainly due to three factors:
— Massive availability of data (Big Data)
— Increased computing power (GPU, TPU)
— Algorithmic advances (new activation functions, regularization techniques, innovative architectures)
Explanation
The major turning point came in 2012 with the landslide victory of the AlexNet convolutional network in the
ImageNet competition. This demonstration of the superiority of deep neural networks has
triggered a real revolution in the field of AI, attracting the attention of researchers,
businesses and the general public.
Deep learning has transformed many industries and continues to open up new possibilities
d’innovation :
Example
In the healthcare field, convolutional neural networks (CNNs) are used to analyze medical images and
detect pathologies such as cancer with sometimes remarkable accuracy.
superior to that of doctors. For example, a model developed by Google Health demonstrated
superior performance to radiologists in detecting breast cancer on mammograms.
ÿ ÿ
1 import tensorflow as tf
2 from tensorflow import keras
3
8 , activation =’softmax ’)
9 ])
10
ÿ ÿ
ÿ ÿ
1 import torch
2 import torch . nn as nn
10 = nn . Dropout (0.2)
11 self . fc2 = nn . Linear (128 10) ,
12
15 x = self . dropout (x )
16 x = self . fc2 (x )
17 return F. log_softmax (x , dim =1)
18
20 model = SimpleNet ()
ÿ ÿ
Listing 2 – Equivalent example with PyTorch
Explanation
1 # Creating a virtual environment for Deep Learning 2 conda create -n deeplearning python =3.9 3 conda activate
deeplearning
5 # Installing PyTorch with GPU support (CUDA 11.7) 6 conda install pytorch torchvision torchaudio pytorch -
cuda =11.7 -c pytorch -c nvidia
7
8 # Installation de TensorFlow avec support GPU 9 conda install -c conda - forge tensorflow
- gpu
10
Exercise
Install the development environment on your machine following the instructions above.
Next, create a Jupyter notebook and verify that you can import PyTorch and TensorFlow.
Also check GPU availability using the appropriate commands for each framework.
Solution
ÿ ÿ
1 # In a Jupyter notebook 2 import torch 3
import tensorflow as tf
11 # Verification de TensorFlow
12 print (f" TensorFlow version : {tf. __version__ }") 13 print (f" GPU disponible pour
TensorFlow : { len (tf. config . list_physical_devices (’ GPU ’)
) > 0}")
14 print (f" Peripheriques disponibles : {tf. config . list_physical_devices ()}")
ÿ ÿ
Listing 4 – Verification de l’installation
Linear algebra is the fundamental mathematical language of deep learning. Neural networks essentially manipulate vectors, matrices, and
tensors through various linear operations.
ÿ ÿ
1 import numpy as np
2
randn (3)
# Bias vector
7
11
Explanation
In the example above, the matrix product np.dot(x, W) represents the fundamental operation
of a fully connected (Dense) layer in a neural network. This operation calculates
the weighted sum of the inputs for each neuron in the next layer. Adding the bias +
b allows the network to learn an offset for each neuron, increasing its ability to
modeling.
Or :
— ÿt represents the current parameters
— ÿ is the learning rate (hyperparameter)
— ÿÿJ(ÿ) is the gradient of the loss function J with respect to the parameters ÿ
ÿ ÿ
1 import numpy as np
2 import matplotlib . pyplot as plt
3
15 history = [x]
16
22 # Show progress
23 if (i +1) % 10 == 0:
24 print (f" Iteration {i +1}: x = {x :.6f}, f(x) = { fonction (x) :.6 f}")
25
26 return x , history
27
32 50
33 )
34
35 print (f"\ nResultat final : x = { point_optimal :.6 f}, f(x) = { fonction ( point_optimal ) :.6f}")
36
37 # Visualisation
Explanation
This example illustrates the fundamental principle of gradient descent on a simple function
a variable. In neural networks, this principle is applied to multivariable functions
(with potentially millions of parameters), but the logic remains the same: calculate the
gradient of the loss function with respect to each parameter and fit these parameters in
the direction opposite to the gradient to minimize loss.
Gradient backpropagation is an efficient application of the chain rule
which allows these gradients to be calculated layer by layer, starting from the output and going up
towards the network entrance.
Exercise
Modify the gradient descent code above to minimize the function f(x) = x
4
- 2x 2
+
x. Experiment with different starting points and learning rates. What do you notice?
regarding the convergence of the algorithm?
Solution
ÿ ÿ
1 import numpy as np
2 import matplotlib . pyplot as plt
3
15 history = [x ]
16
22 if (i +1) % 10 == 0:
23 print (f" Iteration {i +1}: x = {x :.6f}, f(x) = { fonction (x) :.6 f}")
24
25 return x , historical
26
52
57
58 plt . plot ( historical , ") label =f" Depart : { config [’ depart ’]} , Rate: { config ['rate']}
ÿ ÿ
Solution
ÿ ÿ
1
ÿ ÿ
Observations on convergence:
— The function f(x) = x 4ÿ2x 2+x has several local minima, which makes the convergence
dependent on the starting point.
— With too high a learning rate, the algorithm may oscillate or diverge.
— Depending on the starting point, the algorithm can converge to different minima.
This sensitivity to initial parameters and learning rate is characteristic of non-convex optimization problems, such as those
encountered in Deep Learning.
Probability and statistics play a crucial role in the theoretical foundations of Deep Learning, particularly for the design of loss
functions, data modeling, and performance evaluation. Fundamental probabilistic concepts: — Probability distributions: Modeling the
distribution of data — Expectation and variance: Fundamental statistical
measures — Maximum likelihood: Principle of parameter estimation — Bayes' theorem: Updating beliefs
with new information
2 ( s )
Weight initialization, regularization
1 f(x) = e ÿ ÿ 2ÿ
Cross entropy H(p, q) = ÿ x p(x) log q(x) Loss function for classification p(x) p(x) log q(x)
Example
The cross-entropy loss function used in classification can be derived from the maximum likelihood principle. If we model
classification as a multi-nomial distribution, maximizing the likelihood of the training data is equivalent to minimizing the cross-
entropy between the empirical distribution (actual labels) and the distribution predicted by the model.
ÿ ÿ
1 import numpy as np
2
5
Calculates the cross entropy between the y_true and y_pred distributions
6
Args :
7
y_true : Distribution reelle (one - hot encoding )
8
y_pred : Distribution predite ( probabilites )
9 Returns :
10 Scalar value of cross entropy
"""
11
16 # Number of samples
17 n_samples = y_true . shape [0]
18
22 return what
23
24 # Example of use
26 y_true = np . array ([
27 [1 , 0 #, Class
0] ,0
28 [0 , 1 , 0] , # Class 1
29 [0 , 0 , 1] # Class 2
30 ])
31
33 y_pred = np . array ([
34 0.20.1], [0.2 0.5]
[0.7 0.1], [0.1 , # Prediction for sample 1
35 , 0.8 , , # Prediction for sample 2
36 , 0.3 , # Prediction for sample 3
37 ])
38
Explanation
In the context of Deep Learning, cross-entropy measures the difference between the distribution of
actual labels and the probabilities predicted by the model. A lower cross-entropy indicates
a better match between predictions and reality. This measure is particularly
suitable for classification problems where the network output is normalized via a function
softmax to represent a probability distribution over the different classes.
Another crucial statistical concept is regularization, which allows controlling the complexity of the
model and prevent overfitting:
ÿ ÿ
1 import numpy as np
2
5 Regularisation L1 ( Lasso )
6 Args :
7 weights: Model parameters
8 lambda_param: Regularization coefficient
9 Returns :
10 Regularization term to add to the loss function
"""
11
16 Regularisation L2 ( Ridge )
17 Args :
18 weights: Model parameters
19 lambda_param: Regularization coefficient
20 Returns :
21 Regularization term to add to the loss function
"""
22
25 # Example of use
28
Exercise
Solution
ÿ ÿ
1 import numpy as np
2 import matplotlib . pyplot as plt
3
6 Calculates the binary cross-entropy between the y_true and y_pred distributions
7 Args :
8 y_true : Labels reels (0 ou 1)
9 y_pred : Predicted probabilities (between 0 and 1)
10 Returns :
11 Scalar value of binary cross-entropy
"""
12
17 # Number of samples
18 n_samples = len( y_true )
19
23 return bce
24
60 plt . show ()
ÿ ÿ
Listing 11 – Implementation of binary cross-entropy
This implementation calculates binary cross-entropy, which is suitable for problems where the target
is either 0 or 1. We observe that:
— Predictions close to the true values (case 1) give a low loss
— Uncertain predictions (case 2) give an intermediate loss
— Wrong predictions (case 3) give high loss
The visualization shows how the loss increases as the prediction moves away from the
Mohamed Ouazze BDCC-2024-2025 14 real value, with exponential growth when approaching the extremes (0 or 1).
Machine Translated by Google
The artificial neuron, inspired by the biological neuron, is the fundamental computing unit of the networks of
neurons. Its operation can be summarized in three main stages:
Mathematically, for a neuron with n inputs x = [x1, x2, . . . , xn], weights w = [w1, w2, . . . , wn]
and a bias b, the output y is calculated as follows:
x1
w1
w2 f
x2 and
w3
x3
6 Args :
7 nb_entrees : Entrees name of the neuron
"""
8
21
ÿ ÿ
Listing 12 – Implementation of an artificial neuron
ÿ ÿ
1 def forward ( self , entrees , activation_function = 'sigmoid '):
"""
2
5 Args :
6 inputs: Input vector
7 activation_function: Activation Type ('sigmoid', 'relu', 'tanh')
8
9 Returns :
10 Neuron output after activation
"""
11
12 # V r i f i c a t i o n de la dimension des e n t r e s
13 if len ( entrees ) != len ( self . poids ):
"
14 raise ValueError (f"The number of entries ({ len( entries )}) does not match
15 f"to the number of weights ({ len ( self . weight )})")
16
30 return sortie
31 Example of use
32 neurone = Neurone ( nb_entrees =3)
Explanation
The artificial neuron is a computing unit that transforms its inputs into a single output via
a linear combination followed by a non-linearity. This non-linearity, brought by the function
activation, is crucial because it allows the network to model complex relationships between the
inputs and outputs. Without these nonlinearities, even a deep network would be reduced to a simple
linear transformation. Activation functions such as the sigmoid and the hyperbolic tangent
have the advantage of being bounded, which facilitates convergence during learning. However,
they suffer from the vanishing gradient problem for deep networks. The ReLU function
(Rectified Linear Unit) has gained popularity because it is not prone to this problem and allows
faster computation, although it can lead to the "dead neuron" problem if the gradient
becomes zero.
Feedforward neural networks, also called multi-layer perceptrons (MLPs), are organized into successive layers where information flows only
from input to output, without connections
recurring.
h11
x1
h21
x2 h12 and
h22
x3
h13
ÿ ÿ
1 import numpy as np
2 class Neural Network:
3 def init ( self , architecture , activation =’sigmoid ’):
"""
4
6 Args :
7 architecture: List defining the number of neurons per layer
8 [ e n t r e sortie
, ] couche _ cache _ 1 , ... , couche _ cache _ n ,
ÿ ÿ
Listing 14 – Implementing a feedforward network with NumPy
ÿ ÿ
1 def activation_function ( self , WITH
, derive = False ):
"""
2
3
Applies the chosen activation function
4
5
Args :
6 Z: Input of the activation function
7 derivee : Si True , returns the derivative of the function
8
9 Returns :
10 Result of the activation function or its derivative
"""
11
12
if self . activation == ’sigmoid ’:
13 if not derivee :
14
return 1 / (1 + np . exp (- Z))
15 else :
16
s = 1 / (1 + np . exp ( -Z))
17 return s * (1 - s)
18
22
23
return (Z > 0) . astype ( float )
24
31 else :
32 raise ValueError ("Unrecognized activation function ")
33
36
Forward propagation through the network
37
38
Args :
39 X: D o n n e s d’ e n t r e ( nb_features , nb_samples )
40
41 Returns :
42
Network output and caches for backpropagation
"""
43
44 caches = {}
45 A=X
46
47
# Propagation for l in through the layers
48
range (1 , self . nb_couches + 1) :
49
# Cache from previous activation
50 caches [f’A{l -1} ’] = A
51
56
# Applying the activation function
57 A = self . activation_function(Z )
58
59 return A , caches
ÿ ÿ
Listing 15 – Implementing a feedforward network with NumPy continued
ÿ ÿ
1 def compute_cost ( self , Y_pred , Y , cost_type =’mse ’):
"""
2
5 Args :
6 Y_pred : P r d i c t i o n s du r s e a u
7 Y: Actual target values
10 Returns :
11 Cost value
"""
12
23 else :
24 raise ValueError (" Type de c o t non reconnu ")
25
26 return cost
27
32 Args :
33 X: D o n n e s d’ e n t r e
34
35 Returns :
36 Network predictions
"""
37
ÿ ÿ
Listing 16 – Implementing a feedforward network with NumPy continued
Explanation
A feedforward network propagates information from the input layer to the output layer at
through a series of nonlinear transformations. Each layer extracts features from
increasingly abstract input data. In the example above, we have implemented
a two-hidden-layer network, but this architecture can be generalized to a number
arbitrary number of layers. The choice of weight initialization is crucial for network convergence.
Initialization of He (for ReLU) and Xavier/Glorot (for sigmoid/tanh) are techniques
popular that help maintain signal variance across layers, thus avoiding
vanishing or exploding gradient problems. Note that this implementation does not include
backpropagation learning, which we will discuss in the next section.
1
Sigmoid ÿ(x) = 1+eÿx
Bounded output [0,1], differentiable Vanishing gradient, non-centered output
and xÿeÿx
Fishy tanh(x) = Output bounded [-1,1] and centered Vanishing gradient at the extremes
ex+eÿx
resume max(0, x) Fast calculation, prevents vanishing gradient "Dead" neurons (gradient = 0)
Leaky ReLU max(ÿx, x), ÿ ÿ 0.01 Avoids the dead neuron problem Few empirical advantages vs ReLU
x and x > 0
HIGH Average output close to zero Higher computational cost
ÿ(e x ÿ 1) if x ÿ 0
xi
Softmax e
xje
Normalization in probability distribution Sensitive to extreme values
j
ÿ ÿ
1 import numpy as np
2 import matplotlib . pyplot as plt
3
19 performs softmax ( x ):
20 exp_x = np . exp (x - np . max (x )) # S t a b i l i t n u m r i q u e
21 return exp_x / np .sum ( exp_x )
22
24 x = np . linspace ( -5 , 5 , 100)
25
ÿ ÿ
Listing 17 – Comparison of activation functions
ÿ ÿ
1 for i , ( func , name , color) in enumerate (functions):
2
plt . subplot (2 y = func , 3 , i + 1)
3
(x )
4
plt . plot (x , y , plt . title color = color , linewidth =2)
5
(f'Function { name }')
6
plt . grid ( True )
7
plt . axhline (y =0 plt . axvline , color =’black ’, linestyle =’-’, alpha =0.3)
8
(x =0 , color =’black ’, linestyle =’-’, alpha =0.3)
9
14 plt . bar ( range (len ( x_softmax )) , y_softmax , 15 plt . title (’Softmax color =’cyan ’)
( exemple )’)
16 plt . xlabel (’Classes ’)
17 plt . ylabel (’ P r o b a b i l i t ’)
18
19 plt . tight_layout ()
20 plt . show ()
ÿ ÿ
Listing 18 – Comparison suite
Explanation
Activation functions are crucial because they determine the network's ability to learn
complex patterns. The ReLU function has become standard in hidden layers because
of its computational simplicity and its ability to avoid the vanishing gradient. The function
softmax is mainly used in the output layer for classification problems
multi-class, because it produces a normalized probability distribution.
The choice of activation function depends on the context:
— Hidden layers: ReLU or its variants (Leaky ReLU, ELU)
— Binary output: Sigmoid
— Multi-class output: Softmax
— Regression: No activation (linear)
Let's implement a simple perceptron capable of learning a logical function like the AND operator
(AND).
ÿ ÿ
1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 class Perceptron :
5 def __init__ ( self , nb_entries, learning_rate =0.1):
"""
6
7
Perceptron Initialization
8
9
Args :
10 nb_inputs : Number of inputs of the perceptron
11 learning_rate: Learning rate for the setting weight day
"""
12
ÿ ÿ
Listing 19 – Perceptron for binary classification - AND operator
ÿ ÿ
1
6 Args :
7 inputs: Input vector
8
9 Returns :
10 Binary prediction (0 or 1)
"""
11
21 Args :
22 X: Input data matrix (nb_samples y: Vector of target labels , nb_features )
23
24 nb_epoques : Nombre d’ i t r a t i o n s d’ e n t r a n e m e n t
"""
25
46 # A r r t a n t i c i p si convergence
47 if total_errors == 0:
48 print (f) Convergence reached "time {time + 1}")
49 break
50
55 Args :
56 X: Test data
57 and: real labels
58
59 Returns :
60 Precision you model
"""
61
ÿ ÿ
Listing 20 – Perceptron for binary classification - AND operator
ÿ ÿ
1 # C r a t i o n des d o n n e s pour l’ o p r a t e u r ET ( AND )
2 X_train = np . array ([
3 [0 0], , # 0 AND 0 = 0
4 [0 1], , # 0 AND 1 = 0
5 [1 0], , # 1 AND 0 = 0
6 [1 1], # 1 AND 1 = 1
7 ])
8
11 # C r a t i o n et e n t r a n e m e n t du perceptron
12 perceptron = Perceptron (nb_inputs =2, learning_rate =0.1)
13
22 # E n t r a n e m e n t
23 print ("\ n E n t r a n e m e n t en cours ...")
24 perceptron . training ( X_train , y_train , nb_epochs =100)
25
34 # valuation
35 precision = perceptron . evaluate ( X_train , y_train )
36 print (f"\ n P r c i s i o n : { precision * 100:.1 f}%")
37
ÿ ÿ
Listing 21 – Perceptron for binary classification - AND operator
Exercise
Modify the perceptron above to learn the OR operator and then the XOR operator. What do you notice about
the XOR? Explain why and propose a solution.
Solution
ÿ ÿ
13
38 # Output layer
Explanation: The simple perceptron cannot learn the XOR function because this function
is not linearly separable. It is impossible to draw a straight line that perfectly separates
classes in two-dimensional space. This historical limitation of the perceptron led to the
development of multilayer networks, which can solve problems nonlinearly
separable thanks to hidden layers and nonlinear activation functions.
Backpropagation is the fundamental algorithm for training multi-layer neural networks. It uses the chain rule
to efficiently calculate the gradients of the
loss function with respect to all network parameters.
Steps of backpropagation:
1. Forward propagation: Calculating activations layer by layer
2. Loss calculation: Evaluation of the error between prediction and ground truth
3. Back propagation: Calculating gradients by going up the network
4. Parameter Update: Adjusting Weights and Biases According to Gradients
ÿ ÿ
1 import numpy as np
2
3 class ReseauNeuronesBP :
4 def __init__ ( self , architecture ) :
"""
5
8 Args :
9 architecture: List [nb_entries] , nb _ caches1 , ... , nb_sorties ]
"""
10
14 # Initializing parameters
15 self . parametres = {}
16 for l in range (1 self . nb_couches
, + 1) :
17 # Xavier / Glorot Initialization
18 self . parametres [f’W{l}’] = np . random . randn (
19 architecture [l] architecture , [l -1]
20 ) * np . sqrt (2 / ( architecture [l] + architecture [l -1]) )
21
32 return s * (1 - s)
ÿ ÿ
Listing 23 – Complete Implementation of Backpropagation
ÿ ÿ
1 def propagation_avant ( self , X):
"""
2
5 Args :
6 X: D o n n e s d’ e n t r e ( nb_features , nb_samples )
7
8 Returns :
9 Final activations and cache for backpropagation
"""
10
23 return A , cache
24
29 Args :
30 Y_pred : P r d i c t i o n s ( nb_classes , nb_samples )
31 Y: V r i t terrain ( nb_classes , nb_samples )
32
33 Returns :
34 Average across all samples
"""
35
36 m = Y . shape [1]
37 epsilon = 1e -15
38 Y_pred = np.clip(Y_pred, epsilon, 1 - epsilon )
39
52 gradients = {}
53 m = Y . shape [1]
54 # Output layer gradient
55 A_final = cache [f’A{ self . nb_couches }’]
56 dA = -( np . divide (Y A_final ) - np
, . divide (1 - Y # R t r o p r o p a g a t i o n , 1 - A_final ))
57 couche par couche
58 for l in range ( self . nb_couches , 0, -1) :
59 # Gradient of the activation function
60 dZ = dA * self . sigmoid_derivee ( cache [f’Z{l}’])
61
62 # Gradients des p a r a m t r e s
63 gradients [f’dW{l}’] = np . dot (dZ gradients [f’db{l}’] = , cache [ f’A{l -1} ’]. T ) / m
64 np . sum (dZ , axis =1 , keepdims = True ) / m
65
ÿ ÿ
Listing 24 – Complete Implementation of Backpropagation
ÿ ÿ
1
def update_parameters(self, gradients, learning_rate):
"""
2
5
Args :
6
gradients : Gradients c a l c u l s par r t r o p r o p a g a t i o n
7
learning_rate: Learning rate
"""
8
9
for l in range (1 self . nb_couches
, + 1) :
10
self . parameters [f'W{l}'] -= learning_rate * gradients [f'dW{l}']
11
self . parameters [f'b{l}'] -= learning_rate * gradients [f'db{l}']
12
15
Train the network with backpropagation
16
17
Args :
18 X: D o n n e s d’entree ( nb_features , nb_samples )
19 AND:
tiquettes ( nb_classes , nb_samples )
20
nb_epochs : Number of training iterations
21
learning_rate: Learning rate
22
display_cost: Cost display frequency
23
24 Returns :
25
Cost history
"""
26
27
historical_costs = []
28
29
for epoch in range ( nb_epoques ):
30
# Forward propagation
31 A_final , cache = self . forward_propagation (X)
32
33 # Cost calculation
34 cout = self . cost_calculation ( A_final historical_costs . , AND)
35
append ( cout )
36
37
# Back propagation
38
gradients = self . backpropagation ( cache , AND)
39
43
# Displaying progress
44
if epoch % display_cost == 0:
45
print (f" poque { epoque }: C ot = { cout :.6 f}")
46
47
return historical_costs
48
49
def prediction ( self , X):
"""
50
51
Makes predictions on new data
52
53
Args :
54 X: D o n n e s d’ e n t r e
55
56 Returns :
57 P r d i c t i o n s binaires
"""
58
ÿ ÿ
1 # Example of use: Spiral synthetic yarns
2 def generate_spiral_data (n_points =300 , noise =0.1):
""" """
3
Generate spiral data for np test. random. seed (42)
4
5
N = n_points // 2 # points per class
6
7 # Class 0
8
r = np . linspace (0.1 , 1, N )
9 t = np . linspace (0 4* np .pi , x0
, = r * np . cos N ) + np . random . randn (N) * noise
10 ( t)
11 y0 = r * np . sin ( t)
12
13 # Class 1
14 r = np . linspace (0.1 , 1, N )
15 t = np . linspace ( np .pi , x1 = r * np . 5* np .pi , N ) + np . random . randn (N) * noise
16 cos ( t)
17 y1 = r * np . sin ( t)
18
19 # Combination
20 X = np . vstack ([ np . column_stack ([ x0 , y0 ]) , np . column_stack ([ x1 , y1 ]) ])
21 y = np . hstack ([ np . zeros (N) , np . ones (N ) ])
22
36
37 )
38
39 # valuation
44 # Visualization of results
61 mask = (Y [0] == i)
62 plt . scatter (X [0 mask ], X, [1 marker =’o’ if , mask], c= colors [i],
63 predictions [0 label =f’Classe {i}’, alpha =0.7) , mask ][0] == i else ’x’,
64
ÿ ÿ
Listing 26 – Complete implementation of backpropagation
ÿ ÿ
1 plt . title (f’ Classification en spirale ( P r c i s i o n : { precision *100:.1 f }%) ’)
2 plt . xlabel ('X1 ')
3 plt . ylabel (’X2 ’)
4 plt . legend ()
6 plt . tight_layout ()
7 plt . show ()
ÿ ÿ
Listing 27 – Complete Implementation of Backpropagation
Explanation
Loss function and evaluation metrics The choice of the loss function is crucial because it
defines the optimization objective of the model. Different tasks require different loss functions:
ÿ ÿ
1 def focal_loss ( y_pred , y_true , alpha =1 , gamma =2) :
""" """
2 Focal Loss for Unbalanced Classes
3 ce_loss = F. cross_entropy ( y_pred , y_true , pt = torch . exp (- ce_loss ) reduction =’none ’)
4
15 # Calculating metrics
16 accuracy = accuracy_score ( y_true . cpu () , y_pred_classes . cpu () )
17 precision = precision_score ( y_true . cpu () , y_pred_classes . cpu () , average =’weighted ’)
18 recall = recall_score ( y_true . cpu () , y_pred_classes . cpu () , average =’weighted ’)
19 f1 = f1_score ( y_true . cpu () , y_pred_classes . cpu () , average =’weighted ’)
20
21 return {
22 ’accuracy ’: accuracy ,
23 ’precision ’: precision ,
24 ’recall ’: recall ’f1_score ’: f1 ,
25
26 }
ÿ ÿ
Listing 29 – Implementation of common loss functions
Weight Initialization Strategies Weight initialization is a critical aspect that can significantly affect model convergence
and performance. Improper initialization can
lead to gradients that cancel or explode.
Main initialization methods:
ÿ ÿ
1 import torch
2 import torch . nn as nn
3 import math
4
ÿ ÿ
Listing 30 – Implementation of the different initialization strategies
ÿ ÿ
1 def lecun_normal_init ( layer ) :
""" """
2 Normal LeCun initialization . Linear):
3 if isinstance ( layer , nn
4 fan_in = layer . in_features
5 std = math . sqrt (1.0 / fan_in )
6 layer . weight . data . normal_ (0 std ) ,
14
15 layers = []
16 prev_size = input_size
17
ÿ ÿ
Listing 31 – Implementation of the different initialization strategies
Explanation
The choice of initialization method depends mainly on the activation function used:
— He initialization: Optimal for ReLU and its variants (Leaky ReLU, ELU)
— Xavier initialization: Suitable for symmetric functions like tanh and sigmoid
— LeCun initialization: Recommended for networks with batch normalization
Proper initialization helps avoid problems with vanishing or exploding gradients, thus speeding up
convergence.
CNNs are distinguished from traditional neural networks by their ability to preserve structure
spatial preservation of the input data. This preservation is made possible by three key concepts:
1. Local connectivity: Each neuron is only connected to a local region of the previous layer, unlike fully connected layers.
2. Parameter sharing: The same weights (filters) are used across the entire image, reducing
drastically the number of parameters.
3. Translation invariance: The same feature can be detected regardless of its
position in the image.
Image FC
Conv1 Half1 Conv2 Pool2
entrance 10
28×28×32 14×14×32 10×10×64 5×5×64
32×32×3 classes
1 import torch
2 import torch . 3 import nn as nn
torch . nn . functional as F
4
16 , padding =2)
17 nn . MaxPool2d ( kernel_size =2 self . pool2 = , stride =2)
18
ÿ ÿ
Listing 32 – Basic CNN Architecture with PyTorch
ÿ ÿ
1 def forward ( self , x) :
2 # Premiere bloc convolutif
3
x = self . pool1 (F. relu ( self . conv1 (x )))
4
11
# Flattening for FC layers
12 x = x . view (x. size (0) , -1)
13
19 return x
20
Filters (or convolution kernels) are the fundamental elements of CNNs. Each filter is a
small weight matrix that slides over the input image to detect specific patterns.
Mechanism of convolution:
For an input image I and a filter F of size k × k, the convolution produces a feature map O where
each element is calculated as:
kÿ1 kÿ1
O(i, j) = m=0 n=0 I(i + m, j + n) × F(m, n) + b
where b is the bias associated with the filter.
ÿ ÿ
1 import torch
2 import torch . 3 import nn as nn
matplotlib . pyplot as plt
4 import numpy as np
5 from torchvision import datasets , transforms
6
ÿ ÿ
1 # Normalizing weights for visualization
2 weights = ( weights - weights .min () ) / ( weights . max () - weights . min () )
3
42 # Visualisation
43 axes = plt . subplots (2 , 4, figsize =(12 fig , for i in range ( min , 6) )
44 ( num_maps , feature_maps . shape [0]) ):
45 ax = axes [i //4 ax . imshow , i %4]
46 ( feature_maps [i], cmap =’viridis ’)
47 ax . set_title (f’Feature Map {i +1} ’)
48 ax . axis (’off ’)
49
54 # Cleaning
55 handle . remove ()
56
57 # Example of use
58 # Creation of a simple model for demonstration
59 class VisualizationCNN ( nn . Module ):
60 def __init__ ( self ):
61 super ( VisualizationCNN self ). __init__ () ,
62 self . conv1 = nn . Conv2d (3 16 kernel_size
, =3
, , padding =1)
63 self . conv2 = nn kernel_size =3
. Conv2d (16 , padding
, 32
=1) ,
64 self . pool = nn . MaxPool2d(2 2) ,
ÿ ÿ
Listing 35 – Visualizing filters and feature maps
ÿ ÿ
1 def forward ( self x = self . , x) :
2 pool ( torch . relu ( self . conv1 (x) ))
3 x = self . pool ( torch . relu ( self . conv2 (x) ))
4 x = x . view ( -1 32 * 8 *, 8)
5 x = self . fc (x)
6 return x
7
8 model = VisualizationCNN ()
9
ÿ ÿ
Listing 36 – Visualizing filters and feature maps
Natural language processing in Deep Learning requires converting text into representations
digital that neural networks can manipulate. Several approaches exist, each with
its advantages and disadvantages.
ÿ ÿ
1 import numpy as np
2 from sklearn . feature_extraction . text import CountVectorizer 3 import matplotlib . pyplot as plt , TfidfVectorizer
5 # Example Corpus
6 corpus = [
7 "The cat eats fish ",
8 "The dog eats meat ",
9 "The fish swims in the water ",
"
10 "The cat and the dog are animals
11 ]
12
ÿ ÿ
Listing 37 – Comparison of word representations
ÿ ÿ
1
20 return encoding
21
32
42 # 3. TF - IDF
43 vectorizer_tfidf = TfidfVectorizer ()
44 tfidf_matrix = vectorizer_tfidf . fit_transform ( corpus )
45
49
53 # BOW heatmap
54 im1 = axes [0]. imshow ( bow_matrix . toarray () , cmap =’Blues ’, aspect =’auto ’)
55 axes [0]. set_title (’Bag of Words ’)
56 axes [0]. set_xlabel (’Mots ’)
57 axes [0]. set_ylabel (’Documents ’)
58
59 # TF - IDF heatmap
60 im2 = axes [1]. imshow ( tfidf_matrix . toarray () , cmap =’Reds ’, aspect =’auto ’)
61 axes [1]. set_title (’TF - IDF ’)
62 axes [1]. set_xlabel (’Mots ’)
63 axes [1]. set_ylabel (’Documents ’)
64
67 sizes = [ len ( vocabulary ) len ( vocabulary ), 68 colors = [’blue ’, ’green ’, , len( vocabulary )]
’red ’]
ÿ ÿ
Listing 38 – Comparison of word representations
ÿ ÿ
sizes
1 axes [2]. bar ( methods color = colors , alpha, =0.7) ,
6 plt . tight_layout ()
7 plt . show ()
ÿ ÿ
Listing 39 – Comparison of word representations
Explanation
Traditional representations like one-hot encoding suffer from the "curse of the
dimensionality" and do not capture the semantic relationships between words. For example, "cat"
and "dog" are treated as completely independent, even though they share properties
semantics (pets).
Word embeddings solve these problems by learning dense representations where words
semantically similar have close vectors in the vector space. This proximity can
be measured by cosine similarity or Euclidean distance.
ÿ ÿ
1 import numpy as np
2 from collections import defaultdict 3 import matplotlib . pyplot as plt , Counter
5 class SimpleWord2Vec :
6 def __init__ ( self , vector_size =100 , window_size =2 , min_count =1) :
"""
7
10 Args :
11 vector_size: Size of word vectors
12 window_size: Size of the pop-up window
13 min_count: Minimum frequency to include a word
"""
14
ÿ ÿ
Listing 40 – Implementation simple de Word2Vec Skip-gram
ÿ ÿ
1 def build_vocabulary ( self , sentences ):
""" """
2 Vocabulary building
3 # Word Count
4 for sentence in sentences :
5 words = self . preprocess_text ( sentence )
6 self . word_count . update ( words )
7
8 # Filtering by min_count
9 filtered_words = [ word for word if count >= self . , count in self . word_count . items ()
10 min_count ]
11
16
22
40 return training_data
41
58 , self . vocab_size ))
59
60 # G n r a t i o n des d o n n e s d’ e n t r a n e m e n t
61 training_data = self . generate_training_data ( sentences )
62 print (f" D o n n e s d’ e n t r a n e m e n t g n r e s : { len ( training_data )} paires ")
ÿ ÿ
Listing 41 – Implementation simple de Word2Vec Skip-gram
ÿ ÿ
1 #Entranement
2 for epoch in range ( epochs ):
3 total_loss = 0
4
18 # Backward pass
19 # Output layer gradient
20 grad_out = y_pred . copy ()
21 grad_out [ context_idx ] -= 1
22
23 # Weight gradients
24 grad_W_out = np . outer (h , grad_out )
25 grad_h = np . dot ( self . W_out , grad_out )
26
31 if epoch % 20 == 0:
32 avg_loss = total_loss / len ( training_data )
33 print (f" epoch { epoch }: Average loss = { avg_loss :.4f}")
34
59 # Tri par s i m i l a r i t d c r o i s s a n t e
60 similarities . sort ( key = lambda x : x [1] return similarities [: , reverse = True )
61 top_k ]
ÿ ÿ
Listing 42 – Simple implementation of Word2Vec Skip-gram function suite
ÿ ÿ
1 # Larger example corpus
2 corpus_extend = [
3 "The cat eats fish every day ,"
4 "The dog eats red meat ",
5 "The fish swims in clear water ",
6 "The cat and the dog are domestic animals ,"
"
7 Animals need food ,"
8 "Cat food contains fish ,"
9 "Dog food contains meat ,"
10 "Water is essential for all animals ,"
11 "The cat likes to sleep in the sun ",
12 "The dog likes to play in the garden ,"
"
13 Fish live in water ,
" "
14 Pets are our companions
15 ]
16
22 # Test des s i m i l a r i t s
23 print ("\ nTest des s i m i l a r i t s :")
24 test_words = ['cat ', 'dog ', 'fish ', 'eat ']
25
44 # Visualisation
,
45 plt . figure ( figsize =(12 8) )
46 plt . scatter ( vectors_2d [: 0] , , vectors_2d [: , 1] , alpha =0.7)
47
48 # Annotation of words
49 for i , word in enumerate ( words ) :
50 plt . annotate ( word ( vectors_2d, [i xytext =(5 , 0] , vectors_2d [i , 1]) ,
ÿ ÿ
Listing 43 – Simple implementation of Word2Vec Skip-gram suite 2
Explanation
Recurrent neural networks are designed to process sequences of data while maintaining
a "memory" of previous elements through recurring connections.
Exit sequence
y1 y2 y3
x1 x2 x3
Input sequence
ÿ ÿ
1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 class SimpleRNN :
5 def __init__ ( self , input_size , hidden_size , output_size ):
"""
6
9 Args :
10 input_size : Size of the interval of each time step
11 hidden_size : Size of the cache
12 output_size: Output size
"""
13
18 # Initializing parameters
19 # Weight for entry to hidden state
20 self . Wxh = np . random . randn ( hidden_size , input_size ) * 0.1
21 # Weight for hidden state to hidden state (recurring connections)
22 self . Whh = np . random . randn ( hidden_size # Weight for the , hidden_size ) * 0.1
23 hidden state towards the output
24 self . Why = np . random . randn ( output_size , hidden_size ) * 0.1
25
26 # Bias
27 self . bh = np . zeros (( hidden_size self . by = np . , 1) )
28 zeros (( output_size , 1) )
29
ÿ ÿ
1 def softmax ( self , x) :
""" """
2 Softmax function for output
3 exp_x = np . exp (x - np . max (x ))
4 return exp_x / np .sum ( exp_x )
5
10 Args :
11 inputs : S q u e n c e d’ e n t r e s ( seq_length , input_size )
12 c a c h initial (si None
h_prev : tat i n i t i a l i s , zro )
13
14 Returns :
15 outputs : S q u e n c e de sorties
16 hidden_states : Tous les tats c a c h s
"""
17
44 # Storage
45 hidden_states [t] = h
46 outputs [t] = y
47
56 # Initializing gradients
57 dWxh = np . zeros_like ( self . Wxh )
58 dWhh = np.zeros_like(self.Whh)
59 dWhy = np . zeros_like ( self . Why )
60 dbh = np . zeros_like ( self . bh )
61 dby = np . zeros_like ( self . by )
62
ÿ ÿ
Listing 45 – Implementation of a simple RNN for sequence processing
ÿ ÿ
1 # R t r o p r o p a g a t i o n travers le temps
2 for t in reversed ( range ( seq_length )) :
3 # Output gradient
4 dy = outputs [ t ]. copy ()
5 dy [ targets [t ]] -= 1 # Cross - entropy gradient
6
10
11 # Gradient of the hidden state (from the output + from the future)
12 dh = np . dot ( self . Why .T , dy ) + dh_next
13
17 # Gradients des p a r a m t r e s
18 dbh += dh_raw
19 dWxh += np . dot ( dh_raw , inputs [t ]. reshape (1 dWhh += np . dot ( dh_raw , -1) )
20 , hidden_states [t -1]. T)
21
24
40 losses = []
41
49 # Calculation of loss
50 loss = 0
51 for t in range ( len ( targets )) :
52 loss += -np . log ( outputs [t ][ targets [t] total_loss += loss , 0])
53
54
55 #Rtropropagation
56 self . backward ( sequence , targets , outputs , hidden_states , learning_rate )
57
61 if epoch % 10 == 0:
62 print (f" epoch { epoch }: Average loss = { avg_loss :.4f}")
63
64 return losses
ÿ ÿ
Listing 46 – Implementation of a simple RNN for sequence processing
ÿ ÿ
1 # Exemple d’ utilisation : P r d i c t i o n de s q u e n c e simple
2 # Data generation: binary sequences where the output is the XOR of the two
entresprcdentes
3
7 targets = []
8
14 # The target at time t is the XOR of the inputs at times t-1 and t-2
15 for t in range (2 , seq_length ):
16 target [t] = sequence [t -1] ^ sequence [t -2]
17
26 # G n r a t i o n des d o n n e s
66 plt . show ()
ÿ ÿ
Listing 47 – Implementation of a simple RNN for sequence processing
Explanation
RNNs are particularly suited to sequential data because they can "remember"
past information through their recurring hidden state. In the example of temporal XOR, the
network must remember the two previous inputs to calculate the correct output.
However, simple RNNs suffer from the vanishing gradient problem: gradients become exponentially
small as they propagate to earlier time steps,
limiting the network's ability to learn long-term dependencies. This is why
More sophisticated architectures like LSTM and GRU have been developed.
Let's implement a sentiment classifier using word embeddings and a simple RNN.
ÿ ÿ
1 import numpy as np
2 import re
3 from collections import Counter
4 import matplotlib . pyplot as plt
5
6 class SentimentClassifier :
7 def __init__ ( self , embedding_dim =50 , hidden_dim =64 , max_length =100) :
"""
8
11 Args :
12 embedding_dim : Dimension of word embeddings
13 hidden_dim : Size of the RNN hidden layer
14 max_length : Maximum length of sequences
"""
15
25
35
36
37 # Word Count
38 for text in texts :
39 words = self . preprocess_text ( text )
40 word_counts . update ( words )
41
ÿ ÿ
1 # Creation of dictionaries
2 self . word_to_idx = { word : idx for idx , word in enumerate ( vocab_words )}
3 self . idx_to_word = { idx : word for word self . vocab_size = len , idx in self . word_to_idx . items () }
4 ( vocab_words )
5
6
print (f" Vocabulaire construit : { self . vocab_size } mots ")
7
8
def text_to_sequence ( self , text ) :
""" """
9
Conversion of text to sequence of indexes words = self . preprocess_text
10
( text )
11
sequence = []
12
19
return sequence
20
21
def pad_sequence ( self , sequence ) :
""" """
22
Sequence Padding Max Length if len ( sequence ) > self . max_length :
23
24
return sequence [: self . max_length ]
25 else :
26
padding = [ self . word_to_idx [’<PAD >’]] * ( self . max_length - len ( sequence ) )
27
return sequence + padding
28
29
def sigmoid ( self , x) :
""" """
30
Fonction sigmoid stable x = np . clip (x
31
return 1 / (1 + np . exp , -250 , 250)
32
(- x))
33
38
def initialize_parameters ( self ):
""" """
39
Initializing model parameters
40
# Matrice d’embeddings
41
self . embeddings = np . random . uniform ( -0.1 ( self . vocab_size , 0.1 ,
42 , self . embedding_dim ))
43
44 # RNN parameters
45
self . Wxh = np . random . randn ( self . hidden_dim self . Whh = np . , self . embedding_dim ) * 0.1
46
random . randn ( self . hidden_dim self . bh = np . zeros (( self . , self . hidden_dim ) * 0.1
47
hidden_dim , 1) )
48
57 # tats c a c h s
58 hidden_states = {}
59
hidden_states [ -1] = np . zeros (( self . hidden_dim , 1) )
ÿ ÿ
Listing 49 – Sentiment Classifier with RNN
ÿ ÿ
1
# Propagation through the sequence
2
for t in range ( seq_length ):
3
# R cupation of word embedding
4
word_idx = sequence [t]
5
if word_idx != self . word_to_idx [’<PAD >’]: # Ignore le padding
6
x = self . embeddings [ word_idx ]. reshape ( -1 , 1)
7
12 else :
13
hidden_states [t] = hidden_states [t -1] # tat i n c h a n g pour padding
14
19
return output [0 , 0] , hidden_states
20
21
def compute_loss ( self , prediction , target ):
""" """
22
Calculation of binary cross entropy loss epsilon = 1st -15
23
24
prediction = np . clip ( prediction , epsilon , return -( target * np . log 1 - epsilon )
25
( prediction ) + (1 - target ) * np . log (1 - prediction ))
26
32
# Initializing parameters
33
self . initialize_parameters ()
34
35
# Convert text to sequences
36
sequences = []
37 for text in texts :
38
seq = self . text_to_sequence ( text )
39
seq = self . pad_sequence ( seq )
40
sequences . append ( seq )
41
42 losses = []
43 accuracies = []
44
45
for epoch in range ( epochs ):
46 total_loss = 0
47
correct_predictions = 0
48
49
# Data Mix
50
indices = np . random . permutation ( len ( sequences ))
51
56
# Forward propagation
57
prediction, # Loss hidden_states = self . forward ( sequence )
58
calculation
59
loss = self . compute_loss ( prediction , target )
60 total_loss += loss
61 #Prcision
62
pred_class = 1 if prediction > 0.5 else 0
63
if pred_class == target :
64
correct_predictions += 1
65
# R t r o p r o p a g a t i o n s i m p l i f i e ( gradient approximatif )
66
# Pour une i m p l m e n t a t i o n c o m p l t e error = prediction , it would require a complete BPTT
67
- target
ÿ ÿ
Listing 50 – Sentiment Classifier with RNN Suite
ÿ ÿ
1 # Bet jour approximative ( pour d m o n s t r a t i o n )
2 final_hidden = hidden_states [ len ( sequence ) -1]
3 self . Wy -= learning_rate * error * final_hidden . T
6 # M poke tricks
7 avg_loss = total_loss / len ( sequences )
8 accuracy = correct_predictions / len ( sequences )
9
13 if epoch % 10 == 0:
14 print (f" poque { epoch }: Perte = { avg_loss :.4f}, P r c i s i o n = { accuracy
:.3 f}")
15
49 labels_exemples = [1 , 0 , 1, 0, 1, 0, 1, 0 , 1, 0, 1, 0, 1, 0 , 1 , 0] # 1 = Positive , 0=
Ngatif
50
ÿ ÿ
Listing 51 – Sentiment Classifier with RNN
ÿ ÿ
1 for text in new_texts:
2 sentiment , confidence = classifier . predict ( text )
3
print (f" ’{ texte } ’")
4
print (f" { sentiment } ( confiance : { confidence :.3f})\n")
5
6 # Visualisation des m t r i q u e s d’ e n t r a n e m e n t
7 plt . figure ( figsize =(12 , 5) )
8
9 plt . subplot (1 1) , ,
16 plt . subplot (1 2) , 2 ,
23 plt . tight_layout ()
24 plt . show ()
ÿ ÿ
Listing 52 – Sentiment Classifier with RNN Suite
Explanation
This example shows how to combine word embeddings with an RNN for classification
sentiment. The model treats each review as a sequence of words, uses embeddings
to represent each word, then the RNN captures the sequential dependencies to make a
final prediction.
In a real implementation, more advanced techniques would be used such as:
— LSTM or GRU instead of a simple RNN
— Pre-trained embeddings (Word2Vec, GloVe)
— Regularization techniques (dropout, early stopping)
— More sophisticated optimizers (Adam, RMSprop)
LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks are variants
advances in RNNs designed to solve the vanishing gradient problem and enable learning
of long-term dependencies.
fewer parameters
quick to calculate
ÿ ÿ
1 import numpy as np
2
3 class LSTMCell :
4 def __init__ ( self , input_size , hidden_size ):
"""
5
8 Args :
9 input_size : Size of between
10 hidden_size : Size of the cache
"""
11
19
28 # Output gate
29 self . Wo = np . random . randn ( hidden_size , input_size + hidden_size ) * 0.1
30 self . bo = np . zeros (( hidden_size , 1) )
31
39 tanh (x)
40
45 Args :
46 x: E n t r e actuelle ( input_size , c a c h p r c d e n t 1)
47 h_prev : tat c_prev : ( hidden_size 1) ,
50 Returns :
51 h: New state of affairs
52 c: New cell state
53 cache : Valeurs i n t e r m d i a i r e s pour la r t r o p r o p a g a t i o n
"""
54
55 # C ontnation of the between and the tat concat = np . vstack (( x , h_prev )) cachprcdent
56
57
ÿ ÿ
Listing 53 – Implementation of an LSTM cell
ÿ ÿ
1
# Output gate: decides which parts of the cell state to use
2
ot = self . sigmoid ( np . dot ( self .Wo , concat ) + self . bo )
3
7
# Cache pour la r t r o p r o p a g a t i o n
8 cache = {
9 ’x’: x , ’h_prev ’: h_prev , ’c_prev ’: c_prev ,
10 ’ft ’: ft
’concat ’: concat ’it ’: it ’ct_tilde ’:, ct_tilde , , ,
14 return ht , ct , cache
15
16 class GRUCell :
17 def __init__ ( self , input_size , hidden_size ):
"""
18
19
Cellule GRU ( version s i m p l i f i e de LSTM )
20
21
Args :
22
input_size : Size of between
23 hidden_size : Size of the cache
"""
24
25
self . input_size = input_size
26 self . hidden_size = hidden_size
27
36
# Candidat pour le nouvel tat self . Wh = np . random . cach
37
randn ( hidden_size , input_size + hidden_size ) * 0.1
38
self . bh = np . zeros (( hidden_size , 1) )
39
40
def sigmoid ( self , x) :
""" """
41
Fonction sigmoid stable x = np . clip (x return
42
1 / (1 + np . exp (- x)) , -250 , 250)
43
44
51
Forward propagation of a GRU cell
52
53
Args :
54
x: E n t r e actuelle ( input_size , h_prev : tat 1)
55
c a c h p r c d e n t ( hidden_size , 1)
56
57 Returns :
58 h: New state of affairs
59
cache : Valeurs i n t e r m d i a i r e s pour la r t r o p r o p a g a t i o n
"""
60
61 # C ontnation of the between and the tat concat = np . vstack (( x , h_prev )) cachprcdent
62
63
ÿ ÿ
Listing 54 – Implementation of an LSTM Suite Cell
ÿ ÿ
1 # Candidate for the new hidden state
2 concat_reset = np . vstack ((x rt * h_prev ) ) ,
8 # Cache pour la r t r o p r o p a g a t i o n
9 cache = {
10 ’x’: x , ’h_prev ’: h_prev , ’zt ’: zt ’ht_tilde ’: ht_tilde ’concat ’: concat ,
14 return ht , cache
15
29 targets = []
30
34
48 # Comparative test
49 def test_memory_architectures () :
""" """
50 Compare les performances des d i f f r e n t e s architectures
51 # G n r a t i o n des d o n n e s
52 task = SequenceMemoryTask ( seq_length =15 , num_sequences =500)
53 sequences , targets = task . generate_data ()
54
59 # Example of squence
60 print (f"\ nExemple de s q u e n c e :")
61 print (f" S q u e n c e : { sequences [0]} ")
62 print (f" Signal m m o r i s e r : { int ( sequences [0][0]) }")
63 print (f" Cible : { targets [0]} ")
64 # Simple test: see if cells can remember the first element
65 lstm_cell = LSTMCell ( input_size =1 gru_cell = GRUCell , hidden_size =8)
66 ( input_size =1 hidden_size =8) ,
67 # Test on a sequence
68 test_sequence = sequences [0]
ÿ ÿ
Listing 55 – Implementing an LSTM Suite Cell
ÿ ÿ
1 # LSTM
2 h_lstm = np . zeros ((8 c_lstm = np . , 1) )
3 zeros ((8 , 1) )
4
9 # GRU
10 h_gru = np . zeros ((8 , 1) )
11
16 print (f"\ n t a t final LSTM : { h_lstm . flatten () [:4]}... ") # Affichage partiel
17 print (f" tat final GRU : { h_gru . flatten () [:4]}... ") # Partial display
18
24 # Visualization of architectures
25 fig , axes = plt . subplots (1 , 3, figsize =(15 , 5) )
26
29 axes [0]. text (0.5 fontsize =10) , 0.6 , ’h_t = tanh ( Wx_t + Uh_ {t -1} + b)’, ha =’center ’, va =’center ’,
’
30 axes [0]. text (0.5 31 axes [0]. , 0.4 , Gradient vanescent ’, ha =’center ’, va =’center ’, fontsize =9)
’
text (0.5 32 axes [0]. set_xlim (0 , 0.3 , M m o i r e l i m i t e ’, ha =’center ’, va =’center ’, fontsize =9)
33 axes [0]. set_ylim (0 34 axes [0]. , 1)
axis (’off ’) , 1)
35
36 # Diagramme LSTM
37 axes [1]. text (0.5 38 axes [1]. , 0.9 , ’LSTM ’, ha =’center ’, va =’center ’, fontsize =14 , weight =’bold ’)
text (0.5 39 axes [1]. text (0.5 , 0.75 , ’Portes :’, ha =’center ’, va =’center ’, fontsize =11 , weight =’bold ’)
’
, 0.65 , Forget : f_t = ( W_f[h_{t -1} x_t ] + b_f, )’, ha =’center ’, va =
’center ’, fontsize =8)
’
40 axes [1]. text (0.5 , 0.55 , Input : i_t = ( W_i [h_{t -1} , x_t ] + b_i )’, ha =’center ’, va =’
center ’, fontsize =8)
’
41 axes [1]. text (0.5 , 0.45 , Output : o_t = ( W_o[h_{t -1} , x_t ] + b_o )’, ha =’center ’, va =
’center ’, fontsize =8)
’
42 axes [1]. text (0.5 =9) , 0.3 , Mmoire long terme ’, ha =’center ’, va =’center ’, fontsize
’
43 axes [1]. text (0.5 44 axes [1]. , 0.2 , C o n t r l e p r c i s ’, ha =’center ’, va =’center ’, fontsize =9)
set_xlim (0 45 axes [1]. set_ylim (0 46 , 1)
axes [1]. axis (’off ’) , 1)
47
48 # GRU Diagram
49 axes [2]. text (0.5 50 axes [2]. , 0.8 , ’GRU ’, ha =’center ’, va =’center ’, fontsize =14 , weight =’bold ’)
text (0.5 51 axes [2]. text (0.5 , 0.65 , 'Portes :', ha ='center ', va ='center ', fontsize =11 , weight =’bold ’)
’
, 0.55 , Reset : r_t = ( W_r [h_{t -1} , x_t ] + b_r )’, ha =’center ’, va =’
center ’, fontsize =9)
’
52 axes [2]. text (0.5 , 0.45 , Update : z_t = ( W_z[h_{t -1} , x_t ] + b_z )’, ha =’center ’, va =
’center ’, fontsize =9)
’
53 axes [2]. text (0.5 54 axes [2]. , 0.3 , Plus simple que LSTM ’, ha =’center ’, va =’center ’, fontsize =9)
’
text (0.5 55 axes [2]. set_xlim (0 , 0.2 , Less parameters ', ha ='center ', va ='center ', fontsize =9)
56 axes [2]. set_ylim (0 57 axes [2]. , 1)
axis (’off ’) , 1)
58
ÿ ÿ
Listing 56 – Implementation of an LSTM Suite Cell
Explanation
LSTMs solve the vanishing gradient problem through their gate architecture which
allows a direct flow of information through the cell state. The three main gates are:
Forget Gate: Decides what information to delete from the cell state Input Gate:
Decides what new information to store Exit Gate: Determines which parts of
the cell state to use for output
GRUs simplify this architecture by merging some gates, reducing the number of
parameters while maintaining long-term storage capacity. In practice, GRUs
are often as efficient as LSTMs on many tasks.
ÿ ÿ
7 # Clipping distances
8 distance_mat_clipped = torch . clamp ( distance_mat - self . max_relative_position , ,
10 self . max_relative_position )
11
33 # C r a t i o n des f r q u e n c e s de rotation
34 inv_freq = 1.0 / (10000 ** ( torch . arange (0.0 self . register_buffer (’inv_freq ’, , d_model , 2.0) / d_model ))
35 inv_freq )
36
38 if seq_len is None :
39 seq_len = x. size (0)
40
41 # Positions
42 t = torch . arange ( seq_len , device = x. device ). type_as ( self . inv_freq )
43
44 # Calculating angles
45 freqs = torch . einsum (’i,j->ij ’, t emb = torch . cat (( freqs , , self . inv_freq )
46 freqs ) , dim = -1)
47
48 # Applying rotation
49 cos_emb = emb . cos () [ None , :, None sin_emb = emb . , :]
50 sin () [ None , :, None , :]
51
ÿ ÿ
1 def visualize_positional_encodings () :
"""
2
5 d_model = 64
6 max_len = 100
7
11
12 # Dummy data
13 dummy_input = torch . zeros ( max_len , 1, d_model )
14
15 # Application of encodings
16 with torch . no_grad () :
17 sin_encoded = sinusoidal_pe ( dummy_input )
18 learned_encoded = learned_pe ( dummy_input )
19
24 # Visualisation
25 fig , axes = plt . subplots (2 , 3, figsize =(18 , 10) )
26
27 # Encodage s i n u s o d a l - Heatmap
28 im1 = axes [0 0]. imshow
, ( sin_pe_values .T , cmap =’RdBu ’, aspect =’auto ’)
29 axes [0 0]. set_title
, (’Encodage Positionnel S i n u s o d a l ’)
30 axes [0 0]. set_xlabel
, (’Position ’)
31 axes [0 0]. set_ylabel
, (’Dimension ’)
32 plt . colorbar ( im1 ax = axes [0, , 0])
33
53 # Analyse des f r q u e n c e s
54 freq_analysis = []
55 for pos in range (0 min (50 max_len, )): ,
61 im3 = axes [1 axes [1 , 0]. imshow ( freq_analysis .T , cmap =’viridis ’, aspect =’auto ’)
62 0]. set_title (’Analyse
, F r q u e n t i e l l e ( FFT )’)
63 axes [1 0]. set_xlabel
, (’Position ’)
64 axes [1 0]. set_ylabel
, (’ F r q u e n c e ’)
65 plt . colorbar ( im3 , ax = axes [1 , 0])
ÿ ÿ
Listing 58 – Transformers vs RNN
ÿ ÿ
1 # Cosine similarity between positions
2 from sklearn . metrics . pairwise import cosine_similarity
3
25 plt . tight_layout ()
26 plt . show ()
27
30 def analyze_positional_encoding_properties () :
"""
31
36 d_model = 128
37 max_len = 200
38
39 # Encodage s i n u s o d a l
40 pe = SinusoidalPositionalEncoding ( d_model max_len ) ,
ÿ ÿ
Listing 59 – Transformers
ÿ ÿ
1 # Frequencies decrease with dimension
2 freqs = []
3 for i in range (0 d_model freq = ,1.0 / (10000 ** , 2) :
4 (i / d_model ))
5 freqs . append ( freq )
6 if i < 10: # Display the first frequencies
7 print (f" Dimension {i}: f r q u e n c e = { freq :.6 f}")
8
9 print (f" Ratio f r q u e n c e max / min : {max ( freqs )/ min ( freqs ) :.2 e}")
10
21 print (f" Average distance between adjacent positions: {np. mean(min_distances) :.4 f}")
22 print (f" Distance minimale : {np. min ( min_distances ):.4 f}")
23 print (f" Distance maximale : {np. max ( min_distances ):.4 f}")
24
32
36 similarity_diff = np . dot ( diff1 diff2 )) , diff2 ) / ( np . linalg . norm ( diff1 ) * np . linalg . norm (
39 return pe_values
40
41 def compare_positional_encoding_methods () :
"""
42
48 comparison = {
49 ’ S i n u s o d a l ’: {
50 'benefits ': [
51 'No parameters' learn ',
52 ’Gnralisation longer sequences ',
53 ’ P r o p r i t s m a t h m a t i q u e s i n t r e s s a n t e s ’,
54 ’Invariance par translation relative ’
55 ],
56 ’ i n c o n v n i e n t s ’: [
57 'Potentially non-optimal fixed form ',
58 'May not capture all positional nuances'
59 ],
60 'usage ': 'Original Transformers, GPT , BERT ’
61 },
62 'Learned ': {
63 'benefits ': [
64 ’Adaptable aux d o n n e s s p c i f i q u e s ’,
65 'Can learn complex patterns ',
66 ’ Optimisation end -to - end ’
67 ],
ÿ ÿ
Listing 60 – Transformers
ÿ ÿ
1 ’ i n c o n v n i e n t s ’: [
2 ’ P a r a m t r e s s u p p l m e n t a i r e s ’,
3 ’ L i m i t e n t r a n e m ethe
n tlength
’, of
4 'Potential overlearning'
5 ],
’
6 'usage ': 'Some variants of BERT , modlesspcialiss
7 },
8 'Relative ': {
9 'benefits ': [
10 'Focus on relative distances ',
11 'More linguistically intuitive ',
’
12 ’Meilleure g n r a l i s a t i o n
13 ],
14 ’ i n c o n v n i e n t s ’: [
15 ’ C o m p l e x i t computationnelle accrue ’,
16 ’ I m p l m e n t a t i o n plus complexe ’
17 ],
’
18 ’usage ’: ’Transformer -XL , certains m o d l e s r c e n t s
19 },
20 ’RoPE ’: {
21 'benefits ': [
22 ’ P r o p r i t s de rotation lgantes ’,
23 ’Bonne g n r a l i s a t i o n ’,
24 ’ E f f i c a c i t computationnelle ’
25 ],
26 ’ i n c o n v n i e n t s ’: [
27 'Relatively new ',
28 'Less well studied'
29 ],
30 'usage ': 'GPT -NeoX , PaLM , certains m o d l e s r c e n t s ’
31 }
32 }
33
44 def demonstrate_position_encoding_impact () :
"""
45
65 # Adding noise
66 noise_positions = np . random . choice ( seq_len , for pos in noise_positions : size = seq_len //4 , replace = False )
67
ÿ ÿ
Listing 61 – Transformers
ÿ ÿ
1 data . append ( seq )
2 labels . append ( label )
3
6 # G n r a t i o n des d o n n e s
7 X , y = create_order_task ()
8
18 print (f"\nWithout positional encoding, print (f" distinguishes a Transformer could not ")
19 these sequences because attention is invariant the order!")
20
21 # E x c u t i o n des d m o n s t r a t i o n s
22 print ("=== POSITIONAL ENCODINGS IN TRANSFORMERS ===\n")
23
24 # Visualization of encodings
25 sin_pe , learned_pe = visualize_positional_encodings ()
26
27 # Analyse des p r o p r i t s
28 pe_analysis = analyze_positional_encoding_properties ()
29
30 # Comparison of methods
31 compare_positional_encoding_methods ()
32
33 # Impact d m o n t r
34 demonstrate_position_encoding_impact ()
35
Explanation
Positional encodings are fundamental because Transformers are inherently invariant to token order. Without them, "Cat eats
fish" and "Fish eats cat"
would have exactly the same representation.
Sine encoding: Uses trigonometric functions with different frequencies.
low frequencies capture distant positions, high frequencies capture near positions.
This approach allows for natural generalization to longer sequences than those seen
in training.
Learned encoding: Positions are represented by trainable embeddings. More flexible
but limited to the maximum training length.
Relative encoding: Focuses on relative distances between positions rather than positions
absolute, often more linguistically relevant.
RoPE (Rotary Position Embedding)**: Recent method that encodes the position by rotation in the feature space, offering
interesting mathematical properties and good
efficiency.
The choice of positional encoding can significantly impact performance depending on the task
and the nature of the data.
Let's apply the concepts of Transformers by fine-tuning BERT, a pre-trained model, for a
sentiment classification task.
ÿ ÿ
1 import torch
2 import torch . 3 from nn as nn
torch . utils . data import Dataset 4 from transformers import , DataLoader
BertTokenizer , BertForSequenceClassification , 5 from transformers import get_linear_schedule_with_warmup AdamW
6 import numpy as np
7 import matplotlib . pyplot as plt
8 from sklearn . metrics import classification_report , 9 import seaborn as sns confusion_matrix
10 import warnings
11 warnings . filterwarnings (’ignore ’)
12
33 truncation = True ,
34 padding =’max_length ’,
35 max_length = self . max_length ,
36 return_tensors =’pt ’
37 )
38
39 return {
40 ’input_ids ’: encoding [’input_ids ’]. flatten () ’ attention_mask ’: encoding [’ ,
41 attention_mask ’]. flatten () ’labels ’: torch . tensor ( label , dtype = torch . long ) ,
42
43 }
44
47 C r e un dataset s y n t h t i q u e de sentiment en f r a n a i s
"""
48
49 positive_templates = [
50 "I love this { product }, it is { adjective }",
"
51 Excellent { product }, very { adjective }",
"
52 Fantastic experience with this { product } { adjective }",
53 "I highly recommend this { product } { adjective }",
"
54 Perfect, this { product
, } is really { adjective }",
"
55 Magnificent { product }, completely { adjective }",
"
56 Superbe q u a l i t t r s { adjectif
, }",
"
57 Wonderful {product} absolutely {adjective}"
,
58 ]
ÿ ÿ
Listing 63 – Fine-tuning BERT for sentiment classification
ÿ ÿ
1
negative_templates = [
2
"This { product } is { adjective }, I do not recommend it ",
"
3
Horrible e x p r i e n c e , t r s { adjectif }",
"
4
D c e v a n t ce { produit }, c o m p l t e m e n t { adjectif }",
"
5
Poor quality , really {adjective}",
6
"I regret this purchase, too {adjective}",
"
7
This {product} is useless, absolutely {adjective}",
"
8
Catastrophic, extremely { adjective }",
"
9
avoid this {product} it is {adjective}",
10 ]
11
12
products = ['movie ', 'book ', 'restaurant ', 'hotel ', 'product ', 'service ', '
application ', 'game ']
13
14
positive_adjectives = ['excellent ', 'fantastic ', 'wonderful ', 'perfect ', 'great'
,
15
'extraordinary ', 'remarkable ', 'impressive ', 'brilliant ', '
Magnificent ']
16
17
negative_adjectives = ['horrible ', 'disappointing ', 'zero ', 'catastrophic ', 'mediocre ',
18
'awful ', 'lamentable ', 'pitiful ', 'disastrous ', 'terrible ']
19
20 texts = []
21 labels = []
22
23
# Generation of positive examples
24 for in range
_ ( num_samples // 2) :
25
template = np . random . choice ( positive_templates )
26
product = np . random . choice ( products )
27
adjective = np . random . choice ( positive_adjectives )
28
29
text = template . format ( product = product , adjective = adjective )
30
texts . append ( text )
31
labels . append (1) # Positif
32
33
# G n r a t i o n d’exemples n g a t i f s
34 for _ in range ( num_samples // 2) :
35
template = np . random . choice ( negative_templates )
36
product = np . random . choice ( products )
37
adjective = np . random . choice ( negative_adjectives )
38
39
text = template . format ( product = product , adjective = adjective )
40
texts . append ( text )
41
labels . append (0) # N g a t i f
42
47
Analyze BERT's attention patterns on a text
"""
48
49 # Tokenisation
50
inputs = tokenizer ( text , return_tensors =’pt ’, truncation = True , max_length =128)
51
52
# Forward pass with attention capture
53 model . eval ()
54
with torch . no_grad () :
55
outputs = model (** inputs , output_attentions = True )
56
57
# Extracting attention weights
58
attention_weights = outputs . attentions [ layer_idx ][0] # P r e m i r e instance du batch
59
60
# Converting tokens for visualization
61
tokens = tokenizer . convert_ids_to_tokens ( inputs [’input_ids ’][0])
62
63
return attention_weights , tokens
ÿ ÿ
Listing 64 – Fine-tuning BERT for sentiment classification
ÿ ÿ
1 def visualize_bert_attention ( attention_weights , tokens , heads_to_show =4) :
"""
2
3
Visualize BERT's attention patterns
"""
4
5
num_heads = attention_weights . shape [0]
6
seq_len = len ( tokens )
7
8 # S l e c t i o n des t t e s display
9
heads_indices = np . linspace (0 , num_heads -1 , heads_to_show , dtype = int )
10
11
axes = plt . subplots (2 , 2, figsize =(16 fig , axes = axes . flatten () , 12) )
12
13
18
# Extracting weights for this head
19
head_attention = attention_weights [ head_idx ]. numpy ()
20
21
# C r a t i o n de la heatmap
22
sns . heatmap ( head_attention xticklabels = ,
23
tokens [: seq_len ] yticklabels = tokens [: seq_len ] ,
24
cmap =’Blues ’, ,
25
26 ax = axes [i ],
27 cbar_kws ={ ’label ’: ’Attention ’})
28
33
# Rotation des labels pour l i s i b i l i t
34
axes [i ]. tick_params ( axis =’x’, rotation =45)
35
axes [i ]. tick_params ( axis =’y’, rotation =0)
36
37
plt . tight_layout ()
38
plt . show ()
39
40 def train_bert_classifier () :
"""
41
42
Fine-tune BERT for sentiment classification
"""
43
44
device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
45
print (f" Utilisation du device : { device }")
46
47
# Loading the tokenizer and the pre-trained BERT model
48
model_name = 'bert -base - uncased ' # Using the English version for
compatibilit
49
tokenizer = BertTokenizer . from_pretrained ( model_name )
50
model = BertForSequenceClassification . from_pretrained (
51 model_name ,
52 num_labels =2 ,
53
output_attentions = True ,
54
output_hidden_states = False
55 ). to ( device )
56
57 # Data creation
58
print (" C r a t i o n du dataset s y n t h t i q u e ... ")
59 texts , labels = create_synthetic_sentiment_data ( num_samples =1600)
60
61
# Simple translation of templates into English for BERT
62
texts_english = []
ÿ ÿ
Listing 65 – Fine-tuning BERT for sentiment classification
ÿ ÿ
1 for text in texts :
2 # Basic translation for demonstration
3 text_en = text . replace ("J’adore ", "I love "). replace ("ce ", " this ") . replace ("il
est ", "it is")
4 text_en = text_en . replace (" Excellent ", " Excellent "). replace (" t r s ", " very ")
5 text_en = text_en . replace ("Je recommande ", "I recommend "). replace (" Parfait ", "
Perfect ")
6 text_en = text_en . replace ("je le d c o n s e i l l e ", "I don ’t recommend it")
7 text_en = text_en . replace (" Horrible ", " Horrible ") . replace (" D c e v a n t ", "
Disappointing ")
8 text_en = text_en . replace (" produit ", " product "). replace (" film ", " movie ") . replace (
"
livre ", " book ")
9 texts_english . append ( text_en )
10
15
22
29 epochs = 3
30 total_steps = len ( train_loader ) * epochs
31 scheduler = get_linear_schedule_with_warmup (
32 optimizer ,
33 num_warmup_steps =0 ,
34 num_training_steps = total_steps
35 )
36
37 #Entranement
38 train_losses = []
39 val_accuracies = []
40
54 # Reset gradients
55 model . zero_grad ()
56
57 # Forward pass
58 outputs = model (
59 input_ids = input_ids ,
60 attention_mask = attention_mask labels = labels ,
61
62 )
63
ÿ ÿ
Listing 66 – Fine-tuning BERT for sentiment classification
ÿ ÿ
1
# Backward pass
2 loss . backward ()
3
4 # Clipping gradients
5 torch . nn . utils . clip_grad_norm_ ( model . parameters () , 1.0)
6
7 # Update settings
8 optimizer . step ()
9 scheduler . step ()
10
11 if batch_idx % 20 == 0:
12 print (f’Epoch { epoch +1} loss . item () :.4 , Batch { batch_idx }/{ len ( train_loader )}, Loss : {
f}’)
13
17 # Phase de validation
18 model . eval ()
19 total_eval_accuracy = 0
20 total_eval_loss = 0
21
22 predictions = []
23 true_labels = []
24
31 outputs = model (
32 input_ids = input_ids ,
33 attention_mask = attention_mask labels = labels ,
34
35 )
36
42 # Accuracy calculation
43 preds = torch . argmax ( logits , accuracy = ( preds dim =1)
44 == labels ). cpu () . numpy () . mean ()
45 total_eval_accuracy += accuracy
46
59 # Final metrics
60 print ("\ nRapport de classification final :")
61 print ( classification_report ( true_labels , predictions ,
62 target_names =[ 'Negative ', 'Positive ']) )
63
ÿ ÿ
Listing 67 – Fine-tuning BERT for sentiment classification
ÿ ÿ
1 def analyze_bert_performance ( model , tokenizer , test_texts , test_labels ) :
"""
2
8 predictions = []
9 confidences = []
10
19 # Prediction
20 outputs = model (** inputs )
21 logits = outputs . logits
22 probabilities = torch . softmax ( logits , dim =1)
23
24 predicted_class = torch . argmax ( logits , confidence = torch . max dim =1) . item ()
25 ( probabilities ). item ()
26
30 # Detailed display
31 true_label = test_labels [i]
" " " "
32 status = if predicted_class == true_label else
33
42 def demonstrate_bert_attention_analysis () :
"""
43
54
63 # Attention Analysis
64 attention_weights , layer_idx = tokens = analyze_bert_attention ( model , tokenizer , text ,
-1)
ÿ ÿ
Listing 68 – Fine-tuning BERT for sentiment classification
ÿ ÿ
1
print (f" Nombre de t t e s d’attention : { attention_weights . shape [0]} ")
2
print (f" Longueur de sequence : { len( tokens )}")
3
print (f" Tokens : { tokens [:10]}... ") # Affichage des premiers tokens
4
5
# Statistical analysis of attention
6
avg_attention_per_head = attention_weights . mean ( dim =(1 max_attention_per_head = , 2) )
7
attention_weights . max ( dim =2) [0]. max ( dim =1) [0]
8
9
print (f" Attention moyenne par tete : { avg_attention_per_head [:4]. tolist ()}")
10
print (f" Attention maximale par tete : { max_attention_per_head [:4]. tolist ()}")
11
12
# Visualization for the first example only
13 if i == 0:
14
print ("Generating the attention visualization... ")
15
visualize_bert_attention ( attention_weights , tokens , heads_to_show =4)
16
17 def compare_bert_variants () :
"""
18
19
Comparison of BERT variants
"""
20
21
print ("\n=== COMPARISON OF BERT VARIANTS === ")
22
print ("-" * 60)
23
24 bert_variants = {
25 ’BERT - Base ’: {
26
’layers ’: 12 ’ ,
27 hidden_size ’: 768 ’ ,
28 attention_heads ’: 12 ’parameters ’: ,
29
’110 M’,
30
’ training_data ’: ’BookCorpus + Wikipedia ’,
31
'strengths ': ['Bidirectional ', 'Versatile ', 'Well-studied '] 'use_cases ': ['Classification ', 'NER ', 'Question - ,
32
Answering ']
33 },
34
’BERT - Large ’: {
35
’layers ’: 24 ’ ,
36 hidden_size ’: 1024 ’ ,
37 attention_heads ’: 16 ’parameters ’: ,
38
’340 M’,
39
’ training_data ’: ’BookCorpus + Wikipedia ’,
40
'strengths ': ['More efficient ', 'Better representation '] 'use_cases ': ['Complex tasks ', 'State of the art '] ,
41
42 },
43 ’RoBERTa ’: {
44
’layers ’: 24 ’ ,
45 hidden_size ’: 1024 ’ ,
46 attention_heads ’: 16 ’parameters ’: ,
47
’355 M’,
48
'training_data ': 'More data, longer ',
49
'strengths ': ['Training Optimizations ', 'No NSP '],
50 ’use_cases ’: [’Alternative robuste a BERT ’]
51 },
52 'DistilBERT ': {
53
’layers ’: 6,
54 ’ hidden_size ’: 768 ’ ,
55 attention_heads ’: 12 ’parameters ’: ,
56
’66M’,
57
’ training_data ’: ’ Distillation de BERT ’,
58
'strengths ': ['Faster ', 'Lighter ', '97% performance '],
59 'use_cases ': ['Production ', 'Limited Resources ']
60 },
61 ’ELECTRA ’: {
62
’layers ’: 12 ’ ,
63 hidden_size ’: 768 ’ ,
64 attention_heads ’: 12 ’parameters ’: ,
65
’110 M’,
66
'training_data ': 'Discriminative training ',
67
'strengths ': ['More efficient ', 'Better than BERT '],
68
’use_cases ’: [’Alternative performante ’]
69 }
70 }
71
86 # Loss curve
87 ax1 . plot ( train_losses ax1 . set_title , ’b-’, marker =’o’, linewidth =2 , markersize =8)
88 ( 'Training Loss Evolution ')
89 ax1 . set_xlabel (’Epoque ’)
90 ax1 . set_ylabel ('Perte ')
91 ax1 . grid ( True , alpha =0.3)
92
93 # Accuracy curve
94 ax2 . plot ( val_accuracies , ’g-’, marker =’s’, linewidth =2 ax2 . set_title (’Evolution de la Precision de , markersize =8)
95 Validation ’)
96 ax2 . set_xlabel (’Epoque ’)
97 ax2 . set_ylabel (’Precision ’)
98 ax2 . grid ( True , alpha =0.3)
99 ax2 . set_ylim (0 1) ,
100
111 sns . heatmap (cm annot = True, , fmt =’d’, cmap =’Blues ’,
112 xticklabels =[ 'Negative ', 'Positive '],
113 yticklabels =[ 'Negative ', 'Positive '])
114 plt . title ('Confusion Matrix - Sentiment Classification ')
115 plt . xlabel (’ Predictions ’)
116 plt . ylabel ('Actual Values ')
117 plt . show ()
118
119 return cm
120
128 benefits = {
129 'Pre-training ': {
130 ' description ': 'BERT is pre-trained on large text corpora ',
131 'impact ': 'Capture of rich linguistic representations ',
’
132 'advantage ': 'No need to start from scratch
133 },
134 ’Fine - tuning ’: {
135 'description ': 'Adaptation to specific tasks with little data ',
136 'impact ': 'High performance even with limited datasets ',
’
137 'advantage ': 'Time and resource efficiency'
138 },
139 'Contextual representations ': {
140 'description ': 'Each word has a context-dependent representation ',
141 'impact ': 'Management of polysemy and ambiguities ',
142 'advantage ': 'Nuanced understanding of language'
143 },
144 ’ Bidirectionnalite ’: {
145 'description ': 'Access left AND right context simultaneously ',
146 'impact ': 'Better understanding than unidirectional models ',
147 'advantage ': 'Captures complex dependencies'
148 }
149 }
150
151 for benefit , details in benefits . items () :
152 print (f"\n{ benefit . upper () }:")
153 print (f" Description : { details [’ description ’]}")
154 print (f" Impact : { details [ ’ impact ’]}")
155 print (f" Advantage: { details ['advantage']}")
156
157 # Simulation comparative des performances
158 print (f"\nSIMULATED PERFORMANCE COMPARISON:")
159 print ("-" * 40)
160
161 scenarios = {
162 ’Modele from scratch ’: {’accuracy ’: 0.65 ’100 k+ samples ’}, , ’ training_time ’: ’48h’, ’ data_needed ’:
163 ’Fine - tuning BERT ’: {’accuracy ’: 0.89 samples ’}, , ’ training_time ’: ’2h’, ’ data_needed ’: ’1k+
164 ’BERT sans fine - tuning ’: {’accuracy ’: 0.76 ’: ’0 samples ’} , ’ training_time ’: ’5 min ’, ’ data_needed
165 }
166
167 for scenario , metrics in scenarios . items () :
168 print (f"\n{ scenario }:")
169 print (f" Precision : { metrics [’ accuracy ’]:.2%} ")
170 print (f" Temps d_entrainement : { metrics [’ training_time ’]}")
171 print (f" Donnees necessaires : { metrics [’ data_needed ’]}")
172
177 try :
178 trained_model , tokenizer , losses , accuracies , true_labels , predictions =
train_bert_classifier ()
179
180 # Visualization of training results
181 plot_training_results ( losses , accuracies )
182
183 # Confusion matrix
184 cm = create_confusion_matrix ( true_labels , predictions )
185
186 # Performance analysis
187 test_texts = [
"
188 This product is excellent and works perfectly !",
"
189 Terrible quality , completely disappointed .",
190 " Amazing experience , highly recommended !",
"
191 Waste of money , very poor quality ."
192 ]
test_labels )
196
210 print ("\n=== POINTS CLES DU FINE - TUNING BERT === ")
211 print ("1. Transfer learning: starting from a pre-trained model ")
212 print ("2. Fine - tuning: adaptation with little data and period ")
213 print ("3. Attention: understand what the model 'looks at'")
214 print ("4. Variants: choose according to constraints (speed vs. performance)")
215 print ("5. Tokenisation : importance du preprocessing avec BERT tokenizer ")
216 print ("6. Hyperparametres : learning rate faible , warmup , gradient clipping ")
217 print ("7. Evaluation: metrics adapted to the classification task ")
ÿ ÿ
Listing 69 – Fine-tuning BERT for sentiment classification
Explanation
ÿ ÿ
1
5 # Forward pass
6 logits , attention_weights = model ( sequences , loss = criterion ( logits , labels ) mask )
7
9 # Backward pass
10 optimizer . zero_grad ()
11 loss . backward ()
12 optimizer . step ()
13
14 # Metrics
15 epoch_loss += loss . item ()
16 predictions = torch . argmax ( logits , correct_predictions += dim =1)
17 ( predictions == labels ). sum () . item ()
18 total_predictions += labels . size (0)
19
20 if batch_idx % 20 == 0:
21 print (f’Epoch { epoch }, Batch { batch_idx }, Loss : { loss . item () :.4 f}’)
22
32
33 # Backward pass
34 optimizer . zero_grad ()
35 loss . backward ()
36 optimizer . step ()
37
38 # Metrics
39 epoch_loss += loss . item ()
40 predictions = torch . argmax ( logits , dim =1)
44 if batch_idx % 20 == 0:
45 print (f’Epoch { epoch }, Batch { batch_idx }, Loss : { loss . item () :.4 f}’)
46
60 model . eval ()
61 device = next ( model . parameters () ). device
62
80 # Extracting weights
81 if layer_idx == -1:
82 layer_idx = len( attention_weights ) - 1
83
95 return attention_weights
96
105 comparison_data = {
106 ’Critere ’: [
107 ’ Parallelisation ’,
108 'Long-term memory ',
109 ’Complexite computationnelle ’,
110 'Training speed ',
111 'Interpretability ',
133 }
134
140 , 0.5 ,
rnn_times = [0.1 5.0] # Temps sequentiel 1.0 , 2.0 ,
147 plt . plot ( sequence_lengths , plt . plot rnn_times , ’o-’, label =’RNN / LSTM ’, linewidth =2)
148 ( sequence_lengths , plt . xlabel (’Longueur de s transformer_times , ’s-’, label =’ Transformer ’, linewidth =2)
149 q u e n c e ’)
150 plt . ylabel ('Relative temps ')
151 plt . title (’Temps d\’ e n t r a n e m e n t c o m p a r ’)
152 plt . legend ()
153 plt . grid ( True , alpha =0.3)
154
160 plt . plot ( sequence_lengths , rnn_memory , ’o-’, label =’RNN/ LSTM (O(n))’, linewidth =2)
161 plt . plot ( sequence_lengths , transformer_memory , ’s-’, label =’ Transformer (O( n ))’,
linewidth =2)
162 plt . xlabel ('Longueur de squence ')
163 plt . ylabel (’ Utilisation m m o i r e relative ’)
164 plt . title (’ C o m p l e x i t m m o i r e ’)
165 plt . legend ()
166 plt . grid ( True , alpha =0.3)
167
183 compare_transformer_vs_rnn ()
184
,
186 plt . figure ( figsize =(10 6) )
187 plt . plot ( losses , ’b-’, linewidth =2 marker =’o’) ,
188 pts . title (' Revolution from Loss - Our Training Transform ')
189 plt . xlabel ('poque ')
190 plt . ylabel ('Perte ')
191 plt . grid ( True , alpha =0.3)
Explanation
Attention mechanism: Each position can directly access all other positions,
eliminating the vanishing gradient problems of RNNs. Weighted attention allows to
focus on the relevant elements.
Parallelization: Unlike RNNs which process sequentially, Transformers can
process all tokens simultaneously, drastically speeding up GPU training.
Multi-head attention: Allows the model to capture different types of relationships (syntactic,
semantics) simultaneously using multiple parallel attention "heads".
Positional encoding: Compensates for the lack of natural sequential order by injecting positional information via
sine functions.
This architecture has become the basis for revolutionary models like BERT, GPT, and their
successors.
The attention mechanism is the heart of Transformers. Let's understand in detail how it works and
its variants.
ÿ ÿ
1 import torch
2 import torch . 3 import nn as nn
torch . . functional as F nn 4
import math
5 import numpy as np
6 import matplotlib . pyplot as plt
7 import seaborn as sns
8
26
27 # Normalisation softmax
28 attn = F. softmax ( attn dim = -1) ,
31 # Application to values
32 output = torch . matmul ( attn , in )
33
40 def __init__ ( self n_head super (), . __init__ () , d_model , d_k , d_v , dropout =0.1) :
41
42
59 d_k d_v
, n_head =
, self . d_k self . d_v sz_b , len_q , len_k
, len_v = q. size , self . n_head
60 (0) , , q. size (1) k . size (1)
, , v. size (1)
61
62 residual = q # Connexion r s i d u e l l e
63
68
76 # Application of attention
77 q, attn = self . attention (q , k , in , mask = mask )
78
79 # C o n c a t n a t i o n des t t e s
80 q = q . transpose (1 q = self . , 2) . contiguous () . view ( sz_b , len_q , -1)
81 dropout ( self . fc (q ))
82
87 return q , attn
88
89 class AttentionVisualizer :
"""
90
"""
99
108 # M t r i q u e s d’analyse
109 avg_attention = np . mean ( attn_head )
110 max_attention = np . max ( attn_head )
111
126
143 fig , axes = plt . subplots (2 , 2, figsize =(15 axes = axes . flatten () , 12) )
144
145
161
171 metric_names = ['Average Attention ', 'Entropy ', 'Self-attention ', 'Distance
Average ']
172
173 axes = plt . subplots (2 , 2, figsize =(12 fig , axes = axes . flatten () , 10) )
174
175
179 axes [i ]. bar ( range ( n_heads ) values , alpha =0.7 color =, plt . cm . viridis ( np . linspace (0 axes ,
194 def __init__ ( self , vocab_size super , d_model =256 , n_heads =8 , n_layers =6) :
195 () . __init__ ()
196
205 ])
206
234 #Paramtres
235 vocab_size = 1000
236 seq_len = 20
237 batch_size = 2
238
240 model = SelfAttentionAnalyzer ( vocab_size model . eval () , d_model =256 , n_heads =8 , n_layers =4)
241
242
280 layer_entropies = []
281 layer_self_attention = []
282
283 for layer_idx , attn_weights in enumerate ( attention_weights ) :
284 layer_analysis = visualizer . analyze_attention_patterns ( attn_weights )
285
286 avg_entropy = np . mean ([ analysis [’entropy ’] for analysis in layer_analysis . values
() ])
287 avg_self_attn = np . mean ([ analysis [’ diagonal_attention ’] for analysis in
layer_analysis . values () ])
288
289 layer_entropies . append ( avg_entropy )
290 layer_self_attention . append ( avg_self_attn )
291
292 # Evolution graph
293 plt . figure ( figsize =(12 , 5) )
294
295 plt . subplot (1 1) , 2 ,
313
364
Explanation
Quadratic complexity: The main challenge is the O(n²) complexity in sequence length,
motivating research into more efficient variants such as linear or sparse attention.
Positional encodings are crucial in Transformers because they compensate for the lack of order
natural sequential nature of the attention mechanism.
ÿ ÿ
1 import torch
2 import torch . nn as nn
3 import numpy as np
4 import matplotlib . pyplot as plt
5
34 Args :
35 x: Tensor de forme [ seq_len , batch_size , d_model ]
"""
36
50 Args :
51 x: Tensor de forme [ seq_len , batch_size , d_model ]
"""
52
ÿ ÿ
Listing 72 – Implementation and Analysis of Positional Encodings
Residual networks, introduced by He et al. in 2015, have revolutionized the training of very complex networks.
deep by solving the problem of performance degradation with increasing depth.
Fundamental problem with deep networks: Contrary to intuition, simply increasing the number of layers does not guarantee better
performance. Beyond a certain depth,
the training error itself starts to increase, suggesting an optimization problem rather
than overlearning.
(152+ couches)
Principle of residual connections: Instead of directly learning the function H(x), the
residual blocks learn the residual function F(x) = H(x) ÿ x, allowing information to "short-circuit" certain layers.
where F(x, {Wi}) represents the learned residual function and x is the shortcut identity.
ÿ ÿ
1 import torch
2 import torch . 3 import nn as nn
torch . . functional as F nn 4
10 expansion = 1
11
14
15 # P r e m i r e couche convolutive
16 nn . Conv2d ( in_channels self . conv1 = , out_channels , kernel_size =3 bias = ,
20 # D e u x i m e couche convolutive
21 self . conv2 = nn . Conv2d ( out_channels stride =1 , padding =1 . , out_channels kernel_size
, =3 bias = False ) ,
22 BatchNorm2d ( out_channels ) ,
23 self . bn2 = nn
24
25 # Connexion r s i d u e l l e ( shortcut )
26 self . downsample = downsample
27 self . stride = stride
28
33 # P r e m i r e convolution
38 # D e u x i m e convolution
39 out = self . conv2 ( out )
40 out = self . bn2 ( out )
41
50 return out
51
56 expansion = 4
57
68 = nn . BatchNorm2d ( out_channels )
69
80
81 # 1x1 conv
82 out = self . conv1 (x)
83 out = self . bn1 ( out )
84 out = F. relu ( out )
85
86 # 3x3 conv
87 out = self . conv2 ( out )
88 out = self . bn2 ( out )
89 out = F. relu ( out )
90
91 # 1x1 conv
92 out = self . conv3 ( out )
93 out = self . bn3 ( out )
94
95 # Connexion r s i d u e l l e
96 if self . downsample is not None :
97 identity = self . downsample ( x)
98
99 out += identity
100 out = F. relu ( out )
101
"""
107
108 def __init__ ( self super ( ResNet, block , layers , self ). __init__ num_classes =1000) :
109 , ()
110
111 self . in_channels = 64
112
118 # Blocs r s i d u e l s
119 self . layer1 = self . _make_layer ( block self . layer2 = self . , 64 , layers [0])
120 _make_layer ( block self . layer3 = self . _make_layer ( block self . , layers , layers [1] 128 , , stride =2)
121 layer4 = self . _make_layer ( block , 256 [2] 512 , layers [3] , stride =2)
122 , , stride =2)
123
124 # Classification layer
127
143 )
144
145 layers = []
146 # First block (potentially with downsampling)
147 layers . append ( block ( self . in_channels out_channels self . in_channels
, = out_channels * , stride , downsample ))
148 block . expansion
149
164
172 # Blocs r s i d u e l s
173 x = self . layer1 (x)
174 x = self . layer2 (x)
175 x = self . layer3 (x)
176 x = self . layer4 (x)
177
178 # Classification
179 x = self . avgpool (x )
183 return x
184
185 def resnet18 ( num_classes =1000) :
""" """
186 ResNet -18 return
217 nn .ReLU() ,
220 nn .ReLU() ,
222
223 # Bloc 2
224 nn . Conv2d (64 . , 128 , 3, padding =1) ,
226 nn .ReLU() ,
229 nn .ReLU() ,
231
232 # Bloc 3
233 nn . Conv2d (128 . , 256 , 3 , padding =1) ,
235 nn .ReLU() ,
238 nn .ReLU() ,
240
241 # Bloc 4
242 nn . Conv2d (256 . , 512 , 3 , padding =1) ,
244 nn .ReLU() ,
247 nn .ReLU() ,
249 )
250
253 nn . Flatten () ,
256
257 def forward ( self x) : ,
285 batch_size = 32
286 input_tensor = torch . randn ( batch_size , 3, 32 , 32) . to ( device )
287
288 # Speed test
ÿ ÿ
Listing 73 – Implementing a Residual Block with PyTorch
Explanation
1 import torch
2 import torch . 3 import nn as nn
torch . . functional as F nn 4
import numpy as np
14 if not mid_channels :
15 mid_channels = out_channels
16
19 nn . BatchNorm2d ( mid_channels ) ,
22 nn BatchNorm2d ( out_channels ) ,
25
39
43 class Up ( nn . Module ):
"""
44
49
50 if bilinear :
51 # Upsampling b i l i n a i r e + convolution 1x1 pour r d u c t i o n de canaux
52 self . conv = . Upsample ( scale_factor =2 self . up = nn , mode =’bilinear ’, align_corners = True )
53 DoubleConv ( in_channels , out_channels , in_channels // 2)
54 else :
55 # Convolution t r a n s p o s e
56 self . up = nn stride . ConvTranspose2d ( in_channels in_channels // 2, ,
59
87 Architecture U-Net c o m p l t e
"""
88
133 def __init__ ( self self . size = size , size =1000 , img_size =128) :
134
145 # G n r a t i o n a l a t o i r e de formes g o m t r i q u e s
146 num_shapes = np . random . randint (2 , 5)
147
156
157 # C r a t i o n du cercle
158 y, x = np . ogrid [: self . img_size , : self . img_size ]
159 circle_mask = ( x - center_x ) **2 + (y - center_y ) **2 <= radius **2
160
217 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
218 print (f" Utilisation du device : { device }")
219
223
224 # M odle loss and optimizer
,
229 # M t r i q u e s d’ e n t r a n e m e n t
230 train_losses = []
231
255 if batch_idx % 50 == 0:
256 print (f’Epoch { epoch }, Batch { batch_idx }, Loss : { loss . item () :.4 f}’)
257
263
264 # Results visualization function
313 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
314 dataset = SyntheticSegmentationDataset ( size =500 , img_size =128)
315
316 # Training you model
ÿ ÿ
Listing 74 – U-Net Implementation for Image Segmentation
Explanation
Object detection requires not only classifying the objects present in an image, but
also to locate them precisely. The R-CNN family has revolutionized this field with improvements
successive.
ÿ ÿ
1 import torch
2 import torch . 3 import nn as nn
torch . . functional as F nn 4
17
24 # Initialization of weights
25 self . _init_weights ()
26
32
37 Args :
38 features : Feature maps du backbone CNN
39
40 Returns :
41 cls_logits : Scores de classification objet / a r r i r e - plan
42 bbox_pred : P r d i c t i o n s de r g r e s s i o n des b o t e s
"""
43
44 # Features communes
45 x = F . relu ( self . conv ( features ) )
46
47 # Classification et r g r e s s i o n
48 cls_logits = self . cls_logits (x)
49 bbox_pred = self . bbox_pred (x )
50
64 Args :
65 features : Feature maps [N, C, H, W]
66 rois : Regions of Interest [ num_rois , 5] ( batch_idx , x1 , y1 , x2 , y2)
67
68 Returns :
69 pooled_features : Features p o o l e s [ num_rois , C, output_size , output_size ]
"""
70
82 # Adaptive pooling
83 pooled_features . . adaptive_max_pool2d ( roi_feature pooled = F , self . output_size )
84 append ( pooled )
85
94
112 nn .ReLU() ,
115 nn () ,
117 )
118
127 Args :
128 images : Batch d’images [N, 3, H, W]
129 gt_boxes : B o t e s englobantes ground truth ( pour l’ e n t r a n e m e n t )
130
131 Returns :
132 If training: dictionary with losses
133 Si i n f r e n c e : p r d i c t i o n s ( classes b o t e s , , scores )
"""
134
155 return {
156 ’cls_scores ’: cls_scores ’bbox_pred ’: ,
157 bbox_predictions ,
158 ’ rpn_cls_logits ’: rpn_cls_logits ,
159 ' rpn_bbox_pred ': rpn_bbox_pred ,
160 ’proposals ’: proposals
161 }
162 else :
163 # No proposal generated
164 return {
165 ’cls_scores ’: torch . empty (0 ’bbox_pred ’: torch . , self . cls_score . out_features ) ,
167
190 # Ajout de la proposition ( format : batch_idx for batch_idx in range ( batch_size ): , x1 , y1 , x2 , y2)
191
194 if proposals :
195 proposals = torch . tensor ( proposals , dtype = torch . float32 )
196 # Limiting the number of proposals
197 if len ( proposals ) > 100:
198 indices = torch . randperm ( len( proposals )) [:100]
199 proposals = proposals [ indices ]
200 else :
201 proposals = torch . empty (0 , 5)
202
219 Args :
220 predictions: Dictionary of model predictions
221 targets: Dictionary of targets (classes , botes)
222
223 Returns :
224 Total loss and individual components
"""
225
245 return {
246 ’total_loss ’: total_loss ’ rpn_cls_loss ’: ,
247 rpn_cls_loss ,
248 ' rpn_bbox_loss ': rpn_bbox_loss ,
249 ’ final_cls_loss ’: final_cls_loss ’ final_bbox_loss ’: ,
250 final_bbox_loss
251 }
252
’
284 detection '
285 }
286 }
287
288 print (" Performance des d i f f r e n t e s m t h o d e s :")
289 print ("-" * 70)
290 print (f"{’ M t h o d e ’: <15} {’ Vitesse (FPS ) ’: <15} {’ mAP (%) ’: <10} {’ A n n e ’: <6} {’
Innovation ’}")
291 print ("-" * 70)
292
293 for method , stats in methods_comparison . items () :
294 print (f"{ method : <15} { stats [ ’ speed_fps ’]: <15.1f} { stats [ ’ map_score ’]: <10.1f} "
295 f"{ stats [’ year ’]: <6} { stats [’ innovation ’]}")
296
297 return methods_comparison
298
331 # E x c u t i o n des d m o n s t r a t i o n s
332 print (" === D T E C T I O N D’OBJETS AVEC R- CNN ===")
333 comparison_results = compare_detection_methods ()
334 train_faster_rcnn_example ()
ÿ ÿ
Listing 75 – Simplified Implementation of Faster R-CNN
Explanation
The evolution from R-CNN to Faster R-CNN perfectly illustrates the progressive optimization of Deep
Learning architectures:
Original R-CNN: Used selective search to generate region proposals, then classified each region with a
CNN. Very slow because each region required a forward pass.
separated.
Fast R-CNN: Introduces ROI pooling allowing convolution calculations to be shared between
all regions of the same image. End-to-end training with multi-task loss.
Faster R-CNN : Remplace selective search par un Region Proposal Network (RPN) neuronal,
enabling a fully differentiable and much faster pipeline.
This progression shows the importance of architectural and algorithmic optimization for
make Deep Learning models practically usable.
Let's apply U-Net to a real use case: segmenting skin lesions to aid diagnosis
dermatological.
ÿ ÿ
1 import torch
2 import torch . 3 import nn as nn
torch . optim as optim
4 from torch . utils . data import Dataset 5 import torchvision . , DataLoader
transforms as transforms
6 import numpy as np
7 import matplotlib . pyplot as plt
8 from PIL import Image
9 imports
10
16 def __init__ ( self self . size = size , size =200 , img_size =256 , augment = True ):
17
27 =(0.8 1.0) ) ,
48 # Generation of lsions
49 num_lesions = np . random . randint (1 , 3)
50
56 # Forme i r r g u l i r e pour la l s i o n
57 radius_base = np . random . randint (20 40) ,
58
59 # C r a t i o n d’une forme i r r g u l i r e
60 angles = np . linspace (0 2* np . pi , 100) ,
64 # Location coordinates
65 x_coords = center_x + radii * np . cos ( angles )
66 y_coords = center_y + radii * np . sin ( angles )
67
76
92
99 # G n r a t i o n d’une image s y n t h t i q u e
100 image , mask = self . generate_synthetic_skin_lesion ()
101
110
111 torch . manual_seed ( seed )
112 image_tensor = transforms . ToTensor () ( self . transform ( image_pil ) )
113
114 torch . manual_seed ( seed )
115 mask_tensor = transforms . ToTensor () ( self . transform ( mask_pil ))
116 else :
117 image_tensor = transforms . ToTensor () ( image_pil )
118 mask_tensor = transforms . ToTensor () ( mask_pil )
119
125 # M t r i q u e s d’ valuation m d i c a l e
126 class MedicalMetrics :
"""
127
130 @staticmethod
131 def dice_coefficient ( pred , target , smooth =1e -6) :
"""
132 Coefficient de Dice (F1 - score pour segmentation ) """
133 pred_flat = pred . view ( -1)
134 target_flat = target . view ( -1)
135
153 @staticmethod
154 def sensitivity ( pred , target ):
""" """
155 Sensibilit ( Recall ) - c a p a c i t d t e c t e r les l s i o n s
156 pred_flat = pred . view ( -1)
157 target_flat = target . view ( -1)
158
162 if actual_positives == 0:
163 return 1.0 # No lsion dtecter
164
167 @staticmethod
168 def specificity ( pred , target ):
""" """
169 Spcificit-capacit avoid false positives
170 pred_flat = pred . view ( -1)
171 target_flat = target . view ( -1)
172
176 if actual_negatives == 0:
177 return 1.0 # No background prserver
178
186 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
187 print (f" E n t r a n e m e n t sur : { device }")
188
193 train_loader = DataLoader ( train_dataset val_loader = DataLoader , batch_size =8 , shuffle = True , num_workers =2)
194 ( val_dataset , batch_size =8 , shuffle = False num_workers
, =2)
195
202
253
254 with torch . no_grad () :
255 for images , masks in val_loader :
256 masks = images . to ( device ) images , , masks . to ( device )
257
258 outputs = model ( images )
259 loss = combined_loss ( outputs , val_loss += loss . masks )
260 item ()
261 val_batches += 1
262
263 # M t r i q u e s de segmentation
264 pred_masks = F. softmax ( outputs , dim =1) [: , 1] > 0.5
265
266 for i in range ( images . size (0) ):
267 dice = MedicalMetrics . dice_coefficient ( pred_masks [i ]. float () , masks [i
])
268 iou = MedicalMetrics . iou_score ( pred_masks [i ]. float () epoch_dice_scores . append ( dice ) , masks [ i ])
269
270 epoch_iou_scores . append ( iou )
271
272 avg_val_loss = val_loss / val_batches
273 avg_dice = np . mean ( epoch_dice_scores )
274 avg_iou = np . mean ( epoch_iou_scores )
275
276 val_losses . append ( avg_val_loss )
277 val_dice_scores . append ( avg_dice )
278 val_iou_scores . append ( avg_iou )
279
280 # Scheduler step
281 scheduler . step ( avg_val_loss )
282
283 print (f’Epoch { epoch }: ’)
284 print (f’ Train Loss : { avg_train_loss :.4 f}’)
285 print (f’ Val Loss : { avg_val_loss :.4 f}’)
286 print (f’ Val Dice : { avg_dice :.4 f}’)
287 print (f' Val IoU : { avg_iou :.4f}')
288 print (f’ LR: { optimizer . param_groups [0][" lr "]:.6 f}’)
289 print (’-’ * 50)
290
291 return model , {
292 ’ train_losses ’: train_losses ’val_losses ’: val_losses ’ ,
293 val_dice_scores ’: val_dice_scores ’ ,
295
296 }
297
298 def evaluate_medical_model ( model , test_dataset , device ):
"""
299
300 valuation c o m p l t e du m o d l e m d i c a l
"""
301
302 model . eval ()
303 test_loader = DataLoader ( test_dataset , batch_size =1 , shuffle = False )
304
305 # M t r i q u e s globales
306 all_dice_scores = []
307 all_iou_scores = []
308 all_sensitivity = []
309 all_specificity = []
310
311 print (" model evaluation on test dataset ... ")
312
313 with torch . no_grad () :
314 for idx , ( image , mask ) in enumerate ( test_loader ):
315 mask = image . to ( device ) image , , mask . to ( device )
316
317 #Prdiction
318 output = model ( image )
319 pred_mask = F. softmax ( output , dim =1) [0 , 1] > 0.5
320
321 # Calcul des m t r i q u e s
322 dice = MedicalMetrics . dice_coefficient ( pred_mask . float () iou = MedicalMetrics . iou_score ( pred_mask ., mask )
323 float () mask ) ,
Explanation
This medical application from U-Net illustrates several crucial aspects of Deep Learning in health:
Specialized metrics: Dice coefficient and IoU are more suitable for segmentation than
simple accuracy. Sensitivity (recall) is critical in medicine because missing a lesion is more
serious than a false positive.
Data augmentation: Particularly important with small medical datasets.
transformations must preserve medical consistency (rotation, flip) but avoid deformations
unrealistic.
Rigorous validation: Evaluation on separate test datasets is essential. In practice, the
Validation by medical experts is mandatory before any clinical deployment.
Ethical considerations: Medical AI models require careful attention to
bias, equity between populations, and transparency of decisions for clinical acceptance.
The Transformers, introduced in "Attention is All You Need" (Vaswani et al., 2017), revolutionized
natural language processing by completely abandoning recurrent architectures in favor of
attention mechanism.
Transformers Fundamentals:
— Full parallelization: Unlike RNNs, all tokens can be processed simultaneously.
course
— Attention mechanism: Each position can “see” all other positions
— Positional encodings: Compensate for the lack of natural sequential order
— Encoder-decoder architecture: Flexible for various NLP tasks
ÿ ÿ
1 import torch
2 import torch . nn as nn
10 M c a n i s m e d’attention multi - t t e s
"""
11
14
21 # Projections l i n a i r e s pour Q, K, V
22 self . W_q = nn . Linear . Linear ( d_model d_model ),
23 self . W_k = nn . Linear ( d_model d_model ) ,
26
33 Args :
34 Q: Queries [ batch_size num_heads , seq_len
, , d_k ]
35 K: Keys [ batch_size d_k ] , num_heads , seq_len ,
36 V: Values [ batch_size num_heads , seq_len
, , d_k ]
37 mask: Optional mask to hide certain positions
38
39 Returns :
40 output : Sortie a p r s attention
41 attention_weights: Attention weights for visualization
"""
42
59 def forward ( self , query , key , batch_size = query . value , mask = None ) :
60 size (0)
61 seq_len = query . size (1)
62
65 K = self . W_k ( key ). view ( batch_size , seq_len , 2) self . num_heads , self . d_k ) . transpose
(1 ,
66 V = self . W_v ( value ). view ( batch_size , seq_len , 2) self . num_heads , self . d_k ). transpose
(1 ,
67
68 # Application of attention
69 attention_output , attention_weights = self . scaled_dot_product_attention (Q , mask ) K , In ,
70
71 # C o n c a t n a t i o n des t t e s
72 attention_output = attention_output . transpose (1 batch_size , seq_len , , 2) . contiguous () . view (
73 self . d_model
74 )
75
76 # Projection finale
77 output = self . W_o ( attention_output )
78
80
92 # Calcul des f r q u e n c e s
93 div_term = torch . exp ( torch . arange (0 2) . float () * , d_model ,
106
111 def __init__ ( self d_model super ,( FeedForward , d_ff , dropout =0.1) :
112 , self ) . __init__ ()
113
114 self . linear1 = nn . Linear ( d_model .Linear ( d_ff self . linear2 = nn , d_ff )
115 Dropout ( dropout ) self . dropout = nn . , d_model )
116
117
119 return self . linear2 ( self . dropout (F . relu ( self . linear1 (x ))))
120
128 self . attention = MultiHeadAttention ( d_model self . feed_forward = FeedForward , num_heads , dropout )
129 ( d_model , d_ff , dropout )
130
146
151 def __init__ ( self =0.1) : , vocab_size , d_model , num_heads , num_layers , d_ff , max_len , dropout
157
172 attention_weights = []
173 for transformer_block in self . transformer_blocks :
174 x , attn_weights = transformer_block (x attention_weights . append , mask )
175 ( attn_weights )
176
=0.1) :
184 super ( TransformerClassifier , self ). __init__ ()
185
186 self . encoder = TransformerEncoder ( vocab_size max_len , dropout ) , d_model , num_heads , num_layers ,
d_ff ,
187
191 nn .ReLU() ,
195
210 # Classification
211 logits = self . classifier ( pooled )
212
220 def __init__ ( self self . , vocab_size =1000 , seq_len =50 , num_samples =1000) :
221 vocab_size = vocab_size
222 self . seq_len = seq_len
295 num_layers ,
304
305 # Optimizer and loss function
306 optimizer = torch . optim . Adam ( model . parameters () , lr =1e -4)
307 criterion = nn . CrossEntropyLoss ()
308
309 #Entranement
310 model . train ()
311 train_losses = []
312
313 print (" D b u t de l’ e n t r a n e m e n t du Transformer ... ")
314
315 for epoch in range (10) :
316 epoch_loss = 0
317 correct_predictions = 0
318 total_predictions = 0
319
320 for batch_idx , ( sequences , labels ) in enumerate ( dataloader ) :
321 sequences , labels = sequences . to ( device ) , labels . to ( device )
322
323 # Creation of the padding mask
324 mask = create_padding_mask ( sequences ) . to ( device )
325
326 # Forward pass
327 logits , attention_weights = model ( sequences , loss = criterion ( logits , mask )
328 labels )
329
330 # Backward pass
331 optimizer . zero_grad ()
332 loss . backward ()
333 optimizer . step ()
334
335 #Mtriques
336 epoch_loss += loss . item ()
337 predictions = torch . argmax ( logits , correct_predictions += dim =1)
338 ( predictions == labels ). sum () . item ()
339 total_predictions += labels . size (0)
340
341 if batch_idx % 20 == 0:
342 print (f’Epoch { epoch }, Batch { batch_idx }, Loss : { loss . item () :.4 f}’)
343
344 avg_loss = epoch_loss / len ( dataloader )
345 accuracy = correct_predictions / total_predictions
346 train_losses . append ( avg_loss )
347
348 print (f’Epoch { epoch }: Loss = { avg_loss :.4f}, Accuracy = { accuracy :.4f}’)
349
350 return model , train_losses , attention_weights
351 # Analysis of attention patterns
352 def analyze_attention_patterns ( model , sample_text , vocab_size =1000) :
"""
353
354 Analyze attention patterns on a text sample
"""
355
356 model . eval ()
357 device = next ( model . parameters () ). device
358
359 # Converting text to tokens (simulation)
360 tokens = torch . randint (1 vocab_size mask = create_padding_mask
, , (1 , 20) ). to ( device )
361 ( tokens ). to ( device )
362
363 with torch . no_grad () :
364 _ , attention_weights = model ( tokens , mask )
365
366 print ("Analysis of attention patterns:")
367 print (f" Nombre de couches : {len ( attention_weights )}")
368 print (f" Nombre de t t e s par couche : { attention_weights [0]. size (1) }")
405 ’ C o m p l e x i t computationnelle ’,
406 ’Vitesse d\’ e n t r a n e m e n t ’,
407 ’ I n t e r p r t a b i l i t ’,
408 'Long Sequence Performance ',
’
409 'Memory consumption'
410 ],
411 ’RNN / LSTM ’: [
412 ’ S q u e n t i e l ’,
413 ’ L i m i t e ( gradient vanescent )’,
414 'O(n) per time step ',
415 'Lens ( squentiel )',
416 'Difficult ',
417 ’ P r o b l m a t i q u e ’,
418 ’Modre ’
419 ],
420 ’ Transformer ’: [
421 ' Complete parallel ',
422 'Excellent (overall attention)',
423 'O(n) for attention ',
424 ’Rapide ( p a r a l l l i s a b l e )’,
425 'Good (attention)',
426 'Excellent ',
427 ' live (quadratic attention)'
428 ]
429 }
430
431 for i in range ( len ( comparison_data [’ C r i t r e ’]) ):
432 print (f"{ comparison_data [’ C r i t r e ’][i ]: <25} | { comparison_data [’ RNN / LSTM ’][i
]: <25} | { comparison_data [’ Transformer ’][i]}")
433
434 # Simulation of time performance
435 sequence_lengths = [10 500] , 50 , 100 , 200 ,
443 plt . plot ( sequence_lengths , rnn_times , ’o-’, label =’RNN / LSTM ’, linewidth =2)
444 transformer_times
plt . plot ( sequence_lengths , , ’s-’, label =’ Transformer ’, linewidth =2)plt . xlabel
445 ('Longueur de squence ')
446 plt . ylabel ('Relative temps ')
447 plt . title (’Temps d\’ e n t r a n e m e n t c o m p a r ’)
448 plt . legend ()
449 plt . grid ( True , alpha =0.3)
450
451 # Memory complexity graph
452 plt . subplot (1 2) , 2 ,
484 pts . title (' Revolution from Loss - Our Training Transform ')
485 plt . xlabel ('poque ')
486 plt . ylabel ('Perte ')
487 plt . grid ( True , alpha =0.3)
488 plt . show ()
489
ÿ ÿ
Listing 77 – Complete Implementation of a Transformer
Explanation
Attention mechanism: Each position can directly access all other positions,
eliminating the vanishing gradient problems of RNNs. Weighted attention allows to
focus on the relevant elements.
Parallelization: Unlike RNNs which process sequentially, Transformers can
process all tokens simultaneously, drastically speeding up GPU training.
Multi-head attention: Allows the model to capture different types of relationships (syntactic,
semantics) simultaneously using multiple parallel attention "heads".
Positional encoding: Compensates for the lack of natural sequential order by injecting positional information via
sine functions.
This architecture has become the basis for revolutionary models like BERT, GPT, and their
successors.
General Conclusion
After this captivating journey through Deep Learning, you are now armed with the essential knowledge to understand, implement,
and optimize deep learning models. From an introduction to neural networks to advanced architectures like Transformers and
multimodal systems, this course has provided you with a solid and progressive foundation.
You've explored the foundations of mathematics, practiced with powerful frameworks like TensorFlow and PyTorch, and experienced
real-world applications in diverse fields such as vision, language, and content generation.
But the adventure is only just beginning. The next chapter will immerse you in the fascinating world of Reinforcement Learning,
where intelligent agents learn to interact with their environment to achieve complex goals. Stay curious, keep experimenting, and
above all... have fun learning!
Congratulations
You have completed this course on Deep Learning! Congratulations on your commitment and perseverance.
rance.
For any questions, suggestions or future collaboration, please do not hesitate to contact me:
Next Chapter
Next: Level 4 — Reinforcement Learning
In this next module, we will cover: — The theoretical
bases of RL (Markov Decision Processes, Policy, Reward)
— Q-Learning and Deep Q Networks (DQN)
— Policy Gradient Methods
— And of course, a practical project to train an agent to play a simple game. Get ready to discover a field
where AI learns by trial and error, like a child discovering the world!