0% found this document useful (0 votes)
45 views116 pages

Notions de Deep Learning

This document outlines a comprehensive Deep Learning course designed for students, researchers, and professionals in artificial intelligence. It covers fundamental concepts, mathematical foundations, neural network principles, and advanced architectures, structured into three progressive levels. The course aims to equip participants with the skills to design, implement, and optimize deep neural networks using popular frameworks like TensorFlow and PyTorch.

Uploaded by

Jospin Tchomguim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views116 pages

Notions de Deep Learning

This document outlines a comprehensive Deep Learning course designed for students, researchers, and professionals in artificial intelligence. It covers fundamental concepts, mathematical foundations, neural network principles, and advanced architectures, structured into three progressive levels. The course aims to equip participants with the skills to design, implement, and optimize deep neural networks using popular frameworks like TensorFlow and PyTorch.

Uploaded by

Jospin Tchomguim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

Machine Translated by Google

Cours de Deep Learning

18/05/2025

Master Big Data et Cloud Computing


Prepared by: Mohamed Ouazze
Machine Translated by Google

TABLE OF CONTENTS

Table of Contents

1 Introduction 2

Introduction 2

2 Level 1: Fundamentals and Introduction 3

. ... . . . . . . . . . . . . . . . . . . . . ... . .. 3
2.1 Introduction au Deep Learning . . .

2.1.1 What is Deep Learning and how does it differ from Machine Learning? 3

2.1.2 History and evolution of Deep Learning. 3 . . . . . . . . . . . . . . . . . . ... . ..

. . . . . . . . . . . . ... . .. 4
2.1.3 Current applications and use cases by sector.

2.1.4 Presentation of tools and frameworks. . . . . . . . . . . . . . . . . . . . . . . .. 4

. . . . . . .. 7
2.1.5 Practical example: Installing development environments.

. . . ... . . . . . . . . . . . . . . . . . . . . . . . . .. 7
2.2 Mathematical fundamentals.

. . . . . . . . . . . . . . . . . . ... . .. 7
2.2.1 Linear Algebra for Deep Learning.

. . . . . . . . . . . . . . . . . . . . .. 8
2.2.2 Differential calculus and gradient descent.

. . . . . . . . . . . . . . . . . . ... . . . 11
2.2.3 Essential Probabilities and Statistics.

. ... . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Principles of neural networks.

2.3.1 Structure and operation of an artificial neuron. . . . . . . . . . . . ... . . . 15

2.3.2 Architecture of feedforward networks. . . . . . . . . . . . . . . . . . . . . . . . . . 17

. . . . . . . . . . . . . . . . . . . ... . . . 20
2.3.3 Activation functions and their impact.

. . . 22
2.3.4 Practical example: Create a simple perceptron for binary classification.

2.4 Training neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

. . . . . . . ... . ... . . . 26
2.4.1 Backpropagation algorithm explained step by step.

3 Level 2: Intermediate Concepts 34

. . . . . . . . . . . . . . . . . . . . ... . . . 34
3.1 Convolutional Neural Networks (CNN) . .

. ... . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Principles and architecture of CNNs.

3.1.2 Understanding filters and feature maps . . .


. . . . . . . . . . . . . . . . . . . . . . 35

. . . . . . . . . . . . . . . . . . . . . . ... . . . 37
3.2 Natural Language Processing (NLP).

. . . . . . . . . . . . . . . . 37
3.2.1 Word representation (one-hot, word embeddings).

... . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Word2Vec, GloVe and FastText.

. . . . . . . . ... . . . 44
3.2.3 Introduction to Recurrent Neural Networks (RNN).

3.2.4 Practical example: Sentiment analysis on movie reviews.


. . . . . . . . 48

3.3 Advanced recurrent networks. . ... . ... . . . . . . . . . . . . . . . . . . . . ... . . . 52

3.3.1 Architecture LSTM et GRU . . ... . ... . . . . . . . . . . . . . . . . . . . . . . 52

3.3.2 Practical example: Fine-tuning BERT for text classification . . .


. . . 63

3.3.3 Self-attention et multi-head attention . . . . . . . . . . . . . . . . . . . ... . . . 75

. . ... . ... . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.4 Position encodings .

4 Level 3: Advanced Concepts 4.1 Advanced Architectures for Computer Vision. 83

. . . . . . . . . . ... . ... . . . 83

. . . . . . . . . . . . . . . . . . . 83
4.1.1 Residual Networks (ResNet) and Skip Connections.

. . . . . . . . . . . . . . . . ... . . . 89
4.1.2 U-Net architectures for segmentation.

. . . . ... . . . 94
4.1.3 R-CNN, Fast R-CNN, Faster R-CNN for Object Detection.

. . . ... . . . 99
4.1.4 Practical example: Segmentation of medical images with U-Net.

4.2 Transformers et attention . . . ... . ... . . . . . . . . . . . . . . . . . . . . ... . . . 106

4.2.1 Transformers Architecture Explained.


. . . . . . . . . . . . . . . . . . . . . . . 106

General Conclusion 115

Next Chapter 115

Mohamed Ouazze 1 BDCC-2024-2025


Machine Translated by Google

1 INTRODUCTION

1 Introduction
Welcome to this comprehensive Deep Learning course. Whether you're a student, researcher, data science
professional, or simply passionate about artificial intelligence, this course is designed to provide you with a
progressive, structured, and practical introduction to deep neural networks.
Deep learning is one of the most dynamic and promising branches of artificial intelligence today. This
discipline has revolutionized many fields such as computer vision, natural language processing, speech
recognition, and many others. Its ability to solve complex problems by learning directly from data makes it a
powerful tool for technological innovation.

This course offers a complete immersion into the world of deep learning, from theoretical fundamentals to
advanced architectures. You will learn how to design, implement, and optimize deep neural networks using
popular libraries like TensorFlow, PyTorch, and Keras.

Why take this course?


— To acquire a solid understanding of the mathematical and algorithmic principles that underlie-
tendent le Deep Learning.
— To master the most used frameworks in industry and research.
— To develop the ability to design innovative solutions based on neural networks
deep.
— To prepare you for the current and future challenges of artificial intelligence.
The course is organized into three progressive levels, each designed to strengthen your skills in
structured way:

1. Level 1 - Fundamentals and Introduction: Introduction to Deep Learning, mathematical bases-


matics, perceptrons and simple neural networks.
2. Level 2 - Fundamental Architectures: Convolutional Networks (CNN), Recurrent Networks
(RNN), autoencoders and optimization techniques.
3. Level 3 - Advanced Architectures and Applications: Reinforcement Learning, GANs, Transformers,
Deployment and Ethics in AI. At the end of this
course, you will be able to: — Understand the
theoretical mechanisms that govern deep neural networks — Design and implement architectures
adapted to various problems — Optimize the performance of your models and
avoid common pitfalls — Deploy your solutions in production environments —
Evaluate the ethical and social implications of your creations Are you
ready to explore the depths of artificial intelligence and master the
techniques that shape our technological future? Let's embark together on this exciting Deep Learning
adventure!

Mohamed Ouazze 2 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

2 Level 1: Fundamentals and Introduction


Deep learning is a discipline built on solid theoretical foundations and requiring a thorough understanding of
several areas. This first level will provide you with the fundamental knowledge needed to confidently tackle
more complex architectures.

2.1 Introduction au Deep Learning


2.1.1 What is Deep Learning and how does it differ from Machine Learning?

Deep learning is a sub-branch of machine learning, which itself is part of the broader field of artificial intelligence. While traditional
machine learning often relies on manual feature extraction (feature engineering), deep learning is distinguished by its ability to automatically
learn hierarchical representations directly from raw data.

Artificial intelligence

Machine Learning

Deep Learning

The main difference lies in the architecture and depth of the models. The term "deep" refers to the presence
of multiple hidden layers in neural networks, allowing complex relationships in data to be abstracted and
modeled.

Example

Let's take the case of image recognition:


— In classic Machine Learning: An expert manually defines the relevant characteristics (contours,
textures, etc.), then an algorithm like SVM or Random Forest learns to classify the images.

— In Deep Learning: A deep neural network automatically learns to extract increasingly abstract
features. The first layers detect simple contours, then textures, patterns, and finally high-level
concepts like faces or objects.

Explanation

This fundamental difference explains why deep learning excels in areas where manual feature definition
is difficult or inefficient, such as computer vision, natural language processing, or speech recognition.

2.1.2 History and evolution of Deep Learning

Deep learning has had a fascinating history, marked by periods of sustained enthusiasm
of "AI winters", before experiencing its spectacular renaissance in the 2010s.

Mohamed Ouazze 3 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Period Key Events

1943-1958 First theoretical models of artificial neurons (McCulloch & Pitts, Rosenblatt's Perceptron)
1969-1986 "First winter of AI" following the limitations of the simple perceptron

1986-1995 Gradient backpropagation algorithm, first practical applications

1995-2006 "Second Winter of AI", focus on SVM and other methods

2006-2012 Renaissance avec le "Deep Learning" (Hinton, LeCun, Bengio)


2012-present Explosion of applications with AlexNet, DeepMind, GPT, etc.

The resurgence of Deep Learning in the 2010s was mainly due to three factors:
— Massive availability of data (Big Data)
— Increased computing power (GPU, TPU)
— Algorithmic advances (new activation functions, regularization techniques, innovative architectures)

Explanation

The major turning point came in 2012 with the landslide victory of the AlexNet convolutional network in the
ImageNet competition. This demonstration of the superiority of deep neural networks has
triggered a real revolution in the field of AI, attracting the attention of researchers,
businesses and the general public.

2.1.3 Current applications and use cases by sector

Deep learning has transformed many industries and continues to open up new possibilities
d’innovation :

Sector Applications Concrete examples


Health Medical diagnosis, drug discovery, cancer detection on MRI, AlphaFold
Finance Fraud detection, algorithmic trading Anti-money laundering systems, price prediction

Transport Autonomous vehicles, logistics optimization Tesla Autopilot, fleet optimization


Industry Predictive maintenance, quality control Defect detection, production optimization
Media Content generation, recommendation DALL-E, Netflix recommendation systems
Sciences Climate modeling, DeepMind particle physics for protein prediction

Example

In the healthcare field, convolutional neural networks (CNNs) are used to analyze medical images and
detect pathologies such as cancer with sometimes remarkable accuracy.
superior to that of doctors. For example, a model developed by Google Health demonstrated
superior performance to radiologists in detecting breast cancer on mammograms.

2.1.4 Presentation of tools and frameworks

Several open-source frameworks currently dominate the Deep Learning ecosystem:

Mohamed Ouazze 4 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Framework Lead Developer Features

TensorFlow Google Complete ecosystem, TensorBoard, TF Lite, TF.js


PyTorch Facebook (Meta) Pythonic, dynamic interface, popular in research
Hard Community (integrated with TF) High-level API, easy to use for beginners
JAX Google Automatic differentiation, XLA compilation, search

Mohamed Ouazze 5 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 import tensorflow as tf
2 from tensorflow import keras
3

4 # Creation of a simple sequential model


5 models = hard . Sequential ([
6 keras . layers . Dense (128 activation ='relu
, ', input_shape =(784 ,)) keras . layers . Dropout (0.2) keras . ,

7 layers . Dense (10 ,

8 , activation =’softmax ’)
9 ])
10

11 # Compiling the model


12 model . compile (
13 optimizer =’adam ’,
14 loss =’ sparse_categorical_crossentropy ’,
15 metrics =[ ’accuracy ’]
16 )

ÿ ÿ

Listing 1 – Simple example with TensorFlow/Keras

ÿ ÿ
1 import torch
2 import torch . nn as nn

3 import torch . nn . functional as F


4

5 # Definition of a simple model


6 class SimpleNet ( nn . Module ):
7 def __init__ ( self ):
8 super ( SimpleNet , self ). __init__ ()
9 . Linear (784 128)
self . fc1 = nn self . dropout ,

10 = nn . Dropout (0.2)
11 self . fc2 = nn . Linear (128 10) ,

12

13 def forward ( self , x) :


14 x = F . relu ( self . fc1 (x) )

15 x = self . dropout (x )
16 x = self . fc2 (x )
17 return F. log_softmax (x , dim =1)
18

19 # Creation of the model

20 model = SimpleNet ()
ÿ ÿ
Listing 2 – Equivalent example with PyTorch

Explanation

The choice of framework often depends on the context of use:


— TensorFlow is often preferred for production deployment and applications
mobiles.
— PyTorch is generally preferred for research and rapid experimentation.
— Keras offers an intuitive API ideal for learning and rapid prototyping.
In this course, we will mainly use PyTorch and TensorFlow/Keras to cover the
most common approaches in industry and research.

Mohamed Ouazze 6 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

2.1.5 Practical example: Installing development environments

To start working with Deep Learning, it is essential to correctly configure your


development environment.
ÿ ÿ

1 # Creating a virtual environment for Deep Learning 2 conda create -n deeplearning python =3.9 3 conda activate
deeplearning

5 # Installing PyTorch with GPU support (CUDA 11.7) 6 conda install pytorch torchvision torchaudio pytorch -
cuda =11.7 -c pytorch -c nvidia
7

8 # Installation de TensorFlow avec support GPU 9 conda install -c conda - forge tensorflow
- gpu
10

11 # Useful additional libraries 12 conda install -c conda - forge matplotlib


pandas scikit - learn jupyter
ÿ ÿ
Listing 3 – Installation with Conda

Exercise

Install the development environment on your machine following the instructions above.
Next, create a Jupyter notebook and verify that you can import PyTorch and TensorFlow.
Also check GPU availability using the appropriate commands for each framework.

Solution

ÿ ÿ
1 # In a Jupyter notebook 2 import torch 3
import tensorflow as tf

5 # Verification de PyTorch 6 print (f" PyTorch


version : { torch . __version__ }") 7 print (f" GPU disponible pour PyTorch : { torch .
cuda . is_available ()}") 8 if torch . cuda . is_available () :

9 print (f" Nom du GPU : { torch . cuda . get_device_name (0)}")


10

11 # Verification de TensorFlow
12 print (f" TensorFlow version : {tf. __version__ }") 13 print (f" GPU disponible pour
TensorFlow : { len (tf. config . list_physical_devices (’ GPU ’)
) > 0}")
14 print (f" Peripheriques disponibles : {tf. config . list_physical_devices ()}")
ÿ ÿ
Listing 4 – Verification de l’installation

2.2 Mathematical fundamentals


2.2.1 Linear Algebra for Deep Learning

Linear algebra is the fundamental mathematical language of deep learning. Neural networks essentially manipulate vectors, matrices, and
tensors through various linear operations.

Key concepts to master:

Mohamed Ouazze 7 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Concept Definition Application en Deep Learning

Vectors 1D arrays of values Representation of input features

Matrices 2D arrays of values Weight of connections between layers

Tensors Multidimensional arrays Input data (images, sequences, etc.)

Scalar product a · b = flaw Calculation of a linear combination


i

Matrix product C = A × B Forward propagation in a network

Transposition AT = Aji ij Gradient backpropagation

ÿ ÿ
1 import numpy as np
2

3 # Creation of vectors and matrices


4 x = np . array ([1 , 2, 3 5 W = np . , 4]) # Vector (input features)
random . randn (4 6 b = np . random . , 3) 3 neurones ) matrix (4 inputs
# Weight ,

randn (3)
# Bias vector
7

8 # Operation de couche fully connected ( Dense )


9 # y = W^T * x + b
10 y = np . dot (x W) + b ,

11

12 print (" Vecteur d’ e n t r e x:", x)


13 print (" Matrice de poids W:", W)
14 print ("Bias vector b:", b)
15 print (" Sortie y:", y )
ÿ ÿ
Listing 5 – Linear Algebra Operations with NumPy

Explanation

In the example above, the matrix product np.dot(x, W) represents the fundamental operation
of a fully connected (Dense) layer in a neural network. This operation calculates
the weighted sum of the inputs for each neuron in the next layer. Adding the bias +
b allows the network to learn an offset for each neuron, increasing its ability to
modeling.

2.2.2 Differential calculus and gradient descent

Differential calculus is at the heart of learning neural networks. Gradient descent,


based on the calculation of partial derivatives, allows to optimize the network parameters by minimizing a
loss function.
Fundamental concepts:
— Derivative: Rate of change of a function with respect to a variable
— Partial derivative: Derivative with respect to a variable while keeping the other variables constant
— Gradient: Vector of partial derivatives of a multivariable function
— Chain rule: Fundamental principle allowing gradient backpropagation
Gradient descent updates the parameters ÿ according to the formula:

ÿt+1 = ÿt ÿ ÿÿÿJ(ÿ) (1)

Or :
— ÿt represents the current parameters
— ÿ is the learning rate (hyperparameter)
— ÿÿJ(ÿ) is the gradient of the loss function J with respect to the parameters ÿ

Mohamed Ouazze 8 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ

1 import numpy as np
2 import matplotlib . pyplot as plt
3

4 # Function to minimize: f(x) = x^2 + 2


5 def function(x):
6 return x **2 + 2
7

8 # Derivative of the function: f '(x) = 2x


9 def derivee ( x):
10 return 2* x
11

12 # Gradient Descent Algorithm


13 def gradient_descent ( starting_point , learning_rate , x = starting_point nb_iterations ):
14

15 history = [x]
16

17 for i in range ( nb_iterations ):


18 gradient = derivee ( x)
19 x = x - learning_rate * gradient
20 history . append (x)
21

22 # Show progress
23 if (i +1) % 10 == 0:
24 print (f" Iteration {i +1}: x = {x :.6f}, f(x) = { fonction (x) :.6 f}")
25

26 return x , history

27

28 # Execution of the algorithm


29 optimal_point, history = gradient_descent (
30 starting_point = 5.0 ,

31 learning_rate = 0.1 number_iterations = ,

32 50
33 )
34

35 print (f"\ nResultat final : x = { point_optimal :.6 f}, f(x) = { fonction ( point_optimal ) :.6f}")
36

37 # Visualisation

38 x_plot = np . linspace ( -6 , 6, 100)


39 plt . figure ( figsize =(10 6) ) ,

40 plt . plot ( x_plot , fonction ( x_plot ) , ’b-’, label =’f(x) = x_ + 2’)


41 plt . plot ( history , [ function ( x ) for x in history ], 'ro-', label = 'Progress ')
42 plt . grid ( True )
43 plt . xlabel (’x’)
44 plt . ylabel (’f(x)’)
45 plt . title ('Gradient Descent ')
46 plt . legend ()
47 plt . show ()
ÿ ÿ
Listing 6 – Implementing a simple gradient descent

Explanation

This example illustrates the fundamental principle of gradient descent on a simple function
a variable. In neural networks, this principle is applied to multivariable functions
(with potentially millions of parameters), but the logic remains the same: calculate the
gradient of the loss function with respect to each parameter and fit these parameters in
the direction opposite to the gradient to minimize loss.
Gradient backpropagation is an efficient application of the chain rule
which allows these gradients to be calculated layer by layer, starting from the output and going up
towards the network entrance.

Exercise

Modify the gradient descent code above to minimize the function f(x) = x
4
- 2x 2
+
x. Experiment with different starting points and learning rates. What do you notice?
regarding the convergence of the algorithm?

Mohamed Ouazze 9 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Solution

ÿ ÿ
1 import numpy as np
2 import matplotlib . pyplot as plt
3

4 # New function to minimize: f(x) = x^4 - 2x^2 + x


5 def function (x):
6 return x **4 - 2* x **2 + x
7

8 # Derivative of the function: f '(x) = 4x^3 - 4x + 1


9 def derivee (x ):
10 return 4* x **3 - 4* x + 1
11

12 # Gradient descent algorithm (unchanged)


13 def gradient_descent ( starting_point , learning_rate , x = starting_point nb_iterations ):
14

15 history = [x ]
16

17 for i in range ( nb_iterations ):


18 gradient = derivee ( x)
19 x = x - learning_rate * gradient
20 history . append (x)
21

22 if (i +1) % 10 == 0:
23 print (f" Iteration {i +1}: x = {x :.6f}, f(x) = { fonction (x) :.6 f}")
24

25 return x , historical
26

27 # Testing with different starting points and learning rates


28 configurations = [
" "
29 {" depart ": 2.0 {" depart , rate ": 0.1 , iterations ": 100} ,
" "
30 ": -1.0 {" depart ": 0.0 , rate ": 0.05 , iterations ": 100} ,
" "
31 , rate ": 0.01 , iterations ": 100}
32 ]
33

34 plt . figure ( figsize =(15 , 10) )


35

36 # Trace of the function

37 x_plot = np . linspace ( -2 , 2, 1000)


38 plt . subplot (2 , 1, 1)
39 plt . plot ( x_plot , fonction ( x_plot ) , ’b-’, label =’f(x) = x_4 - 2 x_2 + x’)
40 plt . grid ( True )
41 plt . ylabel (’f(x)’)
42 plt . title ('Local function and minima ')
43

44 # Testing the different configurations


45 plt . subplot (2 , 1, 2)
46 for i , config in enumerate ( configurations ):
47 optimal_point, history = gradient_descent (
48 point_depart = config [" depart "],
49 taux_apprentissage = config [" taux "] nb_iterations = ,

50 config [" iterations "]


51 )

52

53 print (f"\ nConfiguration {i +1}: ")


54 print (f" Point de depart : { config [’ depart ’]}")
55 print (f" Learning rate: {config['rate']}")
56 print (f" Resultat final : x = { point_optimal :.6 f} ) :.6 f}") , f(x) = { function ( optimal_point

57

58 plt . plot ( historical , ") label =f" Depart : { config [’ depart ’]} , Rate: { config ['rate']}

ÿ ÿ

Listing 7 – Solution for gradient descent on f(x)

Mohamed Ouazze 10 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Solution

ÿ ÿ
1

2 plt . grid ( True ) 3


plt . xlabel (’Iterations ’) 4 plt . ylabel
(’x’) 5 plt . title
(’Convergence selon differentes configurations ’) 6 plt . legend () 7 plt . tight_layout () 8
plt . show ()

ÿ ÿ

Listing 8 – Continuation of gradient descent on f(x)

Observations on convergence:
— The function f(x) = x 4ÿ2x 2+x has several local minima, which makes the convergence
dependent on the starting point.
— With too high a learning rate, the algorithm may oscillate or diverge.
— Depending on the starting point, the algorithm can converge to different minima.
This sensitivity to initial parameters and learning rate is characteristic of non-convex optimization problems, such as those
encountered in Deep Learning.

2.2.3 Essential Probabilities and Statistics

Probability and statistics play a crucial role in the theoretical foundations of Deep Learning, particularly for the design of loss
functions, data modeling, and performance evaluation. Fundamental probabilistic concepts: — Probability distributions: Modeling the
distribution of data — Expectation and variance: Fundamental statistical
measures — Maximum likelihood: Principle of parameter estimation — Bayes' theorem: Updating beliefs
with new information

Statistical concept Formula Application en Deep Learning


1 xÿµ 2
Distribution normal ÿ

2 ( s )
Weight initialization, regularization
1 f(x) = e ÿ ÿ 2ÿ

Cross entropy H(p, q) = ÿ x p(x) log q(x) Loss function for classification p(x) p(x) log q(x)

Divergence KL DKL(p||q) = Measure of difference between distributions


x

Maximum likelihood ÿMLE = arg maxÿ P(X|ÿ) Principle of supervised training

Example

The cross-entropy loss function used in classification can be derived from the maximum likelihood principle. If we model
classification as a multi-nomial distribution, maximizing the likelihood of the training data is equivalent to minimizing the cross-
entropy between the empirical distribution (actual labels) and the distribution predicted by the model.

Mohamed Ouazze 11 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ

1 import numpy as np
2

3 def cross_entropy ( y_true , y_pred ):


"""
4

5
Calculates the cross entropy between the y_true and y_pred distributions
6
Args :
7
y_true : Distribution reelle (one - hot encoding )
8
y_pred : Distribution predite ( probabilites )
9 Returns :
10 Scalar value of cross entropy
"""
11

12 # Added an epsilon to avoid log(0)


13 epsilon = 1e -15
14 y_pred = np.clip(y_pred, epsilon, 1 - epsilon )
15

16 # Number of samples
17 n_samples = y_true . shape [0]
18

19 # Calculation of cross entropy


20 this = - np . sum( y_true * np . log ( y_pred ) ) / n_samples
21

22 return what
23

24 # Example of use

25 # Distribution reelle (one - hot encoding pour 3 classes )

26 y_true = np . array ([
27 [1 , 0 #, Class
0] ,0
28 [0 , 1 , 0] , # Class 1
29 [0 , 0 , 1] # Class 2
30 ])
31

32 # Model predictions (probabilities for each class)

33 y_pred = np . array ([
34 0.20.1], [0.2 0.5]
[0.7 0.1], [0.1 , # Prediction for sample 1
35 , 0.8 , , # Prediction for sample 2
36 , 0.3 , # Prediction for sample 3
37 ])
38

39 ce_loss = cross_entropy ( y_true , y_pred )

40 print (f" Cross entropy loss: {ce_loss:.4 f}")


ÿ ÿ
Listing 9 – Calculating Cross Entropy in Python

Explanation

In the context of Deep Learning, cross-entropy measures the difference between the distribution of
actual labels and the probabilities predicted by the model. A lower cross-entropy indicates
a better match between predictions and reality. This measure is particularly
suitable for classification problems where the network output is normalized via a function
softmax to represent a probability distribution over the different classes.

Mohamed Ouazze 12 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Another crucial statistical concept is regularization, which allows controlling the complexity of the
model and prevent overfitting:
ÿ ÿ
1 import numpy as np
2

3 def l1_regularization ( weights , lambda_param ) :


"""
4

5 Regularisation L1 ( Lasso )
6 Args :
7 weights: Model parameters
8 lambda_param: Regularization coefficient
9 Returns :
10 Regularization term to add to the loss function
"""
11

12 return lambda_param * np . sum ( np . abs ( weights ))


13

14 def l2_regularization ( weights , lambda_param ) :


"""
15

16 Regularisation L2 ( Ridge )
17 Args :
18 weights: Model parameters
19 lambda_param: Regularization coefficient
20 Returns :
21 Regularization term to add to the loss function
"""
22

23 return lambda_param * np . sum ( np . square ( weights ))


24

25 # Example of use

26 weights = np . array ([0.5 27 lambda_val = 0.01 , -0.3 , 0.8 , -0.1 , 0.2])

28

29 l1_reg = l1_regularization ( weights , 30 l2_reg = l2_regularization ( weights ,


lambda_val )
lambda_val )
31

32 print (f" Regularisation L1: { l1_reg :.4 f}")


33 print (f" Regularisation L2: { l2_reg :.4 f}")
ÿ ÿ
Listing 10 – Implementation of L1 and L2 regularizations

Exercise

Modify the cross-entropy function to implement a binary cross-entropy loss,


used in binary classification problems. Test your implementation with examples
of correct and incorrect predictions, and observe the impact on the value of the loss.

Mohamed Ouazze 13 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Solution

ÿ ÿ

1 import numpy as np
2 import matplotlib . pyplot as plt
3

4 def binary_cross_entropy ( y_true , y_pred ) :


"""
5

6 Calculates the binary cross-entropy between the y_true and y_pred distributions
7 Args :
8 y_true : Labels reels (0 ou 1)
9 y_pred : Predicted probabilities (between 0 and 1)
10 Returns :
11 Scalar value of binary cross-entropy
"""
12

13 # Added an epsilon to avoid log(0) or log(1)


14 epsilon = 1e -15
15 y_pred = np.clip(y_pred, epsilon, 1 - epsilon )
16

17 # Number of samples
18 n_samples = len( y_true )
19

20 # Calculation of binary cross entropy


21 bce = -np . sum ( y_true * np . log ( y_pred ) + (1 - y_true ) * np . log (1 - y_pred ) ) /
n_samples
22

23 return bce
24

25 # Test with different scenarios

26 # Case 1: Near-perfect predictions


27 y_true_1 = np . array ([1 , 0, 1, 0, 1])
28 y_pred_1 = np . array ([0.99 0.01 , , 0.98 , 0.02 , 0.97])
29

30 # Case 2: Average Predictions


31 y_true_2 = np . array ([1 , 0, 1, 0, 1])
32 y_pred_2 = np . array ([0.7 0.6 , 0.3 , , 0.4 , 0.8])
33

34 # Case 3: Bad Predictions

35 y_true_3 = np . array ([1 , 0, 1, 0, 1])


36 y_pred_3 = np . array ([0.1 0.2 , 0.9 , , 0.8 , 0.1])
37

38 # Calculation of losses for each case


39 bce_1 = binary_cross_entropy ( y_true_1 , y_pred_1 )
40 bce_2 = binary_cross_entropy ( y_true_2 , y_pred_2 )
41 bce_3 = binary_cross_entropy ( y_true_3 , y_pred_3 )
42

43 print (f" BCE with almost perfect predictions: { bce_1 :.4f}")


44 print (f" BCE with average predictions: { bce_2 :.4 f}")
45 print (f" BCE with bad predictions: { bce_3 :.4 f}")
46

47 # Visualization of the loss function


48 y = np . array([1]) # True label is 1
0.99values
49 preds = np . linspace (0.01 100) # Different prediction
, 50 ,

losses = [ binary_cross_entropy (y , np . array ([ p ]) ) for p in preds ]


51

52 plt . figure ( figsize =(10 53 plt . plot ( preds , 54 , 5) )


plt . grid ( True ) losses )

55 plt . xlabel (’ Probabilite predite p(y =1) ’)


56 plt . ylabel ('Perte BCE ')
57 plt . title ('Binary Cross Entropy Loss Function (label true = 1) ')
58 plt . axvline (x =0.5 59 plt . legend () , color ='r', linestyle ='--', label ='Frontiere de decision ')

60 plt . show ()

ÿ ÿ
Listing 11 – Implementation of binary cross-entropy

This implementation calculates binary cross-entropy, which is suitable for problems where the target
is either 0 or 1. We observe that:
— Predictions close to the true values (case 1) give a low loss
— Uncertain predictions (case 2) give an intermediate loss
— Wrong predictions (case 3) give high loss
The visualization shows how the loss increases as the prediction moves away from the
Mohamed Ouazze BDCC-2024-2025 14 real value, with exponential growth when approaching the extremes (0 or 1).
Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

2.3 Principles of neural networks


2.3.1 Structure and operation of an artificial neuron

The artificial neuron, inspired by the biological neuron, is the fundamental computing unit of the networks of
neurons. Its operation can be summarized in three main stages:

1. Aggregation: Linear combination of weighted inputs

2. Adding the bias: Adding a constant term

3. Activation: Application of a non-linear function

Mathematically, for a neuron with n inputs x = [x1, x2, . . . , xn], weights w = [w1, w2, . . . , wn]
and a bias b, the output y is calculated as follows:

y=f wix + b = f(w · x + b) (2)


i=1

x1

w1

w2 f
x2 and

w3

x3

where f is the activation function.


ÿ ÿ
1 import numpy as np
2 class Neurone :
3 def init ( self , nb_entrees ):
"""
4

5 Initialization of an artificial neuron

6 Args :
7 nb_entrees : Entrees name of the neuron
"""
8

9 # Initialize weights with small random values


10 self . poids = np . random . randn ( nb_entrees ) * 0.01
11 # Initialize the bias zro
12 self . bias = 0
13

14 def activation_sigmoid ( self , x):


"""
15 Fonction d’activation s i g m o d e : f(x) = 1 / (1 + e^( -x)) """
16 return 1 / (1 + np . exp (- x))
17

18 def activation_relu ( self x) : ,


"""
19 ReLU activation function: f(x) = max(0 x ) , x)"""
20 return np . maximum (0 ,

21

22 def activation_tanh ( self , x) :


""" """
23 Hyperbolic tangent activation function return np . tanh(x)
24

ÿ ÿ
Listing 12 – Implementation of an artificial neuron

Mohamed Ouazze 15 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 def forward ( self , entrees , activation_function = 'sigmoid '):
"""
2

3 Forward propagation (output calculation)


4

5 Args :
6 inputs: Input vector
7 activation_function: Activation Type ('sigmoid', 'relu', 'tanh')
8

9 Returns :
10 Neuron output after activation
"""
11

12 # V r i f i c a t i o n de la dimension des e n t r e s
13 if len ( entrees ) != len ( self . poids ):
"
14 raise ValueError (f"The number of entries ({ len( entries )}) does not match
15 f"to the number of weights ({ len ( self . weight )})")
16

17 # Calculation of the weighted sum


18 weighted_sum = np . dot ( self . weight , inputs) + self. bias
19

20 # Applying the activation function


21 if fonction_activation == ’sigmoid ’:
22 sortie = self . activation_sigmoid ( somme_ponderee )
23 elif activation_function == 'relu ':
24 output = self . activation_relu ( weighted_sum )
25 elif activation_function == 'tanh ':
26 output = self . activation_tanh ( weighted_sum )
27 else :
28 raise ValueError ("Unrecognized activation function ")
29

30 return sortie
31 Example of use
32 neurone = Neurone ( nb_entrees =3)

33 entrees = np. array ([0.5 , -0.2 , 0.1])


34 Test with different activation functions
35 sortie_sigmoid = neurone . forward ( entrees , ’sigmoid ’)
36 output_relu = neuron . forward ( 'relu ' inputs) ,

37 sortie_tanh = neurone . forward ( entrees ’tanh ’) ,

38 print (f" E n t r e s : { entrees }")


39 print (f" Poids : { neurone . poids }")
40 print (f" Bias: { neuron . bias }")
41 print (f" Output with sigmoid: { output_sigmoid:.4 f}")
42 print (f" Output with ReLU: { relu_output:.4 f}")
43 print (f" Output with tanh: { output_tanh:.4 f}")
ÿ ÿ
Listing 13 – Implementation of an artificial neuron continued

Explanation

The artificial neuron is a computing unit that transforms its inputs into a single output via
a linear combination followed by a non-linearity. This non-linearity, brought by the function
activation, is crucial because it allows the network to model complex relationships between the
inputs and outputs. Without these nonlinearities, even a deep network would be reduced to a simple
linear transformation. Activation functions such as the sigmoid and the hyperbolic tangent
have the advantage of being bounded, which facilitates convergence during learning. However,
they suffer from the vanishing gradient problem for deep networks. The ReLU function
(Rectified Linear Unit) has gained popularity because it is not prone to this problem and allows
faster computation, although it can lead to the "dead neuron" problem if the gradient
becomes zero.

Mohamed Ouazze 16 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

2.3.2 Architecture of feedforward networks

Feedforward neural networks, also called multi-layer perceptrons (MLPs), are organized into successive layers where information flows only
from input to output, without connections
recurring.

1. Input Layer: Receives the raw data


2. Hidden layers: Perform successive transformations

3. Output Layer: Produces the final prediction

Input layer Hidden layer 1 Hidden layer 2 Output layer

h11
x1

h21

x2 h12 and

h22
x3

h13

ÿ ÿ
1 import numpy as np
2 class Neural Network:
3 def init ( self , architecture , activation =’sigmoid ’):
"""
4

5 Initializing a Feedforward Neural Network

6 Args :
7 architecture: List defining the number of neurons per layer
8 [ e n t r e sortie
, ] couche _ cache _ 1 , ... , couche _ cache _ n ,

9 activation: Activation function ('sigmoid', 'relu', 'tanh')


"""
10

11 self . architecture = architecture


12 self . nb_couches = len ( architecture ) - 1
13 self . activation = activation
14

15 # Initialization of parameters (weights and biases)


16 self . parametres = {}
17

18 for l in range (1 , self . nb_couches + 1) :


19 # Initialization of He for ReLU , Xavier pour sigmoid / tanh
20 if activation == 'relu ':
21 factor = np . sqrt (2 / architecture [l -1])
22 else :
23 factor = np . sqrt (1 / architecture [l -1])
24

25 self . parameters [f'W{l}'] = np . random . randn ( architecture [l ], architecture [l -1]) *


postman
26 self . parametres [f’b{l}’] = np . zeros (( architecture [l], 1) )

ÿ ÿ
Listing 14 – Implementing a feedforward network with NumPy

Mohamed Ouazze 17 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 def activation_function ( self , WITH
, derive = False ):
"""
2

3
Applies the chosen activation function
4

5
Args :
6 Z: Input of the activation function
7 derivee : Si True , returns the derivative of the function
8

9 Returns :
10 Result of the activation function or its derivative
"""
11

12
if self . activation == ’sigmoid ’:
13 if not derivee :
14
return 1 / (1 + np . exp (- Z))
15 else :
16
s = 1 / (1 + np . exp ( -Z))
17 return s * (1 - s)
18

19 elif self . activation == 'relu ':


20 if not derivee :
21
return np . maximum (0 else : , WITH )

22

23
return (Z > 0) . astype ( float )
24

25 elif self . activation == ’tanh ’:


26 if not derivee :
27
return np . tanh (Z)
28 else :
29
return 1 - np . tanh (Z) **2
30

31 else :
32 raise ValueError ("Unrecognized activation function ")
33

34 def forward_propagation ( self , X) :


"""
35

36
Forward propagation through the network
37

38
Args :
39 X: D o n n e s d’ e n t r e ( nb_features , nb_samples )
40

41 Returns :
42
Network output and caches for backpropagation
"""
43

44 caches = {}
45 A=X
46

47
# Propagation for l in through the layers
48
range (1 , self . nb_couches + 1) :
49
# Cache from previous activation
50 caches [f’A{l -1} ’] = A
51

52 # Calculation of linear combination


53
Z = np . dot ( self . parametres [f’W{l}’], A) + self . parametres [f’b{l}’]
54 caches [f’Z{l}’] = Z
55

56
# Applying the activation function
57 A = self . activation_function(Z )
58

59 return A , caches

ÿ ÿ
Listing 15 – Implementing a feedforward network with NumPy continued

Mohamed Ouazze 18 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 def compute_cost ( self , Y_pred , Y , cost_type =’mse ’):
"""
2

3 Calculates the cot function


4

5 Args :
6 Y_pred : P r d i c t i o n s du r s e a u
7 Y: Actual target values

8 cost_type: Function type of cost('mse', 'bce')


9

10 Returns :
11 Cost value
"""
12

13 m = Y . shape [1] # Number of examples


14

15 if cost_type == 'mse ': # Mean Square Error


16 cost = np . mean ( np .sum (( Y_pred - Y) **2 , axis =0) )
17

18 elif cost_type == ’bce ’: # Entropie c r o i s e binaire


19 epsilon = 1e -15
20 Y_pred = np.clip(Y_pred, epsilon, 1-epsilon)
21 cost = -np . sum (Y * np . log ( Y_pred ) + (1 - Y) * np . log (1 - Y_pred )) / m
22

23 else :
24 raise ValueError (" Type de c o t non reconnu ")
25

26 return cost
27

28 def predict ( self , X) :


"""
29

30 Makes a prediction for new data


31

32 Args :
33 X: D o n n e s d’ e n t r e
34

35 Returns :
36 Network predictions
"""
37

38 A ,_ = self . forward_propagation (X)


39 return A
40 # Example of use
41 # Architecture: 3 inputs 42 # 2 neurons in the cache layer 2, 1 output , 4 neurons in cache layer 1,

43 network = NeuralNetwork ([3 , 2, 1] 4 44 # Example waves , , activation = 'relu ')

45 X = np . random . randn (3 5) # 5 examples, with 3 characteristics each


46 Y = np . random . randint (0 , 2, (1 5) ) # 5 binary labels,
47 Forward Propagation

48 Y_pred, 49 # Cost caches = reseau . forward_propagation ( X)


calculation

50 cout = reseau . compute_cost ( Y_pred , , cost_type =’bce ’) AND

51 print (f" Architecture du r s e a u : { reseau . architecture }")


52 print (f" Dimensions des e n t r e s X: {X. shape }")
53 print (f" Dimensions des sorties Y: {Y. shape }")
54 print (f" Dimensions des p r d i c t i o n s Y_pred : { Y_pred . shape }")
55 print (f" C o t : { cout :.4 f}")

ÿ ÿ
Listing 16 – Implementing a feedforward network with NumPy continued

Mohamed Ouazze 19 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Explanation

A feedforward network propagates information from the input layer to the output layer at
through a series of nonlinear transformations. Each layer extracts features from
increasingly abstract input data. In the example above, we have implemented
a two-hidden-layer network, but this architecture can be generalized to a number
arbitrary number of layers. The choice of weight initialization is crucial for network convergence.
Initialization of He (for ReLU) and Xavier/Glorot (for sigmoid/tanh) are techniques
popular that help maintain signal variance across layers, thus avoiding
vanishing or exploding gradient problems. Note that this implementation does not include
backpropagation learning, which we will discuss in the next section.

2.3.3 Activation functions and their impact


Activation functions introduce nonlinearities into the network, allowing it to model
complex relationships. The choice of activation function has a significant impact on performance
and network convergence.

Function Equation Benefits Disadvantages

1
Sigmoid ÿ(x) = 1+eÿx
Bounded output [0,1], differentiable Vanishing gradient, non-centered output
and xÿeÿx
Fishy tanh(x) = Output bounded [-1,1] and centered Vanishing gradient at the extremes
ex+eÿx

resume max(0, x) Fast calculation, prevents vanishing gradient "Dead" neurons (gradient = 0)

Leaky ReLU max(ÿx, x), ÿ ÿ 0.01 Avoids the dead neuron problem Few empirical advantages vs ReLU
x and x > 0
HIGH Average output close to zero Higher computational cost
ÿ(e x ÿ 1) if x ÿ 0
xi
Softmax e
xje
Normalization in probability distribution Sensitive to extreme values
j

ÿ ÿ

1 import numpy as np
2 import matplotlib . pyplot as plt
3

4 def sigmoid ( x):


5 return 1 / (1 + np . exp (- x))
6

7 def tanh (x):


8 return np . tanh (x)
9

10 def relu (x):


11 return np . maximum (0 , x)
12

13 setting leaky_relu (x , alpha =0.01):


14 return np . where (x > 0 , x , alpha * x)
15

16 perform the following (x , alpha =1.0):


17 return np . where (x > 0 , x , alpha * ( np . exp (x) - 1) )
18

19 performs softmax ( x ):
20 exp_x = np . exp (x - np . max (x )) # S t a b i l i t n u m r i q u e
21 return exp_x / np .sum ( exp_x )
22

23 # Visualization of activation functions

24 x = np . linspace ( -5 , 5 , 100)
25

26 plt . figure ( figsize =(15 , 10) )


27

28 # Subplot for each function


29 functions = [
30 ( sigmoid , ’Sigmoid ’, ’blue ’) ( tanh ,

31 , 'Fish ', 'red ') ( relu ,

32 , 'ReLU ', 'green ') ,

33 ( leaky_relu , 'Leaky ReLU ', 'orange ') ( elu ,

34 , 'HIGH ', 'purple ')


35 ]

Mohamed Ouazze 20 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
Listing 17 – Comparison of activation functions

Mohamed Ouazze 21 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 for i , ( func , name , color) in enumerate (functions):
2
plt . subplot (2 y = func , 3 , i + 1)
3
(x )
4
plt . plot (x , y , plt . title color = color , linewidth =2)
5
(f'Function { name }')
6
plt . grid ( True )
7
plt . axhline (y =0 plt . axvline , color =’black ’, linestyle =’-’, alpha =0.3)
8
(x =0 , color =’black ’, linestyle =’-’, alpha =0.3)
9

10 # Softmax requires special processing ( input vector )

11 plt . subplot (2 12 x_softmax = np . , 3 , 6)


array ([1 , 2, 3, 4, 5])

13 y_softmax = softmax ( x_softmax )

14 plt . bar ( range (len ( x_softmax )) , y_softmax , 15 plt . title (’Softmax color =’cyan ’)
( exemple )’)
16 plt . xlabel (’Classes ’)
17 plt . ylabel (’ P r o b a b i l i t ’)
18

19 plt . tight_layout ()

20 plt . show ()

ÿ ÿ
Listing 18 – Comparison suite

Explanation

Activation functions are crucial because they determine the network's ability to learn
complex patterns. The ReLU function has become standard in hidden layers because
of its computational simplicity and its ability to avoid the vanishing gradient. The function
softmax is mainly used in the output layer for classification problems
multi-class, because it produces a normalized probability distribution.
The choice of activation function depends on the context:
— Hidden layers: ReLU or its variants (Leaky ReLU, ELU)
— Binary output: Sigmoid
— Multi-class output: Softmax
— Regression: No activation (linear)

2.3.4 Practical example: Creating a simple perceptron for binary classification

Let's implement a simple perceptron capable of learning a logical function like the AND operator
(AND).
ÿ ÿ
1 import numpy as np
2 import matplotlib . pyplot as plt
3

4 class Perceptron :
5 def __init__ ( self , nb_entries, learning_rate =0.1):
"""
6

7
Perceptron Initialization
8

9
Args :
10 nb_inputs : Number of inputs of the perceptron
11 learning_rate: Learning rate for the setting weight day
"""
12

13 # Random initialization of weights


14 self . poids = np . random . randn ( nb_entrees ) * 0.1
15 self . bias = np . random . randn() * 0.1
16 self . learning_rate = learning_rate
17 self . error_history = []
18

19 def activation_function ( self , x) :


"""
20 Fonction d’activation en escalier ( step function )"""
21 return 1 if x >= 0 else 0

ÿ ÿ
Listing 19 – Perceptron for binary classification - AND operator

Mohamed Ouazze 22 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1

2 def prediction ( self , entrees ):


"""
3

4 Makes a prediction for a set of inputs


5

6 Args :
7 inputs: Input vector
8

9 Returns :
10 Binary prediction (0 or 1)
"""
11

12 # Calculation of the weighted sum


13 weighted_sum = np . dot ( self . weight , # Applying the inputs) + self. bias
14 activation function
15 return self . activation_function ( weighted_sum )
16

17 def training (self , X , and , nb_epochs =100) :


"""
18

19 Entrain the perceptron on the data


20

21 Args :
22 X: Input data matrix (nb_samples y: Vector of target labels , nb_features )
23

24 nb_epoques : Nombre d’ i t r a t i o n s d’ e n t r a n e m e n t
"""
25

26 for epoch in range ( nb_epoques ):


27 total_errors = 0
28

29 # Walkthrough of all training examples


30 for i in range ( len (X )):
31 # Prediction for the current example
32 prediction = self . prediction (X[i ])
33

34 # Calculation of the error


35 error = y [i] - prediction
36 total_errors += abs (error)
37

38 # Bet jour des poids si erreur ( rgle de Hebb modifie )


39 if error != 0:
40 self . weight += self . learning_rate * error * X[i ]
41 self . bias += self . learning_rate * error
42

43 # Record the total error for this epoch


44 self . error_history . append( total_errors )
45

46 # A r r t a n t i c i p si convergence
47 if total_errors == 0:
48 print (f) Convergence reached "time {time + 1}")
49 break
50

51 def evaluate(self , X, y):


"""
52

53 evaluates the performance of the perceptron


54

55 Args :
56 X: Test data
57 and: real labels
58

59 Returns :
60 Precision you model
"""
61

62 predictions = [ self . prediction ( x) for x in X]


63 precision = np . mean ( predictions == y)
64 return precision

ÿ ÿ
Listing 20 – Perceptron for binary classification - AND operator

Mohamed Ouazze 23 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 # C r a t i o n des d o n n e s pour l’ o p r a t e u r ET ( AND )
2 X_train = np . array ([
3 [0 0], , # 0 AND 0 = 0
4 [0 1], , # 0 AND 1 = 0
5 [1 0], , # 1 AND 0 = 0
6 [1 1], # 1 AND 1 = 1
7 ])
8

9 y_train = np . array ([0 , 0, 0, 1])


10

11 # C r a t i o n et e n t r a n e m e n t du perceptron
12 perceptron = Perceptron (nb_inputs =2, learning_rate =0.1)
13

14 print (" Avant l’ e n t r a n e m e n t :")


15 print (f" Poids : { perceptron . poids }")
16 print (f" Biais : { perceptron . biais }")
17 print (" P r d i c t i o n s :")
18 for i , x in enumerate ( X_train ):
19 pred = perceptron . prediction ( x)
20 print (f" {x [0]} AND {x [1]} = { pred } ( attendu : { y_train [i ]}) ")
21

22 # E n t r a n e m e n t
23 print ("\ n E n t r a n e m e n t en cours ...")
24 perceptron . training ( X_train , y_train , nb_epochs =100)
25

26 print ("\ n A p r s l’ e n t r a n e m e n t :")


27 print (f" Poids : { perceptron . poids }")
28 print (f" Biais : { perceptron . biais }")
29 print (" P r d i c t i o n s :")
30 for i , x in enumerate ( X_train ):
31 pred = perceptron . prediction ( x)
32 print (f" {x [0]} AND {x [1]} = { pred } ( attendu : { y_train [i ]}) ")
33

34 # valuation
35 precision = perceptron . evaluate ( X_train , y_train )
36 print (f"\ n P r c i s i o n : { precision * 100:.1 f}%")
37

38 # Visualization of the evolution of the error


39 plt . figure ( figsize =(10 40 plt . plot , 5) )
( perceptron . historique_erreurs , ’b-’, linewidth =2)
41 pts . title (' Error evolution during running ')
42 plt . xlabel ('poque ')
43 plt . ylabel ('Number of errors ')
44 plt . grid ( True )
45 plt . show ()

ÿ ÿ
Listing 21 – Perceptron for binary classification - AND operator

Exercise

Modify the perceptron above to learn the OR operator and then the XOR operator. What do you notice about
the XOR? Explain why and propose a solution.

Mohamed Ouazze 24 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

Solution

ÿ ÿ

1 # Test with the OR operator


,
2 X_or = np . array ([[0 0] 3 y_or = np . array ([0 , [0 , 1] , [1 , 0] , [1 , 1]])
, 1 , 1 , 1])
4

5 perceptron_or = Perceptron (nb_inputs =2, learning_rate =0.1)


6 perceptron_or . entrainment ( X_or , y_or , nb_epochs =100)
7 precision_or = perceptron_or . evaluer ( X_or , y_or )
8 print (f" P r c i s i o n OR: { precision_or * 100:.1 f}%")
9

10 # Test with the XOR operator (non-linearly parable problem)


,
11 X_xor = np . array ([[0 0] [0 12 y_xor = np . array ([0 , 1,, 1, 0]) , 1] , [1 , 0] , [1 , 1]])

13

14 perceptron_xor = Perceptron (nb_inputs =2, learning_rate =0.1)


15 perceptron_xor . training ( X_xor , y_xor , nb_epoques =1000)
16 precision_xor = perceptron_xor . evaluate ( X_xor , y_xor )
17 print (f" P r c i s i o n XOR: { precision_xor * 100:.1 f}%")
18

19 # Solution for XOR: Multilayer Network


20 class Multilayer Network:
21 def __init__ ( self ):
22 # Couche c a c h e : 2 neurones
23 self . W1 = np . array ([[20 self . b1 = np . , 20] , [ -20 , -20]]) # Manually adjusted weights
24 array ([ -10 , 30])
25

26 # Output layer: 1 neuron

27 self . W2 = np . array ([20 , -20])


28 self . b2 = -10
29

30 def sigmoid ( self x) : ,

31 return 1 / (1 + np . exp (- np . clip (x , -250 , 250) )) # Clip pour s t a b i l i t


32

33 def forward ( self # Couche c a c , X) :


34 he

35 z1 = np . dot (X self . W1, .T ) + self . b1


36 a1 = self . sigmoid ( z1 )
37

38 # Output layer

39 z2 = np . dot (a1 self . W2 , . T) + self . b2


40 a2 = self . sigmoid ( z2 )
41

42 return ( a2 > 0.5) . astype ( int)


43

44 # Testing the multilayer network on XOR


45 reseau_xor = ReseauMulticouche ()
46 predictions_xor = reseau_xor . forward ( X_xor )
47 precision_multicouche = np . mean ( predictions_xor == y_xor )
48

49 print (f"\ n R s u l t a t s pour XOR avec r s e a u multicouche :")


50 for i pred = , x in enumerate ( X_xor ):
51 predictions_xor [i]
52 print (f" {x [0]} XOR {x [1]} = { pred } ( attendu : { y_xor [i ]}) ")
53 print (f" P r c i s i o n : { precision_multicouche * 100:.1 f}%")
ÿ ÿ
Listing 22 – Testing with different logical operators

Explanation: The simple perceptron cannot learn the XOR function because this function
is not linearly separable. It is impossible to draw a straight line that perfectly separates
classes in two-dimensional space. This historical limitation of the perceptron led to the
development of multilayer networks, which can solve problems nonlinearly
separable thanks to hidden layers and nonlinear activation functions.

Mohamed Ouazze 25 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

2.4 Training Neural Networks

2.4.1 Backpropagation algorithm explained step by step

Backpropagation is the fundamental algorithm for training multi-layer neural networks. It uses the chain rule
to efficiently calculate the gradients of the
loss function with respect to all network parameters.
Steps of backpropagation:
1. Forward propagation: Calculating activations layer by layer
2. Loss calculation: Evaluation of the error between prediction and ground truth
3. Back propagation: Calculating gradients by going up the network
4. Parameter Update: Adjusting Weights and Biases According to Gradients
ÿ ÿ
1 import numpy as np
2

3 class ReseauNeuronesBP :
4 def __init__ ( self , architecture ) :
"""
5

6 R bucket of neurons with rtropropagation.


7

8 Args :
9 architecture: List [nb_entries] , nb _ caches1 , ... , nb_sorties ]
"""
10

11 self . architecture = architecture


12 self . nb_couches = len ( architecture ) - 1
13

14 # Initializing parameters
15 self . parametres = {}
16 for l in range (1 self . nb_couches
, + 1) :
17 # Xavier / Glorot Initialization
18 self . parametres [f’W{l}’] = np . random . randn (
19 architecture [l] architecture , [l -1]
20 ) * np . sqrt (2 / ( architecture [l] + architecture [l -1]) )
21

22 self . parametres [f’b{l}’] = np . zeros (( architecture [l], 1) )


23

24 def sigmoid ( self , With) :


""" """
25 Sigmoid activation function with overflow protection z = np . clip (z 250) # Overflow prevention
26 , -250 ,
27 return 1 / (1 + np . exp (- z))
28

29 def sigmoid_derivee ( self , With) :


""" """
30 Drift of the sigmoid function s = self . sigmoid (z )
31

32 return s * (1 - s)

ÿ ÿ
Listing 23 – Complete Implementation of Backpropagation

Mohamed Ouazze 26 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 def propagation_avant ( self , X):
"""
2

3 Forward propagation with saving of intermediate values


4

5 Args :
6 X: D o n n e s d’ e n t r e ( nb_features , nb_samples )
7

8 Returns :
9 Final activations and cache for backpropagation
"""
10

11 cache = {'A0 ': X}


12 A=X
13

14 for l in range (1 , self . nb_couches + 1) :


15 # Calculate Z = W*A + b
16 Z = np . dot ( self . parametres [f’W{l}’], A) + self . parametres [f’b{l}’]
17 cache [f’Z{l}’] = Z
18

19 # Calcul de A = sigmoid (Z)


20 A = self . sigmoid (Z )
21 cache [f’A{l}’] = A
22

23 return A , cache
24

25 def calculation_cout ( self , Y_pred , AND) :


"""
26

27 Calculation of binary cross entropy


28

29 Args :
30 Y_pred : P r d i c t i o n s ( nb_classes , nb_samples )
31 Y: V r i t terrain ( nb_classes , nb_samples )
32

33 Returns :
34 Average across all samples
"""
35

36 m = Y . shape [1]
37 epsilon = 1e -15
38 Y_pred = np.clip(Y_pred, epsilon, 1 - epsilon )
39

40 cout = -np.sum (Y * np.log(Y_pred) + (1 - Y) * np.log(1 - Y_pred)) / m


41 return cout
42

43 def propagation_arriere ( self , cache , AND):


"""
44

45 Calculation of gradients by rtropropagation


46 Args :
47 cache: Intermediate values of forward propagation
48 Y: V r i t terrain
49 Returns :
50 Gradients for all parameters
"""
51

52 gradients = {}
53 m = Y . shape [1]
54 # Output layer gradient
55 A_final = cache [f’A{ self . nb_couches }’]
56 dA = -( np . divide (Y A_final ) - np
, . divide (1 - Y # R t r o p r o p a g a t i o n , 1 - A_final ))
57 couche par couche
58 for l in range ( self . nb_couches , 0, -1) :
59 # Gradient of the activation function
60 dZ = dA * self . sigmoid_derivee ( cache [f’Z{l}’])
61

62 # Gradients des p a r a m t r e s
63 gradients [f’dW{l}’] = np . dot (dZ gradients [f’db{l}’] = , cache [ f’A{l -1} ’]. T ) / m
64 np . sum (dZ , axis =1 , keepdims = True ) / m
65

66 # Gradient for the previous layer (if not the first)


67 if l > 1:
68 dA = np . dot ( self . parametres [f’W{l}’].T return gradients , dZ )
69

ÿ ÿ
Listing 24 – Complete Implementation of Backpropagation

Mohamed Ouazze 27 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1
def update_parameters(self, gradients, learning_rate):
"""
2

3 Of update the parameters according to the calculated gradients


4

5
Args :
6
gradients : Gradients c a l c u l s par r t r o p r o p a g a t i o n
7
learning_rate: Learning rate
"""
8

9
for l in range (1 self . nb_couches
, + 1) :
10
self . parameters [f'W{l}'] -= learning_rate * gradients [f'dW{l}']
11
self . parameters [f'b{l}'] -= learning_rate * gradients [f'db{l}']
12

13 def training (self , X , AND


, nb_epochs, learning_rate, display_cost =100):
"""
14

15
Train the network with backpropagation
16

17
Args :
18 X: D o n n e s d’entree ( nb_features , nb_samples )
19 AND:
tiquettes ( nb_classes , nb_samples )
20
nb_epochs : Number of training iterations
21
learning_rate: Learning rate
22
display_cost: Cost display frequency
23

24 Returns :
25
Cost history
"""
26

27
historical_costs = []
28

29
for epoch in range ( nb_epoques ):
30
# Forward propagation
31 A_final , cache = self . forward_propagation (X)
32

33 # Cost calculation
34 cout = self . cost_calculation ( A_final historical_costs . , AND)

35
append ( cout )
36

37
# Back propagation
38
gradients = self . backpropagation ( cache , AND)

39

40 # Bet jour des p a r a m t r e s


41
self . parameter_update ( gradients , learning_rate )
42

43
# Displaying progress
44
if epoch % display_cost == 0:
45
print (f" poque { epoque }: C ot = { cout :.6 f}")
46

47
return historical_costs
48

49
def prediction ( self , X):
"""
50

51
Makes predictions on new data
52

53
Args :
54 X: D o n n e s d’ e n t r e
55

56 Returns :
57 P r d i c t i o n s binaires
"""
58

59 A_final ,_ = self . propagation_avant ( X)


60
return ( A_final > 0.5) . astype (int )
ÿ ÿ
Listing 25 – Complete Implementation of Backpropagation

Mohamed Ouazze 28 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 # Example of use: Spiral synthetic yarns
2 def generate_spiral_data (n_points =300 , noise =0.1):
""" """
3
Generate spiral data for np test. random. seed (42)
4

5
N = n_points // 2 # points per class
6

7 # Class 0
8
r = np . linspace (0.1 , 1, N )
9 t = np . linspace (0 4* np .pi , x0
, = r * np . cos N ) + np . random . randn (N) * noise
10 ( t)
11 y0 = r * np . sin ( t)
12

13 # Class 1
14 r = np . linspace (0.1 , 1, N )
15 t = np . linspace ( np .pi , x1 = r * np . 5* np .pi , N ) + np . random . randn (N) * noise
16 cos ( t)
17 y1 = r * np . sin ( t)
18

19 # Combination
20 X = np . vstack ([ np . column_stack ([ x0 , y0 ]) , np . column_stack ([ x1 , y1 ]) ])
21 y = np . hstack ([ np . zeros (N) , np . ones (N ) ])
22

23 return X.T , y. reshape (1 , -1)


24

25 # Test on spiral data


26 X , Y = generate_spiral_data (n_points =400)
27

28 # Network Creation and Training


29 reseau = ReseauNeuronesBP ([2 , 10 , 8, 1]) # 2 e n t r e s , 2 couches c a c h e s , 1 exit
30

31 print (" E n t r a n e m e n t sur d o n n e s en spirale ... ")


32 history = network . training (
33 XY, ,

34 nb_epochs =2000 learning_rate ,

35 =0.5 display_cost =400 ,

36

37 )
38

39 # valuation

40 predictions = reseau . prediction (X)


41 precision = np . mean ( predictions == Y)
42 print (f"\ n P r c i s i o n finale : { precision * 100:.2 f}%")
43

44 # Visualization of results

45 import matplotlib . pyplot as plt


46

47 plt . figure ( figsize =(12 , 4) )


48

49 # Graph 1: evolution of the cost


50 plt . subplot (1 1) , ,

2 51 plt . plot ( historical )


52 pcs . title (' Cock Roll During Training ')
53 plt . xlabel ('poque ')
54 plt . ylabel (’ C o t ’)
55 plt . grid ( True )
56

57 # Graph 2: Data classification


58 plt . subplot (1 2) , 2 ,

59 colors = ['red ', 'blue ']


60 for i in [0 1]: ,

61 mask = (Y [0] == i)
62 plt . scatter (X [0 mask ], X, [1 marker =’o’ if , mask], c= colors [i],
63 predictions [0 label =f’Classe {i}’, alpha =0.7) , mask ][0] == i else ’x’,
64

ÿ ÿ
Listing 26 – Complete implementation of backpropagation

Mohamed Ouazze 29 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 plt . title (f’ Classification en spirale ( P r c i s i o n : { precision *100:.1 f }%) ’)
2 plt . xlabel ('X1 ')
3 plt . ylabel (’X2 ’)
4 plt . legend ()

5 plt . grid ( True )

6 plt . tight_layout ()

7 plt . show ()

ÿ ÿ
Listing 27 – Complete Implementation of Backpropagation

Explanation

The backpropagation algorithm works in two main phases:


Phase 1 - Forward Propagation: Data traverses the network layer by layer, each
layer applying a linear transformation followed by a nonlinear activation function.
Phase 2 - Back propagation: The gradient of the loss function is calculated by going backward
the network, using the chain rule to break down the complex calculation into simple steps.
ÿz(l)
ÿL = ÿL · j
The key to backpropagation lies in the chain rule of differential calculus: ÿw(l)
ÿz(l) ÿw(l)
ij j ij
(l)
where L is the loss function, w is the weight connecting neuron i of layer lÿ1 to neuron
ij
(l)
j of layer l, and z j
is the net input of neuron j into layer l.

Loss function and evaluation metrics The choice of the loss function is crucial because it
defines the optimization objective of the model. Different tasks require different loss functions:

Task type Loss function Mathematical formula


m
Binary classification Binary cross-entropy
ÿ 1
i=1[yi log( ˆyi) + (1 ÿ yi) log(1 ÿ yˆi)]
m
1 m C
Multi-class classification Categorical cross-entropy
ÿ

m i=1 j=1 yij log( ˆyij )


1 m 2
Regression Root mean square error 2m i=1(yi ÿ yˆi)
1 m
Robust regression Mean absolute error m i=1 |yi ÿ yˆi |

Common evaluation metrics:


— Accuracy: Proportion of correct predictions
— Precision: Proportion of true positives among positive predictions
— Recall: Proportion of true positives detected
— F1-Score: Harmonic mean of precision and recall
— AUC-ROC: Area under the ROC curve for binary classification
ÿ ÿ
1 import torch
2 import torch . nn . functional as F
3

4 def binary_cross_entropy_loss ( y_pred , y_true ):


""" """
5 Binary cross-entropy loss return F. binary_cross_entropy
6 ( y_pred , y_true )
7

8 def categorical_cross_entropy_loss ( y_pred , y_true ):


""" """
9 Cross loss - categorical entropy
10 return F. cross_entropy ( y_pred , y_true )
11

12 def mean_squared_error_loss ( y_pred , y_true ):


""" """
13 Mean square error return F. mse_loss ( y_pred ,
14 y_true )
ÿ ÿ
Listing 28 – Implementation of common loss functions

Mohamed Ouazze 30 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 def focal_loss ( y_pred , y_true , alpha =1 , gamma =2) :
""" """
2 Focal Loss for Unbalanced Classes
3 ce_loss = F. cross_entropy ( y_pred , y_true , pt = torch . exp (- ce_loss ) reduction =’none ’)
4

5 focal_loss = alpha * (1 - pt ) ** gamma * ce_loss


6 return focal_loss . mean ()
7

8 # Example of use with metrics

9 def calculate_metrics ( y_pred , y_true ):


10 from sklearn . metrics import accuracy_score , precision_score , recall_score , f1_score
11

12 # Converting probabilities into predictions


13 y_pred_classes = torch . argmax ( y_pred , dim =1)
14

15 # Calculating metrics
16 accuracy = accuracy_score ( y_true . cpu () , y_pred_classes . cpu () )
17 precision = precision_score ( y_true . cpu () , y_pred_classes . cpu () , average =’weighted ’)
18 recall = recall_score ( y_true . cpu () , y_pred_classes . cpu () , average =’weighted ’)
19 f1 = f1_score ( y_true . cpu () , y_pred_classes . cpu () , average =’weighted ’)
20

21 return {
22 ’accuracy ’: accuracy ,
23 ’precision ’: precision ,
24 ’recall ’: recall ’f1_score ’: f1 ,

25

26 }

ÿ ÿ
Listing 29 – Implementation of common loss functions

Weight Initialization Strategies Weight initialization is a critical aspect that can significantly affect model convergence
and performance. Improper initialization can
lead to gradients that cancel or explode.
Main initialization methods:

Method Distribution Cas d’usage optimal


2
Xavier/Glorot N (0, night+night ) Symmetric activation functions (tanh, sigmoid)
2
He/MSRA N (0, of
)
ReLU functions and variants
1
LeCun N (0, of
) Very deep networks with normalization

Orthogonal Orthogonal matrices RNN and recurrent networks

ÿ ÿ
1 import torch
2 import torch . nn as nn

3 import math
4

5 def xavier_uniform_init ( layer ):


""" """
6 Uniform Xavier Initialization
7 if isinstance ( layer , . Linear ): nn
8 fan_in = layer . in_features
9 fan_out = layer . out_features
10 std = math . sqrt (2.0 / ( fan_in + fan_out ))
11 layer . weight . data . uniform_ (- std if layer . bias is not , std )
12 None :
13 layer . bias . data . zero_ ()
14

15 def he_normal_init ( layer ):


""" """
16 Normal He initialization
17 if isinstance ( layer , . Linear ): nn
18 fan_in = layer . in_features
19 std = math . sqrt (2.0 / fan_in )
20 layer . weight . data . normal_ (0 std ) ,

21 if layer . bias is not None :


22 layer . bias . data . zero_ ()

Mohamed Ouazze 31 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
Listing 30 – Implementation of the different initialization strategies

Mohamed Ouazze 32 BDCC-2024-2025


Machine Translated by Google

2 LEVEL 1: FUNDAMENTALS AND INTRODUCTION

ÿ ÿ
1 def lecun_normal_init ( layer ) :
""" """
2 Normal LeCun initialization . Linear):
3 if isinstance ( layer , nn
4 fan_in = layer . in_features
5 std = math . sqrt (1.0 / fan_in )
6 layer . weight . data . normal_ (0 std ) ,

7 if layer . bias is not None :


8 layer . bias . data . zero_ ()
9

10 # Example of application on a model


11 class MLPWithCustomInit ( nn . Module ):
12 def __init__ ( self , input_size , hidden_sizes , output_size , super ( MLPWithCustomInit self ). __init__ () init_method =’he ’) :
13 ,

14

15 layers = []
16 prev_size = input_size
17

18 for hidden_size in hidden_sizes :


19 layers . append ( nn . Linear ( prev_size , layers . append ( nn . hidden_size ))
20 ReLU () )
21 prev_size = hidden_size
22

23 layers . append ( nn . Linear ( prev_size , output_size ))


24 self . network = nn . Sequential (* layers )
25

26 # Applying the chosen initialization


27 if init_method == ’xavier ’:
28 self . network . apply ( xavier_uniform_init )
29 elif init_method == ’he ’:
30 self . network . apply ( he_normal_init )
31 elif init_method == 'lecun ':
32 self . network . apply ( lecun_normal_init )
33

34 def forward ( self x) : ,

35 return self . network (x)

ÿ ÿ
Listing 31 – Implementation of the different initialization strategies

Explanation

The choice of initialization method depends mainly on the activation function used:
— He initialization: Optimal for ReLU and its variants (Leaky ReLU, ELU)
— Xavier initialization: Suitable for symmetric functions like tanh and sigmoid
— LeCun initialization: Recommended for networks with batch normalization
Proper initialization helps avoid problems with vanishing or exploding gradients, thus speeding up
convergence.

Mohamed Ouazze 33 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

3 Level 2: Intermediate Concepts


This intermediate level will introduce you to the specialized architectures that have revolutionized Deep
Learning. We will explore convolutional networks, essential for computer vision, as well as
transfer learning techniques that allow the use of pre-trained models.

3.1 Convolutional Neural Networks (CNN)


Convolutional neural networks represent a major advance in image processing and
of signals. Inspired by the visual cortex of mammals, they exploit the spatial properties of data
to automatically extract hierarchical features.

3.1.1 Principles and architecture of CNNs

CNNs are distinguished from traditional neural networks by their ability to preserve structure
spatial preservation of the input data. This preservation is made possible by three key concepts:
1. Local connectivity: Each neuron is only connected to a local region of the previous layer, unlike fully connected layers.

2. Parameter sharing: The same weights (filters) are used across the entire image, reducing
drastically the number of parameters.
3. Translation invariance: The same feature can be detected regardless of its
position in the image.

Image FC
Conv1 Half1 Conv2 Pool2
entrance 10
28×28×32 14×14×32 10×10×64 5×5×64
32×32×3 classes

Typical architecture of a CNN:

1. Convolution Layers: Local Feature Extraction

2. Pooling Layers: Dimensionality Reduction and Invariance


3. Fully Connected Layers: Final Classification

4. Activation functions: Introduction of non-linearities (ReLU, etc.)


ÿ ÿ

1 import torch
2 import torch . 3 import nn as nn
torch . nn . functional as F
4

5 class BasicCNN ( nn . Module ):


6 def __init__ ( self num_classes =10)
, :
7 super ( BasicCNN , self ). __init__ ()
8

9 # First convolutional layer


10 kernel_size =5 . Conv2d ( in_channels =3 self . conv1 = nn , out_channels =32 ,

11 , stride =1 , padding =2)


12 self . pool1 = nn . MaxPool2d( kernel_size =2 , stride =2)
13

14 # Second convolutional layer


15 kernel_size =5 . Conv2d ( in_channels =32 self . conv2 = nn , out_channels =64 stride =1 , ,

16 , padding =2)
17 nn . MaxPool2d ( kernel_size =2 self . pool2 = , stride =2)
18

19 # Third convolutional layer


20 self . conv3 = nn . Conv2d ( in_channels =64 , out_channels =128 , padding ,

21 kernel_size =3 , stride =1 stride =1)


22 self . pool3 = nn . MaxPool2d( kernel_size =2 , =2)
23

24 # Fully connected layers


25 self . fc1 = nn . Linear (128 * 4 * 4, 512)
26 self . dropout = nn self . fc2 = nn . . Dropout (0.5)
27 Linear (512 num_classes ) ,

ÿ ÿ
Listing 32 – Basic CNN Architecture with PyTorch

Mohamed Ouazze 34 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 def forward ( self , x) :
2 # Premiere bloc convolutif
3
x = self . pool1 (F. relu ( self . conv1 (x )))
4

5 # Second convolutional block


6
x = self . pool2 (F. relu ( self . conv2 (x )))
7

8 # Third convolutional block


9
x = self . pool3 (F. relu ( self . conv3 (x )))
10

11
# Flattening for FC layers
12 x = x . view (x. size (0) , -1)
13

14 # Fully connected layers


15 x=F .relu ( self .fc1 (x) )
16
x = self . dropout (x )
17 x = self . fc2 (x )
18

19 return x
20

21 # Creation and testing of the model


22 model = BasicCNN ( num_classes =10)
23 print (f" Nombre de parametres : { sum(p. numel () for p in model . parameters ()): ,}")
24

25 # Test with a dummy input


26 x = torch . randn (1 , 3, 32 27 output = , 32) # Batch of 1, 3 channels , 32 x32
model (x)
28 print (f" Forme de sortie : { output . shape }")
ÿ ÿ
Listing 33 – Basic CNN Architecture with PyTorch

3.1.2 Understanding filters and feature maps

Filters (or convolution kernels) are the fundamental elements of CNNs. Each filter is a
small weight matrix that slides over the input image to detect specific patterns.
Mechanism of convolution:
For an input image I and a filter F of size k × k, the convolution produces a feature map O where
each element is calculated as:
kÿ1 kÿ1
O(i, j) = m=0 n=0 I(i + m, j + n) × F(m, n) + b
where b is the bias associated with the filter.

Filter Size Typical Use Features detected


1×1 Channel reduction/increase Linear channel combinations
3×3 Extraction of local features Contours, simple textures
5×5 Average characteristics More complex shapes and patterns
7×7 Overall characteristics Large structures, context

ÿ ÿ
1 import torch
2 import torch . 3 import nn as nn
matplotlib . pyplot as plt
4 import numpy as np
5 from torchvision import datasets , transforms
6

7 # Function to visualize the filters of a convolutional layer


8 def visualize_filters ( model , layer_name , # Retrieve the weights of the num_filters =8) :
9
specified layer
10
layer = dict ( model . named_modules () )[ layer_name ]
11
weights = layer . weight . data . clone ()
ÿ ÿ
Listing 34 – Visualizing filters and feature maps

Mohamed Ouazze 35 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # Normalizing weights for visualization
2 weights = ( weights - weights .min () ) / ( weights . max () - weights . min () )
3

4 fig , axes = plt . subplots (2 , 4, figsize =(12 6) ) ,

5 for i in range ( min ( num_filters , weights . shape [0]) ):


6 ax = axes [i //4 i %4] ,

8 # For RGB filters , we take the average of the channels


9 if weights . shape [1] == 3:
10 filter_img = weights [i ]. permute (1 , 2, 0) . mean ( dim =2)
11 else :
12 filter_img = weights [i , 0]
13

14 ax . imshow ( filter_img , cmap =’viridis ’)


15 ax . set_title (f’Filtre {i +1} ’)
16 ax . axis (’off ’)
17

18 plt . suptitle ( f'Filters of layer { layer_name }')


19 plt . tight_layout ()
20 plt . show ()
21

22 # Function to visualize feature maps


23 def visualize_feature_maps ( model , image , layer_name , num_maps =8) :
24 model . eval ()
25

26 # Hook to capture activations


27 activations = {}
28 def hook_fn ( module , input , output ):
29 activations [’features ’] = output . detach ()
30

31 # Recording the hook


32 layer = dict ( model . named_modules () )[ layer_name ]
33 handle = layer . register_forward_hook ( hook_fn )
34

35 # Passing the image into the model


36 with torch . no_grad () :
37 _ = model ( image . unsqueeze (0) )
38

39 # Recuperation des feature maps


40 feature_maps = activations ['features '][0] # First sample of the batch
41

42 # Visualisation
43 axes = plt . subplots (2 , 4, figsize =(12 fig , for i in range ( min , 6) )
44 ( num_maps , feature_maps . shape [0]) ):
45 ax = axes [i //4 ax . imshow , i %4]
46 ( feature_maps [i], cmap =’viridis ’)
47 ax . set_title (f’Feature Map {i +1} ’)
48 ax . axis (’off ’)
49

50 plt . suptitle ( f’Feature Maps de la couche { layer_name }’)


51 plt . tight_layout ()
52 plt . show ()
53

54 # Cleaning
55 handle . remove ()
56

57 # Example of use
58 # Creation of a simple model for demonstration
59 class VisualizationCNN ( nn . Module ):
60 def __init__ ( self ):
61 super ( VisualizationCNN self ). __init__ () ,
62 self . conv1 = nn . Conv2d (3 16 kernel_size
, =3
, , padding =1)
63 self . conv2 = nn kernel_size =3
. Conv2d (16 , padding
, 32
=1) ,
64 self . pool = nn . MaxPool2d(2 2) ,

65 self . fc = nn . Linear (32 * 8 * 8, 10)

ÿ ÿ
Listing 35 – Visualizing filters and feature maps

Mohamed Ouazze 36 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 def forward ( self x = self . , x) :
2 pool ( torch . relu ( self . conv1 (x) ))
3 x = self . pool ( torch . relu ( self . conv2 (x) ))
4 x = x . view ( -1 32 * 8 *, 8)
5 x = self . fc (x)
6 return x
7

8 model = VisualizationCNN ()
9

10 # Visualization of the filters of the first layer


11 visualize_filters ( model , ’conv1 ’)
12

13 # Creating a test image


14 test_image = torch . randn (3 32 32) , ,

15 visualize_feature_maps ( model , test_image , ’conv1 ’)

ÿ ÿ
Listing 36 – Visualizing filters and feature maps

3.2 Natural Language Processing (NLP)


3.2.1 Word representation (one-hot, word embeddings)

Natural language processing in Deep Learning requires converting text into representations
digital that neural networks can manipulate. Several approaches exist, each with
its advantages and disadvantages.

Method Description Benefits Disadvantages

One-hot encoding Binary vector with a 1 Simple, no ambiguity High dimensionality,

at the word position no semantic relationship

Bag of Words Counting Occurrences Captures Frequency Loses word order

TF-IDF Frequency weighted by Importance relative No semantics

rarity of the term words

Word Embeddings Dense Vectors of Captures semantics, Requires a lot

dimension fixe reduced size of training data

ÿ ÿ
1 import numpy as np
2 from sklearn . feature_extraction . text import CountVectorizer 3 import matplotlib . pyplot as plt , TfidfVectorizer

5 # Example Corpus
6 corpus = [
7 "The cat eats fish ",
8 "The dog eats meat ",
9 "The fish swims in the water ",
"
10 "The cat and the dog are animals
11 ]
12

13 print (" Corpus d’exemple :")


14 for i , phrase in enumerate ( corpus ):
15 print (f"{i +1}: { phrase }")

ÿ ÿ
Listing 37 – Comparison of word representations

Mohamed Ouazze 37 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1

2 # 1. One - hot encoding manuel


3 def create_vocabulary ( corpus ):
""" """
4 Create a vocabulary from the corpus
5 vocabulary = set ()
6 for phrase in corpus :
7 vocabulary . update ( phrase . lower () . split () )
8 return sorted ( list ( vocabulary ) )
9

10 def one_hot_encode ( phrase , vocabulary ):


""" """
11 Encodes a sentence as a one-hot sentence
12 words = phrase . lower () . split ()
13 encoding = np . zeros (( len ( words ) , len ( vocabulary )))
14

15 for i , word in enumerate ( words ) :


16 if word in vocabulary :
17 idx = vocabulary . index ( word )
18 encoding [i idx ] = , 1
19

20 return encoding
21

22 vocabulary = create_vocabulary ( corpus )


23 print (f"\ nVocabulaire : { vocabulary }")
24 print (f" Vocabulary size: {len(vocabulary)}")
25

26 # Example of one-hot encoding for the first sentence


27 phrase_1_onehot = one_hot_encode ( corpus [0] , vocabulary )
28 print (f"\nOne - hot encoding de ’{ corpus [0]} ’: ")
29 print (f" Dimensions : { phrase_1_onehot . shape }")
30 print (" Matrice ( p r e m i r e s lignes ):")
31 print (phrase_1_onehot [:3:10]) # Partial display
,

32

33 # 2. Bag of Words avec scikit - learn


34 vectorizer_bow = CountVectorizer ()
35 bow_matrix = vectorizer_bow . fit_transform ( corpus )
36

37 print (f"\ nBag of Words :")


38 print (f" Vocabulaire : { vectorizer_bow . get_feature_names_out ()}")
39 print (f" Matrice BOW ( dense ):")
40 print ( bow_matrix . toarray () )
41

42 # 3. TF - IDF
43 vectorizer_tfidf = TfidfVectorizer ()
44 tfidf_matrix = vectorizer_tfidf . fit_transform ( corpus )
45

46 print (f"\nTF -IDF :")


47 print (f" Matrice TF - IDF (dense arrondie ):") ,

48 print ( np . round ( tfidf_matrix . toarray () 3) ) ,

49

50 # Visualization of different representations


,
51 fig , axes = plt . subplots (1 , 3, figsize =(15 4) )
52

53 # BOW heatmap
54 im1 = axes [0]. imshow ( bow_matrix . toarray () , cmap =’Blues ’, aspect =’auto ’)
55 axes [0]. set_title (’Bag of Words ’)
56 axes [0]. set_xlabel (’Mots ’)
57 axes [0]. set_ylabel (’Documents ’)
58

59 # TF - IDF heatmap
60 im2 = axes [1]. imshow ( tfidf_matrix . toarray () , cmap =’Reds ’, aspect =’auto ’)
61 axes [1]. set_title (’TF - IDF ’)
62 axes [1]. set_xlabel (’Mots ’)
63 axes [1]. set_ylabel (’Documents ’)
64

65 # Comparison of representation sizes


66 methods = [’One - hot \n(par mot )’, ’Bag of Words \n( par document )’, ’TF - IDF \n( par document )’
]

67 sizes = [ len ( vocabulary ) len ( vocabulary ), 68 colors = [’blue ’, ’green ’, , len( vocabulary )]
’red ’]

ÿ ÿ
Listing 38 – Comparison of word representations

Mohamed Ouazze 38 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
sizes
1 axes [2]. bar ( methods color = colors , alpha, =0.7) ,

2 axes [2]. set_title (’ D i m e n s i o n n a l i t des r e p r s e n t a t i o n s ’)


3 axes [2]. set_ylabel ('Number of dimensions ')
4 axes [2]. tick_params ( axis =’x’, rotation =45)
5

6 plt . tight_layout ()

7 plt . show ()

ÿ ÿ
Listing 39 – Comparison of word representations

Explanation

Traditional representations like one-hot encoding suffer from the "curse of the
dimensionality" and do not capture the semantic relationships between words. For example, "cat"
and "dog" are treated as completely independent, even though they share properties
semantics (pets).
Word embeddings solve these problems by learning dense representations where words
semantically similar have close vectors in the vector space. This proximity can
be measured by cosine similarity or Euclidean distance.

3.2.2 Word2Vec, GloVe and FastText

Modern word embeddings use unsupervised learning techniques to create


vector representations rich in semantic information.

Model Approach Benefits Limitations

Word2Vec Contextual prediction Fast, efficient Ignore global order

(Skip-gram/CBOW) (sliding window) capture analogies words

GloVe Matrix factorization Uses statistics Requires a lot

of global co-occurrence of the memory corpus

FastText Word2Vec Extension Handles words outside More complex

with subwords vocabulary to train

ÿ ÿ

1 import numpy as np
2 from collections import defaultdict 3 import matplotlib . pyplot as plt , Counter

5 class SimpleWord2Vec :
6 def __init__ ( self , vector_size =100 , window_size =2 , min_count =1) :
"""
7

8 I m p l m e n t a t i o n s i m p l i f i e de Word2Vec Skip - gram


9

10 Args :
11 vector_size: Size of word vectors
12 window_size: Size of the pop-up window
13 min_count: Minimum frequency to include a word
"""
14

15 self . vector_size = vector_size


16 self . window_size = window_size
17 self . min_count = min_count
18 self . vocabulary = {}
19 self . word_count = Counter ()
20

21 def preprocess_text ( self , text ):


""" """
22 P r t r a i t e m e n t simple du texte
23 # Convert to lowercase and remove basic punctuation
24 import re
25 text = re . sub (r’[^\ w\s]’, ’’, text . lower () )
26 return text . split ()

Mohamed Ouazze 39 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
Listing 40 – Implementation simple de Word2Vec Skip-gram

Mohamed Ouazze 40 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 def build_vocabulary ( self , sentences ):
""" """
2 Vocabulary building
3 # Word Count
4 for sentence in sentences :
5 words = self . preprocess_text ( sentence )
6 self . word_count . update ( words )
7

8 # Filtering by min_count
9 filtered_words = [ word for word if count >= self . , count in self . word_count . items ()
10 min_count ]
11

12 # C ration of the vocabulary


13 self . vocabulary = { word : idx for idx self . reverse_vocabulary , word in enumerate ( filtered_words ) }
14 = { idx : word for word self . vocab_size = len ( self . vocabulary ) , idx in self . vocabulary . items () }
15

16

17 print (f" Constructed vocabulary: { self . vocab_size } unique words ")


18

19 def generate_training_data ( self sentences ): ,


""" """
20 Number of pairs ( central word contextual word ) for training
, training_data = []
21

22

23 for sentence in sentences :


24 words = self . preprocess_text ( sentence )
25 # Conversion into indexes
26 word_indices = [ self . vocabulary [ word ] for word in words
27 if word in self . vocabulary ]
28

29 # G n r a t i o n des paires skip - gram


30 for i , center_word in enumerate ( word_indices ):
31 # D finish of the contextual slit.
32 start = max (0 , i - self . window_size )
33 end = min ( len ( word_indices ) i + self . window_size
, + 1)
34

35 for j in range ( start if i != j: # , end ):


36 Exclusion du mot central lui - m m e
37 context_word = word_indices [j]
38 training_data . append (( center_word , context_word ) )
39

40 return training_data
41

42 def softmax ( self , x) :


""" """
43 Fonction softmax stable n u m r i q u e m e n t exp_x = np . exp (x
44 - np . max (x ))
45 return exp_x / np .sum ( exp_x )
46

47 def train ( self , sentences , epochs =100 , learning_rate =0.01) :


""" """
48 Training of the Word2Vec module
49 # Vocabulary building
50 self . build_vocabulary ( sentences )
51

52 # Initialization of weight matrices


53 # Embedding matrix (vocabulary -> vectors)
54 self . W_in = np . random . uniform ( -0.5/ self . vector_size , 0.5/ self . vector_size ,

55 ( self . vocab_size # Output , self . vector_size ))


56 matrix ( vectors -> vocabulary )
57 self . W_out = np . random . uniform ( -0.5/ self . vector_size ( self . vector_size , 0.5/ self . vector_size ,

58 , self . vocab_size ))
59

60 # G n r a t i o n des d o n n e s d’ e n t r a n e m e n t
61 training_data = self . generate_training_data ( sentences )
62 print (f" D o n n e s d’ e n t r a n e m e n t g n r e s : { len ( training_data )} paires ")

ÿ ÿ
Listing 41 – Implementation simple de Word2Vec Skip-gram

Mohamed Ouazze 41 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 #Entranement
2 for epoch in range ( epochs ):
3 total_loss = 0
4

5 for center_idx , context_idx in training_data :


6 # Forward pass
7 # Recovery of the central word vector
8 h = self . W_in [ center_idx ] # Vecteur c a c h
9

10 # Calculating scores for all words


11 u = np . dot (h self . W_out
, ) # Scores
12 y_pred = self . softmax ( u) #Probabilits
13

14 # Calculation of loss (cross-entropy)


15 loss = -np . log ( y_pred [ context_idx ])
16 total_loss += loss
17

18 # Backward pass
19 # Output layer gradient
20 grad_out = y_pred . copy ()
21 grad_out [ context_idx ] -= 1
22

23 # Weight gradients
24 grad_W_out = np . outer (h , grad_out )
25 grad_h = np . dot ( self . W_out , grad_out )
26

27 # Bet weight day


28 self . W_out -= learning_rate * grad_W_out
29 self . W_in [ center_idx ] -= learning_rate * grad_h
30

31 if epoch % 20 == 0:
32 avg_loss = total_loss / len ( training_data )
33 print (f" epoch { epoch }: Average loss = { avg_loss :.4f}")
34

35 def get_vector ( self , word ):


""" """
36 R recovers the vector of a word
37 if word in self . vocabulary :
38 return self . W_in [ self . vocabulary [ word ]]
39 else :
40 return None
41

42 def most_similar ( self , word , top_k =5) :


""" """
43 Find the most similar words
44 if word not in self . vocabulary :
45 return []
46

47 word_vector = self . get_vector ( word )


48 similarities = []
49

50 for other_word , idx in self . vocabulary . items () :


51 if other_word != word :
52 other_vector = self . W_in [ idx ]
53 # S i m i l a r i t cosinus
54 similarity = np . dot ( word_vector , other_vector ) / (
55 np . linalg . norm ( word_vector ) * np . linalg . norm ( other_vector )
56 )

57 similarities . append (( other_word , similarity ))


58

59 # Tri par s i m i l a r i t d c r o i s s a n t e
60 similarities . sort ( key = lambda x : x [1] return similarities [: , reverse = True )
61 top_k ]
ÿ ÿ
Listing 42 – Simple implementation of Word2Vec Skip-gram function suite

Mohamed Ouazze 42 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # Larger example corpus
2 corpus_extend = [
3 "The cat eats fish every day ,"
4 "The dog eats red meat ",
5 "The fish swims in clear water ",
6 "The cat and the dog are domestic animals ,"
"
7 Animals need food ,"
8 "Cat food contains fish ,"
9 "Dog food contains meat ,"
10 "Water is essential for all animals ,"
11 "The cat likes to sleep in the sun ",
12 "The dog likes to play in the garden ,"
"
13 Fish live in water ,
" "
14 Pets are our companions
15 ]
16

17 # Training you model


18 print (" E n t r a n e m e n t du m o d l e Word2Vec ... ")
19 model = SimpleWord2Vec ( vector_size =50 20 model . train ( corpus_etendu , , window_size =2)
epochs =200 , learning_rate =0.1)
21

22 # Test des s i m i l a r i t s
23 print ("\ nTest des s i m i l a r i t s :")
24 test_words = ['cat ', 'dog ', 'fish ', 'eat ']
25

26 for word in test_words :


27 if word in model . vocabulary :
28 similar_words = model . most_similar ( word , top_k =3)
29 print (f"\ nMots similaires ’{ word } ’:")
30 for similar_word , similarity in similar_words :
31 print (f" { similar_word }: { similarity :.3 f}")
32

33 # Visualization of embeddings (2D projection with PCA)


34 from sklearn . decomposition import PCA
35

36 # Retrieving all vectors


37 words = list ( model . vocabulary . keys () )
38 vectors = np . array ([ model . get_vector ( word ) for word in words ])
39

40 # R Dimensioning with PCA

41 pca = PCA ( n_components =2)


42 vectors_2d = pca . fit_transform ( vectors )
43

44 # Visualisation
,
45 plt . figure ( figsize =(12 8) )
46 plt . scatter ( vectors_2d [: 0] , , vectors_2d [: , 1] , alpha =0.7)
47

48 # Annotation of words
49 for i , word in enumerate ( words ) :
50 plt . annotate ( word ( vectors_2d, [i xytext =(5 , 0] , vectors_2d [i , 1]) ,

51 , 5) , textcoords =’offset points ’, fontsize =9)


52

53 plt . title (’ Visualisation des Word Embeddings ( PCA 2D)’)


54 plt . xlabel (f’Composante principale 1 ({ pca . explained_variance_ratio_ [0]:.1%} variance )’)
55 plt . ylabel (f’Composante principale 2 ({ pca . explained_variance_ratio_ [1]:.1%} variance )’)
56 plt . grid ( True , alpha =0.3)
57 plt . tight_layout ()
58 plt . show ()

ÿ ÿ
Listing 43 – Simple implementation of Word2Vec Skip-gram suite 2

Mohamed Ouazze 43 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

Explanation

Word2Vec uses the distributional hypothesis: "words that appear in contexts


similar have similar meanings." The Skip-gram model predicts contextual words at
start from a central word, while CBOW (Continuous Bag of Words) does the opposite.
Training optimizes the vectors so that the probability of contextual words is maximized,
naturally creating representations where semantically related words have close vectors.
Vector operations on these embeddings capture complex semantic relationships
(analogies like "king - man + woman queen").

3.2.3 Introduction to Recurrent Neural Networks (RNN)

Recurrent neural networks are designed to process sequences of data while maintaining
a "memory" of previous elements through recurring connections.

Exit sequence
y1 y2 y3

Why Why Why


Hidden States
Whh Whh Whh
h0 h1 h2 h3

Wxh Wxh Wxh

x1 x2 x3
Input sequence

ÿ ÿ

1 import numpy as np
2 import matplotlib . pyplot as plt
3

4 class SimpleRNN :
5 def __init__ ( self , input_size , hidden_size , output_size ):
"""
6

7 Simple RNN for sequence processing


8

9 Args :
10 input_size : Size of the interval of each time step
11 hidden_size : Size of the cache
12 output_size: Output size
"""
13

14 self . input_size = input_size


15 self . hidden_size = hidden_size
16 self . output_size = output_size
17

18 # Initializing parameters
19 # Weight for entry to hidden state
20 self . Wxh = np . random . randn ( hidden_size , input_size ) * 0.1
21 # Weight for hidden state to hidden state (recurring connections)
22 self . Whh = np . random . randn ( hidden_size # Weight for the , hidden_size ) * 0.1
23 hidden state towards the output
24 self . Why = np . random . randn ( output_size , hidden_size ) * 0.1
25

26 # Bias
27 self . bh = np . zeros (( hidden_size self . by = np . , 1) )
28 zeros (( output_size , 1) )
29

30 def tanh ( self , x ):


""" """
31 tanh activation function
32 return np . tanh (x)
ÿ ÿ
Listing 44 – Implementation of a simple RNN for sequence processing

Mohamed Ouazze 44 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 def softmax ( self , x) :
""" """
2 Softmax function for output
3 exp_x = np . exp (x - np . max (x ))
4 return exp_x / np .sum ( exp_x )
5

6 def forward ( self , inputs , h_prev = None ):


"""
7

8 Forward propagation through the sequence


9

10 Args :
11 inputs : S q u e n c e d’ e n t r e s ( seq_length , input_size )
12 c a c h initial (si None
h_prev : tat i n i t i a l i s , zro )
13

14 Returns :
15 outputs : S q u e n c e de sorties
16 hidden_states : Tous les tats c a c h s
"""
17

18 seq_length = len ( inputs )


19

20 # Initializing the hidden state


21 if h_prev is None :
22 h_prev = np . zeros (( self . hidden_size , 1) )
23

24 # Storing states and outputs


25 hidden_states = {}
26 outputs = {}
27

28 hidden_states [ -1] = h_prev . copy ()


29

30 # Propagation for t in through each time step


31 range ( seq_length ):
32 # Converting the input to a column vector
33 x = inputs [t ]. reshape ( -1 , 1)
34

35 # Calculation of the new hidden state


36 # h_t = tanh ( W_xh * x_t + W_hh * h_{t -1} + b_h )
37 h = self . tanh ( np . dot ( self . Wxh x) + ,

38 np . dot ( self . Whh , hidden_states [t -1]) + self . bh )


39

40 # Calculating the output


41 # y_t = softmax ( W_hy * h_t + b_y )
42 y = self . softmax ( np . dot ( self . Why , h) + self . by )
43

44 # Storage
45 hidden_states [t] = h
46 outputs [t] = y
47

48 return outputs , hidden_states


49

50 def backward ( self , inputs , targets , outputs , hidden_states , learning_rate =0.01) :


"""
51

52 Rtropropagation through time (BPTT)


"""
53

54 seq_length = len ( inputs )


55

56 # Initializing gradients
57 dWxh = np . zeros_like ( self . Wxh )
58 dWhh = np.zeros_like(self.Whh)
59 dWhy = np . zeros_like ( self . Why )
60 dbh = np . zeros_like ( self . bh )
61 dby = np . zeros_like ( self . by )
62

63 # Gradient de l’ tat c a c h suivant ( i n i t i a l i s dh_next = np . zeros (( self . zro )


64 hidden_size , 1) )

ÿ ÿ
Listing 45 – Implementation of a simple RNN for sequence processing

Mohamed Ouazze 45 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # R t r o p r o p a g a t i o n travers le temps
2 for t in reversed ( range ( seq_length )) :
3 # Output gradient
4 dy = outputs [ t ]. copy ()
5 dy [ targets [t ]] -= 1 # Cross - entropy gradient
6

7 # Gradients for the output layer


8 dWhy += np . dot (dy , dby += dy hidden_states [ t ]. T )
9

10

11 # Gradient of the hidden state (from the output + from the future)
12 dh = np . dot ( self . Why .T , dy ) + dh_next
13

14 # Gradient dh_raw travers


15 = (1 - hidden_states [t ] ** 2) * dh
16

17 # Gradients des p a r a m t r e s
18 dbh += dh_raw
19 dWxh += np . dot ( dh_raw , inputs [t ]. reshape (1 dWhh += np . dot ( dh_raw , -1) )
20 , hidden_states [t -1]. T)
21

22 # Gradient pour l’ tat c a c h p r c d e n t


23 dh_next = np.dot(self.Whh.T dh_raw) ,

24

25 # Clipping gradients to avoid explosion


26 for grad in [ dWxh dWhh , dWhy , np . ,clip ( grad , -5, 5, out = dbh , dby ]:
27 grad )
28

29 # Bet jour des p a r a m t r e s


30 self . Wxh -= learning_rate * dWxh
31 self . Whh -= learning_rate * dWhh
32 self . Why -= learning_rate * dWhy
33 self . bh -= learning_rate * dbh
34 self . by -= learning_rate * dby
35

36 def train ( self , sequences ,targets_sequences , epochs =100 , learning_rate =0.01) :


"""
37

38 E ntrane the RNN on squences


"""
39

40 losses = []
41

42 for epoch in range ( epochs ):


43 total_loss = 0
44

45 for seq_idx , ( sequence , targets ) in enumerate ( zip ( sequences ,


targets_sequences )) :
46 # Forward propagation
47 outputs , hidden_states = self . forward ( sequence )
48

49 # Calculation of loss
50 loss = 0
51 for t in range ( len ( targets )) :
52 loss += -np . log ( outputs [t ][ targets [t] total_loss += loss , 0])
53

54

55 #Rtropropagation
56 self . backward ( sequence , targets , outputs , hidden_states , learning_rate )
57

58 avg_loss = total_loss / len ( sequences )


59 losses . append ( avg_loss )
60

61 if epoch % 10 == 0:
62 print (f" epoch { epoch }: Average loss = { avg_loss :.4f}")
63

64 return losses

ÿ ÿ
Listing 46 – Implementation of a simple RNN for sequence processing

Mohamed Ouazze 46 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # Exemple d’ utilisation : P r d i c t i o n de s q u e n c e simple

2 # Data generation: binary sequences where the output is the XOR of the two
entresprcdentes
3

4 def generate_xor_sequences ( num_sequences =100 , seq_length =10) :


""" """
5 Generate sequences for the time XOR problem sequences = []
6

7 targets = []
8

9 for _ in range ( num_sequences ):


10 # Generation of a random binary sequence
11 sequence = np . random . randint (0 , 2, seq_length )
12 target = np . zeros ( seq_length , dtype = int )
13

14 # The target at time t is the XOR of the inputs at times t-1 and t-2
15 for t in range (2 , seq_length ):
16 target [t] = sequence [t -1] ^ sequence [t -2]
17

18 # One-hot conversion for the in-between


19 sequence_onehot = np . eye (2) [ sequence ]
20

21 sequences . append ( sequence_onehot )


22 targets . append ( target )
23

24 return sequences , targets


25

26 # G n r a t i o n des d o n n e s

27 print (" G n r a t i o n des d o n n e s d’ e n t r a n e m e n t ...")


28 train_sequences , train_targets = generate_xor_sequences ( num_sequences =500 , seq_length =8)
29

30 # C ration et entranement du RNN

31 rnn = SimpleRNN ( input_size =2 , hidden_size =10 , output_size =2)


32

33 print (" E n t r a n e m e n t du RNN ...")


34 losses = rnn . train ( train_sequences , train_targets , epochs =100 , learning_rate =0.1)
35

36 # Test on some examples


37 print ("\nTest on examples:")
38 test_sequences , test_targets = generate_xor_sequences ( num_sequences =5 , seq_length =8)
39

40 for i in range (3) :


41 sequence = test_sequences [i ]
42 target = test_targets [ i]
43

44 outputs , _ = rnn . forward ( sequence )


45 predictions = [ np . argmax ( outputs [t ]) for t in range ( len ( target ))]
46

47 # Conversion for display


48 input_seq = [ np . argmax ( sequence [ t ]) for t in range (len ( sequence ))]
49

50 print (f"\ nExemple {i +1}: ")


51 print (f" E n t r e : print (f" { input_seq }")
52 Cible : print (f" P r d i c t i { target . tolist ()}")
53 o n : { predictions }")
54

55 # Calculating the accuracy for this example


56 accuracy = np . mean ( np . array ( predictions [2:]) == target [2:]) # Ignore les 2 premiers
57 print (f" P r c i s i o n : { accuracy :.2 f}")
58

59 # Visualization of the evolution of the loss


,
60 plt . figure ( figsize =(10 6) )
61 plt . plot ( losses , ’b-’, linewidth =2)
62 plt . title ('Evolution of loss during RNN training ')
63 plt . xlabel ('poque ')
64 plt . ylabel ('Average Loss ')
65 plt . grid ( True , alpha =0.3)

66 plt . show ()

ÿ ÿ
Listing 47 – Implementation of a simple RNN for sequence processing

Mohamed Ouazze 47 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

Explanation

RNNs are particularly suited to sequential data because they can "remember"
past information through their recurring hidden state. In the example of temporal XOR, the
network must remember the two previous inputs to calculate the correct output.
However, simple RNNs suffer from the vanishing gradient problem: gradients become exponentially
small as they propagate to earlier time steps,
limiting the network's ability to learn long-term dependencies. This is why
More sophisticated architectures like LSTM and GRU have been developed.

3.2.4 Practical example: Sentiment analysis on movie reviews

Let's implement a sentiment classifier using word embeddings and a simple RNN.
ÿ ÿ

1 import numpy as np
2 import re
3 from collections import Counter
4 import matplotlib . pyplot as plt
5

6 class SentimentClassifier :
7 def __init__ ( self , embedding_dim =50 , hidden_dim =64 , max_length =100) :
"""
8

9 RNN-based sentiment classifier


10

11 Args :
12 embedding_dim : Dimension of word embeddings
13 hidden_dim : Size of the RNN hidden layer
14 max_length : Maximum length of sequences
"""
15

16 self . embedding_dim = embedding_dim


17 self . hidden_dim = hidden_dim
18 self . max_length = max_length
19 self . word_to_idx = {}
20 self . idx_to_word = {}
21 self . vocab_size = 0
22

23 def preprocess_text ( self , text ):


""" """
24 Text Pr ocessing # Lower Case Conversion

25

26 text = text . lower ()


27 # Suppression de la ponctuation et c a r a c t r e s s p c i a u x
28 text = re . sub (r’[^a-zA -Z\s]’, ’’, text )
29 # Tokenisation
30 words = text . split ()
31 return words
32

33 def build_vocabulary ( self , texts , min_freq =2) :


""" """
34 Vocabulary building word_counts = Counter()

35

36

37 # Word Count
38 for text in texts :
39 words = self . preprocess_text ( text )
40 word_counts . update ( words )
41

42 # Filtering according to the minimum frequency


43 filtered_words = [ word for word , count in word_counts . items ()
44 if count >= min_freq ]
45

46 # Ajout des tokens s p c i a u x


47 special_tokens = [’<PAD >’, ’<UNK >’]
48 vocab_words = special_tokens + filtered_words
ÿ ÿ
Listing 48 – Sentiment Classifier with RNN

Mohamed Ouazze 48 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # Creation of dictionaries
2 self . word_to_idx = { word : idx for idx , word in enumerate ( vocab_words )}
3 self . idx_to_word = { idx : word for word self . vocab_size = len , idx in self . word_to_idx . items () }
4 ( vocab_words )
5

6
print (f" Vocabulaire construit : { self . vocab_size } mots ")
7

8
def text_to_sequence ( self , text ) :
""" """
9
Conversion of text to sequence of indexes words = self . preprocess_text
10
( text )
11
sequence = []
12

13 for word in words :


14 if word in self . word_to_idx :
15
sequence . append ( self . word_to_idx [ word ])
16 else :
17
sequence . append ( self . word_to_idx [’<UNK >’]) # Mot inconnu
18

19
return sequence
20

21
def pad_sequence ( self , sequence ) :
""" """
22
Sequence Padding Max Length if len ( sequence ) > self . max_length :
23

24
return sequence [: self . max_length ]
25 else :
26
padding = [ self . word_to_idx [’<PAD >’]] * ( self . max_length - len ( sequence ) )
27
return sequence + padding
28

29
def sigmoid ( self , x) :
""" """
30
Fonction sigmoid stable x = np . clip (x
31
return 1 / (1 + np . exp , -250 , 250)
32
(- x))
33

34 def tanh ( self , x ):


""" """
35 Fonction tanh return np .
36
tanh (x)
37

38
def initialize_parameters ( self ):
""" """
39
Initializing model parameters
40
# Matrice d’embeddings
41
self . embeddings = np . random . uniform ( -0.1 ( self . vocab_size , 0.1 ,

42 , self . embedding_dim ))
43

44 # RNN parameters
45
self . Wxh = np . random . randn ( self . hidden_dim self . Whh = np . , self . embedding_dim ) * 0.1
46
random . randn ( self . hidden_dim self . bh = np . zeros (( self . , self . hidden_dim ) * 0.1
47
hidden_dim , 1) )
48

49 # P arameters of the classification layer


50
self . Wy = np . random . randn (1 self . by = , self . hidden_dim ) * 0.1
51
np . zeros ((1 , 1) )
52

53 def forward ( self , sequence ):


""" """
54
Propagation avant seq_length
55
= len ( sequence )
56

57 # tats c a c h s
58 hidden_states = {}
59
hidden_states [ -1] = np . zeros (( self . hidden_dim , 1) )

ÿ ÿ
Listing 49 – Sentiment Classifier with RNN

Mohamed Ouazze 49 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1
# Propagation through the sequence
2
for t in range ( seq_length ):
3
# R cupation of word embedding
4
word_idx = sequence [t]
5
if word_idx != self . word_to_idx [’<PAD >’]: # Ignore le padding
6
x = self . embeddings [ word_idx ]. reshape ( -1 , 1)
7

8 # Calculation of the cash state


9
h = self . tanh ( np . dot ( self . Wxh np . dot ( self . , x) +
10
Whh hidden_states [t] = h , hidden_states [t -1]) + self . bh )
11

12 else :
13
hidden_states [t] = hidden_states [t -1] # tat i n c h a n g pour padding
14

15 # Classification based on last hidden state


16
final_hidden = hidden_states [ seq_length -1]
17
output = self . sigmoid ( np . dot ( self . Wy , final_hidden ) + self . by )
18

19
return output [0 , 0] , hidden_states
20

21
def compute_loss ( self , prediction , target ):
""" """
22
Calculation of binary cross entropy loss epsilon = 1st -15
23

24
prediction = np . clip ( prediction , epsilon , return -( target * np . log 1 - epsilon )
25
( prediction ) + (1 - target ) * np . log (1 - prediction ))
26

27 def train ( self , texts , labels , epochs =50 , learning_rate =0.01) :


""" """
28 Training you model
29 # Vocabulary building
30
self . build_vocabulary ( texts )
31

32
# Initializing parameters
33
self . initialize_parameters ()
34

35
# Convert text to sequences
36
sequences = []
37 for text in texts :
38
seq = self . text_to_sequence ( text )
39
seq = self . pad_sequence ( seq )
40
sequences . append ( seq )
41

42 losses = []
43 accuracies = []
44

45
for epoch in range ( epochs ):
46 total_loss = 0
47
correct_predictions = 0
48

49
# Data Mix
50
indices = np . random . permutation ( len ( sequences ))
51

52 for idx in indices :


53
sequence = sequences [ idx ]
54
target = labels [ idx ]
55

56
# Forward propagation
57
prediction, # Loss hidden_states = self . forward ( sequence )
58
calculation
59
loss = self . compute_loss ( prediction , target )
60 total_loss += loss
61 #Prcision
62
pred_class = 1 if prediction > 0.5 else 0
63
if pred_class == target :
64
correct_predictions += 1
65
# R t r o p r o p a g a t i o n s i m p l i f i e ( gradient approximatif )
66
# Pour une i m p l m e n t a t i o n c o m p l t e error = prediction , it would require a complete BPTT
67
- target
ÿ ÿ
Listing 50 – Sentiment Classifier with RNN Suite

Mohamed Ouazze 50 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # Bet jour approximative ( pour d m o n s t r a t i o n )
2 final_hidden = hidden_states [ len ( sequence ) -1]
3 self . Wy -= learning_rate * error * final_hidden . T

4 self . by -= learning_rate * error


5

6 # M poke tricks
7 avg_loss = total_loss / len ( sequences )
8 accuracy = correct_predictions / len ( sequences )
9

10 losses . append ( avg_loss )


11 accuracies . append ( accuracy )
12

13 if epoch % 10 == 0:
14 print (f" poque { epoch }: Perte = { avg_loss :.4f}, P r c i s i o n = { accuracy
:.3 f}")
15

16 return losses , accuracies


17

18 def predict ( self , text ):


""" """
19 Prediction on a new text
20 sequence = self . text_to_sequence ( text )
21 sequence = self . pad_sequence ( sequence )
22 prediction , _ = self . forward ( sequence )
23
" " " "
24 sentiment = Positive if prediction > 0.5 else Ngatif
25 confidence = prediction if prediction > 0.5 else 1 - prediction
26

27 return sentiment , confidence


28

29 # Example data for sentiment analysis


30 critical_examples = [
31 "This movie is absolutely fantastic, I loved every minute of it ,"
"
32 A boring story with bad actors ",
"
33 Excellent entertainment, very well done ", ,
"
34 Completely useless, a total waste of time
, ",
35 "A brilliantly written masterpiece of cinema ," ,
"
36 Scenario p r v i s i b l e et dialogue faible ",
"
37 Magnificent performance by the main actors ",
"
38 Horrible movie, I don't recommend it at all ."
"
39 Very good film with a captivating plot ",
"
40 Disappointing and uninteresting, really bad ", ,
"
41 Superb production and impressive visual effects ",
"
42 Boring to die, I almost fell asleep ",
"
43 "A touching and well-constructed story ,"
44 "Mediocre actors in a film without a soul ,"
"
45 Entertaining and full of action ",
" "
46 S c n a r i o i n c o h r e n t et fin d c e v a n t e
47 ]
48

49 labels_exemples = [1 , 0 , 1, 0, 1, 0, 1, 0 , 1, 0, 1, 0, 1, 0 , 1 , 0] # 1 = Positive , 0=

Ngatif
50

51 print (" C r a t i o n du classificateur de sentiment ... ")


52 classifier = SentimentClassifier ( embedding_dim =30 , hidden_dim =32 , max_length =20)
53

54 print (" E n t r a n e m e n t du m o d l e ... ")


55 losses , accuracies = classifier . train ( reviews_examples , labels_examples ,
56 epochs =100 , learning_rate =0.1)
57

58 # Test on new examples


59 new_texts = [
60 "This film is really excellent ,"
61 "Je d t e s t e ce film horrible ",
"
62 Interesting story and well played ",
" "
63 Completely boring and poorly executed
64 ]
65

66 print ("\nTest on new examples:")

ÿ ÿ
Listing 51 – Sentiment Classifier with RNN

Mohamed Ouazze 51 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 for text in new_texts:
2 sentiment , confidence = classifier . predict ( text )
3
print (f" ’{ texte } ’")
4
print (f" { sentiment } ( confiance : { confidence :.3f})\n")
5

6 # Visualisation des m t r i q u e s d’ e n t r a n e m e n t
7 plt . figure ( figsize =(12 , 5) )
8

9 plt . subplot (1 1) , ,

2 10 plt . plot ( losses , ’r-’, linewidth =2)


11 plt . title (' evolution of loss ')
12 plt . xlabel ('poque ')
13 plt . ylabel ('Perte ')
14 plt . grid ( True , alpha =0.3)
15

16 plt . subplot (1 2) , 2 ,

17 plt . plot ( accuracies , ’g-’, linewidth =2)


18 plt . title (’ volution de la p r c i s i o n ’)
19 plt . xlabel ('poque ')
20 plt . ylabel (’ P r c i s i o n ’)
21 plt . grid ( True , alpha =0.3)
22

23 plt . tight_layout ()
24 plt . show ()
ÿ ÿ
Listing 52 – Sentiment Classifier with RNN Suite

Explanation

This example shows how to combine word embeddings with an RNN for classification
sentiment. The model treats each review as a sequence of words, uses embeddings
to represent each word, then the RNN captures the sequential dependencies to make a
final prediction.
In a real implementation, more advanced techniques would be used such as:
— LSTM or GRU instead of a simple RNN
— Pre-trained embeddings (Word2Vec, GloVe)
— Regularization techniques (dropout, early stopping)
— More sophisticated optimizers (Adam, RMSprop)

3.3 Advanced Recurrent Networks


3.3.1 Architecture LSTM et GRU

LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks are variants
advances in RNNs designed to solve the vanishing gradient problem and enable learning
of long-term dependencies.

Architecture Main doors Advantages


LSTM Forget, Input, Output Precise control of information,
+ Cell State long-term memory
GRU Reset, Update Plus simple que LSTM,

fewer parameters

Simple RNN None Simple to understand,

quick to calculate

Mohamed Ouazze 52 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 import numpy as np
2

3 class LSTMCell :
4 def __init__ ( self , input_size , hidden_size ):
"""
5

6 LSTM cell with all gates


7

8 Args :
9 input_size : Size of between
10 hidden_size : Size of the cache
"""
11

12 self . input_size = input_size


13 self . hidden_size = hidden_size
14

15 # Initializing parameters for each gate


16 # Forget Gate
17 self . Wf = np . random . randn ( hidden_size , input_size + hidden_size ) * 0.1
18 self . bf = np . zeros (( hidden_size 1) ) ,

19

20 # Porte d’ e n t r e ( input gate )


21 self . Wi = np . random . randn ( hidden_size , input_size + hidden_size ) * 0.1
22 self . bi = np . zeros (( hidden_size , 1) )
23

24 # Candidates for cell status


25 self . Wc = np . random . randn ( hidden_size , input_size + hidden_size ) * 0.1
26 self . bc = np . zeros (( hidden_size , 1) )
27

28 # Output gate
29 self . Wo = np . random . randn ( hidden_size , input_size + hidden_size ) * 0.1
30 self . bo = np . zeros (( hidden_size , 1) )
31

32 def sigmoid ( self , x) :


""" """
33 Fonction sigmoid stable x = np . clip (x
34 return 1 / (1 + np . exp , -250 , 250)
35 (- x))
36

37 def tanh ( self , x ):


""" """
38 Fonction tanh return np .

39 tanh (x)
40

41 def forward ( self , x , h_prev , c_prev ):


"""
42

43 Forward propagation of an LSTM cell


44

45 Args :
46 x: E n t r e actuelle ( input_size , c a c h p r c d e n t 1)
47 h_prev : tat c_prev : ( hidden_size 1) ,

48 previous cell tat ( hidden_size , 1)


49

50 Returns :
51 h: New state of affairs
52 c: New cell state
53 cache : Valeurs i n t e r m d i a i r e s pour la r t r o p r o p a g a t i o n
"""
54

55 # C ontnation of the between and the tat concat = np . vstack (( x , h_prev )) cachprcdent
56

57

58 # Forget Gate: Decide what to forget from the cell state


59 ft = self . sigmoid ( np . dot ( self .Wf , concat ) + self . bf )
60

61 # Gateway: decides what new information to store


62 it = self . sigmoid ( np . dot ( self .Wi , concat ) + self . bi )
63

64 # Candidates for the new cell status


65 ct_tilde = self . tanh ( np . dot ( self .Wc , concat ) + self . bc )
66

67 # New cell state


68 ct = ft * c_prev + it * ct_tilde

ÿ ÿ
Listing 53 – Implementation of an LSTM cell

Mohamed Ouazze 53 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1
# Output gate: decides which parts of the cell state to use
2
ot = self . sigmoid ( np . dot ( self .Wo , concat ) + self . bo )
3

4 # New hidden state


5 ht = ot * self . tanh ( ct )
6

7
# Cache pour la r t r o p r o p a g a t i o n
8 cache = {
9 ’x’: x , ’h_prev ’: h_prev , ’c_prev ’: c_prev ,
10 ’ft ’: ft
’concat ’: concat ’it ’: it ’ct_tilde ’:, ct_tilde , , ,

11 ’ct ’: ct , 'ot ': ot , 'ht ': ht


12 }
13

14 return ht , ct , cache
15

16 class GRUCell :
17 def __init__ ( self , input_size , hidden_size ):
"""
18

19
Cellule GRU ( version s i m p l i f i e de LSTM )
20

21
Args :
22
input_size : Size of between
23 hidden_size : Size of the cache
"""
24

25
self . input_size = input_size
26 self . hidden_size = hidden_size
27

28 # Porte de remise self . Wr = np . z r o ( reset gate )


29
random . randn ( hidden_size , input_size + hidden_size ) * 0.1
30
self . br = np . zeros (( hidden_size , 1) )
31

32 # Putting door jour ( update gate )


33
self . Wz = np . random . randn ( hidden_size , input_size + hidden_size )* 0.1
34
self . bz = np . zeros (( hidden_size , 1) )
35

36
# Candidat pour le nouvel tat self . Wh = np . random . cach
37
randn ( hidden_size , input_size + hidden_size ) * 0.1
38
self . bh = np . zeros (( hidden_size , 1) )
39

40
def sigmoid ( self , x) :
""" """
41
Fonction sigmoid stable x = np . clip (x return
42
1 / (1 + np . exp (- x)) , -250 , 250)
43

44

45 def tanh ( self , x ):


""" """
46 Fonction tanh return np .
47
tanh (x)
48

49 def forward ( self , x , h_prev ) :


"""
50

51
Forward propagation of a GRU cell
52

53
Args :
54
x: E n t r e actuelle ( input_size , h_prev : tat 1)
55
c a c h p r c d e n t ( hidden_size , 1)
56

57 Returns :
58 h: New state of affairs
59
cache : Valeurs i n t e r m d i a i r e s pour la r t r o p r o p a g a t i o n
"""
60

61 # C ontnation of the between and the tat concat = np . vstack (( x , h_prev )) cachprcdent
62

63

64 # Shed door zro


65
rt = self . sigmoid ( np . dot ( self .Wr , concat ) + self . br )
66

67 # Porte de mise zt = self . day


68
sigmoid ( np . dot ( self .Wz , concat ) + self . bz )

ÿ ÿ
Listing 54 – Implementation of an LSTM Suite Cell

Mohamed Ouazze 54 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # Candidate for the new hidden state
2 concat_reset = np . vstack ((x rt * h_prev ) ) ,

3 ht_tilde = self . tanh ( np . dot ( self .Wh , concat_reset ) + self . bh )


4

5 # New hidden status (interpolation between former and candidate)


6 ht = (1 - zt ) * h_prev + zt * ht_tilde
7

8 # Cache pour la r t r o p r o p a g a t i o n
9 cache = {
10 ’x’: x , ’h_prev ’: h_prev , ’zt ’: zt ’ht_tilde ’: ht_tilde ’concat ’: concat ,

11 ’rt ’: rt , , , 'ht ': ht


12 }
13

14 return ht , cache
15

16 # Comparison of architectures on a memorization task


17 class SequenceMemoryTask :
18 def __init__ ( self , num_sequences =1000)=20
, seq_length :
"""
19

20 Memory task: the model must remember a signal at the beginning


21 of the sequence and reproduce it at the end
"""
22

23 self . seq_length = seq_length


24 self . num_sequences = num_sequences
25

26 def generate_data ( self ):


""" """
27 G n r e des d o n n e s pour la t c h e de m m o r i s a t i o n sequences = []
28

29 targets = []
30

31 for _ in range ( self . num_sequences ):


32 # Signal m m o r i s e r (0 ou 1)
33 signal = np . random . randint (0 2) ,

34

35 # S q u e n c e : signal au d b u t sequence = np . , zeros in the middle , request the end


36 zeros ( self . seq_length )
37 sequence [0] = signal # Signal au d b u t
38 sequence [ -1] = 2 # Request marker the end
39

40 # Target: reproduce the signal


41 target = signal
42

43 sequences . append ( sequence )


44 targets . append ( target )
45

46 return np . array ( sequences ) , np . array ( targets )


47

48 # Comparative test
49 def test_memory_architectures () :
""" """
50 Compare les performances des d i f f r e n t e s architectures
51 # G n r a t i o n des d o n n e s
52 task = SequenceMemoryTask ( seq_length =15 , num_sequences =500)
53 sequences , targets = task . generate_data ()
54

55 print (" Test de m m o r i s a t i o n long terme ")


56 print (f" Longueur de s q u e n c e : { task . seq_length }")
57 print (f" Nombre de s q u e n c e s : { task . num_sequences }")
58

59 # Example of squence
60 print (f"\ nExemple de s q u e n c e :")
61 print (f" S q u e n c e : { sequences [0]} ")
62 print (f" Signal m m o r i s e r : { int ( sequences [0][0]) }")
63 print (f" Cible : { targets [0]} ")
64 # Simple test: see if cells can remember the first element
65 lstm_cell = LSTMCell ( input_size =1 gru_cell = GRUCell , hidden_size =8)
66 ( input_size =1 hidden_size =8) ,

67 # Test on a sequence
68 test_sequence = sequences [0]

ÿ ÿ
Listing 55 – Implementing an LSTM Suite Cell

Mohamed Ouazze 55 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # LSTM
2 h_lstm = np . zeros ((8 c_lstm = np . , 1) )
3 zeros ((8 , 1) )
4

5 for t in range ( len ( test_sequence )):


6 x = np . array ([[ test_sequence [t ]]])
7 , ,_ = lstm_cell . forward (x h_lstm c_lstm , h_lstm , c_lstm )
8

9 # GRU
10 h_gru = np . zeros ((8 , 1) )
11

12 for t in range ( len ( test_sequence )):


13 x = np . array ([[ test_sequence [t ]]])
14 h_gru , _ = gru_cell . forward (x , h_gru )
15

16 print (f"\ n t a t final LSTM : { h_lstm . flatten () [:4]}... ") # Affichage partiel
17 print (f" tat final GRU : { h_gru . flatten () [:4]}... ") # Partial display
18

19 return sequences , targets


20

21 # Running the test

22 sequences , targets = test_memory_architectures ()


23

24 # Visualization of architectures
25 fig , axes = plt . subplots (1 , 3, figsize =(15 , 5) )
26

27 # Simple RNN Diagram


28 axes [0]. text (0.5 , 0.8 , ’RNN Simple ’, ha =’center ’, va =’center ’, fontsize =14 , weight =’bold ’
)

29 axes [0]. text (0.5 fontsize =10) , 0.6 , ’h_t = tanh ( Wx_t + Uh_ {t -1} + b)’, ha =’center ’, va =’center ’,


30 axes [0]. text (0.5 31 axes [0]. , 0.4 , Gradient vanescent ’, ha =’center ’, va =’center ’, fontsize =9)

text (0.5 32 axes [0]. set_xlim (0 , 0.3 , M m o i r e l i m i t e ’, ha =’center ’, va =’center ’, fontsize =9)
33 axes [0]. set_ylim (0 34 axes [0]. , 1)
axis (’off ’) , 1)

35

36 # Diagramme LSTM
37 axes [1]. text (0.5 38 axes [1]. , 0.9 , ’LSTM ’, ha =’center ’, va =’center ’, fontsize =14 , weight =’bold ’)
text (0.5 39 axes [1]. text (0.5 , 0.75 , ’Portes :’, ha =’center ’, va =’center ’, fontsize =11 , weight =’bold ’)

, 0.65 , Forget : f_t = ( W_f[h_{t -1} x_t ] + b_f, )’, ha =’center ’, va =
’center ’, fontsize =8)

40 axes [1]. text (0.5 , 0.55 , Input : i_t = ( W_i [h_{t -1} , x_t ] + b_i )’, ha =’center ’, va =’
center ’, fontsize =8)

41 axes [1]. text (0.5 , 0.45 , Output : o_t = ( W_o[h_{t -1} , x_t ] + b_o )’, ha =’center ’, va =
’center ’, fontsize =8)

42 axes [1]. text (0.5 =9) , 0.3 , Mmoire long terme ’, ha =’center ’, va =’center ’, fontsize


43 axes [1]. text (0.5 44 axes [1]. , 0.2 , C o n t r l e p r c i s ’, ha =’center ’, va =’center ’, fontsize =9)
set_xlim (0 45 axes [1]. set_ylim (0 46 , 1)
axes [1]. axis (’off ’) , 1)

47

48 # GRU Diagram
49 axes [2]. text (0.5 50 axes [2]. , 0.8 , ’GRU ’, ha =’center ’, va =’center ’, fontsize =14 , weight =’bold ’)
text (0.5 51 axes [2]. text (0.5 , 0.65 , 'Portes :', ha ='center ', va ='center ', fontsize =11 , weight =’bold ’)

, 0.55 , Reset : r_t = ( W_r [h_{t -1} , x_t ] + b_r )’, ha =’center ’, va =’
center ’, fontsize =9)

52 axes [2]. text (0.5 , 0.45 , Update : z_t = ( W_z[h_{t -1} , x_t ] + b_z )’, ha =’center ’, va =
’center ’, fontsize =9)

53 axes [2]. text (0.5 54 axes [2]. , 0.3 , Plus simple que LSTM ’, ha =’center ’, va =’center ’, fontsize =9)

text (0.5 55 axes [2]. set_xlim (0 , 0.2 , Less parameters ', ha ='center ', va ='center ', fontsize =9)
56 axes [2]. set_ylim (0 57 axes [2]. , 1)
axis (’off ’) , 1)

58

59 plt . suptitle (’ Comparaison des architectures r c u r r e n t e s ’, fontsize =16 , weight =’bold ’)


60 plt . tight_layout ()
61 plt . show ()

ÿ ÿ
Listing 56 – Implementation of an LSTM Suite Cell

Mohamed Ouazze 56 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

Explanation

LSTMs solve the vanishing gradient problem through their gate architecture which
allows a direct flow of information through the cell state. The three main gates are:
Forget Gate: Decides what information to delete from the cell state Input Gate:
Decides what new information to store Exit Gate: Determines which parts of
the cell state to use for output
GRUs simplify this architecture by merging some gates, reducing the number of
parameters while maintaining long-term storage capacity. In practice, GRUs
are often as efficient as LSTMs on many tasks.

ÿ ÿ

1 def _get_relative_positions ( self , seq_len ):


""" """
2 Computes the matrix of relative positions range_vec = torch . arange
3 ( seq_len )
4 range_mat = range_vec . repeat ( seq_len ). view ( seq_len , seq_len )
5 distance_mat = range_mat - range_mat . transpose (0 , 1)
6

7 # Clipping distances
8 distance_mat_clipped = torch . clamp ( distance_mat - self . max_relative_position , ,

10 self . max_relative_position )
11

12 # D stalling to have positive indices


13 final_mat = distance_mat_clipped + self . max_relative_position
14 return final_mat
15

16 def forward ( self x) : ,

17 seq_len = x. size (0)


18 relative_positions = self . _get_relative_positions ( seq_len )
19 relative_embeddings = self . relative_position_embeddings ( relative_positions )
20

21 # I n t g r a t i o n avec l’attention ( s i m p l i f i e ici )


22 return x + relative_embeddings . mean ( dim =1 , keepdim = True )
23

24 class RoPEPositionalEncoding ( nn . Module ):


"""
25

26 Rotary Position Embedding ( RoPE ) - Rotary positional encoding


27 U t i l i s dans des m o d l e s r c e n t s comme GPT - NeoX
"""
28

29 def __init__ ( self super () . , d_model , max_len =5000) :


30 __init__ ()
31 self . d_model = d_model
32

33 # C r a t i o n des f r q u e n c e s de rotation
34 inv_freq = 1.0 / (10000 ** ( torch . arange (0.0 self . register_buffer (’inv_freq ’, , d_model , 2.0) / d_model ))
35 inv_freq )
36

37 def forward ( self , seq_len = None


, x ):

38 if seq_len is None :
39 seq_len = x. size (0)
40

41 # Positions
42 t = torch . arange ( seq_len , device = x. device ). type_as ( self . inv_freq )
43

44 # Calculating angles
45 freqs = torch . einsum (’i,j->ij ’, t emb = torch . cat (( freqs , , self . inv_freq )
46 freqs ) , dim = -1)
47

48 # Applying rotation
49 cos_emb = emb . cos () [ None , :, None sin_emb = emb . , :]
50 sin () [ None , :, None , :]
51

52 return cos_emb , sin_emb


ÿ ÿ
Listing 57 – Transformers

Mohamed Ouazze 57 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 def visualize_positional_encodings () :
"""
2

3 Visualize different types of positional encodings


"""
4

5 d_model = 64
6 max_len = 100
7

8 # Creation of the different encodings


9 sinusoidal_pe = SinusoidalPositionalEncoding ( d_model max_len ) ,

10 learned_pe = LearnedPositionalEncoding ( d_model max_len ) ,

11

12 # Dummy data
13 dummy_input = torch . zeros ( max_len , 1, d_model )
14

15 # Application of encodings
16 with torch . no_grad () :
17 sin_encoded = sinusoidal_pe ( dummy_input )
18 learned_encoded = learned_pe ( dummy_input )
19

20 # Extraction of pure encodings


21 sin_pe_values = sin_encoded [: , 0, :]. numpy ()
22 learned_pe_values = learned_encoded [: , 0, :]. numpy () - dummy_input [: , 0, :]. numpy ()
23

24 # Visualisation
25 fig , axes = plt . subplots (2 , 3, figsize =(18 , 10) )
26

27 # Encodage s i n u s o d a l - Heatmap
28 im1 = axes [0 0]. imshow
, ( sin_pe_values .T , cmap =’RdBu ’, aspect =’auto ’)
29 axes [0 0]. set_title
, (’Encodage Positionnel S i n u s o d a l ’)
30 axes [0 0]. set_xlabel
, (’Position ’)
31 axes [0 0]. set_ylabel
, (’Dimension ’)
32 plt . colorbar ( im1 ax = axes [0, , 0])
33

34 # Learned Encoding - Heatmap


35 im2 = axes [0 1]. imshow
, ( learned_pe_values .T , cmap =’RdBu ’, aspect =’auto ’)
36 axes[0 1].set_title
, ('Learned Positional Encoding ')
37 axes [0 1]. set_xlabel
, (’Position ’)
38 axes [0 1]. set_ylabel
, (’Dimension ’)
39 plt . colorbar ( im2 ax = axes [0, , 1])
40

41 # Comparison for some dimensions


42 positions = range (0 max_len dim in enumerate
, ([0 for i , 5)
43 , , 16 , 32 , 48]) :
44 if i < 4:
45 axes [0 , 2]. plot ( positions , sin_pe_values [::5 label =f’Dim { dim }’, alpha , dim ],
46 =0.7)
47 axes [0 , 2]. set_title (’Encodage S i n u s o d a l - Dimensions S l e c t i o n n e s ’)
48 axes [0 , 2]. set_xlabel (’Position ’)
49 axes [0 , 2]. set_ylabel ('Value ')
50 axes [0 , 2]. legend ()
51 axes [0 , 2]. grid ( True , alpha =0.3)
52

53 # Analyse des f r q u e n c e s
54 freq_analysis = []
55 for pos in range (0 min (50 max_len, )): ,

56 fft_result = np . fft . fft ( sin_pe_values [ pos , freq_analysis . append ( np . abs :])


57 ( fft_result ))
58

59 freq_analysis = np . array ( freq_analysis )


60

61 im3 = axes [1 axes [1 , 0]. imshow ( freq_analysis .T , cmap =’viridis ’, aspect =’auto ’)
62 0]. set_title (’Analyse
, F r q u e n t i e l l e ( FFT )’)
63 axes [1 0]. set_xlabel
, (’Position ’)
64 axes [1 0]. set_ylabel
, (’ F r q u e n c e ’)
65 plt . colorbar ( im3 , ax = axes [1 , 0])

ÿ ÿ
Listing 58 – Transformers vs RNN

Mohamed Ouazze 58 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # Cosine similarity between positions
2 from sklearn . metrics . pairwise import cosine_similarity
3

4 positions_subset = sin_pe_values [::2] # Sous - chantillonnage


5 similarity_matrix = cosine_similarity ( positions_subset )
6

7 im4 = axes [1 axes , 1]. imshow ( similarity_matrix , cmap =’Blues ’)


8 [1 1]. set_title
, (’ S i m i l a r i t Cosinus entre Positions ’)
9 axes [1 1]. ,set_xlabel (’Position ’)
10 axes [1 1]. ,set_ylabel (’Position ’)
11 plt . colorbar ( im4 ax = axes, [1 , 1])
12

13 # Euclidean distance between adjacent positions


14 distances = []
15 for i in range (1 len ( sin_pe_values
, )):
16 dist = np . linalg . norm ( sin_pe_values [i ] - sin_pe_values [i -1])
17 distances . append ( dist )
18

19 axes [1 , 2]. plot ( distances linewidth =2)


,

20 axes [1 , 2]. set_title ('Euclidean Distance Between Adjacent Positions ')


21 axes [1 , 2]. set_xlabel (’Position ’)
22 axes [1 , 2]. set_ylabel (’Distance ’)
23 axes [1 , 2]. grid ( True , alpha =0.3)
24

25 plt . tight_layout ()
26 plt . show ()
27

28 return sin_pe_values , learned_pe_values


29

30 def analyze_positional_encoding_properties () :
"""
31

32 Analyzes the mathematical properties of positional encodings


"""
33

34 print ("=== ANALYSIS OF POSITIONAL ENCODING PROPERTIES ===\n")


35

36 d_model = 128
37 max_len = 200
38

39 # Encodage s i n u s o d a l
40 pe = SinusoidalPositionalEncoding ( d_model max_len ) ,

41 dummy_input = torch . zeros ( max_len , 1, d_model )


42

43 with torch . no_grad () :


44 encoded = pe ( dummy_input )
45

46 pe_values = encoded [: , 0, :]. numpy ()


47

48 #1. Translation Property


49 print ("1. P R O P R I T DE TRANSLATION ")
50 print ("-" * 40)
51

52 # Test: PE(pos + k) must have a relationship with PE(pos)


53 pos1 , pos2 = 10 offset = , 20
54 pos2 - pos1
55

56 per_post1 = per_values [ post1 ]


57 for_post2 = for_values [ post2 ]
58

59 # Calculation from similarity


60 similarity = np . dot ( pe_pos1 , pe_pos2 ) / ( np . linalg . norm ( pe_pos1 ) * np . linalg . norm (
on_pos2 ))
61 print (f" S i m i l a r i t cosinus PE ({ pos1 }) et PE ({ pos2 }): { similarity :.4 f}")
62

63 # 2. Periodicity and frequencies


64 print (f"\n2. ANALYSE DES F R Q U E N C E S ")
65 print ("-" * 40)

ÿ ÿ
Listing 59 – Transformers

Mohamed Ouazze 59 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 # Frequencies decrease with dimension
2 freqs = []
3 for i in range (0 d_model freq = ,1.0 / (10000 ** , 2) :
4 (i / d_model ))
5 freqs . append ( freq )
6 if i < 10: # Display the first frequencies
7 print (f" Dimension {i}: f r q u e n c e = { freq :.6 f}")
8

9 print (f" Ratio f r q u e n c e max / min : {max ( freqs )/ min ( freqs ) :.2 e}")
10

11 #3. Ability to distinguish


12 print (f"\n3. C A P A C I T DE DISTINCTION ")
13 print ("-" * 40)
14

15 # Minimum distance between nearby positions


16 min_distances = []
17 for i in range (1 min (50 dist = np
, . linalg . , max_len )) :
18 norm ( pe_values [i ] - pe_values [i -1])
19 min_distances . append ( dist )
20

21 print (f" Average distance between adjacent positions: {np. mean(min_distances) :.4 f}")
22 print (f" Distance minimale : {np. min ( min_distances ):.4 f}")
23 print (f" Distance maximale : {np. max ( min_distances ):.4 f}")
24

25 # 4. Invariance par translation relative


26 print (f"\n4. INVARIANCE PAR TRANSLATION RELATIVE ")
27 print ("-" * 40)
28

29 # Invariance test: the difference between PE(i+k) and PE(j+k)


30 # should be similar k = 5, 15 10 OR(i) - OR(j)
31 i , j, ,

32

33 diff1 = pe_values [i] - pe_values [j]


34 diff2 = pe_values [i+ k] - pe_values [ j+k]
35

36 similarity_diff = np . dot ( diff1 diff2 )) , diff2 ) / ( np . linalg . norm ( diff1 ) * np . linalg . norm (

37 print (f" S i m i l a r i t des d i f f r e n c e s relatives : { similarity_diff :.4 f}")


38

39 return pe_values
40

41 def compare_positional_encoding_methods () :
"""
42

43 Compare d i f f r e n t e s m t h o d e s d’encodage positionnel


"""
44

45 print ("\n=== COMPARISON OF POSITIONAL ENCODING METHODS === ")


46 print ("-" * 60)
47

48 comparison = {
49 ’ S i n u s o d a l ’: {
50 'benefits ': [
51 'No parameters' learn ',
52 ’Gnralisation longer sequences ',
53 ’ P r o p r i t s m a t h m a t i q u e s i n t r e s s a n t e s ’,
54 ’Invariance par translation relative ’
55 ],
56 ’ i n c o n v n i e n t s ’: [
57 'Potentially non-optimal fixed form ',
58 'May not capture all positional nuances'
59 ],
60 'usage ': 'Original Transformers, GPT , BERT ’
61 },
62 'Learned ': {
63 'benefits ': [
64 ’Adaptable aux d o n n e s s p c i f i q u e s ’,
65 'Can learn complex patterns ',
66 ’ Optimisation end -to - end ’
67 ],

ÿ ÿ
Listing 60 – Transformers

Mohamed Ouazze 60 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 ’ i n c o n v n i e n t s ’: [
2 ’ P a r a m t r e s s u p p l m e n t a i r e s ’,
3 ’ L i m i t e n t r a n e m ethe
n tlength
’, of
4 'Potential overlearning'
5 ],

6 'usage ': 'Some variants of BERT , modlesspcialiss
7 },
8 'Relative ': {
9 'benefits ': [
10 'Focus on relative distances ',
11 'More linguistically intuitive ',

12 ’Meilleure g n r a l i s a t i o n
13 ],
14 ’ i n c o n v n i e n t s ’: [
15 ’ C o m p l e x i t computationnelle accrue ’,
16 ’ I m p l m e n t a t i o n plus complexe ’
17 ],

18 ’usage ’: ’Transformer -XL , certains m o d l e s r c e n t s
19 },
20 ’RoPE ’: {
21 'benefits ': [
22 ’ P r o p r i t s de rotation lgantes ’,
23 ’Bonne g n r a l i s a t i o n ’,
24 ’ E f f i c a c i t computationnelle ’
25 ],
26 ’ i n c o n v n i e n t s ’: [
27 'Relatively new ',
28 'Less well studied'
29 ],
30 'usage ': 'GPT -NeoX , PaLM , certains m o d l e s r c e n t s ’

31 }
32 }
33

34 for method , details in comparison . items () :


35 print (f"\n{ method . upper () }:")
36 print (f" Avantages :")
37 for adv in details [’avantages ’]:
38 print (f" - {adv }")
39 print (f" I n c o n v n i e n t s :")
40 for dis in details [’ i n c o n v n i e n t s ’]:
41 print (f" - {dis }")
42 print (f" Usage typique : { details [’ usage ’]}")
43

44 def demonstrate_position_encoding_impact () :
"""
45

46 D shows the impact of positional encodings on performance


"""
47

48 print ("\n=== IMPACT OF POSITIONAL ENCODINGS ===")


49

50 # Simulation of a task requiring word order


51 def create_order_task ( seq_len =20 , num_samples =1000) :
""" """
52 C re a tcheo the order of words matters
53 data = []
54 labels = []
55

56 for _ in range ( num_samples ):


57 # Ascending vs. Descending Sequence
58 if np . random . random () > 0.5:
59 seq = list ( range (1 , seq_len + 1) )
60 label = 1 # Croissant
61 else :
62 seq = list ( range ( seq_len , 0, -1) )
63 label = 0 # D c r o i s s a n t
64

65 # Adding noise
66 noise_positions = np . random . choice ( seq_len , for pos in noise_positions : size = seq_len //4 , replace = False )
67

68 seq [ pos ] = np . random . randint (1 , seq_len + 1)

ÿ ÿ
Listing 61 – Transformers

Mohamed Ouazze 61 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 data . append ( seq )
2 labels . append ( label )
3

4 return np . array ( data ) , np . array ( labels )


5

6 # G n r a t i o n des d o n n e s
7 X , y = create_order_task ()
8

9 print (f" T c h e de classification d’ordre :")


10 print (f" chantillons : { len (X)}")
11 print (f" Longueur de s q u e n c e : {X. shape [1]} ")
12 print (f" Classes : Croissant (1) vs D c r o i s s a n t (0)")
13

14 print (f"\ nExemples :")


15 for i in range (3) :
16 print (f" {X[i ][:10]}... -> {’ Croissant ’ if y[i] else ’ D c r o i s s a n t ’}")
17

18 print (f"\nWithout positional encoding, print (f" distinguishes a Transformer could not ")
19 these sequences because attention is invariant the order!")
20

21 # E x c u t i o n des d m o n s t r a t i o n s
22 print ("=== POSITIONAL ENCODINGS IN TRANSFORMERS ===\n")
23

24 # Visualization of encodings
25 sin_pe , learned_pe = visualize_positional_encodings ()
26

27 # Analyse des p r o p r i t s
28 pe_analysis = analyze_positional_encoding_properties ()
29

30 # Comparison of methods
31 compare_positional_encoding_methods ()
32

33 # Impact d m o n t r
34 demonstrate_position_encoding_impact ()
35

36 print ("\n=== KEY POINTS ON POSITIONAL ENCODINGS ===")


37 print ("1. Compensate for the lack of natural order in attention ")
38 print ("2. L’encodage s i n u s o d a l offre des p r o p r i t s m a t h m a t i q u e s lgantes ")
39 print ("3. Learned encodings can adapt but are limited in length ")
40 print ("4. Relative encoding focuses on distances between positions ")
41 print ("5. RoPE combine rotation et position de m a n i r e efficace ")
42 print ("6. The choice depends on the task and the constraints of the model ")
ÿ ÿ
Listing 62 – Transformers

Explanation

Positional encodings are fundamental because Transformers are inherently invariant to token order. Without them, "Cat eats
fish" and "Fish eats cat"
would have exactly the same representation.
Sine encoding: Uses trigonometric functions with different frequencies.
low frequencies capture distant positions, high frequencies capture near positions.
This approach allows for natural generalization to longer sequences than those seen
in training.
Learned encoding: Positions are represented by trainable embeddings. More flexible
but limited to the maximum training length.
Relative encoding: Focuses on relative distances between positions rather than positions
absolute, often more linguistically relevant.
RoPE (Rotary Position Embedding)**: Recent method that encodes the position by rotation in the feature space, offering
interesting mathematical properties and good
efficiency.
The choice of positional encoding can significantly impact performance depending on the task
and the nature of the data.

Mohamed Ouazze 62 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

3.3.2 Practical example: Fine-tuning BERT for text classification

Let's apply the concepts of Transformers by fine-tuning BERT, a pre-trained model, for a
sentiment classification task.
ÿ ÿ

1 import torch
2 import torch . 3 from nn as nn
torch . utils . data import Dataset 4 from transformers import , DataLoader
BertTokenizer , BertForSequenceClassification , 5 from transformers import get_linear_schedule_with_warmup AdamW

6 import numpy as np
7 import matplotlib . pyplot as plt
8 from sklearn . metrics import classification_report , 9 import seaborn as sns confusion_matrix

10 import warnings
11 warnings . filterwarnings (’ignore ’)
12

13 class SentimentDataset ( Dataset ):


"""
14

15 Dataset for sentiment classification with BERT


"""
16

17 def __init__ ( self , texts , labels , tokenizer , max_length =128) :


18 self . texts = texts
19 self . labels = labels
20 self . tokenizer = tokenizer
21 self . max_length = max_length
22

23 def __len__ ( self ):


24 return len ( self . texts )
25

26 def __getitem__ ( self idx ): ,

27 text = str ( self . texts [ idx ])


28 label = self . labels [ idx ]
29

30 # Tokenization with BERT tokenizer


31 encoding = self . tokenizer (
32 text ,

33 truncation = True ,

34 padding =’max_length ’,
35 max_length = self . max_length ,
36 return_tensors =’pt ’
37 )

38

39 return {
40 ’input_ids ’: encoding [’input_ids ’]. flatten () ’ attention_mask ’: encoding [’ ,

41 attention_mask ’]. flatten () ’labels ’: torch . tensor ( label , dtype = torch . long ) ,

42

43 }
44

45 def create_synthetic_sentiment_data ( num_samples =2000) :


"""
46

47 C r e un dataset s y n t h t i q u e de sentiment en f r a n a i s
"""
48

49 positive_templates = [
50 "I love this { product }, it is { adjective }",
"
51 Excellent { product }, very { adjective }",
"
52 Fantastic experience with this { product } { adjective }",
53 "I highly recommend this { product } { adjective }",
"
54 Perfect, this { product
, } is really { adjective }",
"
55 Magnificent { product }, completely { adjective }",
"
56 Superbe q u a l i t t r s { adjectif
, }",
"
57 Wonderful {product} absolutely {adjective}"
,

58 ]

ÿ ÿ
Listing 63 – Fine-tuning BERT for sentiment classification

Mohamed Ouazze 63 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1
negative_templates = [
2
"This { product } is { adjective }, I do not recommend it ",
"
3
Horrible e x p r i e n c e , t r s { adjectif }",
"
4
D c e v a n t ce { produit }, c o m p l t e m e n t { adjectif }",
"
5
Poor quality , really {adjective}",
6
"I regret this purchase, too {adjective}",
"
7
This {product} is useless, absolutely {adjective}",
"
8
Catastrophic, extremely { adjective }",
"
9
avoid this {product} it is {adjective}",
10 ]

11

12
products = ['movie ', 'book ', 'restaurant ', 'hotel ', 'product ', 'service ', '
application ', 'game ']
13

14
positive_adjectives = ['excellent ', 'fantastic ', 'wonderful ', 'perfect ', 'great'
,

15
'extraordinary ', 'remarkable ', 'impressive ', 'brilliant ', '
Magnificent ']
16

17
negative_adjectives = ['horrible ', 'disappointing ', 'zero ', 'catastrophic ', 'mediocre ',
18
'awful ', 'lamentable ', 'pitiful ', 'disastrous ', 'terrible ']
19

20 texts = []
21 labels = []
22

23
# Generation of positive examples
24 for in range
_ ( num_samples // 2) :

25
template = np . random . choice ( positive_templates )
26
product = np . random . choice ( products )
27
adjective = np . random . choice ( positive_adjectives )
28

29
text = template . format ( product = product , adjective = adjective )
30
texts . append ( text )
31
labels . append (1) # Positif
32

33
# G n r a t i o n d’exemples n g a t i f s
34 for _ in range ( num_samples // 2) :
35
template = np . random . choice ( negative_templates )
36
product = np . random . choice ( products )
37
adjective = np . random . choice ( negative_adjectives )
38

39
text = template . format ( product = product , adjective = adjective )
40
texts . append ( text )
41
labels . append (0) # N g a t i f
42

43 return texts , labels


44

45 def analyze_bert_attention ( model , tokenizer , text , layer_idx = -1) :


"""
46

47
Analyze BERT's attention patterns on a text
"""
48

49 # Tokenisation
50
inputs = tokenizer ( text , return_tensors =’pt ’, truncation = True , max_length =128)
51

52
# Forward pass with attention capture
53 model . eval ()
54
with torch . no_grad () :
55
outputs = model (** inputs , output_attentions = True )
56

57
# Extracting attention weights
58
attention_weights = outputs . attentions [ layer_idx ][0] # P r e m i r e instance du batch
59

60
# Converting tokens for visualization
61
tokens = tokenizer . convert_ids_to_tokens ( inputs [’input_ids ’][0])
62

63
return attention_weights , tokens

ÿ ÿ
Listing 64 – Fine-tuning BERT for sentiment classification

Mohamed Ouazze 64 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 def visualize_bert_attention ( attention_weights , tokens , heads_to_show =4) :
"""
2

3
Visualize BERT's attention patterns
"""
4

5
num_heads = attention_weights . shape [0]
6
seq_len = len ( tokens )
7

8 # S l e c t i o n des t t e s display
9
heads_indices = np . linspace (0 , num_heads -1 , heads_to_show , dtype = int )
10

11
axes = plt . subplots (2 , 2, figsize =(16 fig , axes = axes . flatten () , 12) )
12

13

14 for i , head_idx in enumerate ( heads_indices ) :


15 if i >= len ( axes ):
16 break
17

18
# Extracting weights for this head
19
head_attention = attention_weights [ head_idx ]. numpy ()
20

21
# C r a t i o n de la heatmap
22
sns . heatmap ( head_attention xticklabels = ,

23
tokens [: seq_len ] yticklabels = tokens [: seq_len ] ,

24
cmap =’Blues ’, ,

25

26 ax = axes [i ],
27 cbar_kws ={ ’label ’: ’Attention ’})
28

29 axes [i ]. set_title ( f’ T t e d\’ attention { head_idx + 1} ’)


30
axes [i ]. set_xlabel (’Tokens ( Keys )’)
31
axes [i ]. set_ylabel (’Tokens ( Queries )’)
32

33
# Rotation des labels pour l i s i b i l i t
34
axes [i ]. tick_params ( axis =’x’, rotation =45)
35
axes [i ]. tick_params ( axis =’y’, rotation =0)
36

37
plt . tight_layout ()
38
plt . show ()
39

40 def train_bert_classifier () :
"""
41

42
Fine-tune BERT for sentiment classification
"""
43

44
device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
45
print (f" Utilisation du device : { device }")
46

47
# Loading the tokenizer and the pre-trained BERT model
48
model_name = 'bert -base - uncased ' # Using the English version for
compatibilit
49
tokenizer = BertTokenizer . from_pretrained ( model_name )
50
model = BertForSequenceClassification . from_pretrained (
51 model_name ,

52 num_labels =2 ,

53
output_attentions = True ,

54
output_hidden_states = False
55 ). to ( device )
56

57 # Data creation
58
print (" C r a t i o n du dataset s y n t h t i q u e ... ")
59 texts , labels = create_synthetic_sentiment_data ( num_samples =1600)
60

61
# Simple translation of templates into English for BERT
62
texts_english = []
ÿ ÿ
Listing 65 – Fine-tuning BERT for sentiment classification

Mohamed Ouazze 65 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 for text in texts :
2 # Basic translation for demonstration
3 text_en = text . replace ("J’adore ", "I love "). replace ("ce ", " this ") . replace ("il
est ", "it is")
4 text_en = text_en . replace (" Excellent ", " Excellent "). replace (" t r s ", " very ")
5 text_en = text_en . replace ("Je recommande ", "I recommend "). replace (" Parfait ", "
Perfect ")
6 text_en = text_en . replace ("je le d c o n s e i l l e ", "I don ’t recommend it")
7 text_en = text_en . replace (" Horrible ", " Horrible ") . replace (" D c e v a n t ", "
Disappointing ")
8 text_en = text_en . replace (" produit ", " product "). replace (" film ", " movie ") . replace (
"
livre ", " book ")
9 texts_english . append ( text_en )
10

11 # Division train / val


12 split_idx = int (0.8 * len ( texts_english ))
13 train_texts , val_texts = texts_english [: split_idx ], texts_english [ split_idx :]
14 train_labels , val_labels = labels [: split_idx ] labels [ split_idx :] ,

15

16 print (f" Donnees d’ e n t r a n e m e n t : { len ( train_texts )}")


17 print (f" D o n n e s de validation : { len ( val_texts )}")
18

19 # Datasets and DataLoaders


20 train_dataset = SentimentDataset ( train_texts val_dataset = SentimentDataset , train_labels val_labels , tokenizer )
21 ( val_texts , tokenizer ) ,

22

23 train_loader = DataLoader ( train_dataset val_loader = DataLoader , batch_size =16 , shuffle = True )


24 ( val_dataset , batch_size =16 , shuffle = False )
25

26 # Optimizer and scheduler


27 optimizer = AdamW ( model . parameters () , lr =2 e -5 , eps =1e -8)
28

29 epochs = 3
30 total_steps = len ( train_loader ) * epochs
31 scheduler = get_linear_schedule_with_warmup (
32 optimizer ,
33 num_warmup_steps =0 ,

34 num_training_steps = total_steps
35 )

36

37 #Entranement
38 train_losses = []
39 val_accuracies = []
40

41 print ("\ nDebut du fine - tuning de BERT ... ")


42

43 for epoch in range ( epochs ):


44 # Phase d’ e n t r a n e m e n t
45 model . train ()
46 total_train_loss = 0
47

48 for batch_idx # D p l a , batch in enumerate ( train_loader ) :


49 c e m e n t vers GPU
50 input_ids = batch [’input_ids ’]. to ( device )
51 attention_mask = batch [’ attention_mask ’]. to ( device )
52 labels = batch [’labels ’]. to ( device )
53

54 # Reset gradients
55 model . zero_grad ()
56

57 # Forward pass
58 outputs = model (
59 input_ids = input_ids ,
60 attention_mask = attention_mask labels = labels ,

61

62 )

63

64 loss = outputs . loss


65 total_train_loss += loss . item ()

ÿ ÿ
Listing 66 – Fine-tuning BERT for sentiment classification

Mohamed Ouazze 66 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1
# Backward pass
2 loss . backward ()
3

4 # Clipping gradients
5 torch . nn . utils . clip_grad_norm_ ( model . parameters () , 1.0)
6

7 # Update settings
8 optimizer . step ()
9 scheduler . step ()
10

11 if batch_idx % 20 == 0:
12 print (f’Epoch { epoch +1} loss . item () :.4 , Batch { batch_idx }/{ len ( train_loader )}, Loss : {
f}’)
13

14 avg_train_loss = total_train_loss / len ( train_loader )


15 train_losses . append ( avg_train_loss )
16

17 # Phase de validation
18 model . eval ()
19 total_eval_accuracy = 0
20 total_eval_loss = 0
21

22 predictions = []
23 true_labels = []
24

25 with torch . no_grad () :


26 for batch in val_loader :
27 input_ids = batch [’input_ids ’]. to ( device )
28 attention_mask = batch [’ attention_mask ’]. to ( device )
29 labels = batch [’labels ’]. to ( device )
30

31 outputs = model (
32 input_ids = input_ids ,
33 attention_mask = attention_mask labels = labels ,

34

35 )

36

37 loss = outputs . loss


38 logits = outputs . logits
39

40 total_eval_loss += loss . item ()


41

42 # Accuracy calculation
43 preds = torch . argmax ( logits , accuracy = ( preds dim =1)
44 == labels ). cpu () . numpy () . mean ()
45 total_eval_accuracy += accuracy
46

47 # Storage for detailed metrics


48 predictions . extend ( preds . cpu () . numpy () )
49 true_labels . extend ( labels . cpu () . numpy () )
50

51 avg_val_accuracy = total_eval_accuracy / len( val_loader )


52 val_accuracies . append ( avg_val_accuracy )
53

54 print (f’Epoch { epoch +1}: ’)


55 print (f’ Train Loss : { avg_train_loss :.4 f}’)
56 print (f’ Val Accuracy : { avg_val_accuracy :.4 f}’)
57 print (’-’ * 50)
58

59 # Final metrics
60 print ("\ nRapport de classification final :")
61 print ( classification_report ( true_labels , predictions ,
62 target_names =[ 'Negative ', 'Positive ']) )
63

64 return model , tokenizer , train_losses , val_accuracies , true_labels , predictions

ÿ ÿ
Listing 67 – Fine-tuning BERT for sentiment classification

Mohamed Ouazze 67 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1 def analyze_bert_performance ( model , tokenizer , test_texts , test_labels ) :
"""
2

3 Detailed analysis of BERT performance


"""
4

5 device = next ( model . parameters () ). device


6 model . eval ()
7

8 predictions = []
9 confidences = []
10

11 print ("Analysis of predictions on test samples... ")


12

13 with torch . no_grad () :


14 for i text in enumerate
, ( test_texts [:10]) : # Analyse sur 10 exemples
15 # Tokenisation
16 inputs = tokenizer ( text , return_tensors =’pt ’, truncation = True , max_length
=128)
17 inputs = { k: v. to ( device ) for k , v in inputs . items () }
18

19 # Prediction
20 outputs = model (** inputs )
21 logits = outputs . logits
22 probabilities = torch . softmax ( logits , dim =1)
23

24 predicted_class = torch . argmax ( logits , confidence = torch . max dim =1) . item ()
25 ( probabilities ). item ()
26

27 predictions . append ( predicted_class )


28 confidences . append ( confidence )
29

30 # Detailed display
31 true_label = test_labels [i]
" " " "
32 status = if predicted_class == true_label else
33

34 print (f"\n{ status } Exemple {i +1}: ")


35 print (f" Texte : { text }")
36 print (f" Vrai label : { ’ Positif ’ if true_label == 1 else ’ N g a t i f ’}")
37 print (f" P r d i c t i o n : {’ Positif ’ if predicted_class == 1 else ’ N g a t i f ’}")
38 print (f" Confiance : { confidence :.3 f}")
39

40 return predictions , confidences


41

42 def demonstrate_bert_attention_analysis () :
"""
43

44 Demonstration of BERT's attention analysis


"""
45

46 print ("\n=== BERT ATTENTION ANALYSIS === ")


47

48 # Examples of texts for analysis


49 example_texts = [
"
50 This movie is absolutely fantastic and amazing !",
"
51 The book was terrible and completely boring .",
52 "I love this product , it works perfectly well ."
53 ]

54

55 # Loading a simple BERT model for analysis


56 tokenizer = BertTokenizer . from_pretrained (’bert -base - uncased ’)
57 model = BertForSequenceClassification . from_pretrained (’bert -base - uncased ’,
58 output_attentions = True )
59

60 for i , text in enumerate ( example_texts ) :


61 print (f"\ nAnalyse du texte {i +1}: ’{ text }’")
62

63 # Attention Analysis
64 attention_weights , layer_idx = tokens = analyze_bert_attention ( model , tokenizer , text ,

-1)

ÿ ÿ
Listing 68 – Fine-tuning BERT for sentiment classification

Mohamed Ouazze 68 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

ÿ ÿ
1
print (f" Nombre de t t e s d’attention : { attention_weights . shape [0]} ")
2
print (f" Longueur de sequence : { len( tokens )}")
3
print (f" Tokens : { tokens [:10]}... ") # Affichage des premiers tokens
4

5
# Statistical analysis of attention
6
avg_attention_per_head = attention_weights . mean ( dim =(1 max_attention_per_head = , 2) )
7
attention_weights . max ( dim =2) [0]. max ( dim =1) [0]
8

9
print (f" Attention moyenne par tete : { avg_attention_per_head [:4]. tolist ()}")
10
print (f" Attention maximale par tete : { max_attention_per_head [:4]. tolist ()}")
11

12
# Visualization for the first example only
13 if i == 0:
14
print ("Generating the attention visualization... ")
15
visualize_bert_attention ( attention_weights , tokens , heads_to_show =4)
16

17 def compare_bert_variants () :
"""
18

19
Comparison of BERT variants
"""
20

21
print ("\n=== COMPARISON OF BERT VARIANTS === ")
22
print ("-" * 60)
23

24 bert_variants = {
25 ’BERT - Base ’: {
26
’layers ’: 12 ’ ,

27 hidden_size ’: 768 ’ ,

28 attention_heads ’: 12 ’parameters ’: ,

29
’110 M’,
30
’ training_data ’: ’BookCorpus + Wikipedia ’,
31
'strengths ': ['Bidirectional ', 'Versatile ', 'Well-studied '] 'use_cases ': ['Classification ', 'NER ', 'Question - ,

32
Answering ']
33 },
34
’BERT - Large ’: {
35
’layers ’: 24 ’ ,

36 hidden_size ’: 1024 ’ ,

37 attention_heads ’: 16 ’parameters ’: ,

38
’340 M’,
39
’ training_data ’: ’BookCorpus + Wikipedia ’,
40
'strengths ': ['More efficient ', 'Better representation '] 'use_cases ': ['Complex tasks ', 'State of the art '] ,

41

42 },
43 ’RoBERTa ’: {
44
’layers ’: 24 ’ ,

45 hidden_size ’: 1024 ’ ,

46 attention_heads ’: 16 ’parameters ’: ,

47
’355 M’,
48
'training_data ': 'More data, longer ',
49
'strengths ': ['Training Optimizations ', 'No NSP '],
50 ’use_cases ’: [’Alternative robuste a BERT ’]
51 },
52 'DistilBERT ': {
53
’layers ’: 6,
54 ’ hidden_size ’: 768 ’ ,

55 attention_heads ’: 12 ’parameters ’: ,

56
’66M’,
57
’ training_data ’: ’ Distillation de BERT ’,
58
'strengths ': ['Faster ', 'Lighter ', '97% performance '],
59 'use_cases ': ['Production ', 'Limited Resources ']
60 },
61 ’ELECTRA ’: {
62
’layers ’: 12 ’ ,

63 hidden_size ’: 768 ’ ,

64 attention_heads ’: 12 ’parameters ’: ,

65
’110 M’,
66
'training_data ': 'Discriminative training ',
67
'strengths ': ['More efficient ', 'Better than BERT '],
68
’use_cases ’: [’Alternative performante ’]
69 }
70 }
71

72 for variant , specs in bert_variants . items () :

Mohamed Ouazze 69 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

73 print (f"\n{ variant }:")


74 print (f" Architecture : { specs [ ’ layers ’]} couches specs [’ attention_heads ’]} tetes ") , { specs [’ hidden_size ’]} dim , {

75 print (f" Parametres : { specs [’ parameters ’]}")


76 print (f" Training data: { specs [' training_data '] }")
77 print (f" Forces : { ’, ’. join ( specs [ ’ strengths ’])}")
78 print (f" Cas d_usage : {’, ’. join ( specs [’ use_cases ’])}")
79

80 def plot_training_results ( train_losses , val_accuracies ):


"""
81

82 View training results


"""
83

84 fig , (ax1) , ax2 ) = plt . subplots (1 , 2 , figsize =(15 , 5) )


85

86 # Loss curve
87 ax1 . plot ( train_losses ax1 . set_title , ’b-’, marker =’o’, linewidth =2 , markersize =8)
88 ( 'Training Loss Evolution ')
89 ax1 . set_xlabel (’Epoque ’)
90 ax1 . set_ylabel ('Perte ')
91 ax1 . grid ( True , alpha =0.3)
92

93 # Accuracy curve
94 ax2 . plot ( val_accuracies , ’g-’, marker =’s’, linewidth =2 ax2 . set_title (’Evolution de la Precision de , markersize =8)
95 Validation ’)
96 ax2 . set_xlabel (’Epoque ’)
97 ax2 . set_ylabel (’Precision ’)
98 ax2 . grid ( True , alpha =0.3)
99 ax2 . set_ylim (0 1) ,

100

101 plt . tight_layout ()


102 plt . show ()
103

104 def create_confusion_matrix ( true_labels , predictions ):


"""
105

106 Create and display the confusion matrix


"""
107

108 cm = confusion_matrix ( true_labels , predictions )


109

110 plt . figure ( figsize =(8 6) ) ,

111 sns . heatmap (cm annot = True, , fmt =’d’, cmap =’Blues ’,
112 xticklabels =[ 'Negative ', 'Positive '],
113 yticklabels =[ 'Negative ', 'Positive '])
114 plt . title ('Confusion Matrix - Sentiment Classification ')
115 plt . xlabel (’ Predictions ’)
116 plt . ylabel ('Actual Values ')
117 plt . show ()
118

119 return cm
120

121 def demonstrate_transfer_learning_benefits () :


"""
122

123 Demonstrates the benefits of transfer learning with BERT


"""
124

125 print ("\n=== BENEFICES DU TRANSFER LEARNING AVEC BERT ===")


126 print ("-" * 60)
127

128 benefits = {
129 'Pre-training ': {
130 ' description ': 'BERT is pre-trained on large text corpora ',
131 'impact ': 'Capture of rich linguistic representations ',

132 'advantage ': 'No need to start from scratch
133 },
134 ’Fine - tuning ’: {
135 'description ': 'Adaptation to specific tasks with little data ',
136 'impact ': 'High performance even with limited datasets ',

137 'advantage ': 'Time and resource efficiency'
138 },
139 'Contextual representations ': {
140 'description ': 'Each word has a context-dependent representation ',
141 'impact ': 'Management of polysemy and ambiguities ',
142 'advantage ': 'Nuanced understanding of language'
143 },
144 ’ Bidirectionnalite ’: {

Mohamed Ouazze 70 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

145 'description ': 'Access left AND right context simultaneously ',
146 'impact ': 'Better understanding than unidirectional models ',
147 'advantage ': 'Captures complex dependencies'
148 }
149 }
150
151 for benefit , details in benefits . items () :
152 print (f"\n{ benefit . upper () }:")
153 print (f" Description : { details [’ description ’]}")
154 print (f" Impact : { details [ ’ impact ’]}")
155 print (f" Advantage: { details ['advantage']}")
156
157 # Simulation comparative des performances
158 print (f"\nSIMULATED PERFORMANCE COMPARISON:")
159 print ("-" * 40)
160
161 scenarios = {
162 ’Modele from scratch ’: {’accuracy ’: 0.65 ’100 k+ samples ’}, , ’ training_time ’: ’48h’, ’ data_needed ’:

163 ’Fine - tuning BERT ’: {’accuracy ’: 0.89 samples ’}, , ’ training_time ’: ’2h’, ’ data_needed ’: ’1k+

164 ’BERT sans fine - tuning ’: {’accuracy ’: 0.76 ’: ’0 samples ’} , ’ training_time ’: ’5 min ’, ’ data_needed

165 }
166
167 for scenario , metrics in scenarios . items () :
168 print (f"\n{ scenario }:")
169 print (f" Precision : { metrics [’ accuracy ’]:.2%} ")
170 print (f" Temps d_entrainement : { metrics [’ training_time ’]}")
171 print (f" Donnees necessaires : { metrics [’ data_needed ’]}")
172

173 # Running the complete example


174 print (" === FINE - TUNING DE BERT POUR CLASSIFICATION ===\ n")
175
176 # Model training

177 try :
178 trained_model , tokenizer , losses , accuracies , true_labels , predictions =
train_bert_classifier ()
179
180 # Visualization of training results
181 plot_training_results ( losses , accuracies )
182
183 # Confusion matrix
184 cm = create_confusion_matrix ( true_labels , predictions )
185
186 # Performance analysis
187 test_texts = [
"
188 This product is excellent and works perfectly !",
"
189 Terrible quality , completely disappointed .",
190 " Amazing experience , highly recommended !",
"
191 Waste of money , very poor quality ."
192 ]

193 test_labels = [1 , 0, 1 , 0] # Positive , Negative , Positive , Negative


194
195 preds , confs = analyze_bert_performance ( trained_model , tokenizer , test_texts ,

test_labels )
196

197 except Exception as e :


198 print (f" Error during training: {e}")
199 print ("Let's continue with the conceptual analyses... ")
200

201 # BERT Attention Analysis


202 demonstrate_bert_attention_analysis ()
203

204 # Comparison of variants


205 compare_bert_variants ()
206

207 # Benefices du transfer learning


208 demonstrate_transfer_learning_benefits ()
209

210 print ("\n=== POINTS CLES DU FINE - TUNING BERT === ")
211 print ("1. Transfer learning: starting from a pre-trained model ")
212 print ("2. Fine - tuning: adaptation with little data and period ")

Mohamed Ouazze 71 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

213 print ("3. Attention: understand what the model 'looks at'")
214 print ("4. Variants: choose according to constraints (speed vs. performance)")
215 print ("5. Tokenisation : importance du preprocessing avec BERT tokenizer ")
216 print ("6. Hyperparametres : learning rate faible , warmup , gradient clipping ")
217 print ("7. Evaluation: metrics adapted to the classification task ")
ÿ ÿ
Listing 69 – Fine-tuning BERT for sentiment classification

Explanation

BERT's fine-tuning perfectly illustrates the power of transfer learning in NLP:


Massive pre-training: BERT is trained on billions of words with self-supervised tasks (Masked Language Modeling and Next
Sentence Prediction), allowing it to learn
rich linguistic representations.
Fast adaptation: With only a few epochs and little data, fine-tuning adapts
these pre-learned representations for specific tasks, achieving remarkable performance.
Contextual representations**: Unlike static embeddings (Word2Vec),
each token has a different representation depending on its context, allowing ambiguity to be managed
and polysemy.
Attention Analysis**: Attention patterns reveal how BERT “understands” text,
often showing emergent syntactic and semantic structures without explicit supervision.
Practical considerations**: Choosing the variant (BERT-Base vs Large vs DistilBERT)
depends on the performance/resource tradeoff needed for the target application.
This approach has democratized the use of cutting-edge NLP models, making it possible to obtain
excellent results even with limited resources.

ÿ ÿ
1

2 # Creation of the padding mask


3 mask = create_padding_mask ( sequences ) . to ( device )
4

5 # Forward pass
6 logits , attention_weights = model ( sequences , loss = criterion ( logits , labels ) mask )
7

9 # Backward pass
10 optimizer . zero_grad ()
11 loss . backward ()
12 optimizer . step ()
13

14 # Metrics
15 epoch_loss += loss . item ()
16 predictions = torch . argmax ( logits , correct_predictions += dim =1)
17 ( predictions == labels ). sum () . item ()
18 total_predictions += labels . size (0)
19

20 if batch_idx % 20 == 0:
21 print (f’Epoch { epoch }, Batch { batch_idx }, Loss : { loss . item () :.4 f}’)
22

23 avg_loss = epoch_loss / len ( dataloader )


24 accuracy = correct_predictions / total_predictions
25 train_losses . append ( avg_loss )
26

27 print (f’Epoch { epoch }: Loss = { avg_loss :.4f}, Accuracy = { accuracy :.4f}’)


28

29 return model , train_losses , attention_weights Forward pass


30 logits , attention_weights = model ( sequences , loss = criterion ( logits , labels ) mask )
31

32

33 # Backward pass
34 optimizer . zero_grad ()
35 loss . backward ()
36 optimizer . step ()
37

38 # Metrics
39 epoch_loss += loss . item ()
40 predictions = torch . argmax ( logits , dim =1)

Mohamed Ouazze 72 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

41 correct_predictions += ( predictions == labels ). sum () . item ()


42 total_predictions += labels . size (0)
43

44 if batch_idx % 20 == 0:
45 print (f’Epoch { epoch }, Batch { batch_idx }, Loss : { loss . item () :.4 f}’)
46

47 avg_loss = epoch_loss / len ( dataloader )


48 accuracy = correct_predictions / total_predictions
49 train_losses . append ( avg_loss )
50

51 print (f’Epoch { epoch }: Loss = { avg_loss :.4f}, Accuracy = { accuracy :.4f}’)


52

53 return model , train_losses , attention_weights


54

55 # Analysis of attention patterns


56 def analyze_attention_patterns ( model , sample_text , vocab_size =1000) :
"""
57

58 Analyze attention patterns on a text sample


"""
59

60 model . eval ()
61 device = next ( model . parameters () ). device
62

63 # Converting text to tokens (simulation)


64 tokens = torch . randint (1 vocab_size mask = create_padding_mask
, , (1 , 20) ). to ( device )
65 ( tokens ). to ( device )
66

67 with torch . no_grad () :


68 _ , attention_weights = model ( tokens , mask )
69

70 print ("Analysis of attention patterns:")


71 print (f" Nombre de couches : {len ( attention_weights )}")
72 print (f" Number of heads per layer: {attention_weights[0].size(1)}")
73 print (f" Taille de la sequence : { attention_weights [0]. size (2) }")
74

75 # Visualization for different layers and heads


76 for layer_idx in [0 len ( attention_weights ) //2
, -1]: # Premiere derniere , , medium ,

77 for head_idx in [0 , attention_weights


[0]. size (1) //2]: # Premiere et tete du
medium
78 print (f"\ nVisualisation Couche { layer_idx +1 if layer_idx >= 0 else len (
attention_weights )} , Tete { head_idx +1}")
79

80 # Extracting weights
81 if layer_idx == -1:
82 layer_idx = len( attention_weights ) - 1
83

84 weights = attention_weights [ layer_idx ][0 , head_idx ]. detach () . cpu () . numpy ()


85

86 # Statistical analysis of weights


87 avg_attention = weights . mean ()
88 max_attention = weights . max ()
89 attention_entropy = -np .sum ( weights * np . log ( weights + 1e -10) , axis =1) . mean ()
90

91 print (f" Attention moyenne : { avg_attention :.4f}")


92 print (f" Attention maximale : { max_attention :.4 f}")
93 print (f" Average entropy: {attention_entropy:.4 f}")
94

95 return attention_weights
96

97 # Transformer vs RNN Comparison


98 def compare_transformer_vs_rnn () :
"""
99

100 Compare Transformer vs RNN performance and features


"""
101

102 print (" COMPARAISON TRANSFORMER vs RNN ")


103 print ("=" * 50)
104

105 comparison_data = {
106 ’Critere ’: [
107 ’ Parallelisation ’,
108 'Long-term memory ',
109 ’Complexite computationnelle ’,
110 'Training speed ',
111 'Interpretability ',

Mohamed Ouazze 73 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

112 'Performance on long sequences ',


113 'Memory consumption'
114 ],
115 ’RNN / LSTM ’: [
116 ’Sequentiel ’,
117 'Limited (vanishing gradient)',
118 'O(n) per time step ',
119 'Spring ( sequential )',
120 'Difficult ',
121 'Problematic ',
122 ’Moderee ’
123 ],
124 ’ Transformer ’: [
125 'Complete parallels ',
126 'Excellent (overall attention)',
127 'O(n) for attention ',
128 'Fast (parallelizable)',
129 'Good (attention)',
130 'Excellent ',
131 'High (quadratic attention)'
132 ]

133 }
134

135 for i in range ( len ( comparison_data [’Critere ’]) ):


136 print (f"{ comparison_data [’ Critere ’][i]: <25} | { comparison_data [’ RNN/ LSTM ’][i
]: <25} | { comparison_data [’ Transformer ’][i]}")
137

138 # Simulation of time performance


139 sequence_lengths = [10 100 200 500] , 50 , , ,

140 , 0.5 ,
rnn_times = [0.1 5.0] # Temps sequentiel 1.0 , 2.0 ,

141 transformer_times = [0.05 , 0.1 , 0.2 , 0.4 , 1.0] # Parallel Time


142

143 plt . figure ( figsize =(12 , 5) )


144

145 # Training time graph


146 plt . subplot (1 1) , 2 ,

147 plt . plot ( sequence_lengths , plt . plot rnn_times , ’o-’, label =’RNN / LSTM ’, linewidth =2)
148 ( sequence_lengths , plt . xlabel (’Longueur de s transformer_times , ’s-’, label =’ Transformer ’, linewidth =2)
149 q u e n c e ’)
150 plt . ylabel ('Relative temps ')
151 plt . title (’Temps d\’ e n t r a n e m e n t c o m p a r ’)
152 plt . legend ()
153 plt . grid ( True , alpha =0.3)
154

155 # Memory Complexity Graph


156 plt . subplot (1 2) , 2 ,

157 rnn_memory = [n for n in sequence_lengths ] # Lineaire


158 transformer_memory = [n **2 for n in sequence_lengths ] # Quadratic
159

160 plt . plot ( sequence_lengths , rnn_memory , ’o-’, label =’RNN/ LSTM (O(n))’, linewidth =2)
161 plt . plot ( sequence_lengths , transformer_memory , ’s-’, label =’ Transformer (O( n ))’,
linewidth =2)
162 plt . xlabel ('Longueur de squence ')
163 plt . ylabel (’ Utilisation m m o i r e relative ’)
164 plt . title (’ C o m p l e x i t m m o i r e ’)
165 plt . legend ()
166 plt . grid ( True , alpha =0.3)
167

168 plt . tight_layout ()


169 plt . show ()
170

171 # Running the Transformer Example


172 print (" === ARCHITECTURE TRANSFORMERS ===\ n")
173
174 # Model training
175 transformer_model , losses , sample_attention = train_transformer_classifier ()
176

177 # Analysis of attention patterns


178 print ("\n" + "=" *50)
"
179 attention_patterns = analyze_attention_patterns ( transformer_model , sample text ")
180

181 # Comparison with RNN


182 print ("\n" + "=" *50)

Mohamed Ouazze 74 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

183 compare_transformer_vs_rnn ()
184

185 # Visualization of training curves

,
186 plt . figure ( figsize =(10 6) )
187 plt . plot ( losses , ’b-’, linewidth =2 marker =’o’) ,

188 pts . title (' Revolution from Loss - Our Training Transform ')
189 plt . xlabel ('poque ')
190 plt . ylabel ('Perte ')
191 plt . grid ( True , alpha =0.3)

192 plt . show ()


193

194 print ("\ nPoints c l s des Transformers :")


195 print ("1. M c a n i s m e d’attention permet de capturer des d p e n d a n c e s 196 print ("2. P a r a l l l i s a t i o n c o m p l long terme ")
t e a c c l r e l’ e n t r a n e m e n t ")
197 print ("3. Positional encoding compensates for the lack of sequential order ")
198 print ("4. Multi - head attention capture d i f f r e n t s types de relations ")
199 print ("5. Residual connections and normalization facilitate deep training ")
ÿ ÿ
Listing 70 – Fine-tuning BERT for sentiment classification

Explanation

Transformers represent a fundamental paradigm shift in sequential processing:

Attention mechanism: Each position can directly access all other positions,
eliminating the vanishing gradient problems of RNNs. Weighted attention allows to
focus on the relevant elements.
Parallelization: Unlike RNNs which process sequentially, Transformers can
process all tokens simultaneously, drastically speeding up GPU training.
Multi-head attention: Allows the model to capture different types of relationships (syntactic,
semantics) simultaneously using multiple parallel attention "heads".
Positional encoding: Compensates for the lack of natural sequential order by injecting positional information via
sine functions.
This architecture has become the basis for revolutionary models like BERT, GPT, and their
successors.

3.3.3 Self-attention et multi-head attention

The attention mechanism is the heart of Transformers. Let's understand in detail how it works and
its variants.
ÿ ÿ
1 import torch
2 import torch . 3 import nn as nn
torch . . functional as F nn 4

import math

5 import numpy as np
6 import matplotlib . pyplot as plt
7 import seaborn as sns
8

9 class ScaledDotProductAttention ( nn . Module ):


"""
10

11 Attention by scalar product with setting the ladder


12 Attention (Q,K,V) = softmax (QK^T/ d_k )V
"""
13

14 def __init__ ( self , temperature , dropout =0.1) :


15 super () . __init__ ()
16 self . temperature = temperature
17 self . dropout = nn . Dropout ( dropout )
18

19 def forward ( self mask = None


, q):, k , in ,

20 # Calculating attention scores

21 attn = torch . matmul (q / self . temperature , k. transpose (2 , 3) )


22

23 # Application of the mask if provided


24 if mask is not None :
25 attn = attn . masked_fill ( mask == 0, -1 e9 )

Mohamed Ouazze 75 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

26

27 # Normalisation softmax
28 attn = F. softmax ( attn dim = -1) ,

29 attn = self . dropout ( attn )


30

31 # Application to values
32 output = torch . matmul ( attn , in )
33

34 return output , attn


35

36 class MultiHeadAttention ( nn . Module ):


"""
37

38 Multi-head attention with detailed analysis


"""
39

40 def __init__ ( self n_head super (), . __init__ () , d_model , d_k , d_v , dropout =0.1) :
41

42

43 self . n_head = n_head


44 self . d_k = d_k
45 self . d_v = d_v
46

47 # Linear projections for each head


48 = nn n_head * d_k self . w_ks =. Linear ( d_model
nn . Linear self .n_head
( d_model w_qs * ,d_k . Linear ( d_model self . , bias = False )
49 w_vs = nn n_head * d_v self . fc = nn . Linear ( n_head * d_v , , bias = False )
50 , , bias = False )
51 , d_model , bias = False )
52

53 self . attention = ScaledDotProductAttention ( temperature = d_k ** 0.5)


54

55 self . dropout = nn self . . Dropout ( dropout )


56 layer_norm = nn . LayerNorm ( d_model , eps =1e -6)
57

58 def forward ( self mask = None,):q , k , in ,

59 d_k d_v
, n_head =
, self . d_k self . d_v sz_b , len_q , len_k
, len_v = q. size , self . n_head
60 (0) , , q. size (1) k . size (1)
, , v. size (1)
61

62 residual = q # Connexion r s i d u e l l e
63

64 # Transformation et reshape pour multi - head


65 q = self . w_qs ( q). view ( sz_b d_k ) , len_q , k = n_head ,

66 self . w_ks ( k). view ( sz_b d_k ) , len_k v , n_head ,

67 = self . w_vs ( v). view ( sz_b d_v ) , len_v , n_head ,

68

69 # Transposition for batch processing of heads


70 k , v = q. transpose (1 , 2) k. , transpose (1 v. transpose (1 q ,, 2) , , 2)
71

72 # Mask adaptation for multi-head


73 if mask is not None :
74 mask = mask . unsqueeze (1) # For head axis broadcasting .
75

76 # Application of attention
77 q, attn = self . attention (q , k , in , mask = mask )
78

79 # C o n c a t n a t i o n des t t e s
80 q = q . transpose (1 q = self . , 2) . contiguous () . view ( sz_b , len_q , -1)
81 dropout ( self . fc (q ))
82

83 # Residual connection and normalization


84 q += residual
85 q = self . layer_norm (q)
86

87 return q , attn
88

89 class AttentionVisualizer :
"""
90

91 Class for visualizing and analyzing attention patterns


"""
92

93 def __init__ ( self ):


94 self . attention_maps = []
95

96 def analyze_attention_patterns ( self , attention_weights , tokens = None ) :


"""
97

98 Analyzes the attention patterns of a model

Mohamed Ouazze 76 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

"""
99

100 # attention_weights : [batch batch_size n_heads , , heads , seq_len , seq_len ]


101 seq_len , _ , = attention_weights . shape
102

103 # Analyzed by tte


104 head_analysis = {}
105 for head in range ( n_heads ):
106 attn_head = attention_weights [0 , head ]. detach () . cpu () . numpy ()
107

108 # M t r i q u e s d’analyse
109 avg_attention = np . mean ( attn_head )
110 max_attention = np . max ( attn_head )
111

112 # Attention entropy (measure of dispersion)


113 entropy = -np .sum ( attn_head * np . log ( attn_head + 1e -10) , axis =1) . mean ()
114

115 # Diagonal (auto - attention )


116 diagonal_attention = np . mean ( np . diag ( attn_head ))
117

118 # Average distance of strong connections


119 strong_connections = attn_head > ( avg_attention + np . std ( attn_head ))
120 positions = np . where ( strong_connections )
121 avg_distance = np . mean ( np . abs( positions [0] - positions [1]) ) if len ( positions
[0]) > 0 else 0
122

123 head_analysis [ head ] = {


124 ’ avg_attention ’: avg_attention ,
125 ’ max_attention ’: max_attention ’entropy ’: entropy , ,

126

127 ’ diagonal_attention ’: diagonal_attention ,


128 ’ avg_distance ’: avg_distance ,
129 ’ attention_matrix ’: attn_head
130 }
131

132 return head_analysis


133

134 def plot_attention_heads ( self , attention_weights , tokens = None , max_heads =4) :


"""
135

136 Visualize attention patterns for multiple heads


"""
137

138 batch_size = attention_weights


, n_heads.,shape
seq_len , _
139 attention_np = attention_weights [0]. detach () . cpu () . numpy ()
140

141 # S l e c t i o n des t t e s heads_to_show = visualize


142 min( max_heads n_heads ) ,

143 fig , axes = plt . subplots (2 , 2, figsize =(15 axes = axes . flatten () , 12) )
144

145

146 for i in range ( heads_to_show ):


147 ax = axes [i]
148

149 #Attention Heatmap


150 sns . heatmap ( attention_np [ i], cmap =’Blues ’, ax =ax cbar_kws ={ ’label ’: ’Attention ,

151 Weight ’})


152

153 ax . set_title (f’ T t e d\’ attention {i +1} ’)


154 ax . set_xlabel (’Position ( Key)’)
155 ax . set_ylabel (’Position ( Query )’)
156

157 # Add tokens if provided


158 if tokens and len ( tokens ) <= 20:
159 ax . set_xticklabels ( tokens rotation =45) ,

160 ax . set_yticklabels ( tokens rotation =0) ,

161

162 plt . tight_layout ()


163 plt . show ()
164

165 def plot_attention_statistics ( self , head_analysis ):


"""
166

167 Statistical graphs of attention patterns


"""
168

169 n_heads = len ( head_analysis )


170 metrics = [’ avg_attention ’, ’entropy ’, ’ diagonal_attention ’, ’ avg_distance ’]

Mohamed Ouazze 77 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

171 metric_names = ['Average Attention ', 'Entropy ', 'Self-attention ', 'Distance
Average ']
172

173 axes = plt . subplots (2 , 2, figsize =(12 fig , axes = axes . flatten () , 10) )
174

175

176 for i , ( metric , name ) in enumerate ( zip( metrics , metric_names )):


177 values = [ head_analysis [ head ][ metric ] for head in range ( n_heads ) ]
178

179 axes [i ]. bar ( range ( n_heads ) values , alpha =0.7 color =, plt . cm . viridis ( np . linspace (0 axes ,

180 [i ]. set_title ( f’{ name } par T t e ’) , 1, n_heads )) )


181

182 axes [i ]. set_xlabel (’ T t e d\’ attention ’)


183 axes [i ]. set_ylabel ( name )
184 axes [i ]. set_xticks ( range ( n_heads ))
185 axes [i ]. set_xticklabels ([ f’H{i +1} ’ for i in range ( n_heads ) ])
186

187 plt . tight_layout ()


188 plt . show ()
189

190 class SelfAttentionAnalyzer ( nn . Module ):


"""
191

192 Specialized model for analyzing self-attention patterns


"""
193

194 def __init__ ( self , vocab_size super , d_model =256 , n_heads =8 , n_layers =6) :
195 () . __init__ ()
196

197 self . d_model = d_model


198 embedding = nn self . pos_encoding . Embedding ( vocab_size self . , d_model )
199 = PositionalEncoding ( d_model )
200

201 # Layers of attention

202 self . attention_layers = nn . ModuleList ([


203 MultiHeadAttention ( n_heads d_model for in range ( n_layers
, ) , d_model // n_heads , d_model // n_heads )
204 _

205 ])
206

207 self . layer_norm = nn . LayerNorm ( d_model )


208

209 def forward ( self return_attention


, =x False
, ):
210 # Embedding and positional encoding
211 x = self . embedding (x) * math . sqrt ( self . d_model )
212 x = self . pos_encoding ( x)
213
214 attention_weights = []
215

216 # Passage for through the layers of attention

217 attention_layer in self . attention_layers :


218 x , attn = attention_layer (x if return_attention : , x , x) # Self - attention
219

220 attention_weights . append ( attn )


221

222 x = self . layer_norm (x)


223
224 if return_attention :
225 return x , attention_weights
226 return x
227
228 def demonstrate_attention_mechanisms () :
"""
229

230 Demonstration of attention mechanisms with analysis


"""
231

232 print (" === D M O N S T R A T I O N DES M C A N I S M E S D’ATTENTION ===\ n")


233

234 #Paramtres
235 vocab_size = 1000
236 seq_len = 20
237 batch_size = 2
238

239 # Creation you model

240 model = SelfAttentionAnalyzer ( vocab_size model . eval () , d_model =256 , n_heads =8 , n_layers =4)
241

242

Mohamed Ouazze 78 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

243 # Example waves


244 input_ids = torch . randint (1 , vocab_size , ( batch_size , seq_len ))
245
246 print (f" Forme d’ e n t r e : { input_ids . shape }")
247 print (f" S q u e n c e d’exemple : { input_ids [0][:10]. tolist () }... ")
248
249 # Forward pass with attention capture
250 with torch . no_grad () :
251 output , attention_weights = model ( input_ids , return_attention = True )
252
253 print (f"\ nNombre de couches d’attention : { len ( attention_weights )}")
254 print (f" Shape of attention weights per layer: {attention_weights[0].shape}")
255
256 # Analysis of attention patterns
257 visualizer = AttentionVisualizer ()
258
259 # Analysis for the first layer
260 print ("\n=== ANALYSE DE LA P R E M I R E COUCHE === ")
261 first_layer_analysis = visualizer . analyze_attention_patterns ( attention_weights [0])
262
263 for head , analysis in first_layer_analysis . items () :
264 print (f"\ n T t e { head + 1}: ")
265 print (f" Attention moyenne : { analysis [ ’ avg_attention ’]:.4f}")
266 print (f" Entropie : { analysis [’ entropy ’]:.4f}")
267 print (f" Auto - attention : { analysis [’ diagonal_attention ’]:.4f}")
268 print (f" Distance moyenne : { analysis [’ avg_distance ’]:.2f}")
269
270 # Visualisations
271 print ("\ n G n r a t i o n des visualisations ... ")
272
273 # First layer attention patterns
274 visualizer . plot_attention_heads ( attention_weights [0] , max_heads =4)
275
276 # Statistics per head
277 visualizer . plot_attention_statistics ( first_layer_analysis )
278
279 #evolution through the layers

280 layer_entropies = []
281 layer_self_attention = []
282
283 for layer_idx , attn_weights in enumerate ( attention_weights ) :
284 layer_analysis = visualizer . analyze_attention_patterns ( attn_weights )
285
286 avg_entropy = np . mean ([ analysis [’entropy ’] for analysis in layer_analysis . values
() ])
287 avg_self_attn = np . mean ([ analysis [’ diagonal_attention ’] for analysis in
layer_analysis . values () ])
288
289 layer_entropies . append ( avg_entropy )
290 layer_self_attention . append ( avg_self_attn )
291
292 # Evolution graph
293 plt . figure ( figsize =(12 , 5) )
294
295 plt . subplot (1 1) , 2 ,

296 plt . plot ( range (1 len ( layer_entropies


, ) +1) , layer_entropies , ’o-’, linewidth =2)
297 plt . title ('Evolution of Attention Entropy ')
298 plt . xlabel (’Couche ’)
299 plt . ylabel ('Average Entropy ')
300 plt . grid ( True , alpha =0.3)
301
302 plt . subplot (1 2) , 2 ,

303 plt . plot ( range (1 len ( layer_self_attention


, ) +1) , layer_self_attention , ’s-’, linewidth
=2 , color =’orange ’)
304 plt . title ('evolution of self-attention ')
305 plt . xlabel (’Couche ’)
306 plt . ylabel ('Auto-attention Medium ')
307 plt . grid ( True , alpha =0.3)
308
309 plt . tight_layout ()
310 plt . show ()
311
312 return model , attention_weights , visualizer

Mohamed Ouazze 79 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

313

314 def compare_attention_types () :


"""
315
316 Compare d i f f r e n t s types d’attention
"""
317
318 print ("\n=== ATTENTION TYPE COMPARISON === ")
319
320 attention_types = {
321 ’Dot - Product ’: {
322 ’complexity ’: ’O( n d )’,
323 ’memory ’: ’O( n )’,
324 ' description ': 'Standard Transformers Attention ',
325 ’advantages ’: [’Simple ’, ’Efficace ’, ’ P a r a l l l i s a b l e ’],
326 ' disadvantages ': ['Quadratic in squence ', ' C otex in moire ']
327 },
328 'Additive ( Bahdanau )': {
329 ’complexity ’: ’O( n d )’,
330 ’memory ’: ’O( n )’,
331 'description ': 'Caution with feedforward network ',
332 'advantages ': ['More expressive ', 'Better dependency capture '],
333 ’ disadvantages ’: [’Plus lent ’, ’Plus de p a r a m t r e s ’]
334 },
335 ’Linear Attention ’: {
336 ’complexity ’: ’O( n d )’,
337 ’memory ’: ’O(nd)’,
338 ’ description ’: ’ Approximation l i n a i r e de l\’ attention ’,
339 ’advantages ’: [’ C o m p l e x i t l i n a i r e ’, ’ conome en m m o i r e ’],
340 'disadvantages ': ['Approximation ', 'Performance loss ']
341 },
342 ’Sparse Attention ’: {
343 ’complexity ’: ’O( n n )’,
344 ’memory ’: ’O( n n )’,
345 ’ description ’: ’Attention sur patterns s p c i f i q u e s ’,
346 ’advantages ’: [’ R d u c t i o n de c o m p l e x i t ’, ’ G r e longues s q u e n c e s ’],
347 ’ disadvantages ’: [’Patterns fixes ’, ’ I m p l m e n t a t i o n complexe ’]
348 }
349 }
350
351 print (" Comparaison des m c a n i s m e s d’attention :")
352 print ("-" * 80)
353 for name , info in attention_types . items () :
354 print (f"\n{ name }:")
355 print (f" C o m p l e x i t : { info [’ complexity ’]}")
356 print (f" M m o i r e : { info [ ’ memory ’]}")
357 print (f" Description : { info [’ description ’]}")
358 print (f" Avantages : {’ ’. join ( info [’ advantages
, ’])}")
359 print (f" I n c o n v n i e n t s : {’, ’. join ( info [ ’ disadvantages ’])}")
360
361 # E xcution de la dmonstration
362 model , attention_weights , 363 compare_attention_types () visualizer = demonstrate_attention_mechanisms ()

364

365 print ("\n=== POINTS C L S SUR L’ATTENTION ===")


366 print ("1. Self-attention allows each position to see all others ")
367 print ("2. Multi - head attention capture d i f f r e n t s types de relations ")
368 print ("3. Entropy measures the dispersion of attention ")
369 print ("4. The first layers often have more self-attention ")
370 print ("5. Deeper layers develop more complex patterns ")
371 print ("6. Visualization reveals emerging linguistic structures ")
ÿ ÿ
Listing 71 – Detailed Implementation of Attention Mechanisms

Mohamed Ouazze 80 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

Explanation

Attention mechanisms revolutionize sequential processing by enabling interaction


direct between all positions:
Self-attention: Each token can "look" at all other tokens in the sequence, calculating
a weighted representation based on relevance. This approach naturally captures the
long-term dependencies.
Multi-head attention: Uses several attention "heads" in parallel, each learning to
focus on different aspects (syntactic, semantic, positional). This diversity enriches
the representation.
Emerging Patterns: Analysis of attention weights often reveals linguistic structures
interesting - some heads specialize in syntactic relations, others in co-reference or semantic dependencies.

Quadratic complexity: The main challenge is the O(n²) complexity in sequence length,
motivating research into more efficient variants such as linear or sparse attention.

3.3.4 Position encodings

Positional encodings are crucial in Transformers because they compensate for the lack of order
natural sequential nature of the attention mechanism.
ÿ ÿ

1 import torch
2 import torch . nn as nn

3 import numpy as np
4 import matplotlib . pyplot as plt
5

6 class SinusoidalPositionalEncoding ( nn . Module ):


"""
7

8 Standard sinusoidal positional encoding of Transformers


9 PE(pos , 2i) = sin ( pos / 10000^(2 i/ d_model ))
10 PE(pos , 2i+1) = cos ( pos / 10000^(2 i/ d_model ))
"""
11

12 def __init__ ( self super () . , d_model , max_len =5000) :


13 __init__ ()
14

15 # Creation of the positional encoding matrix


16 pe = torch . zeros ( max_len position = torch . , d_model )
17 arange (0 , max_len , dtype = torch . float ). unsqueeze (1)
18

19 # Calculation of frequencies ( splitting term )


20 div_term = torch . exp ( torch . arange (0 d_model 2) . float () * , ,

21 (- math . log (10000.0) / d_model ) )


22

23 # Application des fonctions t r i g o n o m t r i q u e s


24 on [: , 0::2] = torch . sin ( position * div_term ) # Dimensions paires
25 on [: , 1::2] = torch . cos ( position * div_term ) # Odd dimensions
26

27 on = on . unsqueeze (0) . transpose (0 , 1) # Shape : [ max_len , 1, d_model ]


28

29 # Save as buffer (non-parameter)


30 self . register_buffer (’pe ’, pe )
31

32 def forward ( self , x) :


"""
33

34 Args :
35 x: Tensor de forme [ seq_len , batch_size , d_model ]
"""
36

37 return x + self . pe [: x . size (0) , :]


38

39 class LearnedPositionalEncoding ( nn . Module ):


"""
40

41 Positional encoding learned by the model


"""
42

43 def __init__ ( self super () . , d_model , max_len =5000) :


44 __init__ ()
45 = nn self . max_len = . Embedding ( max_len self . pe , d_model )
46 max_len
47

Mohamed Ouazze 81 BDCC-2024-2025


Machine Translated by Google

3 LEVEL 2: INTERMEDIATE CONCEPTS

48 def forward ( self , x) :


"""
49

50 Args :
51 x: Tensor de forme [ seq_len , batch_size , d_model ]
"""
52

53 seq_len = x. size (0)


54 positions = torch . arange ( seq_len , device =x. device ). unsqueeze (1)
55 pos_embeddings = self . pe ( positions )
56 return x + pos_embeddings
57

58 class RelativePositionalEncoding ( nn . Module ):


"""
59

60 Relative positional encoding (used in some variants)


"""
61

62 def __init__ ( self super () . , d_model , max_relative_position =20) :


63 __init__ ()
64 self . d_model = d_model
65 self . max_relative_position = max_relative_position
66

67 # Embeddings for relative positions


68 vocab_size = 2 * max_relative_position + 1
69 self . relative_position_embeddings = nn . Embedding ( vocab_size , d_model )

ÿ ÿ
Listing 72 – Implementation and Analysis of Positional Encodings

Mohamed Ouazze 82 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

4 Level 3: Advanced Concepts


Level 3 explores the cutting-edge architectures and techniques that define the current state of the art in Deep
Learning. These advanced concepts enable the solution of complex problems in various fields.
such as computer vision, natural language processing and reinforcement learning.

4.1 Advanced Architectures for Computer Vision


4.1.1 Residual Networks (ResNet) and Skip Connections

Residual networks, introduced by He et al. in 2015, have revolutionized the training of very complex networks.
deep by solving the problem of performance degradation with increasing depth.
Fundamental problem with deep networks: Contrary to intuition, simply increasing the number of layers does not guarantee better
performance. Beyond a certain depth,
the training error itself starts to increase, suggesting an optimization problem rather
than overlearning.

Architecture Benefits Disadvantages

Traditional CNN Simple, well understood Degradation with depth

ResNet Allows very deep networks More complex to implement

(152+ couches)

DenseNet Reusing features High memory consumption

Principle of residual connections: Instead of directly learning the function H(x), the
residual blocks learn the residual function F(x) = H(x) ÿ x, allowing information to "short-circuit" certain layers.

y = F(x, {Wi}) + x (3)

where F(x, {Wi}) represents the learned residual function and x is the shortcut identity.
ÿ ÿ
1 import torch
2 import torch . 3 import nn as nn
torch . . functional as F nn 4

import matplotlib . pyplot as plt

6 class BasicBlock ( nn . Module ):


"""
7

8 Basic residual block for ResNet -18/34


"""
9

10 expansion = 1
11

12 def __init__ ( self in_channels out_channels


, super ( BasicBlock
, self ). __init__ () , stride =1 , downsample = None ):
13 ,

14

15 # P r e m i r e couche convolutive
16 nn . Conv2d ( in_channels self . conv1 = , out_channels , kernel_size =3 bias = ,

17 stride = stride , padding =1 . BatchNorm2d , False )


18 self . bn1 = nn ( out_channels )
19

20 # D e u x i m e couche convolutive
21 self . conv2 = nn . Conv2d ( out_channels stride =1 , padding =1 . , out_channels kernel_size
, =3 bias = False ) ,

22 BatchNorm2d ( out_channels ) ,

23 self . bn2 = nn
24

25 # Connexion r s i d u e l l e ( shortcut )
26 self . downsample = downsample
27 self . stride = stride
28

29 def forward ( self x) : ,

30 # Save entry for residual connection


31 identity = x
32

33 # P r e m i r e convolution

Mohamed Ouazze 83 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

34 out = self . conv1 (x)


35 out = self . bn1 ( out )
36 out = F. relu ( out )
37

38 # D e u x i m e convolution
39 out = self . conv2 ( out )
40 out = self . bn2 ( out )
41

42 # Adjust the residual connection if necessary


43 if self . downsample is not None :
44 identity = self . downsample ( x)
45

46 # Addition of the residual connection


47 out += identity
48 out = F. relu ( out )
49

50 return out
51

52 class Bottleneck ( nn . Module ):


"""
53

54 Bloc bottleneck pour ResNet -50/101/152


"""
55

56 expansion = 4
57

58 def __init__ ( self super , in_channels , out_channels , stride =1 , downsample = None ):


59 ( Bottleneck , self ). __init__ ()
60

61 # 1x1 conv pour r d u c t i o n de dimension


62 self . conv1 = nn . Conv2d ( in_channels out_channels . BatchNorm2d ( out_channels
, ) , kernel_size =1 , bias = False )
63 self . bn1 = nn
64

65 # 3x3 main conv


66 self . conv2 = nn . Conv2d ( out_channels stride = stride , out_channels kernel_size
, =3 bias = False ) ,

67 , padding =1 self . bn2 ,

68 = nn . BatchNorm2d ( out_channels )
69

70 # 1x1 conv for dimension expansion


71 nn out_channels * self . expansion
. Conv2d
, ( out_channels self . conv3 = ,

72 kernel_size =1 bias = False


, )
73 self . bn3 = nn . BatchNorm2d ( out_channels * self . expansion )
74

75 self . downsample = downsample


76 self . stride = stride
77

78 def forward ( self identity = x , x) :


79

80

81 # 1x1 conv
82 out = self . conv1 (x)
83 out = self . bn1 ( out )
84 out = F. relu ( out )
85

86 # 3x3 conv
87 out = self . conv2 ( out )
88 out = self . bn2 ( out )
89 out = F. relu ( out )
90

91 # 1x1 conv
92 out = self . conv3 ( out )
93 out = self . bn3 ( out )
94

95 # Connexion r s i d u e l l e
96 if self . downsample is not None :
97 identity = self . downsample ( x)
98

99 out += identity
100 out = F. relu ( out )
101

102 return out


103

104 class ResNet ( nn . Module ):


"""
105

106 The implementation of ResNet

Mohamed Ouazze 84 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

"""
107

108 def __init__ ( self super ( ResNet, block , layers , self ). __init__ num_classes =1000) :
109 , ()
110
111 self . in_channels = 64
112

113 # Between layer


114 conv1 = nn . Conv2d (3 self . , 64 , kernel_size =7 , stride =2 , padding =3 , bias = False )
115 self . bn1 = nn . BatchNorm2d (64)

116 = nn . MaxPool2d ( kernel_size =3 self . maxpool , stride =2 , padding =1)


117

118 # Blocs r s i d u e l s
119 self . layer1 = self . _make_layer ( block self . layer2 = self . , 64 , layers [0])
120 _make_layer ( block self . layer3 = self . _make_layer ( block self . , layers , layers [1] 128 , , stride =2)
121 layer4 = self . _make_layer ( block , 256 [2] 512 , layers [3] , stride =2)
122 , , stride =2)
123
124 # Classification layer

125 self . avgpool = nn self . fc = nn . . AdaptiveAvgPool2d ((1 1) ) ,

126 Linear (512 * block . expansion , num_classes )

127

128 # Initialization of weights


129 self . _initialize_weights ()
130

131 def _make_layer ( self , block , out_channels , blocks , stride =1) :


"""
132

133 Creates a layer composed of several residual blocks


"""
134

135 downsample = None


136

137 # Adaptation required if resizing


138 if stride != 1 or self . in_channels != out_channels * block . expansion :
139 downsample = nn . Sequential (
140 nn . Conv2d ( self . in_channels , out_channels * block . expansion ,
141 kernel_size =1 , stride = stride , bias = False ) ,

142 nn . BatchNorm2d ( out_channels * block . expansion ) ,

143 )

144

145 layers = []
146 # First block (potentially with downsampling)
147 layers . append ( block ( self . in_channels out_channels self . in_channels
, = out_channels * , stride , downsample ))
148 block . expansion
149

150 # Next blocks


151 for _ in range (1 blocks ):,
152 layers . append ( block ( self . in_channels , out_channels ))
153

154 return nn . Sequential (* layers )


155

156 def _initialize_weights ( self ):


""" """
157 Initialization of weights according to He's method
158 for m in self . modules () :
159 if isinstance (m . Conv2d ): , nn
160 nn . init . kaiming_normal_ (m. weight , elif isinstance (m . mode =’fan_out ’, nonlinearity =’relu ’)
161 BatchNorm2d ): , nn
162 nn . init . constant_ ( m. weight , 1)
163 nn . init . constant_ ( m. bias 0) ,

164

165 def forward ( self , x) :


166 # Between layer
167 x = self . conv1 (x)
168 x = self . bn1 (x )
169 x = F relu (x)

170 x = self . maxpool (x )


171

172 # Blocs r s i d u e l s
173 x = self . layer1 (x)
174 x = self . layer2 (x)
175 x = self . layer3 (x)
176 x = self . layer4 (x)
177

178 # Classification
179 x = self . avgpool (x )

Mohamed Ouazze 85 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

180 x = torch . flatten (x x = self . fc , 1)


181 (x)
182

183 return x
184
185 def resnet18 ( num_classes =1000) :
""" """
186 ResNet -18 return

187 ResNet ( BasicBlock , [2 , 2, 2, 2] , num_classes )


188
189 def resnet34 ( num_classes =1000) :
"""
190 ResNet -34 """
191 return ResNet ( BasicBlock , [3 , 4, 6, 3] , num_classes )
192
193 def resnet50 ( num_classes =1000) :
"""
194 ResNet -50 """
195 return ResNet ( Bottleneck , [3 , 4, 6, 3] , num_classes )
196
197 def resnet101 ( num_classes =1000) :
"""
198 ResNet -101 """
199 return ResNet ( Bottleneck , [3 , 4, 23 , 3] , num_classes )
200
201 def resnet152 ( num_classes =1000) :
""" """
202 ResNet -152
203 return ResNet ( Bottleneck , [3 , 8, 36 , 3] , num_classes )
204

205 # Comparison with a traditional CNN


206 class TraditionalCNN ( nn . Module ):
"""
207

208 Traditional CNN without residual connections for comparison


"""
209

210 def __init__ ( self num_classes, =10) :


211 super ( TraditionalCNN self ). __init__
, ()
212

213 self . features = nn . Sequential (


214 # Bloc 1
215 nn . Conv2d (3 . , , 3, padding =1) 64 ,

216 nn BatchNorm2d (64) ,

217 nn .ReLU() ,

218 nn . Conv2d (64 . , 64 , 3, padding =1) ,

219 nn BatchNorm2d (64) ,

220 nn .ReLU() ,

221 nn . MaxPool2d (2) ,

222

223 # Bloc 2
224 nn . Conv2d (64 . , 128 , 3, padding =1) ,

225 nn BatchNorm2d (128) ,

226 nn .ReLU() ,

227 nn . Conv2d (128 . , 128 , 3 , padding =1) ,

228 nn BatchNorm2d (128) ,

229 nn .ReLU() ,

230 nn . MaxPool2d (2) ,

231

232 # Bloc 3
233 nn . Conv2d (128 . , 256 , 3 , padding =1) ,

234 nn BatchNorm2d (256) ,

235 nn .ReLU() ,

236 nn . Conv2d (256 . , 256 , 3 , padding =1) ,

237 nn BatchNorm2d (256) ,

238 nn .ReLU() ,

239 nn . MaxPool2d (2) ,

240
241 # Bloc 4
242 nn . Conv2d (256 . , 512 , 3 , padding =1) ,

243 nn BatchNorm2d (512) ,

244 nn .ReLU() ,

245 nn . Conv2d (512 . , 512 , , padding =1) 3 ,

246 nn BatchNorm2d (512) ,

247 nn .ReLU() ,

248 nn . MaxPool2d (2) ,

249 )

250

251 self . classifier = nn . Sequential (


252 nn . AdaptiveAvgPool2d ((1 1) ) , ,

Mohamed Ouazze 86 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

253 nn . Flatten () ,

254 nn . Linear (512 , num_classes )


255 )

256
257 def forward ( self x) : ,

258 x = self . features ( x)


259 x = self . classifier (x)
260 return x
261

262 # Testing and comparing architectures


263 def compare_architectures () :
"""
264
265 Compare ResNet with a traditional CNN
"""
266
267 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
268
269 # C r a t i o n des m o d l e s
270 resnet = resnet18 ( num_classes =10) . to ( device )
271 traditional_cnn = TraditionalCNN ( num_classes =10) . to ( device )
272
273 # Counting parameters
274 def count_parameters ( model ):
275 return sum (p. numel () for p in model . parameters () if p. requires_grad )
276
277 resnet_params = count_parameters ( resnet )
278 cnn_params = count_parameters ( traditional_cnn )
279
280 print ("Architecture Comparison:")
281 print (f" ResNet -18 p a r a m t r e s : { resnet_params : ,}")
282 print (f" CNN traditionnel p a r a m t r e s : { cnn_params : ,}")
283
284 # Test with dummy data

285 batch_size = 32
286 input_tensor = torch . randn ( batch_size , 3, 32 , 32) . to ( device )
287
288 # Speed test

289 import time


290
291 # ResNet
292 start_time = time . time ()
293 with torch . no_grad () :
294 resnet_output = resnet ( input_tensor )
295 resnet_time = time . time () - start_time
296
297 #Traditional CNN
298 start_time = time . time ()
299 with torch . no_grad () :
300 cnn_output = traditional_cnn ( input_tensor )
301 cnn_time = time . time () - start_time
302
303 print (f"\ nTemps d’ i n f r e n c e :")
304 print (f" ResNet -18: { resnet_time :.4 f}s")
305 print (f" CNN traditionnel : { cnn_time :.4 f}s")
306
307 print (f"\ nDimensions de sortie :")
308 print (f" ResNet -18: { resnet_output . shape }")
309 print (f" CNN traditionnel : { cnn_output . shape }")
310
311 return resnet , traditional_cnn
312

313 # Running the comparison


314 print ("Comparison of ResNet vs. traditional CNN architectures ")
315 resnet_model cnn_model, = compare_architectures ()
316
317 # Visualization of ResNet architecture
318 def visualize_resnet_block () :
"""
319
320 Visualize the residual block concept
"""
321
322 fig , (ax1) , ax2 ) = plt . subplots (1 , 2 , figsize =(12 , 6) )
323
324 # Traditional block
325 ax1 . text (0.5 , 0.9 , 'Traditional CNN Block ', ha = 'center ', va = 'center ',

Mohamed Ouazze 87 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

326 fontsize =14 , weight =’bold ’)


327
328 # Flux principal
329 ax1 . arrow (0.5 0.8 , 0, -0.15, ) , head_width =0.02 , head_length =0.02 , fc =’blue ’, ec =’blue ’

330 , 0.6', ha, ='center ', va ='center ',


ax1 . text (0.5 'Conv + ReLU
331 bbox = dict ( boxstyle ="round , pad =0.3 ", facecolor =’lightblue ’) )
332
333 ax1 . arrow (0.5 ) , 0.5 , 0, -0.15 , head_width =0.02 , head_length =0.02 , fc =’blue ’, ec =’blue ’

334 , 0.3', ha, ='center ', va ='center ',


ax1 . text (0.5 'Conv + ReLU
335 bbox = dict ( boxstyle ="round , pad =0.3 ", facecolor =’lightblue ’) )
336
337 ax1 . arrow (0.5 , 0.2 , 0, -0.1 , head_width =0.02 , head_length =0.02 , fc =’blue ’, ec =’blue ’)
338
339 ax1 . set_xlim (0 ax1 . , 1)
340 set_ylim (0 ax1 . axis , 1)
341 (’off ’)
342
343 # Bloc r s i d u e l
344 ax2 . text (0.5 ’Bloc R s, i d0.9
u e l ’,, ha =’center ’, va =’center ’,
345 fontsize =14 , weight =’bold ’)
346
347 # Flux principal
348 ax2 . arrow (0.5 0.8 , 0, -0.15, ) , head_width =0.02 , head_length =0.02 , fc =’blue ’, ec =’blue ’

349 , 0.6', ha, ='center ', va ='center ',


ax2 . text (0.5 'Conv + ReLU
350 bbox = dict ( boxstyle ="round , pad =0.3 ", facecolor =’lightblue ’) )
351
352 ax2 . arrow (0.5 ) , 0.5 , 0, -0.15 , head_width =0.02 , head_length =0.02 , fc =’blue ’, ec =’blue ’

353 ax2 . text (0.5 ’Conv ’, ha


, 0.3 , ’, va =’center ’,
=’center
354 bbox = dict ( boxstyle ="round , pad =0.3 ", facecolor =’lightblue ’) )
355
356 # Connexion r s i d u e l l e ( skip connection )
357 ax2 . plot ([0.2 0.8 0.15 ax2 ., text0.2
(0.1 , , 0.8] 0.2] 0.5
, , [0.8
’Skip \ nConnection
, ’,, ha 0.15
=’center ’,, va , ’r-’, linewidth =2)
358 , =’center ’,
359 color =’red ’, fontsize =10 , weight =’bold ’)
360
361 # Addition
362 ax2 . text (0.5 ax2 . , 0.15 , ’+’, ha =’center ’, va =’center ’, fontsize =20 , weight =’bold ’)
363 text (0.7 bbox = dict , 0.15 , 'ReLU ', ha ='center ', va ='center ',
364 ( boxstyle ="round , pad =0.2 ", facecolor =’lightgreen ’))
365
366 ax2 . set_xlim (0 ax2 . , 1)
367 set_ylim (0 ax2 . axis , 1)
368 (’off ’)
369
370 plt . suptitle (’ Comparaison : CNN Traditionnel vs Bloc R s i d u e l ’, fontsize =16 , weight =’
bold ’)
371 plt . tight_layout ()
372 plt . show ()
373
374 visualize_resnet_block ()

ÿ ÿ
Listing 73 – Implementing a Residual Block with PyTorch

Mohamed Ouazze 88 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

Explanation

Residual connections solve several fundamental network problems


deep:
Vanishing Gradient: Gradients can flow directly through skip connections, avoiding
successive attenuation through many layers.
Performance degradation: If a layer does not provide improvement, it can simply learn to do the identity
(F(x) = 0), allowing the network not to degrade the
performances.
Easier optimization: Learning residual functions is often easier than
learning the full function, because small changes are easier to optimize.
This innovation enabled the training of networks with hundreds of layers, opening the
path to ultra-deep modern architectures.

4.1.2 U-Net Architectures for Segmentation

U-Net is an architecture specially designed for semantic segmentation, particularly


efficient with little training data. It is widely used in medical imaging and in
other areas requiring precise segmentation.
Architectural principle: U-Net combines an encoder (contracting path) that captures the context and
a decoder (expansive path) that allows precise localization, with skip connections that preserve
the fine details.
ÿ ÿ

1 import torch
2 import torch . 3 import nn as nn
torch . . functional as F nn 4

import numpy as np

5 import matplotlib . pyplot as plt


6 from torch . utils . data import Dataset , DataLoader
7

8 class DoubleConv ( nn . Module ):


"""
9

10 Block of two consecutive convolutions with BatchNorm and ReLU


"""
11

12 def __init__ ( self in_channels super


, () . __init__ () , out_channels , mid_channels = None ):
13

14 if not mid_channels :
15 mid_channels = out_channels
16

17 self . double_conv = nn . Sequential (


18 nn . Conv2d ( in_channels mid_channels
, , kernel_size =3 , padding =1 , bias = False ) ,

19 nn . BatchNorm2d ( mid_channels ) ,

20 nn . ReLU ( inplace = True ) ,

21 nn . Conv2d ( mid_channels . , out_channels , kernel_size =3 , padding =1 , bias = False ) ,

22 nn BatchNorm2d ( out_channels ) ,

23 nn . ReLU ( inplace = True )


24 )

25

26 def forward ( self return self . , x) :


27 double_conv (x)
28

29 class Down ( nn . Module ):


"""
30

31 Descent block: MaxPooling + DoubleConv


"""
32

33 def __init__ ( self , in_channels , out_channels ):


34 super () . __init__ ()
35 self . maxpool_conv = nn . MaxPool2d . Sequential (
36 nn (2) ,

37 DoubleConv ( in_channels , out_channels )


38 )

39

40 def forward ( self x) : ,

41 return self . maxpool_conv (x )


42

43 class Up ( nn . Module ):
"""
44

Mohamed Ouazze 89 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

45 Bloc de m o n t e : Upsampling + c o n c a t n a t i o n + DoubleConv


"""
46

47 def __init__ ( self in_channels super


, () . __init__ () , out_channels , bilinear = True ):
48

49

50 if bilinear :
51 # Upsampling b i l i n a i r e + convolution 1x1 pour r d u c t i o n de canaux
52 self . conv = . Upsample ( scale_factor =2 self . up = nn , mode =’bilinear ’, align_corners = True )
53 DoubleConv ( in_channels , out_channels , in_channels // 2)
54 else :
55 # Convolution t r a n s p o s e
56 self . up = nn stride . ConvTranspose2d ( in_channels in_channels // 2, ,

57 =2) kernel_size =2 self . ,

58 conv = DoubleConv ( in_channels out_channels ) ,

59

60 def forward ( self x1 = self . , x1 , x2):


61 up ( x1 )
62

63 # Size Difference Management


64 diffY = x2 . size () [2] - x1 . size () [2]
65 diffX = x2 . size () [3] - x1 . size () [3]
66

67 x1 = F. pad (x1 , [ diffX // 2, diffX - diffX // 2 ,

68 diffY // 2, diffY - diffY // 2])


69

70 # C o n c a t n a t i o n des feature maps


71 x = torch . cat ([ x2 x1 ], dim =1) ,
72 return self . conv (x )
73

74 class OutConv ( nn . Module ) :


"""
75

76 Final convolution for output


"""
77

78 def __init__ ( self in_channels super


, ( OutConv self ) . , out_channels ):
79 __init__ () ,

80 . Conv2d ( in_channels self . conv = nn , out_channels , kernel_size =1)


81

82 def forward ( self x) : ,

83 return self . conv (x )


84

85 class UNet ( nn . Module ):


"""
86

87 Architecture U-Net c o m p l t e
"""
88

89 def __init__ ( self n_channels n_classes


, super ( UNet self
, ). __init__ () , bilinear = False ):
90 ,

91 self . n_channels = n_channels


92 self . n_classes = n_classes
93 self . bilinear = bilinear
94

95 # Encoder (contracting path)


96 self . inc = DoubleConv ( n_channels self . down1 = Down , 64)
97 (64 128) ,

98 self . down2 = Down (128 self . down3 , 256)


99 = Down (256 , 512)
100 factor = 2 if bilinear else 1
101 self . down4 = Down (512 , 1024 // factor )
102

103 # D c o d e u r ( expansive path )


104 self . up1 = Up (1024 512 // factor self
, . up2 = Up (512 self . , bilinear )
105 up3 = Up (256 self . up4 = Up , 256 // factor 128 // factor , bilinear )
106 (128 self . outc = OutConv (64 , bilinear ) , bilinear )
107 , 64 ,
108 , n_classes )
109

110 def forward ( self , x) :


111 # Encoder
112 x1 = self . inc (x)
113 x2 = self . down1 ( x1 )
114 x3 = self . down2 ( x2 )
115 x4 = self . down3 ( x3 )
116 x5 = self . down4 ( x4 )
117

Mohamed Ouazze 90 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

118 # D encoder with skip connections


119 x = self . up1 (x5 x = self . , x4)
120 up2 (x x = self . up3 (x , x3 )
121 x = self . up4 (x logits = , x2)
122 self . outc (x) , x1 )
123
124

125 return logits


126

127 # Dataset s y n t h t i q u e pour d m o n s t r a t i o n


128 class SyntheticSegmentationDataset ( Dataset ):
"""
129

130 Dataset s y n t h t i q u e pour tester U- Net


131 Generates circles and squares with their segmentation masks
"""
132

133 def __init__ ( self self . size = size , size =1000 , img_size =128) :

134

135 self . img_size = img_size


136

137 def __len__ ( self ):


138 return self . size
139

140 def __getitem__ ( self idx ): ,

141 # C r a t i o n d’une image s y n t h t i q u e


142 image = np . zeros ((3 self . img_size
, self
) , dtype
. img_size
= np ., float32
mask )
143 = np . zeros (( self . img_size , self . img_size ) , dtype = np . float32 )
144

145 # G n r a t i o n a l a t o i r e de formes g o m t r i q u e s
146 num_shapes = np . random . randint (2 , 5)
147

148 for _ in range ( num_shapes ):


149 shape_type = np . random . choice ([ ’circle ’, ’rectangle ’])
150

151 if shape_type == ’circle ’:


152 center_x = np . random . randint (20 center_y = np . , self . img_size - 20)
153 random . randint (20 radius = np . random . randint , self . img_size - 20)
154 (10 color = np . random . rand (3) , 25)
155

156

157 # C r a t i o n du cercle
158 y, x = np . ogrid [: self . img_size , : self . img_size ]
159 circle_mask = ( x - center_x ) **2 + (y - center_y ) **2 <= radius **2
160

161 for c in range (3) :


162 image [c ][ circle_mask ] = color [c]
163 mask [ circle_mask ] = 1
164

165 else : # rectangle


166 x1 = np . random . randint (10 self . img_size, - 30)
167 y1 = np . random . randint (10 self . img_size, - 30)
168 width = np . random . randint (15 30) ,

169 height = np . random . randint (15 30) ,

170 x2 = min( x1 + width self . img_size, - 1)


171 y2 = min( y1 + height , color = np . self . img_size - 1)
172 random . rand (3)
173

174 for c in range (3) :


175 image [c x1 :, x2
y1 ]: =y2color
, [ c]
176 mask [ y1 :y2 , x1 : x2 ] = 2 # Classe d i f f r e n t e pour les rectangles
177

178 # Adding noise


179 noise = np . random . normal (0 , image . shape
, ) 0.1
180 image = np . clip ( image + noise , 0, 1)
181

182 return torch . from_numpy ( image ) , torch . from_numpy ( mask ) . long ()


183

184 # Dice loss function for segmentation


185 class DiceLoss ( nn . Module ):
"""
186

187 Dice loss for segmentation


"""
188

189 def __init__ ( self super , smooth =1 e -6) :


190 ( DiceLoss , self ). __init__ ()

Mohamed Ouazze 91 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

191 self . smooth = smooth


192

193 def forward ( self , predictions targets


, ):
194 # Application softmax pour obtenir des p r o b a b i l i t s
195 predictions = F . softmax ( predictions , dim =1)
196

197 # Calculation of the Dice coefficient for each class


198 dice_scores = []
199 num_classes = predictions . shape [1]
200

201 for i in range ( num_classes ):


202 pred_i = predictions [: i ]. flatten () ,

203 target_i = ( targets == i). float () . flatten ()


204

205 intersection = ( pred_i * target_i ). sum ()


206 dice = (2.0 * intersection + self . smooth ) / ( pred_i .sum () + target_i . sum () +
self . smooth )
207 dice_scores . append ( dice )
208

209 # Average Dice Scores


210 return 1 - torch . stack ( dice_scores ). mean ()
211
212 # Fonction d’ e n t r a n e m e n t
213 def train_unet () :
"""
214

215 It introduces the U- Net model on synthetic data


"""
216

217 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
218 print (f" Utilisation du device : { device }")
219

220 # C ration you data set a data loader

221 dataset = SyntheticSegmentationDataset ( size =1000 , img_size =128)


222 dataloader = DataLoader ( dataset batch_size =4 shuffle =, True ) ,

223
224 # M odle loss and optimizer
,

225 model = UNet ( n_channels =3 criterion = , n_classes =3 , bilinear = True ). to ( device )


226 DiceLoss ()
227 optimizer = torch . optim . Adam ( model . parameters () , lr =1e -4)
228

229 # M t r i q u e s d’ e n t r a n e m e n t
230 train_losses = []
231

232 print (" D b u t de l’ e n t r a n e m e n t U- Net ...")


233 model . train ()
234

235 for epoch in range (10): # Few epochs for demonstration


236 epoch_loss = 0
237 num_batches = 0
238

239 for batch_idx , ( images , masks ) in enumerate ( dataloader ) :


240 images = images . to ( device )
241 masks = masks . to ( device )
242

243 # Forward pass


244 outputs = model ( images )
245 loss = criterion ( outputs , masks )
246

247 # Backward pass


248 optimizer . zero_grad ()
249 loss . backward ()
250 optimizer . step ()
251

252 epoch_loss += loss . item ()


253 num_batches += 1
254

255 if batch_idx % 50 == 0:
256 print (f’Epoch { epoch }, Batch { batch_idx }, Loss : { loss . item () :.4 f}’)
257

258 avg_loss = epoch_loss / num_batches


259 train_losses . append ( avg_loss )
260 print (f’Epoch { epoch } t e r m i n e , Loss moyenne : { avg_loss :.4 f}’)
261

262 return model , train_losses

Mohamed Ouazze 92 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

263
264 # Results visualization function

265 def visualize_segmentation_results ( model , dataset , device , num_samples =3) :


"""
266
267 Visualizes segmentation results
"""
268
269 model . eval ()
270
271 fig , axes = plt . subplots ( num_samples , if num_samples == 1: 4 , figsize =(16 , 4* num_samples ))
272
273 axes = axes . reshape (1 , -1)
274
275 with torch . no_grad () :
276 for i in range ( num_samples ):
277 # Get a sample

278 image , true_mask = dataset [i ]


279 image_batch = image . unsqueeze (0) . to ( device )
280
281 #Prdiction
282 output = model ( image_batch )
283 pred_mask = torch . argmax ( output , dim =1) . squeeze () . cpu () . numpy ()
284
285 # Conversion for display
286 image_display = image . permute (1 , 2, 0) . numpy ()
287 true_mask_display = true_mask . numpy ()
288
289 # Display
290 axes [i 0]. imshow
, ( image_display )
291 axes [i 0]. set_title
, (’Image originale ’)
292 axes [i 0]. axis
, (’off ’)
293
294 axes [i , 1]. imshow ( true_mask_display , cmap =’tab10 ’)
295 axes [i , 1]. set_title (’Masque v r i t terrain ’)
296 axes [i , 1]. axis (’off ’)
297
298 axes [i , 2]. imshow ( pred_mask , cmap =’tab10 ’)
299 axes [i , 2]. set_title (’Masque p r d i t ’)
300 axes [i , 2]. axis (’off ’)
301
302 # Overlay de la p r d i c t i o n sur l’image
303 overlay = image_display . copy ()
304 overlay [ pred_mask > 0] = overlay [ pred_mask > 0] * 0.7 + 0.3
305 axes [i , 3]. imshow ( overlay )
306 axes [i , 3]. set_title (’Overlay p r d i c t i o n ’)
307 axes [i , 3]. axis (’off ’)
308
309 plt . tight_layout ()
310 plt . show ()
311
312 # E n t r a n e m e n t et test

313 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
314 dataset = SyntheticSegmentationDataset ( size =500 , img_size =128)
315
316 # Training you model

317 trained_model , losses = train_unet()


318
319 # Visualisation des r s u l t a t s

320 print ("\ nVisualisation des r s u l t a t s de segmentation :")


321 visualize_segmentation_results ( trained_model , dataset , device , num_samples =3)

ÿ ÿ
Listing 74 – U-Net Implementation for Image Segmentation

Mohamed Ouazze 93 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

Explanation

U-Net revolutionizes semantic segmentation thanks to several key innovations:


U-shaped architecture: The characteristic shape combines an encoder that gradually reduces
spatial resolution while increasing feature depth, and a decoder that restores
spatial resolution.
Skip Connections: Direct connections between the corresponding levels of the encoder and the
decoder allows to preserve the fine details lost during downsampling.
Efficiency with little data: U-Net can be trained efficiently with datasets
relatively small, crucial in medical imaging where annotated data is scarce.
The Dice loss function is particularly suited to segmentation because it directly penalizes overlap errors
between the prediction and the ground truth, better handling the
class imbalances than classical cross-entropy.

4.1.3 R-CNN, Fast R-CNN, Faster R-CNN for Object Detection

Object detection requires not only classifying the objects present in an image, but
also to locate them precisely. The R-CNN family has revolutionized this field with improvements
successive.

Method Speed Precision Main Innovation


R-CNN Very slow Good CNN for object classification
Fast R-CNN Rapide Best ROI pooling, end-to-end training

Faster R-CNN Very fast Excellent Region Proposal Network (RPN)


YOLO Real Time Good Single-pass detection

ÿ ÿ

1 import torch
2 import torch . 3 import nn as nn
torch . . functional as F nn 4

import torchvision . transforms as transforms

5 from torchvision . models import resnet50


6 import numpy as np
7

8 class RPN ( nn . Module ) :


"""
9

10 Region Proposal Network pour Faster R- CNN


"""
11

12 def __init__ ( self super ( RPN , in_channels , num_anchors ) :


13 , self ).__init__()
14

15 # 3x3 convolution for common features


16 self . conv = nn kernel_size =3
. Conv2d ( in_channels , padding =1) , 512 ,

17

18 # Classification : objet vs a r r i r e - plan


19 cls_logits = nn . Conv2d (512 self . , num_anchors , kernel_size =1)
20

21 # R regression of encompassing buttons


22 self . bbox_pred = nn . Conv2d (512 , num_anchors * 4, kernel_size =1)
23

24 # Initialization of weights
25 self . _init_weights ()
26

27 def _init_weights ( self ):


""" """
28 Initialization of weights
29 for layer in [ self . conv self . bbox_pred ]: , self . cls_logits , . init . normal_
30 nn ( layer . weight , std =0.01)
31 nn . init . constant_ ( layer . bias 0) ,

32

33 def forward ( self , features ):


"""
34

35 Forward pass du RPN


36

Mohamed Ouazze 94 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

37 Args :
38 features : Feature maps du backbone CNN
39

40 Returns :
41 cls_logits : Scores de classification objet / a r r i r e - plan
42 bbox_pred : P r d i c t i o n s de r g r e s s i o n des b o t e s
"""
43

44 # Features communes
45 x = F . relu ( self . conv ( features ) )

46

47 # Classification et r g r e s s i o n
48 cls_logits = self . cls_logits (x)
49 bbox_pred = self . bbox_pred (x )
50

51 return cls_logits , bbox_pred


52

53 class ROIPooling ( nn . Module ):


"""
54

55 ROI Pooling to extract fixed-size features


"""
56

57 def __init__ ( self , output_size , spatial_scale ):


58 super ( ROIPooling , self ). __init__ ()
59 self . output_size = output_size
60 self . spatial_scale = spatial_scale
61

62 def forward ( self , features , rois ):


"""
63

64 Args :
65 features : Feature maps [N, C, H, W]
66 rois : Regions of Interest [ num_rois , 5] ( batch_idx , x1 , y1 , x2 , y2)
67

68 Returns :
69 pooled_features : Features p o o l e s [ num_rois , C, output_size , output_size ]
"""
70

71 # I m p l m e n t a t i o n s i m p l i f i e avec adaptive pooling


72 pooled_features = []
73

74 for roi in rois :


75 batch_idx = int( roi [0])
76 x1 , y1 , x2 , y2 = roi [1:5] * self . spatial_scale
77

78 # Extraction from region


79 [ batch_idx , :,x2 , y2 = int ( x1 ) int ( y1 ) int ( x2 , ) x1 , y1 , roi_feature
, = features , int ( y2 )
80 y1 :y2 , x1 : x2 ]
81

82 # Adaptive pooling
83 pooled_features . . adaptive_max_pool2d ( roi_feature pooled = F , self . output_size )
84 append ( pooled )
85

86 return torch . stack ( pooled_features )


87

88 class FasterRCNN ( nn . Module ):


"""
89

90 Architecture Faster R- CNN s i m p l i f i e


"""
91

92 def __init__ ( self num_classes super


, ( FasterRCNN , num_anchors =9) :
93 self ). __init__ () ,

94

95 # Backbone CNN (ResNet -50 without the final layer)


96 resnet = resnet50 ( pretrained = True )
97 self . backbone = nn . Sequential (* list ( resnet . children () ) [: -2])
98

99 # Freeze the first layers of the backbone


100 for param in list ( self . backbone . parameters () ) [:50]:
101 param . requires_grad = False
102

103 # Region Proposal Network


104 self . rpn = RPN ( in_channels =2048 , num_anchors = num_anchors )
105

106 # ROI Pooling


107 self . roi_pooling = ROIPooling ( output_size =7 , spatial_scale =1/16)
108

109 # Classificateur final

Mohamed Ouazze 95 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

110 self . classifier = nn . Sequential (


111 nn . Linear (2048 * 7 * 7, 1024) ,

112 nn .ReLU() ,

113 nn . Dropout (0.5) ,

114 nn . Linear (1024 . ReLU , 1024) ,

115 nn () ,

116 nn . Dropout (0.5) ,

117 )

118

119 # Exit tests


120 self . cls_score = nn . Linear (1024 . Linear (1024 self . , num_classes )
121 bbox_pred = nn , num_classes * 4)
122

123 def forward ( self , images , gt_boxes = None ) :


"""
124

125 Forward pass de Faster R-CNN


126

127 Args :
128 images : Batch d’images [N, 3, H, W]
129 gt_boxes : B o t e s englobantes ground truth ( pour l’ e n t r a n e m e n t )
130

131 Returns :
132 If training: dictionary with losses
133 Si i n f r e n c e : p r d i c t i o n s ( classes b o t e s , , scores )
"""
134

135 # Extracting features with the backbone


136 features = self . backbone ( images )
137

138 # RPN forward


139 rpn_cls_logits , rpn_bbox_pred = self . rpn ( features )
140
141 # G n r a t i o n des propositions ( s i m u l ici )
142 proposals = self . _generate_proposals ( rpn_cls_logits , rpn_bbox_pred , images . shape )
143
144 # ROI Pooling
145 if len ( proposals ) > 0:
146 pooled_features = self . roi_pooling ( features , proposals )
147

148 # Classification finale


149 pooled_features = pooled_features . view ( pooled_features . size (0) classifier_features = self . classifier , -1)
150 ( pooled_features )
151

152 cls_scores = self . cls_score ( classifier_features )


153 bbox_predictions = self . bbox_pred ( classifier_features )
154

155 return {
156 ’cls_scores ’: cls_scores ’bbox_pred ’: ,

157 bbox_predictions ,
158 ’ rpn_cls_logits ’: rpn_cls_logits ,
159 ' rpn_bbox_pred ': rpn_bbox_pred ,
160 ’proposals ’: proposals
161 }
162 else :
163 # No proposal generated
164 return {
165 ’cls_scores ’: torch . empty (0 ’bbox_pred ’: torch . , self . cls_score . out_features ) ,

166 empty (0 ’ rpn_cls_logits ’: rpn_cls_logits , , self . bbox_pred . out_features ) ,

167

168 ' rpn_bbox_pred ': rpn_bbox_pred ,


169 ’proposals ’: proposals
170 }
171

172 def _generate_proposals ( self , rpn_cls_logits , rpn_bbox_pred , image_shape ):


"""
173

174 G n r e les propositions de r g i o n s ( version s i m p l i f i e )


"""
175

176 batch_size ,_, feat_h , feat_w = rpn_cls_logits . shape


177

178 # G n r a t i o n d’anchors simples ( version t r s s i m p l i f i e )


179 proposals = []
180

181 # For each position in the feature map


182 for i in range (0 feat_h , , 4): # Sub-sampling for demonstration

Mohamed Ouazze 96 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

183 for j in range (0 , feat_w , 4) :


184 # Creating a simple proposition
185 x1 = j * 16 # feature map scale
186 y1 = i * 16
187 x2 = min( x1 + 64 , image_shape [3])
188 y2 = min( y1 + 64 , image_shape [2])
189

190 # Ajout de la proposition ( format : batch_idx for batch_idx in range ( batch_size ): , x1 , y1 , x2 , y2)
191

192 proposals . append ([ batch_idx x1 , y1 , , x2 , y2 ])


193

194 if proposals :
195 proposals = torch . tensor ( proposals , dtype = torch . float32 )
196 # Limiting the number of proposals
197 if len ( proposals ) > 100:
198 indices = torch . randperm ( len( proposals )) [:100]
199 proposals = proposals [ indices ]
200 else :
201 proposals = torch . empty (0 , 5)
202

203 return proposals


204

205 # Loss Functions for Faster R-CNN


206 class FasterRCNNLoss ( nn . Module ):
"""
207

208 Combined loss function for Faster R-CNN


"""
209

210 def __init__ ( self ):


211 super ( FasterRCNNLoss self ). __init__ , ()
212 self . cls_loss = nn self . bbox_loss. CrossEntropyLoss ()
213 = nn . SmoothL1Loss ()
214

215 def forward ( self , predictions ,targets ):


"""
216

217 Calculate the combined loss


218

219 Args :
220 predictions: Dictionary of model predictions
221 targets: Dictionary of targets (classes , botes)
222

223 Returns :
224 Total loss and individual components
"""
225

226 # RPN Loss


227 rpn_cls_loss = self . cls_loss ( predictions [’ rpn_cls_logits ’]. view ( -1 , 2) ,

228 targets [’rpn_labels ’]. view ( -1) )


229 rpn_bbox_loss = self . bbox_loss ( predictions [' rpn_bbox_pred '],
230 targets [’ rpn_bbox_targets ’])
231

232 # Loss of the final classifier


233 if predictions [’cls_scores ’]. numel () > 0:
234 final_cls_loss = self . cls_loss ( predictions [’cls_scores ’],
235 targets [’ final_labels ’])
236 final_bbox_loss = self . bbox_loss ( predictions ['bbox_pred '],
237 targets [’ final_bbox_targets ’])
238 else :
239 final_cls_loss = torch . tensor (0.0)
240 final_bbox_loss = torch . tensor (0.0)
241

242 # Total loss


243 total_loss = rpn_cls_loss + rpn_bbox_loss + final_cls_loss + final_bbox_loss
244

245 return {
246 ’total_loss ’: total_loss ’ rpn_cls_loss ’: ,

247 rpn_cls_loss ,
248 ' rpn_bbox_loss ': rpn_bbox_loss ,
249 ’ final_cls_loss ’: final_cls_loss ’ final_bbox_loss ’: ,

250 final_bbox_loss
251 }
252

253 # Example of use and comparison of approaches


254 def compare_detection_methods () :
"""
255

Mohamed Ouazze 97 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

256 Compare different object detection approaches


"""
257
258 print (" Comparaison des m t h o d e s de d t e c t i o n d’objets \n")
259
260 # Performance simulation (fictitious data for illustration)
261 methods_comparison = {
262 ’R- CNN ’: {
263 ’speed_fps ’: 0.02 , # T r s lent
264 ’map_score ’: 58.5 ’year ’: , # mAP score
265 2014 ’innovation ’: ’ ,

266 P r e m i r e utilisation de CNN pour d t e c t i o n
267 },
268 ’Fast R- CNN ’: {
269 ’speed_fps ’: 0.5 , # Faster
270 ’map_score ’: 65.7 ’year ’: ,

271 2015 ’innovation ’: ,


272 ’ROI pooling , e n t r a n e m e n t end -to - end ’
273 },
274 ’Faster R-CNN ’: {
275 ’speed_fps ’: 7.0 , # Much faster
276 ’map_score ’: 73.2 ’year ’: ,

277 2016 ’innovation ’: ,



278 ’Region Proposal Network i n t g r
279 },
280 'YOLO v1 ': {
281 'speed_fps ': 45.0 'map_score, # Real time
282 ': 63.4 'year ': 2016 'innovation,
283 ': 'Single-pass ,


284 detection '
285 }
286 }
287
288 print (" Performance des d i f f r e n t e s m t h o d e s :")
289 print ("-" * 70)
290 print (f"{’ M t h o d e ’: <15} {’ Vitesse (FPS ) ’: <15} {’ mAP (%) ’: <10} {’ A n n e ’: <6} {’
Innovation ’}")
291 print ("-" * 70)
292
293 for method , stats in methods_comparison . items () :
294 print (f"{ method : <15} { stats [ ’ speed_fps ’]: <15.1f} { stats [ ’ map_score ’]: <10.1f} "
295 f"{ stats [’ year ’]: <6} { stats [’ innovation ’]}")
296
297 return methods_comparison
298

299 # Exemple d’ e n t r a n e m e n t ( structure )


300 def train_faster_rcnn_example () :
"""
301
302 Example training structure for Faster R-CNN
"""
303
304 print ("\ nStructure d’ e n t r a n e m e n t Faster R- CNN :")
305 print ("1. P r p a r a t i o n des d o n n e s avec annotations ( b o t e s + classes )")
306 print ("2. Initializing the template with backbone pr - entran ")
307 print ("3. Configuration des h y p e r p a r a m t r e s ")
308 print ("4. Boucle d’ e n t r a n e m e n t avec calcul des pertes c o m b i n e s ")
309 print ("5. valuation avec m t r i q u e s mAP ")
310
311 # Creation you model

312 model = FasterRCNN ( num_classes =21) # 20 classes + a r r i r e - plan


313
314 # Displaying the architecture
315 print (f"\ nNombre total de p a r a m t r e s : { sum (p. numel () for p in model . parameters ()): ,}
")
316 print (f" P a r a m t r e s e n t r a n a b l e s : { sum (p. numel () for p in model . parameters () if p.
requires_grad ): ,}")
317
318 # Test with a dummy image
319 dummy_input = torch . randn (1 3 , , 224 , 224)
320
321 print ("\nTest with dummy int:")
322 print (f" Forme d’ e n t r e : { dummy_input . shape }")
323
324 model . eval ()
325 with torch . no_grad () :

Mohamed Ouazze 98 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

326 output = model ( dummy_input )


327 print (f" Number of proposals genres: {len ( output [' proposals '])}")
328 if output [’cls_scores ’]. numel () > 0:
329 print (f" Forme des scores de classe : { output [’ cls_scores ’]. shape }")
330

331 # E x c u t i o n des d m o n s t r a t i o n s
332 print (" === D T E C T I O N D’OBJETS AVEC R- CNN ===")
333 comparison_results = compare_detection_methods ()
334 train_faster_rcnn_example ()
ÿ ÿ
Listing 75 – Simplified Implementation of Faster R-CNN

Explanation

The evolution from R-CNN to Faster R-CNN perfectly illustrates the progressive optimization of Deep
Learning architectures:
Original R-CNN: Used selective search to generate region proposals, then classified each region with a
CNN. Very slow because each region required a forward pass.
separated.
Fast R-CNN: Introduces ROI pooling allowing convolution calculations to be shared between
all regions of the same image. End-to-end training with multi-task loss.
Faster R-CNN : Remplace selective search par un Region Proposal Network (RPN) neuronal,
enabling a fully differentiable and much faster pipeline.
This progression shows the importance of architectural and algorithmic optimization for
make Deep Learning models practically usable.

4.1.4 Practical example: Segmentation of medical images with U-Net

Let's apply U-Net to a real use case: segmenting skin lesions to aid diagnosis
dermatological.
ÿ ÿ

1 import torch
2 import torch . 3 import nn as nn
torch . optim as optim
4 from torch . utils . data import Dataset 5 import torchvision . , DataLoader
transforms as transforms
6 import numpy as np
7 import matplotlib . pyplot as plt
8 from PIL import Image
9 imports
10

11 class MedicalSegmentationDataset ( Dataset ):


"""
12

13 Dataset pour segmentation d’images m d i c a l e s


14 Simulation of dermatology data (detection of melanomas)
"""
15

16 def __init__ ( self self . size = size , size =200 , img_size =256 , augment = True ):
17

18 self . img_size = img_size


19 self . augment = augment
20

21 # Transformations for data augmentation


22 self . transform = transforms . Compose ([
23 transforms . RandomHorizontalFlip (p =0.5) transforms . ,

24 RandomVerticalFlip (p =0.5) transforms . RandomRotation ,

25 ( degrees =15) transforms . ColorJitter ( brightness =0.2 ,

26 transforms . RandomResizedCrop ( img_size , , contrast =0.2 , saturation =0.2) scale ,

27 =(0.8 1.0) ) ,

28 ]) if augment else transforms . Resize (( img_size , img_size ))


29

30 def __len__ ( self ):


31 return self . size
32

33 def generate_synthetic_skin_lesion ( self ) :


"""
34

35 G not a synthetic image of a cutaneous lesion.


"""
36

Mohamed Ouazze 99 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

37 # Creation of the basic skin


38 skin_color = np . array ([210 180 image = np . full (( self ., img_size ,, 140]) / 255.0 # Skin color
39 self . img_size , 3) skin_color , dtype
, = np . float32 )
40

41 # Adding skin texture


42 noise = np . random . normal (0 , image . shape
, ) 0.05
43 image = np . clip ( image + noise , 0, 1)
44

45 # C ratio of the segmentation mask.


46 mask = np . zeros (( self . img_size , self . img_size ) , dtype = np . float32 )
47

48 # Generation of lsions
49 num_lesions = np . random . randint (1 , 3)
50

51 for _ in range ( num_lesions ):


52 # Random position and size
53 center_x = np . random . randint (50 center_y = np . , self . img_size - 50)
54 random . randint (50 , self . img_size - 50)
55

56 # Forme i r r g u l i r e pour la l s i o n
57 radius_base = np . random . randint (20 40) ,

58

59 # C r a t i o n d’une forme i r r g u l i r e
60 angles = np . linspace (0 2* np . pi , 100) ,

61 radii = radius_base + np . random . normal (0 , 5, 100)


62 radii = np . maximum ( radii 10) # Minimum, radius
63

64 # Location coordinates
65 x_coords = center_x + radii * np . cos ( angles )
66 y_coords = center_y + radii * np . sin ( angles )
67

68 # Creation of the lesion mask


69 and ,x = np . ogrid [: self . img_size , : self . img_size ]
70 lesion_mask = np . zeros (( self . img_size , self . img_size ) , dtype = bool )
71

72 # Rough filling of the shape


73 for i in range ( len ( x_coords ) -1) :
74 x1 , y1 = int( x_coords [i ]) int ( y_coords [i ]),
75 x2 , y2 = int( x_coords [i +1]) int( y_coords [i +1]) ,

76

77 # Line between two consecutive points


78 if 0 <= x1 < self . img_size and 0 <= y1 < self . img_size :
79 if 0 <= x2 < self . img_size and 0 <= y2 < self . img_size :
80 # Simple distance calculation
81 dist = np . sqrt (( x - center_x ) **2 + (y - center_y ) **2)
82 lesion_mask |= dist <= radius_base
83

84 # Applying the lesion to the image


85 lesion_color = np . array ([80 60 lesion_variation = np . random
, . , 40]) / 255.0 # Dark color
86 normal (0 , 0.1 , ( lesion_mask . sum () , 3) )
87

88 for c in range (3) :


89 image [ lesion_mask c ] = np . ,clip (
90 lesion_color [c] + lesion_variation [: , c], 0, 1
91 )

92

93 # Bet mask day


94 mask [ lesion_mask ] = 1.0
95

96 return image , mask


97

98 def __getitem__ ( self idx ): ,

99 # G n r a t i o n d’une image s y n t h t i q u e
100 image , mask = self . generate_synthetic_skin_lesion ()
101

102 # Conversion to PIL for transformations


103 image_pil = Image . fromarray (( image * 255) . astype ( np . uint8 ))
104 mask_pil = Image . fromarray (( mask * 255) . astype ( np . uint8 ))
105

106 # Applying transformations


107 if self . augment :
108 # M me seed for image and mask
109 seed = np . random . randint (2147483647)

Mohamed Ouazze 100 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

110
111 torch . manual_seed ( seed )
112 image_tensor = transforms . ToTensor () ( self . transform ( image_pil ) )
113
114 torch . manual_seed ( seed )
115 mask_tensor = transforms . ToTensor () ( self . transform ( mask_pil ))
116 else :
117 image_tensor = transforms . ToTensor () ( image_pil )
118 mask_tensor = transforms . ToTensor () ( mask_pil )
119

120 # Mask binarization


121 mask_tensor = ( mask_tensor > 0.5) . float ()
122

123 return image_tensor , mask_tensor . squeeze (0)


124

125 # M t r i q u e s d’ valuation m d i c a l e
126 class MedicalMetrics :
"""
127

128 M t r i q u e s s p c i a l i s e s pour l’ valuation m d i c a l e


"""
129

130 @staticmethod
131 def dice_coefficient ( pred , target , smooth =1e -6) :
"""
132 Coefficient de Dice (F1 - score pour segmentation ) """
133 pred_flat = pred . view ( -1)
134 target_flat = target . view ( -1)
135

136 intersection = ( pred_flat * target_flat ). sum ()


137 dice = (2.0 * intersection + smooth ) / ( pred_flat . sum () + target_flat . sum () +
smooth )
138

139 return dice . item ()


140
141 @staticmethod
142 def iou_score ( pred , target , smooth =1e -6) :
"""
143 Intersection over Union ( Jaccard Index ) """
144 pred_flat = pred . view ( -1)
145 target_flat = target . view ( -1)
146

147 intersection = ( pred_flat * target_flat ). sum ()


148 union = pred_flat . sum () + target_flat . sum () - intersection
149 iou = ( intersection + smooth ) / ( union + smooth )
150

151 return iou . item ()


152

153 @staticmethod
154 def sensitivity ( pred , target ):
""" """
155 Sensibilit ( Recall ) - c a p a c i t d t e c t e r les l s i o n s
156 pred_flat = pred . view ( -1)
157 target_flat = target . view ( -1)
158

159 true_positives = ( pred_flat * target_flat ). sum ()


160 actual_positives = target_flat . sum ()
161

162 if actual_positives == 0:
163 return 1.0 # No lsion dtecter
164

165 return ( true_positives / actual_positives ). item ()


166

167 @staticmethod
168 def specificity ( pred , target ):
""" """
169 Spcificit-capacit avoid false positives
170 pred_flat = pred . view ( -1)
171 target_flat = target . view ( -1)
172

173 true_negatives = ((1 - pred_flat ) * (1 - target_flat )). sum ()


174 actual_negatives = (1 - target_flat ) .sum ()
175

176 if actual_negatives == 0:
177 return 1.0 # No background prserver
178

179 return ( true_negatives / actual_negatives ). item ()


180

181 # E n t r a n e m e n t s p c i a l i s pour applications m d i c a l e s

Mohamed Ouazze 101 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

182 def train_medical_unet () :


"""
183

184 U-Net training for medical segmentation with cross-validation


"""
185

186 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
187 print (f" E n t r a n e m e n t sur : { device }")
188

189 # Training and validation datasets

190 train_dataset = MedicalSegmentationDataset ( size =800 , img_size =256 , augment = True )


191 val_dataset = MedicalSegmentationDataset ( size =200 , augment = False ) , img_size =256
192

193 train_loader = DataLoader ( train_dataset val_loader = DataLoader , batch_size =8 , shuffle = True , num_workers =2)
194 ( val_dataset , batch_size =8 , shuffle = False num_workers
, =2)
195

196 # U-Net model optimized for medical


197 model = UNet ( n_channels =3 , n_classes =2 , bilinear = True ). to ( device )
198

199 # Optimizer and adaptive scheduler


200 optimizer = optim . Adam ( model . parameters () scheduler = optim . , lr =1e -4 , weight_decay =1 e -5)
201 lr_scheduler . ReduceLROnPlateau ( optimizer , factor =0.5) mode =’min ’, patience =5 ,

202

203 # Combined loss function


204 def combined_loss ( pred , target ):
205 # Cross - entropy pour la classification pixel - wise
206 ce_loss = F. cross_entropy ( pred , target . long () )
207

208 # Dice loss for segmentation


209 pred_prob = F. softmax ( pred , dim =1) [: dice_loss = 1 - , 1] # P r o b a b i l i t classe l s i o n
210 MedicalMetrics . dice_coefficient ( pred_prob , target )
211

212 # Combination ponder


213 return ce_loss + dice_loss
214

215 # Training history


216 train_losses = []
217 val_losses = []
218 val_dice_scores = []
219 val_iou_scores = []
220
221 print (" D b u t de l’ e n t r a n e m e n t m d i c a l ...")
222

223 for epoch in range (30) :


224 # Phase d’ e n t r a n e m e n t
225 model . train ()
226 train_loss = 0
227 train_batches = 0
228

229 for batch_idx images , , ( images , masks ) in enumerate ( train_loader ):


230 masks = images . to ( device ) , masks . to ( device )
231

232 optimizer . zero_grad ()


233 outputs = model ( images )
234 loss = combined_loss ( outputs , loss . backward () masks )
235

236 optimizer . step ()


237

238 train_loss += loss . item ()


239 train_batches += 1
240
241 if batch_idx % 20 == 0:
242 print (f’Epoch { epoch }, Batch { batch_idx }/{ len ( train_loader )}, Loss : { loss
. item () :.4f}’)
243
244 avg_train_loss = train_loss / train_batches
245 train_losses . append ( avg_train_loss )
246

247 # Phase de validation


248 model . eval ()
249 val_loss = 0
250 val_batches = 0
251 epoch_dice_scores = []
252 epoch_iou_scores = []

Mohamed Ouazze 102 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

253
254 with torch . no_grad () :
255 for images , masks in val_loader :
256 masks = images . to ( device ) images , , masks . to ( device )
257
258 outputs = model ( images )
259 loss = combined_loss ( outputs , val_loss += loss . masks )
260 item ()
261 val_batches += 1
262
263 # M t r i q u e s de segmentation
264 pred_masks = F. softmax ( outputs , dim =1) [: , 1] > 0.5
265
266 for i in range ( images . size (0) ):
267 dice = MedicalMetrics . dice_coefficient ( pred_masks [i ]. float () , masks [i
])
268 iou = MedicalMetrics . iou_score ( pred_masks [i ]. float () epoch_dice_scores . append ( dice ) , masks [ i ])
269
270 epoch_iou_scores . append ( iou )
271
272 avg_val_loss = val_loss / val_batches
273 avg_dice = np . mean ( epoch_dice_scores )
274 avg_iou = np . mean ( epoch_iou_scores )
275
276 val_losses . append ( avg_val_loss )
277 val_dice_scores . append ( avg_dice )
278 val_iou_scores . append ( avg_iou )
279
280 # Scheduler step
281 scheduler . step ( avg_val_loss )
282
283 print (f’Epoch { epoch }: ’)
284 print (f’ Train Loss : { avg_train_loss :.4 f}’)
285 print (f’ Val Loss : { avg_val_loss :.4 f}’)
286 print (f’ Val Dice : { avg_dice :.4 f}’)
287 print (f' Val IoU : { avg_iou :.4f}')
288 print (f’ LR: { optimizer . param_groups [0][" lr "]:.6 f}’)
289 print (’-’ * 50)
290
291 return model , {
292 ’ train_losses ’: train_losses ’val_losses ’: val_losses ’ ,
293 val_dice_scores ’: val_dice_scores ’ ,

294 val_iou_scores ’: val_iou_scores ,

295
296 }
297
298 def evaluate_medical_model ( model , test_dataset , device ):
"""
299
300 valuation c o m p l t e du m o d l e m d i c a l
"""
301
302 model . eval ()
303 test_loader = DataLoader ( test_dataset , batch_size =1 , shuffle = False )
304
305 # M t r i q u e s globales
306 all_dice_scores = []
307 all_iou_scores = []
308 all_sensitivity = []
309 all_specificity = []
310
311 print (" model evaluation on test dataset ... ")
312
313 with torch . no_grad () :
314 for idx , ( image , mask ) in enumerate ( test_loader ):
315 mask = image . to ( device ) image , , mask . to ( device )
316
317 #Prdiction
318 output = model ( image )
319 pred_mask = F. softmax ( output , dim =1) [0 , 1] > 0.5
320
321 # Calcul des m t r i q u e s
322 dice = MedicalMetrics . dice_coefficient ( pred_mask . float () iou = MedicalMetrics . iou_score ( pred_mask ., mask )
323 float () mask ) ,

324 sens = MedicalMetrics . sensitivity ( pred_mask . float () mask ) ,

Mohamed Ouazze 103 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

325 spec = MedicalMetrics . specificity ( pred_mask . float () , mask )


326
327 all_dice_scores . append ( dice )
328 all_iou_scores . append ( iou )
329 all_sensitivity . append ( sens )
330 all_specificity . append ( spec )
331
332 if idx < 5: # Affichage des premiers r s u l t a t s
333 print (f" chantillon { idx +1}: Dice ={ dice :.3 f}, IoU ={ iou :.3f}, Sens ={ sens
:.3 f} , Spec ={ spec :.3 f}")
334
335 # Final statistics
336 results = {
337 ’dice ’: {’mean ’: np . mean ( all_dice_scores ) ’iou ’: {’mean ’: np . mean , ’std ’: np . std ( all_dice_scores )},
338 ( all_iou_scores ) ’ sensitivity ’: {’mean ’: np . mean ( all_sensitivity ) , ’std ’: np . std ( all_iou_scores )},
339 , ’std ’: np . std ( all_sensitivity )
},
340 ’ specificity ’: {’mean ’: np . mean ( all_specificity ) , ’std ’: np . std ( all_specificity ) }
341 }
342
343 print ("\n" + "=" *60)
344 print (" R S U L T A T S FINAUX DE L’ VALUATION M D I C A L E ")
345 print ("=" *60)
346 for metric stats in results
, . items () :
347 print (f"{ metric . upper () }: { stats [’ mean ’]:.3f} { stats [’ std ’]:.3f}")
348
349 return results
350
351 def visualize_medical_results ( model , dataset , device , num_samples =4) :
"""
352
353 Visualisation s p c i a l i s e pour r s u l t a t s m d i c a u x
"""
354
355 model . eval ()
356
357 fig , axes = plt . subplots ( num_samples , if num_samples == 1: 5 , figsize =(20 , 4* num_samples ))
358
359 axes = axes . reshape (1 , -1)
360
361 with torch . no_grad () :
362 for i in range ( num_samples ):
363 # Get a sample

364 image , true_mask = dataset [i ]


365 image_batch = image . unsqueeze (0) . to ( device )
366
367 #Prdiction
368 output = model ( image_batch )
369 pred_prob = F. softmax ( output , pred_mask = dim =1) [0 , 1]. cpu ()
370 ( pred_prob > 0.5) . float ()
371
372 # Calculating metrics for this sample
373 dice = MedicalMetrics . dice_coefficient ( pred_mask , iou = MedicalMetrics . iou_score true_mask )
374 ( pred_mask , sens = MedicalMetrics . sensitivity ( pred_mask , spec = true_mask )
375 MedicalMetrics . specificity ( pred_mask , true_mask )
376 true_mask )
377
378 # Conversion for display
379 image_display = image . permute (1 , 2, 0) . numpy ()
380
381 # Display
382 axes [i 0]. imshow
, ( image_display )
383 axes [i 0]. set_title
, (’Image originale ’)
384 axes [i 0]. axis
, (’off ’)
385
386 axes [i , 1]. imshow ( true_mask , cmap =’Reds ’, alpha =0.7)
387 axes [i , 1]. set_title (’ V r i t terrain ’)
388 axes [i , 1]. axis (’off ’)
389
390 axes [i , 2]. imshow ( pred_prob , cmap =’Reds ’, vmin =0 2]. set_title (’ P r o b , vmax =1)
391 axes [i , a b i l i t p r d i t e ’)
392 axes [i , 2]. axis (’off ’)
393
394 axes [i , 3]. imshow ( pred_mask , cmap =’Reds ’, alpha =0.7)
395 axes [i , 3]. set_title (’Masque p r d i t ’)

Mohamed Ouazze 104 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

396 axes [i , 3]. axis (’off ’)


397
398 # Overlay diagnostic
399 overlay = image_display . copy ()
400 lesion_areas = pred_mask . numpy () > 0.5
401 overlay [ lesion_areas ] = overlay [ lesion_areas ] * 0.5 + np . array ([1 , 0, 0]) *
0.5
402 axes [i , 4]. imshow ( overlay )
403 axes [i , 4]. set_title ( f’Diagnostic \ nDice : { dice :.3 f}’)
404 axes [i , 4]. axis (’off ’)
405
406 plt . suptitle (’ R s u l t a t s de Segmentation de L s i o n s C u t a n e s ’, fontsize =16 , weight =
’bold ’)
407 plt . tight_layout ()
408 plt . show ()
409

410 def plot_training_curves ( history ):


"""
411
412 Visualise les courbes d’ e n t r a n e m e n t
"""
413
414 fig , axes = plt . subplots (2 , 2, figsize =(15 , 10) )
415
416 # Losses
417 axes [0 , 0]. plot ( history [’ train_losses ’], label =’Train Loss ’, color =’blue ’)
418 axes [0 , 0]. plot ( history [’val_losses ’] label =’Validation Loss, ’, color =’red ’)
419 axes [0 , 0]. set_title ('loss evolution ')
420 axes [0 , 0]. set_xlabel (’ poque ’)
421 axes [0 , 0]. set_ylabel ('Perte ')
422 axes [0 , 0]. legend ()
423 axes [0 , 0]. grid ( True , alpha =0.3)
424

425 # Dice Score


426 axes [0 , 1]. plot ( history [’ val_dice_scores ’], label =’Dice Score ’, color =’green ’)
427 axes [0 , 1]. set_title ('Dice Coefficient ')
428 axes [0 , 1]. set_xlabel (’ poque ’)
429 axes [0 , 1]. set_ylabel (’Dice Score ’)
430 axes [0 , 1]. legend ()
431 axes [0 , 1]. grid ( True , alpha =0.3)
432
433 # IoU Score
434 axes [1 , 0]. plot ( history [’ val_iou_scores ’] 0]. set_title (’ , label =’IoU Score ’, color =’orange ’)
435 axes [1 , Intersection over Union ’)
436 axes [1 , 0]. set_xlabel (’ poque ’)
437 axes [1 , 0]. set_ylabel (’IoU Score ’)
438 axes [1 , 0]. legend ()
439 axes [1 , 0]. grid ( True , alpha =0.3)
440
441 # Final comparison
442 final_metrics = [’Dice ’, ’IoU ’]
443 final_values = [ history [’ val_dice_scores ’][ -1] , history [’ val_iou_scores ’][ -1]]
444

445 bars = axes [1 , 1]. bar ( final_metrics , final_values color ,

446 =[ ’green ’, ’orange ’], alpha =0.7)


447 axes [1 , 1]. set_title (’ M t r i q u e s Finales ’)
448 axes [1 , 1]. set_ylabel (’Score ’)
449 axes [1 , 1]. set_maximum (0 1) ,
450
451 # Adding values to the bars
452 for bar value in zip
, ( bars final_values ): ,

453 height = bar . get_height ()


454 axes [1 1]. ,text ( bar . get_x () + bar . get_width () /2. , height + 0.01 f’{ value :.3 f}’, ha =’center ’, va =’bottom ’, ,

455 fontweight =’bold ’)


456
457 plt . tight_layout ()
458 plt . show ()
459

460 # E xcution of the complete medical example


461 print (" === SEGMENTATION D’IMAGES M D I C A L E S AVEC U- NET ===\ n")
462
463 # Training you model
464 trained_medical_model , training_history = train_medical_unet ()
465
466 # Visualization of training curves

Mohamed Ouazze 105 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

467 plot_training_curves ( training_history )


468
469 # Dataset the test
470 test_dataset = MedicalSegmentationDataset ( size =50 , augment = False ) , img_size =256
471
472 # valuation finale
473 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
474 evaluation_results = evaluate_medical_model ( trained_medical_model , test_dataset , device )
475
476 # Visualisation des r s u l t a t s
477 print ("\ nVisualisation des r s u l t a t s de segmentation m d i c a l e :")
478 visualize_medical_results ( trained_medical_model , num_samples =4) , test_dataset , device
479

480 print ("\n" + "=" *60)


481 print ("RECOMMENDATIONS FOR MEDICAL USE:")
482 print ("=" *60)
483 print ("1. Utiliser des datasets r e l s a n n o t s par des experts ")
484 print ("2. Cross-validation with several radiologists ")
485 print ("3. Tests sur d i f f r e n t e s populations et types de peau ")
486 print ("4. Certification r g l e m e n t a i r e (FDA CE marking )") ,

487 print ("5. Professional-friendly user interface ")


488 print ("6. T r a a b i l i t et e x p l i c a b i l i t des d c i s i o n s ")
489 print ("7. Continuous update with new cases ")
ÿ ÿ
Listing 76 – Medical Image Segmentation with U-Net

Explanation

This medical application from U-Net illustrates several crucial aspects of Deep Learning in health:
Specialized metrics: Dice coefficient and IoU are more suitable for segmentation than
simple accuracy. Sensitivity (recall) is critical in medicine because missing a lesion is more
serious than a false positive.
Data augmentation: Particularly important with small medical datasets.
transformations must preserve medical consistency (rotation, flip) but avoid deformations
unrealistic.
Rigorous validation: Evaluation on separate test datasets is essential. In practice, the
Validation by medical experts is mandatory before any clinical deployment.
Ethical considerations: Medical AI models require careful attention to
bias, equity between populations, and transparency of decisions for clinical acceptance.

4.2 Transformers et attention


4.2.1 Transformers Architecture Explained

The Transformers, introduced in "Attention is All You Need" (Vaswani et al., 2017), revolutionized
natural language processing by completely abandoning recurrent architectures in favor of
attention mechanism.
Transformers Fundamentals:
— Full parallelization: Unlike RNNs, all tokens can be processed simultaneously.
course
— Attention mechanism: Each position can “see” all other positions
— Positional encodings: Compensate for the lack of natural sequential order
— Encoder-decoder architecture: Flexible for various NLP tasks
ÿ ÿ
1 import torch
2 import torch . nn as nn

3 import torch . 4 import nn . functional as F


math
5 import numpy as np
6 import matplotlib . pyplot as plt
7

8 class MultiHeadAttention ( nn . Module ):


"""
9

10 M c a n i s m e d’attention multi - t t e s

Mohamed Ouazze 106 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

"""
11

12 def __init__ ( self , dropout =0.1) ,: d_model num_heads


, super
13 ( MultiHeadAttention self ). __init__ () ,

14

15 assert d_model % num_heads == 0


16

17 self . d_model = d_model


18 self . num_heads = num_heads
19 self . d_k = d_model // num_heads
20

21 # Projections l i n a i r e s pour Q, K, V
22 self . W_q = nn . Linear . Linear ( d_model d_model ),
23 self . W_k = nn . Linear ( d_model d_model ) ,

24 self . W_v = nn . Linear ( d_model d_model ) ,

25 self . W_o = nn ( d_model d_model ) ,

26

27 self . dropout = nn . Dropout ( dropout )


28

29 def scaled_dot_product_attention ( self ,Q, K , In , mask = None ):


"""
30

31 Attention by scalar product with scale


32

33 Args :
34 Q: Queries [ batch_size num_heads , seq_len
, , d_k ]
35 K: Keys [ batch_size d_k ] , num_heads , seq_len ,
36 V: Values [ batch_size num_heads , seq_len
, , d_k ]
37 mask: Optional mask to hide certain positions
38

39 Returns :
40 output : Sortie a p r s attention
41 attention_weights: Attention weights for visualization
"""
42

43 # Calculating attention scores


44 scores = torch . matmul (Q , K. transpose ( -2 , -1) ) / math . sqrt ( self . d_k )
45

46 # Application of the mask if provided


47 if mask is not None :
48 scores = scores . masked_fill ( mask == 0 , -1 e9)
49

50 # Softmax to get attention weights


51 attention_weights = F. softmax ( scores dim = -1) ,

52 attention_weights = self . dropout ( attention_weights )


53

54 # Applying attention to values


55 output = torch . matmul ( attention_weights , V)
56

57 return output , attention_weights


58

59 def forward ( self , query , key , batch_size = query . value , mask = None ) :
60 size (0)
61 seq_len = query . size (1)
62

63 # Projections l i n a i r e s et reshape pour multi - head


64 Q = self . W_q ( query ). view ( batch_size , seq_len , 2) self . num_heads , self . d_k ). transpose
(1 ,

65 K = self . W_k ( key ). view ( batch_size , seq_len , 2) self . num_heads , self . d_k ) . transpose
(1 ,

66 V = self . W_v ( value ). view ( batch_size , seq_len , 2) self . num_heads , self . d_k ). transpose
(1 ,

67

68 # Application of attention
69 attention_output , attention_weights = self . scaled_dot_product_attention (Q , mask ) K , In ,

70

71 # C o n c a t n a t i o n des t t e s
72 attention_output = attention_output . transpose (1 batch_size , seq_len , , 2) . contiguous () . view (
73 self . d_model
74 )

75

76 # Projection finale
77 output = self . W_o ( attention_output )
78

79 return output , attention_weights

Mohamed Ouazze 107 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

80

81 class PositionalEncoding ( nn . Module ):


"""
82

83 Sinusoidal positional encoding


"""
84

85 def __init__ ( self max_len =5000), : d_model ,

86 super ( PositionalEncoding , self ). __init__ ()


87

88 # Creation of the positional encoding matrix


89 pe = torch . zeros ( max_len position = torch . , d_model )
90 arange (0 , max_len , dtype = torch . float ). unsqueeze (1)
91

92 # Calcul des f r q u e n c e s
93 div_term = torch . exp ( torch . arange (0 2) . float () * , d_model ,

94 (- math . log (10000.0) / d_model ) )


95

96 # Application of the sin and cos functions


97 on [: , 0::2] = torch . sin ( position * div_term )
98 on [: , 1::2] = torch . cos ( position * div_term )
99

100 on = on . unsqueeze (0) . transpose (0 self . register_buffer , 1)


101 ('on ', on )
102

103 def forward ( self x) : ,

104 # Added positional encoding


105 return x + self . pe [: x . size (0) :] ,

106

107 class FeedForward ( nn . Module ) :


"""
108

109 Feed-forward network with ReLU activation


"""
110

111 def __init__ ( self d_model super ,( FeedForward , d_ff , dropout =0.1) :
112 , self ) . __init__ ()
113

114 self . linear1 = nn . Linear ( d_model .Linear ( d_ff self . linear2 = nn , d_ff )
115 Dropout ( dropout ) self . dropout = nn . , d_model )
116

117

118 def forward ( self x) : ,

119 return self . linear2 ( self . dropout (F . relu ( self . linear1 (x ))))
120

121 class TransformerBlock ( nn . Module ):


"""
122

123 Bloc Transformer complet ( attention + feed - forward + connexions r s i d u e l l e s )


"""
124

125 def __init__ ( self d_model super ,( TransformerBlock


, num_heads d_ff , ,dropout =0.1) :
126 , self ). __init__ ()
127

128 self . attention = MultiHeadAttention ( d_model self . feed_forward = FeedForward , num_heads , dropout )
129 ( d_model , d_ff , dropout )
130

131 self . norm1 = nn . LayerNorm ( d_model )

132 self . norm2 = nn . LayerNorm ( d_model )


133

134 self . dropout = nn . Dropout ( dropout )


135

136 def forward ( self , x , mask = None ):


137 # Self-attention with residual connection and normalization
138 attn_output , attention_weights = self . attention (x x = self . norm1 (x + self . dropout , x , x , mask )
139 ( attn_output ))
140

141 # Feed-forward with residual connection and normalization


142 ff_output = self . feed_forward (x)
143 x = self . norm2 (x + self . dropout ( ff_output ))
144

145 return x , attention_weights

146

147 class TransformerEncoder ( nn . Module ):


"""
148

149 Full Transformer Encoder


"""
150

151 def __init__ ( self =0.1) : , vocab_size , d_model , num_heads , num_layers , d_ff , max_len , dropout

Mohamed Ouazze 108 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

152 super ( TransformerEncoder , self ). __init__ ()


153

154 self . d_model = d_model


155 self . embedding = nn self . . Embedding ( vocab_size d_model ) ,

156 pos_encoding = PositionalEncoding ( d_model max_len ) ,

157

158 self . transformer_blocks = nn . ModuleList ([


159 TransformerBlock ( d_model for in range , num_heads , d_ff , dropout )
160 _
( num_layers )
161 ])
162

163 self . dropout = nn . Dropout ( dropout )


164

165 def forward ( self mask = None, ): x ,

166 # Embedding and positional encoding


167 x = self . embedding (x) * math . sqrt ( self . d_model )
168 x = self . pos_encoding ( x)
169 x = self . dropout (x )
170

171 # Passage through the Transformer blocks

172 attention_weights = []
173 for transformer_block in self . transformer_blocks :
174 x , attn_weights = transformer_block (x attention_weights . append , mask )
175 ( attn_weights )
176

177 return x , attention_weights


178
179 class TransformerClassifier ( nn . Module ):
"""
180

181 Transformer-based classifier for sentiment analysis


"""
182

183 def __init__ ( self vocab_size num_classes


, , dropout , d_model , num_heads , num_layers , d_ff , max_len ,

=0.1) :
184 super ( TransformerClassifier , self ). __init__ ()
185

186 self . encoder = TransformerEncoder ( vocab_size max_len , dropout ) , d_model , num_heads , num_layers ,
d_ff ,

187

188 # Classification layer

189 self . classifier = nn . Linear ( d_model . Sequential (


190 nn d_model // 2) , ,

191 nn .ReLU() ,

192 nn . Dropout ( dropout ) ,

193 nn . Linear ( d_model // 2, num_classes )


194 )

195

196 def forward (self # Encoding , x , mask = None ):


197

198 , attention_weights = self . encoder (x encoded , mask )


199

200 # Global pooling (average of tokens)


201 if mask is not None :
202 # Hiding padding positions for pooling
203 mask_expanded = mask . unsqueeze ( -1) . expand ( encoded . size () )
204 encoded = encoded * mask_expanded
205 lengths = mask . sum ( dim =1 , keepdim = True ). float ()
206 pooled = encoded . sum ( dim =1) / lengths
207 else :
208 pooled = encoded . mean ( dim =1)
209

210 # Classification
211 logits = self . classifier ( pooled )
212

213 return logits , attention_weights


214

215 # Dataset for testing the Transformer


216 class SimpleTextDataset :
"""
217

218 Dataset s y n t h t i q u e pour tester le Transformer


"""
219

220 def __init__ ( self self . , vocab_size =1000 , seq_len =50 , num_samples =1000) :
221 vocab_size = vocab_size
222 self . seq_len = seq_len

Mohamed Ouazze 109 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

223 self . num_samples = num_samples


224

225 # G nration de donnessynthtiques


226 self . data = []
227 self . labels = []
228
229 for _ in range ( num_samples ):
230 #Squencealatoire
231 sequence = torch . randint (1 , vocab_size , ( seq_len ,) )
232
233 # Label based on a simple rule (even/odd sum)
234 label = int( sequence . sum () . item () % 2)
235
236 self . data . append ( sequence )
237 self . labels . append ( label )
238
239 def __len__ ( self ):
240 return self . num_samples
241
242 def __getitem__ ( self idx ): ,

243 return self . data [ idx ], self . labels [ idx ]


244

245 def create_padding_mask ( sequences , pad_token =0) :


"""
246
247 Create a mask to ignore padding tokens
"""
248
249 return ( sequences != pad_token ). float ()
250

251 def visualize_attention_weights ( attention_weights , tokens , layer_idx =0 , head_idx =0) :


"""
252
253 Visualizes the attention weights of a layer and specific layers
"""
254
255 # Extraction of weights for a layer and specific layers
256 weights = attention_weights [ layer_idx ][0 , head_idx ]. detach () . cpu () . numpy ()
257
258 # C r a t i o n de la heatmap
259 plt . figure ( figsize =(10 8) ) ,

260 plt . imshow ( weights , cmap =’Blues ’, interpolation =’nearest ’)


261 plt . colorbar ( label = 'Attention weight ' )
262
263 # tiquettes
264 if len (tokens) <= 20: # Display only if not too many tokens
265 plt . xticks ( range ( len( tokens )) rotation =45) , tokens ,

266 plt . yticks ( range ( len( tokens )) , tokens )


267
268 plt . xlabel (’Tokens ( Keys )’)
269 plt . ylabel (’Tokens ( Queries )’)
270 plt . title (f’Poids d\ ’ attention - Couche { layer_idx +1} plt . tight_layout () , T t e { head_idx +1} ’)
271
272 plt . show ()
273
274 def train_transformer_classifier () :
"""
275
276 We train the Transform classifier
"""
277
278 device = torch . device (’cuda ’ if torch . cuda . is_available () else ’cpu ’)
279 print (f" E n t r a n e m e n t sur : { device }")
280
281 # Model parameters

282 vocab_size = 1000


283 d_model = 128
284 num_heads = 8
285 num_layers = 6
286 d_ff = 512
287 max_len = 100
288 num_classes = 2
289
290 # Creation you model

291 model = TransformerClassifier (


292 vocab_size = vocab_size d_model = ,

293 d_model num_heads = ,

294 num_heads num_layers = ,

295 num_layers ,

Mohamed Ouazze 110 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

296 d_ff = d_ff ,

297 max_len = max_len ,

298 num_classes = num_classes


299 ). to ( device )
300
301 # Dataset a DataLoader
302 dataset = SimpleTextDataset ( vocab_size = vocab_size , seq_len =50 , num_samples =2000)
303 dataloader = torch.utils.data.DataLoader (dataset shuffle = True) , batch_size =32 ,

304
305 # Optimizer and loss function
306 optimizer = torch . optim . Adam ( model . parameters () , lr =1e -4)
307 criterion = nn . CrossEntropyLoss ()
308
309 #Entranement
310 model . train ()
311 train_losses = []
312
313 print (" D b u t de l’ e n t r a n e m e n t du Transformer ... ")
314
315 for epoch in range (10) :
316 epoch_loss = 0
317 correct_predictions = 0
318 total_predictions = 0
319
320 for batch_idx , ( sequences , labels ) in enumerate ( dataloader ) :
321 sequences , labels = sequences . to ( device ) , labels . to ( device )
322
323 # Creation of the padding mask
324 mask = create_padding_mask ( sequences ) . to ( device )
325
326 # Forward pass
327 logits , attention_weights = model ( sequences , loss = criterion ( logits , mask )
328 labels )
329
330 # Backward pass
331 optimizer . zero_grad ()
332 loss . backward ()
333 optimizer . step ()
334
335 #Mtriques
336 epoch_loss += loss . item ()
337 predictions = torch . argmax ( logits , correct_predictions += dim =1)
338 ( predictions == labels ). sum () . item ()
339 total_predictions += labels . size (0)
340
341 if batch_idx % 20 == 0:
342 print (f’Epoch { epoch }, Batch { batch_idx }, Loss : { loss . item () :.4 f}’)
343
344 avg_loss = epoch_loss / len ( dataloader )
345 accuracy = correct_predictions / total_predictions
346 train_losses . append ( avg_loss )
347
348 print (f’Epoch { epoch }: Loss = { avg_loss :.4f}, Accuracy = { accuracy :.4f}’)
349
350 return model , train_losses , attention_weights
351 # Analysis of attention patterns
352 def analyze_attention_patterns ( model , sample_text , vocab_size =1000) :
"""
353
354 Analyze attention patterns on a text sample
"""
355
356 model . eval ()
357 device = next ( model . parameters () ). device
358
359 # Converting text to tokens (simulation)
360 tokens = torch . randint (1 vocab_size mask = create_padding_mask
, , (1 , 20) ). to ( device )
361 ( tokens ). to ( device )
362
363 with torch . no_grad () :
364 _ , attention_weights = model ( tokens , mask )
365
366 print ("Analysis of attention patterns:")
367 print (f" Nombre de couches : {len ( attention_weights )}")
368 print (f" Nombre de t t e s par couche : { attention_weights [0]. size (1) }")

Mohamed Ouazze 111 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

369 print (f" Taille de la s q u e n c e : { attention_weights [0]. size (2)}")


370
371 # Visualisation pour d i f f r e n t e s couches et t t e s
372 for layer_idx in [0 len ( attention_weights
, ) //2 -1]: # P r e m i r e for head_idx in [0 ,, attention_weights [0]. size (1) // , medium , dernire
373 2]: # P r e m i r e et t t e du
medium
374 print (f"\ nVisualisation Couche { layer_idx +1 if layer_idx >= 0 else len (
attention_weights )} , T t e { head_idx +1} ")
375
376 # Extracting weights
377 if layer_idx == -1:
378 layer_idx = len( attention_weights ) - 1
379
380 weights = attention_weights [ layer_idx ][0 , head_idx ]. detach () . cpu () . numpy ()
381
382 # Statistical analysis of weights
383 avg_attention = weights . mean ()
384 max_attention = weights . max ()
385 attention_entropy = -np .sum ( weights * np . log ( weights + 1e -10) , axis =1) . mean ()
386
387 print (f" Attention moyenne : { avg_attention :.4f}")
388 print (f" Attention maximale : { max_attention :.4 f}")
389 print (f" Average entropy: {attention_entropy:.4 f}")
390
391 return attention_weights
392

393 # Transformer vs RNN Comparison


394 def compare_transformer_vs_rnn () :
"""
395
396 Compare les performances et c a r a c t r i s t i q u e s Transformer vs RNN
"""
397
398 print (" COMPARAISON TRANSFORMER vs RNN ")
399 print ("=" * 50)
400
401 comparison_data = {
402 ’ C r i t r e ’: [
403 ’ P a r a l l l i s a t i o n ’,
404 ’ M m o i r e long terme ’,

405 ’ C o m p l e x i t computationnelle ’,
406 ’Vitesse d\’ e n t r a n e m e n t ’,
407 ’ I n t e r p r t a b i l i t ’,
408 'Long Sequence Performance ',

409 'Memory consumption'

410 ],
411 ’RNN / LSTM ’: [
412 ’ S q u e n t i e l ’,
413 ’ L i m i t e ( gradient vanescent )’,
414 'O(n) per time step ',
415 'Lens ( squentiel )',
416 'Difficult ',
417 ’ P r o b l m a t i q u e ’,
418 ’Modre ’

419 ],
420 ’ Transformer ’: [
421 ' Complete parallel ',
422 'Excellent (overall attention)',
423 'O(n) for attention ',
424 ’Rapide ( p a r a l l l i s a b l e )’,
425 'Good (attention)',
426 'Excellent ',
427 ' live (quadratic attention)'
428 ]

429 }
430
431 for i in range ( len ( comparison_data [’ C r i t r e ’]) ):
432 print (f"{ comparison_data [’ C r i t r e ’][i ]: <25} | { comparison_data [’ RNN / LSTM ’][i
]: <25} | { comparison_data [’ Transformer ’][i]}")
433
434 # Simulation of time performance
435 sequence_lengths = [10 500] , 50 , 100 , 200 ,

436 rnn_times = [0.1 5.0] # Temps s q ,u e 0.5


n t i e l, 1.0 , 2.0 ,

437 transformer_times = [0.05 , 0.1 , 0.2 , 0.4 , 1.0] # Parallel time


438

Mohamed Ouazze 112 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

439 plt . figure ( figsize =(12 , 5) )


440
441 # Training time graph
442 plt . subplot (1 1) , 2 ,

443 plt . plot ( sequence_lengths , rnn_times , ’o-’, label =’RNN / LSTM ’, linewidth =2)
444 transformer_times
plt . plot ( sequence_lengths , , ’s-’, label =’ Transformer ’, linewidth =2)plt . xlabel
445 ('Longueur de squence ')
446 plt . ylabel ('Relative temps ')
447 plt . title (’Temps d\’ e n t r a n e m e n t c o m p a r ’)
448 plt . legend ()
449 plt . grid ( True , alpha =0.3)
450
451 # Memory complexity graph
452 plt . subplot (1 2) , 2 ,

453 rnn_memory = [n for n in sequence_lengths ] # L i n a i r e


454 transformer_memory = [n **2 for n in sequence_lengths ] # Quadratic
455
456 plt . plot ( sequence_lengths , rnn_memory , ’o-’, label =’RNN/ LSTM (O(n))’, linewidth =2)
457 plt . plot ( sequence_lengths , transformer_memory , ’s-’, label =’ Transformer (O( n ))’,
linewidth =2)
458 plt . xlabel ('Longueur de squence ')
459 plt . ylabel (’ Utilisation m m o i r e relative ’)
460 plt . title (’ C o m p l e x i t m m o i r e ’)
461 plt . legend ()
462 plt . grid ( True , alpha =0.3)
463
464 plt . tight_layout ()
465 plt . show ()
466

467 # E xcution of the Transformer example


468 print (" === ARCHITECTURE TRANSFORMERS ===\ n")
469
470 # Training you model
471 transformer_model , losses , sample_attention = train_transformer_classifier ()
472

473 # Analysis of attention patterns


474 print ("\n" + "=" *50)
"
475 attention_patterns = analyze_attention_patterns ( transformer_model , sample text ")
476

477 # Comparison with RNN


478 print ("\n" + "=" *50)
479 compare_transformer_vs_rnn ()
480
481 # Visualization of training curves

482 plt . figure ( figsize =(10 6) ) ,

483 plt . plot ( losses , ’b-’, linewidth =2 marker =’o’) ,

484 pts . title (' Revolution from Loss - Our Training Transform ')
485 plt . xlabel ('poque ')
486 plt . ylabel ('Perte ')
487 plt . grid ( True , alpha =0.3)
488 plt . show ()
489

490 print ("\ nPoints c l s des Transformers :")


491 print ("1. M c a n i s m e d’attention permet de capturer des d p e n d a n c e s 492 print ("2. P a r a l l l i s a t i o n c o m p l t e a c c long terme ")
l r e l’ e n t r a n e m e n t ")
493 print ("3. Positional encoding compensates for the lack of sequential order ")
494 print ("4. Multi - head attention capture d i f f r e n t s types de relations ")
495 print ("5. Residual connections and normalization facilitate deep training ")

ÿ ÿ
Listing 77 – Complete Implementation of a Transformer

Mohamed Ouazze 113 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

Explanation

Transformers represent a fundamental paradigm shift in sequential processing:

Attention mechanism: Each position can directly access all other positions,
eliminating the vanishing gradient problems of RNNs. Weighted attention allows to
focus on the relevant elements.
Parallelization: Unlike RNNs which process sequentially, Transformers can
process all tokens simultaneously, drastically speeding up GPU training.
Multi-head attention: Allows the model to capture different types of relationships (syntactic,
semantics) simultaneously using multiple parallel attention "heads".
Positional encoding: Compensates for the lack of natural sequential order by injecting positional information via
sine functions.
This architecture has become the basis for revolutionary models like BERT, GPT, and their
successors.

Mohamed Ouazze 114 BDCC-2024-2025


Machine Translated by Google

4 LEVEL 3: ADVANCED CONCEPTS

General Conclusion
After this captivating journey through Deep Learning, you are now armed with the essential knowledge to understand, implement,
and optimize deep learning models. From an introduction to neural networks to advanced architectures like Transformers and
multimodal systems, this course has provided you with a solid and progressive foundation.

You've explored the foundations of mathematics, practiced with powerful frameworks like TensorFlow and PyTorch, and experienced
real-world applications in diverse fields such as vision, language, and content generation.

But the adventure is only just beginning. The next chapter will immerse you in the fascinating world of Reinforcement Learning,
where intelligent agents learn to interact with their environment to achieve complex goals. Stay curious, keep experimenting, and
above all... have fun learning!

Congratulations

You have completed this course on Deep Learning! Congratulations on your commitment and perseverance.
rance.

For any questions, suggestions or future collaboration, please do not hesitate to contact me:

— LinkedIn : linkedin.com/in/mohamed-ouazze — Portfolio :


ouazzemohamed.vercel.app — Instagram : @ouazze10

Keep learning, creating, and pushing the boundaries of artificial intelligence!

Next Chapter
Next: Level 4 — Reinforcement Learning
In this next module, we will cover: — The theoretical
bases of RL (Markov Decision Processes, Policy, Reward)
— Q-Learning and Deep Q Networks (DQN)
— Policy Gradient Methods
— And of course, a practical project to train an agent to play a simple game. Get ready to discover a field
where AI learns by trial and error, like a child discovering the world!

Mohamed Ouazze 115 BDCC-2024-2025

You might also like