ICASSP - 2025 - Copie
ICASSP - 2025 - Copie
Abstract—Computational power is crucial for the development Several works in the literature have focused on large model
and deployment of artificial intelligence capabilities, as the large compression, but only a limited number address SSL mod-
size of deep learning models often requires significant resources. els [13], [14]. Among these, in [15], the authors applies
Compression methods aim to reduce model size making artificial
intelligence more sustainable and accessible. Compression tech- knowledge distillation (KD) to the Wav2vec acoustic model,
niques are often applied uniformly across model layers, without achieving a 4.8x compression ratio, though with a WER
considering their individual characteristics. In this paper, we increase of 3.62. In [16], genetic algorithms are proposed for
introduce a customized approach that optimizes compression for structured pruning of the Wav2vec2 XLSR53 model, resulting
each layer individually. Some layers undergo both pruning and/or in a slight WER increase of 0.21% with 40% pruning. In [17],
quantization, while others are only quantized, with fuzzy logic
guiding these decisions. The quantization precision is further ad- the authors employ symmetric linear quantization to reduce
justed based on the importance of each layer. Our experiments on the precision of weights and activations of a BERT model to
both supervised and self-supervised models using the librispeech INT8. To our knowledge, quantization has not yet been applied
dataset show only a slight decrease in performance, with about to speech SSL models.
85% memory footprint reduction. Previous research often applies a uniform compression
Index Terms—Adaptive compression, Self Supervised models,
method across all layers. However, recent studies reveal that
speech recognition, pruning, quantization, fuzzy logic
weight distributions vary by type and position within the
network [5], [18]. For example, layers with many critical
I. I NTRODUCTION weights need higher quantization precision, while layers with
mostly low-magnitude weights are more suitable for pruning.
Artificial Intelligence (AI) achieves remarkable success in We propose a customized approach that selects the optimal
a wide array of applications. This success stems from the compression method for each layer individually. Some layers
development of deeper and wider Deep Neural Network will undergo both quantization and/or pruning, while others
(DNN) architectures, which enhance the model’s ability to will be only quantized, with fuzzy logic guiding the decision
learn intricate patterns for specific tasks. This is especially process.
evident in computer vision, in natural language processing
and audio processing, including speech recognition. However, II. M ODEL COMPRESSION
the deployment of such large models comes with significant Focusing on Green AI [1] to minimize computational costs
computing and financial costs and contributes to a substantial while preserving performance, we prioritize techniques with
carbon footprint [1]. This not only challenges the inclusivity minimal parameter tuning. We explore two model compres-
of AI but also raises environmental concerns [2]. To address sion strategies, quantization and pruning, which are not only
these issues, reducing model size is crucial. Several methods easily applied to pre-trained models but also ideal for rapid
exist for compressing DNNs, including quantization, pruning, deployment on mobile devices.
knowledge distillation, parameter sharing, and matrix factor-
ization [3]–[9]. A. Quantization
Today, most DNN applications rely on supervised learning, Model quantization reduces the size of the neural networks
which requires labeled data — a process that is often time- by using lower precision values of the weight or activation
consuming and costly. In contrast, human learning begins [6]. The two main approaches are Post-Training Quantization
unsupervised, as infants learn language through observation, (PTQ) and Quantization-Aware Training (QAT). A standard
and later through supervised tasks like reading and writing. quantization function Q(x) converts a floating-point value x
To mimic this process, self-supervised learning (SSL) frame- to an integer : ⇣x⌘
works have been developed. In speech processing, models like Q(x) = Int Z (1)
Wav2vec 2.0 [10], HUBERT [11], and WavLM [12] excel S
with minimal annotated data by pretraining on large unlabeled where S is a floating-point scaling factor, and Z is an integer
datasets, followed by fine-tuning on smaller labeled datasets. zero point, representing the zero value in the quantization
scheme. The Int(·) function rounds the scaled x to the nearest
integer. This approach is known as uniform quantization
because all x values are scaled by the same factor S, leading
to evenly spaced quantized values. Non-uniform quantization,
with its variable spacing between quantized values, can more
effectively capture signal information, but it is challenging
to implement efficiently on standard hardware. As a result,
uniform quantization is the preferred method.
Clipping range selection, or calibration, can be done using
the signal’s minimum and maximum values ↵ = xmin and Fig. 1. Fuzzy membership functions for weight importance in the adaptive
= xmax resulting in asymmetric quantization as the range compression method.
may not be centered. Alternatively, symmetric quantization
sets ↵ = often using the maximum absolute values ↵ =
= max(|xmax |, |xmin |). Asymmetric quantization typically a) Trapezoidal Membership Function: This function is
narrows the clipping range, which is important for imbalanced used to classify weights into two categories: low or high
weights or activations like those following ReLU. Setting the importance. The corresponding membership function, denoted
zero point to Z = 0 simplifies symmetric quantization. by µk (x), where k 2 {low, high}, is given by:
B. Pruning 8
>
> 0 if x ak
Pruning removes unimportant weights/components, by ze- >
>
>
> x ak
if a k < x bk
roing values close to zero. Formally, a neural network model < bk a k
can be defined as a function family f (x, W ) where x denotes µk (x; ak , bk , ck , dk ) = 1 if bk < x c k (2)
>
>
the network architecture and W its parameters. Pruning a >
> dk x
if c k < x dk
>
> d k ck
neural network involves taking an existing model f (x, W ) and :0 if x > dk
generating a new model f (x, W 0 ) such that W 0 = M W
|W | b) Triangular Membership Function: For weights clas-
where M 2 {0, 1} is a binary mask to set some parameters
sified as medium importance, we use a triangular mem-
to zero and is the elementwise product operator [19].
bership function. The triangular membership function
The pruning techniques include : – Unstructured pruning
µmedium (x; aM , bM , cM ) is defined as:
[20] removes individual weights, creating sparse matrices
– Structured pruning removes entire blocks, such as rows, 8
columns, neurons, or attention heads [21]. In this paper, our fo- >
> 0 if x aM
>
<x
cus is on unstructured pruning, as it targets the smallest model aM
if aM < x bM
elements without significant performance loss. Unstructured µmedium (x; aM , bM , cM ) = bM
cM
aM
x (3)
>
>
> if bM < x cM
pruning introduces sparsity, creating irregular memory access : cM bM
TABLE I
BASELINE SYSTEMS ’ MODELS : N UMBER OF PARAMETERS (M ILLIONS ), TABLE III
MEMORY FOOTPRINT (M EGABYTES ) AND WORD ERROR RATE (%). M EMORY FOOTPRINT (M EGABYTES ) AFTER QUANTIZATION .