0% found this document useful (0 votes)
7 views5 pages

ICASSP - 2025 - Copie

This paper presents an adaptive compression method for supervised and self-supervised models in speech recognition, focusing on reducing model size while maintaining performance. The approach customizes compression techniques like pruning and quantization based on the individual characteristics of model layers, guided by fuzzy logic. Experiments demonstrate a significant reduction in memory footprint (up to 85%) with only slight performance degradation, highlighting the method's effectiveness for sustainable AI deployment.

Uploaded by

maif.flannels636
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

ICASSP - 2025 - Copie

This paper presents an adaptive compression method for supervised and self-supervised models in speech recognition, focusing on reducing model size while maintaining performance. The approach customizes compression techniques like pruning and quantization based on the individual characteristics of model layers, guided by fuzzy logic. Experiments demonstrate a significant reduction in memory footprint (up to 85%) with only slight performance degradation, highlighting the method's effectiveness for sustainable AI deployment.

Uploaded by

maif.flannels636
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Adaptive Compression of Supervised and

Self-Supervised Models for Green Speech


Recognition
Mouaad Oujabour⇤ , Leila Ben Letaifa⇤ , Jean-François Dollinger⇤ , Jean-Luc Rouas† ,
⇤ CESI
LINEACT UR 7527, Nancy, France
† LaBRI, CNRS UMR 5800 Univ. de Bordeaux, Bordeaux INP Talence, France

Abstract—Computational power is crucial for the development Several works in the literature have focused on large model
and deployment of artificial intelligence capabilities, as the large compression, but only a limited number address SSL mod-
size of deep learning models often requires significant resources. els [13], [14]. Among these, in [15], the authors applies
Compression methods aim to reduce model size making artificial
intelligence more sustainable and accessible. Compression tech- knowledge distillation (KD) to the Wav2vec acoustic model,
niques are often applied uniformly across model layers, without achieving a 4.8x compression ratio, though with a WER
considering their individual characteristics. In this paper, we increase of 3.62. In [16], genetic algorithms are proposed for
introduce a customized approach that optimizes compression for structured pruning of the Wav2vec2 XLSR53 model, resulting
each layer individually. Some layers undergo both pruning and/or in a slight WER increase of 0.21% with 40% pruning. In [17],
quantization, while others are only quantized, with fuzzy logic
guiding these decisions. The quantization precision is further ad- the authors employ symmetric linear quantization to reduce
justed based on the importance of each layer. Our experiments on the precision of weights and activations of a BERT model to
both supervised and self-supervised models using the librispeech INT8. To our knowledge, quantization has not yet been applied
dataset show only a slight decrease in performance, with about to speech SSL models.
85% memory footprint reduction. Previous research often applies a uniform compression
Index Terms—Adaptive compression, Self Supervised models,
method across all layers. However, recent studies reveal that
speech recognition, pruning, quantization, fuzzy logic
weight distributions vary by type and position within the
network [5], [18]. For example, layers with many critical
I. I NTRODUCTION weights need higher quantization precision, while layers with
mostly low-magnitude weights are more suitable for pruning.
Artificial Intelligence (AI) achieves remarkable success in We propose a customized approach that selects the optimal
a wide array of applications. This success stems from the compression method for each layer individually. Some layers
development of deeper and wider Deep Neural Network will undergo both quantization and/or pruning, while others
(DNN) architectures, which enhance the model’s ability to will be only quantized, with fuzzy logic guiding the decision
learn intricate patterns for specific tasks. This is especially process.
evident in computer vision, in natural language processing
and audio processing, including speech recognition. However, II. M ODEL COMPRESSION
the deployment of such large models comes with significant Focusing on Green AI [1] to minimize computational costs
computing and financial costs and contributes to a substantial while preserving performance, we prioritize techniques with
carbon footprint [1]. This not only challenges the inclusivity minimal parameter tuning. We explore two model compres-
of AI but also raises environmental concerns [2]. To address sion strategies, quantization and pruning, which are not only
these issues, reducing model size is crucial. Several methods easily applied to pre-trained models but also ideal for rapid
exist for compressing DNNs, including quantization, pruning, deployment on mobile devices.
knowledge distillation, parameter sharing, and matrix factor-
ization [3]–[9]. A. Quantization
Today, most DNN applications rely on supervised learning, Model quantization reduces the size of the neural networks
which requires labeled data — a process that is often time- by using lower precision values of the weight or activation
consuming and costly. In contrast, human learning begins [6]. The two main approaches are Post-Training Quantization
unsupervised, as infants learn language through observation, (PTQ) and Quantization-Aware Training (QAT). A standard
and later through supervised tasks like reading and writing. quantization function Q(x) converts a floating-point value x
To mimic this process, self-supervised learning (SSL) frame- to an integer : ⇣x⌘
works have been developed. In speech processing, models like Q(x) = Int Z (1)
Wav2vec 2.0 [10], HUBERT [11], and WavLM [12] excel S
with minimal annotated data by pretraining on large unlabeled where S is a floating-point scaling factor, and Z is an integer
datasets, followed by fine-tuning on smaller labeled datasets. zero point, representing the zero value in the quantization
scheme. The Int(·) function rounds the scaled x to the nearest
integer. This approach is known as uniform quantization
because all x values are scaled by the same factor S, leading
to evenly spaced quantized values. Non-uniform quantization,
with its variable spacing between quantized values, can more
effectively capture signal information, but it is challenging
to implement efficiently on standard hardware. As a result,
uniform quantization is the preferred method.
Clipping range selection, or calibration, can be done using
the signal’s minimum and maximum values ↵ = xmin and Fig. 1. Fuzzy membership functions for weight importance in the adaptive
= xmax resulting in asymmetric quantization as the range compression method.
may not be centered. Alternatively, symmetric quantization
sets ↵ = often using the maximum absolute values ↵ =
= max(|xmax |, |xmin |). Asymmetric quantization typically a) Trapezoidal Membership Function: This function is
narrows the clipping range, which is important for imbalanced used to classify weights into two categories: low or high
weights or activations like those following ReLU. Setting the importance. The corresponding membership function, denoted
zero point to Z = 0 simplifies symmetric quantization. by µk (x), where k 2 {low, high}, is given by:

B. Pruning 8
>
> 0 if x  ak
Pruning removes unimportant weights/components, by ze- >
>
>
> x ak
if a k < x  bk
roing values close to zero. Formally, a neural network model < bk a k
can be defined as a function family f (x, W ) where x denotes µk (x; ak , bk , ck , dk ) = 1 if bk < x  c k (2)
>
>
the network architecture and W its parameters. Pruning a >
> dk x
if c k < x  dk
>
> d k ck
neural network involves taking an existing model f (x, W ) and :0 if x > dk
generating a new model f (x, W 0 ) such that W 0 = M W
|W | b) Triangular Membership Function: For weights clas-
where M 2 {0, 1} is a binary mask to set some parameters
sified as medium importance, we use a triangular mem-
to zero and is the elementwise product operator [19].
bership function. The triangular membership function
The pruning techniques include : – Unstructured pruning
µmedium (x; aM , bM , cM ) is defined as:
[20] removes individual weights, creating sparse matrices
– Structured pruning removes entire blocks, such as rows, 8
columns, neurons, or attention heads [21]. In this paper, our fo- >
> 0 if x  aM
>
<x
cus is on unstructured pruning, as it targets the smallest model aM
if aM < x  bM
elements without significant performance loss. Unstructured µmedium (x; aM , bM , cM ) = bM
cM
aM
x (3)
>
>
> if bM < x  cM
pruning introduces sparsity, creating irregular memory access : cM bM

patterns but sparse matrix representations [22] or specialized 0 if x > cM


hardware [23] can address this issue. Pruning can be done 2) Membership Degree Calculation: For each weight w
iteratively between training epochs, applied once after training in a layer, we calculate its degree of membership in each
[18], or integrated during fine-tuning [24]. Pruning can be importance class as low, medium, or high. Let x = |w|, the
applied globally, removing a fraction of parameters across magnitude of the weight. The membership degrees are µlow (x),
the entire model, or locally, targeting a specific percentage µmedium (x) and µhigh (x). These degrees describe the extent to
of parameters within each layer [19]. which a weight belongs to each importance class.
3) Defuzzification and Decision-Making: Defuzzification
III. A DAPTIVE COMPRESSION
converts fuzzy weight classifications into concrete actions.
We propose an adaptive compression method using fuzzy Based on the percentage of weights in each importance
logic [25] to evaluate weight importance in each layer and category (low, medium, high), the method applies the appro-
dynamically select the optimal compression strategy. By an- priate compression strategy. Layers with a majority of low-
alyzing the statistical distribution of weight magnitudes (e.g., importance weights are pruned or quantized with low preci-
minimum, maximum, median, standard deviation), the method sion, while those with medium or high-importance weights are
defines fuzzy membership functions to classify weights as low, quantized with higher precision.
medium, or high importance.
1) Fuzzy Membership Functions: Fuzzy logic allows us to IV. E XPERIMENTS AND RESULTS
assign degrees of membership to different importance classes We conducted experiments comparing quantization, prun-
low, medium, or high for each weight based on its magnitude. ing, and the proposed adaptive method under identical con-
In the adaptive method, we used basically trapezoidal and ditions. Each technique was applied post-training in a one-
triangular membership functions to describe the fuzzy sets for shot manner and evaluated by memory footprint and Word
low, medium, and high importance as shown on Fig. 1. Error Rate (WER). Compression was measured by storage
efficiency for quantization and sparsity for pruning. To ensure TABLE II
fairness, we introduced a unified compression rate for doubly W ORD E RROR R ATE (WER) AFTER QUANTIZATION TO 2, 4, AND 8 BITS .
compressed models. Models Data Initial Qint8 Qint4 Qint2
Test clean 3.3 3.3 3.5 94.0
A. Baseline systems Test other 8.0 8.0 8.6
Transf
We utilize automatic speech recognition (ASR) models Dev clean 3.0 3.0 3.2
Dev other 7.9 7.9 8.4
trained with the ESPnet toolkit [26] on the LibriSpeech Test clean 2.9 3.0 3.0 33.2
dataset [27], which comprises approximately 1000 hours of Conf
Test other 7.3 7.4 7.6
Dev clean 2.9 2.9 3.0
16kHz English speech recordings. Of these, around 960 hours Dev other 7.1 7.3 7.4
are dedicated to training, with the remaining hours evenly Test clean 2.4 2.4 2.4 22.7
split between development (dev) and testing (test) sets. The Branch
Test other 5.3 5.3 5.6
Dev clean 2.1 2.2 2.2
dataset distinguishes between two categories: clean data (test- Dev other 5.2 5.2 5.4
clean and dev-clean) and other data (test-other and dev-other). Test clean 2.2 2.2 2.2 16.5
Other data refer to recordings that are more challenging due E-branch
Test other 4.6 4.6 4.7
Dev clean 2.0 2.0 2.0
to factors such as background noise, unclear pronunciation, Dev other 4.6 4.6 4.7
or varied accents. Instead, clean data consist of recordings Test clean 2.0 2.0 2.0 66.3
with clear and high-quality audio. We chose to evaluate the Hubert
Test other 4.2 4.2 4.2
Dev clean 1.9 1.9 1.9
Transformer architecture [28] and some of its variants that Dev other 4.1 4.1 4.2
are the Conformer [29], the Branchformer [30] and the E- Test clean 2.5 2.5 2.6 100.0
Branchformer [31] because of their high performance in End Wav2Vec
Test other 6.3 6.3 6.6
Dev clean 2.3 2.3 2.4
to End ASR. We also used the Wav2Vec, the Hubert and the Dev other 6.6 6.6 6.9
WavLM SSL models. These models are referred to as T ransf , Test clean 2.0 2.0 2.1 11.4
Conf , Branch, Ebranch, W 2V , Hub and W lm. Results are WavLm
Test other 4.2 4.2 4.3
Dev clean 1.9 1.9 2.0
reported in TABLE I. Dev other 4.2 4.2 4.2

TABLE I
BASELINE SYSTEMS ’ MODELS : N UMBER OF PARAMETERS (M ILLIONS ), TABLE III
MEMORY FOOTPRINT (M EGABYTES ) AND WORD ERROR RATE (%). M EMORY FOOTPRINT (M EGABYTES ) AFTER QUANTIZATION .

Models Qint8 Qint4 Qint2


Characteristic Trans Conf Branch Ebranch W2V Hub Wlm Mem. Mem. zip Mem. Mem. zip Mem. Mem. zip
Parameters 99 93 116 148 432 433 431 Transf 108 97 65 59 41 34
Mem. 397 373 596 553 1734 1731 1727 Conf 131 119 95 87 75 66
Mem. zip 369 345 433 467 1176 1179 1174 Branch 129 111 79 69 50 41
WER E-branch 163 138 99 87 63 51
Test-clean 3.3 2.9 2.4 2.2 2.5 2.0 2.0 Hubert 512 443 332 279 231 171
Test-other 8.0 7.3 5.3 4.6 6.3 4.2 4.2 Wav2Vec 512 444 332 279 230 171
Dev-clean 3.0 2.9 2.1 2.0 2.3 1.9 1.9 WavLm 510 442 330 277 229 170
Dev-other 7.9 7.1 5.2 4.6 6.6 4.1 4.2

Among classical DNNs, E-Branchformer is the most per- C. Pruning


formant but also the largest. For SSL models, all have compa-
Local unstructured pruning is applied to each linear layer to
rable and significant memory sizes, with Hubert and WavLM
all the models using the pruning rates 40% and 60%. Accord-
showing superior performance than Wav2vec.
ing to TABLE IV, not all models are equally robust to pruning.
B. Quantization For traditional models, the Branchformer and E-Branchformer
We reduced the model precision from 32 bits floats to architectures appear to be over-parameterized, as they can be
8, 4, and 2 bits integersusing the symetric PTQ dynamic pruned by up to 50% without any loss in performance. These
quantization method with Quanto software1 . TABLE II shows models are larger compared to transformers and conformers
that all models are robust to 8 bits quantization but less so and include more MLP layers. In the context of SSL, the
to 4 bits one. The best performance is achieved by the E- Hubert and WavLM architectures show considerably greater
Branchformer, Hubert, and Wavlm models. Overall, the error robustness to pruning than Wav2vec. This enhanced robustness
rate remains stable or increases slightly, with a maximum is likely due to their training processes, which rely on masked
absolute rise of 0.2% for clean data (test and dev) and up to prediction.
0.6% for noisy data. Quantizing to 2-bit integers significantly
degrades performance on clean data. So, we chose not to D. Adaptive compression
evaluate it on the remaining dataset. Regarding memory size,
Each model is compressed layer by layer as follows: For
according to TABLE III, it is reduced by more than 3.6 times
each layer, a membership function categorizes the weights into
for 8 bits quantization and by 6.3 times for 4 bits quantization.
three groups: L (low), M (medium), and H(high). To define
1 https://github.com/huggingface/optimum-quanto/tree/main the parameters of these membership functions, let us denote
TABLE IV not pruned, except for the Branchformer and E-Branchformer,
W ORD E RROR R ATE (WER) FOR THE PRUNING RATES : 40% AND 60% where a quarter of the layers were pruned without performance
Models Pr = 40% Pr = 60% loss. With (↵2, 2), we focused on increasing the proportion
Transf 3.6 10.7 of weights in the low class since 60% pruning rate led to
Conf 3.1 5.2 a WER increase of 0.3, 0.1, and 0.9 respectively for the
Branch 2.4 2.7 Branchformer, E-Branchformer, and WavLm (see TABLE IV).
E-branch 2.1 2.3
Hubert 2.3 3.1 We find that the model size is typically larger than with
Wav2Vec 4.1 18.5 4-bit quantization but smaller than with 8-bit quantization.
WavLm 2.1 2.9 Branchformer offers the best trade-off, reducing size by 75%
with slight increasing the WER.
2) Two-class compression: : According to paragraph IV-B,
|w| the weight magnitude. The following relationships then WER changes little with the 4-bit quantization of all layers but
hold under tuple notation: increases significantly with 2-bit quantization. The approach is
(aM , bM , cM ) = (median(|w|) , to use 2 bits for insignificant layers and 4 bits for the remaining
layers. The adaptive compression is applied as follows: if
median(|w|), (4)
class L has the highest cardinality, the layer is quantized to 4
median(|w|) + ) bits; otherwise, it is quantized to 2 bits. TABLE VI show the
(aL , bL , cL , dL ) = (min(|w|), min(|w|), compression results.
min(|w|) + std(|w|), (5)
TABLE VI
median(|w|) ↵) R ESULTS OF THE T WO - CLASS COMPRESSION
(aH , bH , cH , dH ) = (max(|w|), max(|w|),
Model WER Mem .zip %H %L
max(|w|) std(|w|), (6) Transf 5.8 54 92.49 7.51
Conf 4.7 84 95.97 4.03
median(|w|) + ↵) Branch 2.7 54 47.37 52.63
E branch 2.4 74 66.17 33.83
with ↵ and variable parameters. Hubert 3.2 239 76.56 23.44
We apply two experiments: three-class compression and Wav2Vec 15.3 233 73.60 26.40
two-class compression. WavLm 2.8 237 74.93 25.07
1) Three-class compression: : If class L has the highest
cardinality, the layer is pruned and quantized to 8 bits. If class The memory size falls between 4-bit and 2-bit quantization.
M has the highest cardinality, the layer is quantized to 4 bits. While the WER is higher than that of the quantization into 4
Otherwise, it is quantized to 8 bits. TABLE V show the results bits, it remains much better than 2-bit quantization, offering a
for the following values : balance between the two. For Branchformer, 2-bit quantization
(↵1, 1) = ( std_magnitude
2 , std_magnitude
4 ) and of all layers yields a WER of 16.5%, 4-bit gives 2.2%, and a
(↵2, 2) = std_magnitude
, std_magnitude
) mix of 34% of the layers at 2 bits and 66% at 4 bits results
1 2
in 2.4% WER with a memory reduction to 15%. For WavLM,
TABLE V 2-bit quantization gives a WER of 11.4%, 4-bit achieves 2.0%,
R ESULTS OF THE T HREE - CLASS COMPRESSION : (↵1, 1) AND (↵2, 2), and a 25%-75% mix results in 2.8% WER.
%Ł, %H AND %M THE PERCENTAGE OF RESPECTIVE COMPRESSED
LAYERS AND P r THE PRUNING RATE OF THE PRUNED LAYERS . V. C ONCLUSIONS
Model WER Mem.z %L %M %H %Pr Large DNN models are resource-intensive and compres-
(↵1, 1) sion techniques can reduce model size without compromising
Transf 3.3 88.06 1.17 85.96 12.87 48
Conf 3.1 118.97 0.68 97.28 2.04 48 performance, making Artificial intelligence more sustainable
Branch 2.4 88.7 26.57 58.94 14.49 59 and accessible. Typically, compression methods are applied
E branch 2.2 107.82 24.72 63.29 11.99 53 uniformly across all network layers, neglecting their individual
Hubert 2.0 308.46 0.00 91.82 8.18 0
Wav2Vec 2.6 299.82 0.62 93.12 6.25 48 characteristics. This work introduces a refined approach that
WavLm 2.1 313.66 1.47 89.44 9.09 54 adapts compression methods to each layer’s specific needs,
(↵2, 2) optimizing performance. Our experiments with supervised
Transf 3.4 97.25 44.45 32.16 23.39 36 and self-supervised models show only a slight performance
Conf 3.1 120.54 54.42 27.21 18.37 36
Branch 2.4 94.76 54.11 27.54 18.35 49 reduction. BranchFormer achieves the best balance, reducing
E branch 2.2 117.8 56.93 25.09 17.98 45 memory size by up to 85% with just a 0.2% increase in WER.
Hubert 2.1 357.45 38.05 34.91 27.04 36
Wav2Vec 4.4 358.52 37.81 32.50 29.69 36 ACKNOWLEDGMENT
WavLm 2.1 370.2 42.82 31.09 26.09 36
This work was supported by the FVLLMONTI project,
Using (↵1, 1), the WER has either remained the same or funded by the European Union’s Horizon 2020 program (grant
increased slightly. The majority of layers in all models were agreement No. 101016776).
R EFERENCES [21] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep
convolutional neural networks,” ACM Journal on Emerging Technologies
[1] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” in Computing Systems, vol. 1, p. 1–18, 2017.
Commun. ACM, vol. 63, no. 12, p. 54–63, nov 2020. [Online]. [22] G. Nazli, J. Ankit, and S. Qian, “Comparative analysis of sparse matrix
Available: https://doi.org/10.1145/3381831 algorithms for information retrieval,” computer, vol. 2, 2003.
[2] G. Sastry, L. Heim, H. Belfield, M. Anderljung, M. Brundage, J. Hazell, [23] G. A. Alireza Amirshahi, Joshua Alexander Harrison Klein and
C. O’Keefe, G. K. Hadfield, R. Ngo, K. Pilz et al., “Computing D. Atienza, “Tic-sat: Tightly-coupled systolic accelerator for transform-
power and the governance of artificial intelligence,” arXiv preprint ers.” 28th Asia and South Pacific Design Automation Conference, 2023,
arXiv:2402.08797, 2024. pp. 657–663.
[3] S. S. Jash Rathod, Nauman Dawalatabad and D. Gowda, “Multi-stage [24] M. Gupta and P. Agrawal, “Compression of deep learning models for
progressive compression of conformer transducer for on-device speech text: A survey,” Computer Science. ACM Trans. Knowl. Discov. Data,
recognition.” in INTERSPEECH, 2022. 2020.
[4] L. B. Letaifa and J.-L. Rouas, “Transformer model compression for [25] G. Klir and B. Yuan, Fuzzy sets and fuzzy logic. Prentice hall New
end-to-end speech recognition on mobile devices,” in European Signal Jersey, 1995, vol. 4.
Processing Conference, EUSIPCO, 2022. [26] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,
N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala,
[5] L. Ben Letaifa and J.-L. Rouas, “Variable scale pruning for transformer
and T. Ochiai, “Espnet: End-to-end speech processing toolkit.” INTER-
model compression in end-to-end speech recognition,” Algorithms. Spe-
SPEECH, 2018, pp. 2207–2211.
cial Issue ”Recent Advances in Machine Learning Algorithms”, 2023.
[27] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an
[6] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer,
asr corpus based in public domain audio books,” in ICCASP, 2015.
“A survey of quantization methods for efficient neural network infer-
[28] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence
ence,” in Low-Power Computer Vision. Chapman and Hall/CRC, 2022,
sequence-to-sequence model for speech recognition.” IEEE Interna-
pp. 291–326.
tional Conference on Acoustics, Speech, and Signal Processing ICASSP,
[7] M. B. Noach and Y. Goldberg, “Compressing pre-trained language 2018.
models by matrix decomposition,” in International Joint Conference on [29] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han,
Natural Language Processing, 2020. S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-
[8] D. Bekal, K. Gopalakrishnan, K. Mundnich, S. Ronanki, S. Bodapati, augmented Transformer for Speech Recognition,” in INTERSPEECH,
and K. Kirchhoff, “A metric-driven approach to conformer layer pruning Oct. 2020.
for efficient asr inference.” INTERSPEECH, 2023. [30] Y. Peng, S. Dalmia, I. Lane, and S. Watanabe, “Branchformer: Parallel
[9] Y. Wang and J. Li, “Residualtransformer: Residual low-rank learning mlp-attention architectures to capture local and global context for speech
with weight-sharing for transformer layers,” in ICASSP 2024-2024 IEEE recognition and understanding,” in International Conference on Machine
International Conference on Acoustics, Speech and Signal Processing Learning. PMLR, 2022, pp. 17 627–17 643.
(ICASSP). IEEE, 2024, pp. 11 161–11 165. [31] K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe,
[10] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A “E-branchformer: Branchformer with enhanced merging for speech
framework for self-supervised learning of speech representations,” in recognition,” in 2022 IEEE Spoken Language Technology Workshop
Advances in Neural Information Processing Systems, vol. 33, 2020, pp. (SLT). IEEE, 2023, pp. 84–91.
12 449–12 460.
[11] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and
A. Mohamed, “Hubert: Self-supervised speech representation learning
by masked prediction of hidden units,” in IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 29. IEEE, 2021, pp.
3451–3460.
[12] S. Chen, C. Wang, Z. Chen, Y. Wu, Y. Liang, Y. Q. Fan, M. Z. Chang,
S. Liu et al., “Wavlm: Large-scale self-supervised pre-training for full
stack speech processing,” in IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 30. IEEE, 2022, pp. 346–360.
[13] C.-I. J. Lai, Y. Zhang, A. H. Liu, S. Chang, Y.-L. Liao, Y.-S. Chuang,
K. Qian, S. Khurana, D. Cox, and J. Glass, “Parp: Prune, adjust and
re-prune for self-supervised speech recognition,” Advances in Neural
Information Processing Systems, vol. 34, pp. 21 256–21 272, 2021.
[14] Y. Peng, K. Kim, F. Wu, P. Sridhar, and S. Watanabe, “Structured
pruning of self-supervised pre-trained models for speech recognition and
understanding,” in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[15] Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and
J. Nassour, “Shrinking bigfoot: Reducing wav2vec 2.0 footprint,” arXiv
preprint arXiv:2103.15760, 2021.
[16] O. Ludwig and T. Claes, “Compressing wav2vec2 for embedded applica-
tions,” in 2023 IEEE 33rd International Workshop on Machine Learning
for Signal Processing (MLSP). IEEE, 2023, pp. 1–6.
[17] O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8bert: Quantized
8bit bert,” in 2019 Fifth Workshop on Energy Efficient Machine Learning
and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). IEEE, 2019,
pp. 36–39.
[18] L. B. Letaifa and J.-L. Rouas, “Fine-grained analysis of the transformer
model for efficient pruning,” in International Conference on Machine
Learning and Applications ICMLA, 2022.
[19] D. Blalock, J. J. Gonzalez Ortiz, J. Frankle, and J. Guttag, “What is the
state of neural network pruning?” in Proceedings of Machine Learning
and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze, Eds., vol. 2,
2020, pp. 129–146.
[20] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
neural network with pruning, trained quantization and huffman coding,”
in proceedings ICLR, 2016.

You might also like