0% found this document useful (0 votes)
31 views5 pages

Llama Based Punctuation Restoration With Forward Pass Only Decoding

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views5 pages

Llama Based Punctuation Restoration With Forward Pass Only Decoding

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

LLaMA based Punctuation Restoration With Forward Pass Only Decoding

Yutong Pang1 , Debjyoti Paul1 , Kevin Jiang1 , Xuedong Zhang1 , Xin Lei1
1
Meta, USA

Abstract troduces an approach that leverages the capabilities of LLaMA.


Acknowledged for its effectiveness in various language-related
This paper introduces two advancements in the field of Large tasks, LLaMA emerges as a compelling alternative that sur-
Language Model Annotation with a focus on punctuation passes existing benchmarks across numerous Natural Language
restoration tasks. Our first contribution is the application of Processing (NLP) tasks [9] [10]. Additionally, with LoRA fine-
arXiv:2408.11845v1 [cs.CL] 9 Aug 2024

LLaMA for punctuation restoration, which demonstrates supe- tuning [11], which demands significantly less supervised train-
rior performance compared to the established benchmark. ing data, we achieve comparable and even superior performance
Despite its impressive quality, LLaMA faces challenges re- for punctuation restoration compared to traditional methods in-
garding inference speed and hallucinations. To address this, troduced previously. This approach addresses both the quality
our second contribution presents Forward Pass Only Decoding effectiveness and scale-up concerns associated with punctuation
(FPOD), a novel decoding approach for annotation tasks. This restoration in diverse languages and domains.
innovative method results in a substantial 19.8x improvement in
In our exploration of LLaMA-based punctuation restora-
inference speed, effectively addressing a critical bottleneck and
tion, we present a range of strategies. Initially, we delve into
enhancing the practical utility of LLaMA for large-scale data
the traditional approach with auto-regressive generation. Sub-
annotation tasks without hallucinations.
sequently, we explore techniques to address the inherent chal-
The combination of these contributions not only solidifies
lenge of inference speed in LLaMA. The first of these strate-
LLaMA as a powerful tool for punctuation restoration but also
gies involves speculative decoding, showcasing improvements
highlights FPOD as a crucial strategy for overcoming speed
in inference speed while maintaining the quality of generated
constraints.
outputs exactly the same as the original base model. Finally, we
Index Terms: speech recognition, human-computer interac-
present a new forward pass only approach, eliminating the need
tion, computational paralinguistics, punctuation, LLM
for auto-regressive generation entirely. This novel approach re-
sults in a substantial boost in inference speed.
1. Introduction Our contribution not only establishes LLaMA as a potent
Automatic Speech Recognition (ASR) plays a vital role in nu- alternative for achieving high-quality punctuation restoration
merous domains involving human-computer interaction [1] [2] but also introduces practical enhancements to overcome the
[3]. However, the outputs of many ASR systems often lack challenges associated with inference speed.
punctuation. Punctuation restoration in the context of ASR out-
put is a crucial component [4] [5], essential for enhancing the 2. Proposed Method
overall utility, user experience, and comprehensibility of tran-
scribed speech. Restoring the punctuation will make the raw In this section, we described the proposed forward pass only
ASR output more coherent, with improved intractability. method to restore the punctuation. At the same time, we will
The field of punctuation restoration encompasses two dis- compare it with other decoding methods: the auto regressive
tinct techniques: cascade methods, exemplified by models decoding and speculative decoding.
like BERT [6], commonly applied independently to Automatic
Speech Recognition (ASR) outputs in spoken domains without 2.1. Auto Regressive Generation
punctuation [7]. These cascade models function as standalone Auto-regressive generation refers to a process in which a lan-
systems, addressing the punctuation restoration task sequen- guage model generates output (e.g., text) one token at a time in
tially. On the other hand, the End-to-End (E2E) approach, rep- a sequential manner. At each step, the model predicts the next
resented by models such as Recurrent Neural Network Trans- token based on the context of the preceding tokens it has gen-
ducer (RNNT) or Whisper [8], trained in an end2end fash- erated. This process is ”auto-regressive” because the model’s
ion, incorporates built-in punctuation output. This category own outputs become part of the input for predicting subsequent
of techniques streamlines the punctuation restoration process. tokens. Inference from LLaMA auto-regressively is slow - de-
However, both approaches face challenges, the former requir- coding K tokens takes K sequential run of the model.
ing independent but domain-aligned training data and evalua-
tion effort, and the latter compels to use large amounts of high- 2.2. Speculative Decoding
quality supervised data containing punctuation paired with au-
dio, which is a bottleneck for scaling ASR systems to new do- We have already seen that the auto regressive generation is a
mains and languages requiring punctuation restoration. very slow generation process. Speculative Decoding is intro-
Recognizing the significance of the punctuation restoration duced to improve it [12]. It refers to the process of using an
task and the challenges posed by previous models, our work in- assistant model to help the decoding process to prevent going
through auto regressive decoding for most cases. Here is a brief
explanation of how speculative decoding works:
• We first use the assistant model (usually a small distilled stu-
dent model) to generate the output auto regressively
• Then we send the output to the large main model (usually a
large teacher model), and only perform verification forward
passes
• If the verification is successful (the main model agrees with
the assistant model), then we directly use the assistant model
output as final output. Otherwise, we need to run the full
auto-regressive generation with the large main model to get a
“better” output.
Figure 2: Directly feeding input as response in prompt for for-
• Since for the cases with successfully verified results, we ward pass only decoding (FPOD) scheme.
only run the auto-regressive generation with the fast assis-
tant model and only perform verification forward pass with
the slow main model, the decoding process is sped up sub-
stantially.
Speculative decoding could help us to improve the inference
speed; however, we still need to train the distilled student
model, and auto-regressive generation is still needed for all the
student model pass and some of the base model pass (the case
failed with forward verification). The inference speed limit is
totally dependent on the quality and size of the student model.
And the general inference speed improvement is usually less
than 2X [13].

2.3. Forward Pass Only Decoding


Concept. In this section, we introduce forward pass only decod- Figure 3: FPOD for punctuation restoration
ing (FPOD), which can totally discard the auto-regressive gen-
eration step. We can employ FPOD for tagging, edit, prepend, By employing this method, we can effectively restore punc-
and append-based text-enhancement tasks. For ease of under- tuation, resulting in sentences like ”hello, how are you?”. This
standing, we will proceed with the punctuation restoration task approach converts the original generation task into a verifica-
as an example for the decoding scheme. The following explains
the step-by-step procedure to achieve the punctuation restora-
tion task with FPOD. Also, the detailed algorithm is illustrated Algorithm 1 ForwardPassOnlyDecoding(M , x[1:T +L] )
in Algorithm 1.
Input: M , x[1:T +L]
• We will first use LoRA fine tuning to finetune the LLaMA2 ▷ M is a task-based finetuned LLaMA2 model with LoRA,
model for punctuation restoration task following the prompt ▷ x[1:T +L] input prompt tokens, T is length of prompt tem-
template in Figure 1. plate length, T+L is the total prompt length including the re-
• We directly feed the following prompt for forward pass; no- sponse part.
tice the input is copied for the response as shown in Figure 2. Output: resText
• In a single forward pass, for each token in the response, we ▷ Forward pass in parallel to get next token predictions, so
obtain a prediction of the next token, as shown in Figure 3. If we convert a generation task into a verification task
the prediction of the next token is a punctuation symbol, we y[1:T +L+1] ← M (x[1:T +L] )
will prepend it to the current token. ▷ Append Symbol i restoration interation.
for i ← T + 1 to T + L + 1 do ▷ In the response region.
if yi is digit and xi is space then
continue ▷ delete current space
end if
if yi is appendToken then ▷ Append Symbol found
▷ Append symbol
if i = T + L + 1 then ▷ for last token
resText ← resText + yi
else ▷ for non-last token
resText ← resText + yi + xi
end if
else
resText ← resText + xi
end if
Figure 1: LoRA based Llama2 finetuning prompt template with end for
example instruction, input and response. return resText
The probability of accepting output from forward pass
y[1:T +L] ← M (x[1:T +L] ) as it is, from Algorithm 1, i.e.,
without modification of token for punctuation task be β.
Then E(β) is the measure of how efficient FPOD predicts
with respect to regressive generation. Now, simplifying
the assumption that the βs are independent and identically
distributed (i.i.d.), the expected number of tokens is not
followed by punctuations when recursive FPOD is not needed,
and the forward pass result is readily accepted. We dub it as
acceptance rate and denote as α = E(β). Then the expected
number of tokens produced by a single run of recursive FPOD
on the length of tokens L is a capped geometric variable series is

1 − αL
E(#token) =
1−α
.
The above factor is similar to speculative decoding ex-
pected token generation [12]. Moreover, for Algorithm 1 we
Figure 4: Sliding window with padding approach for long input need to consider a time efficiency factor for running forward
text. pass decoding w.r.t. regressive generation. We introduce η, a
time-efficiency factor to attribute running one step of forward
pass decoding vs. one step regressive generation, where η ≤ 1
tion task, significantly enhances the inference speed of punctu- but very close to ≈ 1, in the later experiment section, we will
ation restoration compared to traditional auto-regressive meth- give an estimate of the η. Since forward pass predicts for L
ods. Furthermore, we utilize the frozen LoRA fine-tuned model, tokens in parallel with multiprocessing, ideally taking the same
eliminating the need for additional training such as token clas- time as one token prediction with regressive generation, with
sification for punctuation task. slight overhead for multiprocessing. Then, the overall expected
improvement factor in token generation is
In addition to enhancing speed, the use of forward pass only
decoding ensures that the token lengths and the original sen- 1 − αL
Improvement Factor (IF) = η
tence structure (with punctuation modifications only) remain 1−α
unaltered. This method effectively mitigates the issue of hallu-
cination, a common problem associated with the auto-regressive Let’s conduct an empirical analysis to gauge the enhance-
approach. ment factor of FPOD for the punctuation task. Drawing from
Limitations. Decoding solely through the forward pass appears a frequency distribution analysis of punctuation marks in En-
highly efficient and straightforward; nevertheless, certain de- glish across extensive corpora [14], we can approximate around
tails require careful consideration: 91,000 punctuation marks (including commas, periods, and
question marks) per million words, equating to roughly 9 punc-
• In general, the performance of the large language model tuation marks per 100 words. We can reasonably assume an av-
(LLM) is usually worse for “super long” input context, which erage number of tokens per word for LLaMA models, denoted
is often more important for punctuation restoration. as κ, where κ ≥ 1. This implies we expect to encounter ap-
• With forward only decoding, the given token prediction only proximately 9 punctuation marks every 100κ tokens. However,
depends on the previous token history. So let’s see if the his- for the sake of simplicity, we’ll consider κ = 1. Hence, α is,
tory is “hello how are you”, and we want to predict the next
token of you, ideally should be “?”. However, because the 9
α=1− = 0.91 let κ = 1
previous history does not contain any punctuation, the model 100
behavior may be different from the auto-regressive genera-
Therefore, the improvement factor for the punctuation task is
tion process.
1 − αL 1 − 0.9150
Solutions. To address the first limitation i.e., decoding longer IF = η =η ≈ 11η for L = 50
text, we use a simple sliding window with padding approach, 1−α 1 − 0.91
illustrated by the following Figure 4. To solve the second lim-
itations i.e., context dependant decoding, instead of one pass Applications. As mentioned earlier, FPOD can be promising
forward decoding, we will split the process into the following in various applications such as tagging, verification, and text
step as Recursive FPOD: enhancement. For instance, it can be utilized for tasks like entity
• Iterate through the input tokens “hello how are you”, we will tagging, verifying speech recognition transcriptions for quality
update the sentence once we find a punctuation prediction. control, and text normalization or inverse-text normalization.
• In this case, “hello how are you” → “hello, how are you” →
“hello, how are you?” 3. Experiments
• So instead of one pass forward decoding, we pass the input In this section, we verify the effectiveness of the proposed
sentence two times to the forward decoding process. forward pass only decoding method for punctuation restora-
Improvement Factor. Lets analyze the improvement factor tion. We will compare the F1 score as a metric for punctuation
running recursive FPOD with respect to auto-regressive restoration quality [7]. In all the experiments, the punctuation
generation. restoration is applied directly to ASR output reference (without
punctuation). We will also compare the inference speed, mea- mas (,). We employ a recursive forward decoding technique
sured by tokens/second for each decoding method. The detailed with a window for this study. As shown in Table 2, the results
setup and results of each experiment are also described. suggest that the LLaMA2-based model, both with FPOD and
recursive FPOD method, can improve the F1 score for all punc-
3.1. LoRA Finetuned Model for Punctuation Restoration tuation marks. These improvements surpass the performance
of both the RNNT and Whisper models. Notably, the utiliza-
The punctuation restoration model is trained with Lora Fine tun-
tion of recursive FPOD further amplifies the F1 score by a sub-
ing on the 13B LLaMA2 model, and the training data is 20k
stantial margin. Regarding inference speed, the adoption of
of train-clean-360 data from Librispeech-PC dataset [15]. The
recursive FPOD achieves an impressive rate of 959.1 tokens/s
prompt template for LoRA fine-tuning is described in Figure 1.
for long input texts. This represents a remarkable 10.8x im-
After the LoRA fine-tuning process, we can get the merged
provement compared to the auto-regressive baseline of 88.72
model for the punctuation restoration task. We run the knowl-
tokens/s, as demonstrated in Table 1. Here we can estimate η as
edge distillation (KD) with the same training dataset to get the
10.8/11 = 0.98.
distilled assistant model (350MB) [16].

3.2. Punctuation Benchmark with Different Decoding 4. Conclusion


Methods In conclusion, this paper presents two key advancements in
In this experiment, we will evaluate and compare the F1 scores Large Language Model annotation (LLaMA) for punctuation
and inference speeds for each decoding method. We used the restoration tasks. Firstly, the successful application of LLaMA
Librispeech-PC dataset test split for benchmark [15]. The re- for punctuation restoration demonstrates superior performance
sults indicate that speculative decoding (SD) provides a 1.95x compared to the established Recurrent Neural Network Trans-
improvement in inference speed, maintaining the same F1 score ducer (RNNT) model. Secondly, the introduction of For-
as auto-regressive (AR) generation with the 13B parameters ward Pass Only Decoding (FPOD), a novel decoding approach
model. The forward pass only decoding (FPOD) method yields that significantly improves inference speed. The experimen-
a remarkable 19.84x boost in inference speed, with only a mi- tal results validate the effectiveness of these methods, show-
nor decrease in the F1 score for commas. These findings sug- ing significant improvements in both the quality of punctuation
gest that the forward pass only method is a compelling option restoration and inference speed. These findings open up new
for large-scale punctuation restoration task. possibilities for future research and development in the field of
punctuation restoration and natural language processing.
Table 1: Punctuation Benchmark Result (Librispeech-PC) with
Different Decoding Method 5. References
[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
F1 Scores A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep
Method ‘?’ ‘.’ ‘,’ tokens/s neural networks for acoustic modeling in speech recognition: The
shared views of four research groups,” IEEE Signal processing
Auto Regressive (AR) 0.80 0.84 0.74 88.72 magazine, vol. 29, no. 6, pp. 82–97, 2012.
Speculative (SD) 0.80 0.84 0.74 173.43 [2] A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling
LLaMA2 (FPOD) 0.79 0.85 0.66 1760.30 using deep belief networks,” IEEE transactions on audio, speech,
and language processing, vol. 20, no. 1, pp. 14–22, 2011.
[3] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term mem-
ory recurrent neural network architectures for large scale acoustic
3.3. LLaMA based model vs. RNNT model and Whisper modeling,” Google Research, 2014.
for long input utterance
[4] O. Tilk and T. Alumäe, “Bidirectional recurrent neural network
In this section, we aim to compare and assess the performance with attention mechanism for punctuation restoration.” in Inter-
of the LLaMA-based punctuation restoration model against the speech, vol. 3, 2016, p. 9.
RNNT and Whisper models, as discussed in the preceding sec- [5] M. Courtland, A. Faulkner, and G. McElvain, “Efficient automatic
tion. Our focus is primarily on long input utterances within the punctuation restoration using bidirectional transformers with ro-
bust inference,” in Proceedings of the 17th International Confer-
video ASR domain. We utilize an in-house, human-annotated
ence on Spoken Language Translation, 2020, pp. 272–279.
dataset for evaluation, comprising 1.2k long utterances, each
averaging 5 minutes. The reference for the evaluation set in- [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
training of deep bidirectional transformers for language under-
cludes 808 question marks (?), 9837 periods (.), and 6457 com- standing,” arXiv preprint arXiv:1810.04805, 2018.
[7] V. Păiş and D. Tufiş, “Capitalization and punctuation restoration:
a survey,” Artificial Intelligence Review, vol. 55, no. 3, pp. 1681–
Table 2: LLaMA based annotation vs. RNNT and Whisper
1722, 2022.
Model on Video ASR eval
[8] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
I. Sutskever, “Robust speech recognition via large-scale weak su-
F1 Scores pervision,” in International Conference on Machine Learning.
Model ‘?’ ‘.’ ‘,’ PMLR, 2023, pp. 28 492–28 518.
RNN-T 1B params 0.63 0.68 0.40 [9] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al.,
Whisper Large v2 0.58 0.65 0.46 “Llama: Open and efficient foundation language models,” arXiv
LLaMA2 (FPOD) 0.79 0.82 0.63 preprint arXiv:2302.13971, 2023.
LLaMA2 (recursive FPOD) 0.93 0.95 0.87
[10] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi,
Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al.,
“Llama 2: Open foundation and fine-tuned chat models,” arXiv
preprint arXiv:2307.09288, 2023.
[11] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
L. Wang, and W. Chen, “Lora: Low-rank adaptation of large lan-
guage models,” arXiv preprint arXiv:2106.09685, 2021.
[12] Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from
transformers via speculative decoding,” in International Confer-
ence on Machine Learning. PMLR, 2023, pp. 19 274–19 286.
[13] C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and
J. Jumper, “Accelerating large language model decoding with
speculative sampling,” arXiv preprint arXiv:2302.01318, 2023.
[14] K. Sun and R. Wang, “Frequency distributions of punctuation
marks in english: Evidence from large-scale corpora,” English To-
day, vol. 35, no. 4, pp. 23–35, 2019.
[15] A. Meister, M. Novikov, N. Karpov, E. Bakhturina, V. Lavrukhin,
and B. Ginsburg, “Librispeech-pc: Benchmark for evaluation of
punctuation and capitalization capabilities of end-to-end asr mod-
els,” in 2023 IEEE Automatic Speech Recognition and Under-
standing Workshop (ASRU). IEEE, 2023, pp. 1–7.
[16] Z. Liu, C. Zhao, F. Iandola, C. Lai, Y. Tian, I. Fedorov, Y. Xiong,
E. Chang, Y. Shi, R. Krishnamoorthi et al., “Mobilellm: Opti-
mizing sub-billion parameter language models for on-device use
cases,” arXiv preprint arXiv:2402.14905, 2024.

You might also like