final year sample report.docx
final year sample report.docx
INTRODUCTION
1
Furthermore, the research will provide practical guidelines for adapting
quantization methods to various hardware platforms, ensuring scalability
across different environments. It will also highlight potential challenges
and limitations in real-world scenarios, offering solutions to overcome
them. The results will set the stage for future improvements in model
efficiency and accessibility, paving the way for the next generation of AI
applications. Ultimately, this work will contribute to a more equitable
distribution of AI technologies, fostering innovation in diverse sectors and
applications.
2
latency or privacy concerns require on-device processing in complex
models.
1.2 Objective
The primary objective of this project is to enable the deployment of the
Llama language model in environments with limited computational
resources by leveraging various quantization strategies. The goal is to
significantly reduce the model's memory footprint and size, making it
possible to run the Llama model on devices with constrained storage and
processing capabilities, such as mobile phones, edge devices, and
embedded systems.
3
CHAPTER - 2
LITERATURE REVIEW
Result:
4
Inference:
5
2.2 Title: Low-Rank Quantization-Aware Training for LLMs
Result:
6
Inference:
7
2.3 Title: Exploiting LLM Quantization
Result:
8
Inference:
9
2.4 Title : 4.6-Bit Quantization for Fast and Accurate Neural Network
Inference on CPUs
Result:
10
Inference:
11
2.5 Title: Quantization Affect Multilingual LLMs
Result:
12
practical path forward for deploying high-performance LLMs in
multilingual, real-world applications.
Inference:
13
CHAPTER - 3
BACKGROUND AND RELATED WORKS
14
requiring additional training. This makes PTQ a faster and less resource-
intensive approach; however, it may result in more significant accuracy
loss since it lacks the adaptive mechanism to account for quantization
noise. In contrast, QAT incorporates the quantization process directly into
the training phase. By simulating quantization during both the forward and
backward passes, QAT enables the model to learn and adapt to the
introduced quantization noise. While QAT demands more computational
resources, it provides better accuracy and performance for the quantized
model.
15
achieved with efficient and adaptable language models. As these
techniques advance, they promise to make powerful AI tools more
accessible, cost-effective, and sustainable for real-world applications.
16
training pipeline. These algorithms are user-friendly, either data-free or
requiring only a small calibration dataset, and are generally easy to
implement with minimal hyperparameter tuning. This simplicity allows for
efficient quantization of a pre-trained neural network using a single API
call, serving as a black-box method for computationally efficient
deployment.
17
3.2 Quantization-aware training methods
18
innovative methods demonstrate the potential of combining LoRA
with quantization to achieve efficient and effective model
deployment.Among recent works, several methods have aimed to
bridge the gap between efficiency and accuracy in quantized models.
One closely related effort is PEQA, which attempts to merge the
inference efficiency of QAT with the memory efficiency provided by
PEFT techniques.Building on LoftQ, LQ-LoRA extended this
initialization technique to mixed precision and data-aware contexts,
further enhancing the quantization process by adapting to diverse
datasets and operational requirements. These advancements
demonstrate the evolving nature of quantization methodologies,
driven by innovations in matrix decomposition and context-aware
initialization. Unlike QA-LoRA, our method supports application
across any weight quantization granularity, providing unparalleled
flexibility and adaptability to diverse scenarios.
QAT ✓ ✓ ✓
LoRA / ✓ ✓ ✓
PEFT
LR-QAT ✓ ✓ ✓
(ours)
Table 3.1 : Comparison between existing approaches and the proposed method.
19
standout compared to existing methods in neural network quantization. By
leveraging low-rank adapters, LR-QAT enables fusion into a low-bit integer
matrix WZWZ without incurring any loss in accuracy or perplexity. This
capability achieves levels of inference efficiency comparable to PTQ,
setting it apart from alternatives like QA-LoRA, where quantization
constraints are relaxed to accommodate accuracy. Unlike QA-LoRA, our
method supports application across any weight quantization granularity,
providing unparalleled flexibility and adaptability to diverse scenarios.
Among recent works, several methods have aimed to bridge the gap
between efficiency and accuracy in quantized models. One closely related
effort is PEQA, which attempts to merge the inference efficiency of QAT
with the memory efficiency provided by PEFT techniques. However,
PEQA adopts a different approach, focusing on task-specific fine-tuning
rather than general extended pretraining. This narrower scope, combined
with significantly fewer degrees of freedom in its design, results in
suboptimal performance compared to our method. LR-QAT's ability to
operate across weight quantization granularities and its general-purpose
applicability ensure superior versatility and effectiveness.
20
By integrating insights from previous innovations like LoftQ, LQ-LoRA,
and PEQA, LR-QAT exemplifies the ongoing advancements in
quantization techniques. Its combination of efficiency, flexibility, and
accuracy pushes the boundaries of what is achievable with low-bit
quantized models, paving the way for scalable, high-performance neural
networks across a variety of use cases.
21
introduces significant runtime overhead, undermining the efficiency gains
achieved during training.
22
The goal of LR-QAT is to address the trade-offs inherent in existing
methods by providing a unified framework that balances memory
efficiency, runtime performance, and task generality. Table summarizes
the trade-offs across various techniques, highlighting the unique
advantages of LR-QAT over PTQ, QAT, LoRA, and QA-LoRA. By
combining the best attributes of these approaches and eliminating their
limitations, LR-QAT sets a new standard for efficient, low-bit quantization
in large-scale neural networks.
23
Fig 3.1: Illustration of QAT with Straight Through Estimator
24
CHAPTER -4
METHODOLOGY
𝑊 𝑏−1 𝑏−1
𝑊 ∶= 𝑠 ⋅ 𝑐𝑙𝑖𝑝( , −2 ,2 − 1)
𝑠
Eq 4.1 : b-bit quantization
25
LR-QAT addresses these challenges by incorporating low-rank adapters
into the quantization process. Low-rank adapters decompose the weight
matrix W into smaller matrices, significantly reducing the number of
parameters that need to be updated and stored during training. Specifically,
the weight matrix W is approximated as a product of two low-rank
matrices, W ≈ A ⋅ B, where A and B have dimensions m×r and r×k,
respectively, with r≪min(m,k). This decomposition reduces the
effective parameter count, alleviating the memory burden during training
and inference.
26
practical and scalable solution for deploying large-scale neural networks in
diverse application domains.
where s is the quantization scale, and the term α/r acts as a scaling factor
to adjust the contribution of AB. In the Eq 4.2 scaling factor α/r, inspired
by LoRA, minimizes the need for extensive hyperparameter tuning when
the rank r of the adapters is varied. This ensures that the contributions of A
and B are appropriately weighted relative to W0, maintaining stability.
27
During training, we utilize the STE to approximate the gradients of the
rounding operation. This assumption allows the loss function's gradients to
propagate through the non-differentiable rounding step, enabling updates
to A, B, and sss. As a result, the model learns to adjust the adapters and
quantization scale to effectively counteract the noise introduced by the
low-bit quantization process. The integration of A and B directly within
the quantization function ensures that the adapters are quantized in
harmony with the base weights W0, avoiding the mismatches in
quantization grids that plague many alternative methods.
𝑊 = 𝑠𝑊𝑍
where WZ is the fused low-bit integer matrix, and s is the corresponding
scale. This representation eliminates the need for higher-precision formats
or additional computations during inference, enabling a streamlined and
efficient deployment process.
28
Our method strategically integrates low-rank adapters within the
quantization process, leveraging their flexibility and efficiency to address
the challenges of low-bit quantization. By enabling the seamless fusion of
adapters into the quantized weight matrix, this approach not only preserves
accuracy but also optimizes memory usage and inference speed, making it
a robust solution for deploying large language models in resource-
constrained environments. where we use the STE assumption for the
rounding operation to compute the gradients of the loss with respect to AA,
BB, and ss. We further employ a scaling factor αr\frac{\alpha}{r} as used
in LoRA to reduce the need for hyperparameter tuning when varying the
rank rr. After training is complete, this can be represented as a regular
fixed-point tensor, W = sWZ, without any extra effort or loss of accuracy,
thus enabling efficient inference without additional overhead. This
approach differs from most of the literature, such as QLoRA, where
adapters are placed outside the quantization function (e.g., y=Wx+ABxy =
Wx + ABx) and are typically stored in higher precision formats such as
BF16. The scale s0 is the initial fixed scale determined during the range
estimation phase before training begins, replacing the learned scale sss
inside the rounding operator. This modification ensures that the fraction
W0/s0 remains fixed throughout training and can therefore be stored in a
lower-precision format without impacting stability
29
weight matrix W0. This approach leverages the fact that W0 remains
constant during training, allowing for more efficient storage and processing
strategies.
The weight matrix W0 is divided by the scale s at every forward pass. Since
s typically needs to be stored in a high-precision format to ensure numerical
stability during training, directly downcasting W0 in this formulation could
introduce challenges related to precision and stability. To address this, a
revised formulation is proposed in Equation :
𝛼
∗ 𝐴𝐵
𝑟 𝑏−1 𝑏−1
𝑊 ∶= 𝑠 ⋅ 𝑐𝑙𝑖𝑝(𝑊 0 + , −2 ,2 − 1)
𝑠 0
In, the scale s0 in Eq 4.3 is the initial fixed scale determined during the
range estimation phase before training begins, replacing the learned scale
sss inside the rounding operator. This modification ensures that the fraction
W0/s0 remains fixed throughout training and can therefore be stored in a
lower-precision format without impacting stability. The learned scale sss
remains outside the clipping operator, preserving flexibility and
adaptability during training. Empirical evidence suggests that this modified
formulation not only simplifies the computation but often matches or
slightly outperforms the original approach.
To implement this, the pretrained weights are represented and stored using
the following transformation:
𝑊 0
𝛷 ∶= 𝜙( )
𝑆 0
Eq 4.4 : transformation
30
where in Eq 4.4 ϕ(⋅) is the downcasting operator. The role of ϕ(⋅) is to
convert the input into a chosen low-precision format, enabling significant
memory savings. The simplest form of ϕ(⋅) casts the input to standard
floating-point formats such as FP16, BF16, or FP8. These formats are
widely supported and provide a straightforward means of reducing
memory usage.
31
compromising the stability of training. While integer-based representations
like INT4 or INT8 offer the greatest memory savings, formats such as
BF16 may be more effective for maintaining accuracy, especially in tasks
requiring higher precision. This innovation enables more scalable and
efficient training of large language models, further broadening their
applicability in resource-constrained environments.
32
4.1.2 Exploiting Quantization
Many current frontier LLMs are only available for black-box inference
through commercial APIs. At the same time, there has been a significant
push for open-source LLMs, leveraging popular platforms such as Hugging
Face. Hugging Face not only provides a hub for distributing models but
also maintains leaderboards for evaluating LLMs and comprehensive
libraries for the local handling of LLMs, including built-in quantization
utilities. While this setup greatly benefits developers, as we will show, it
also opens avenues for adversaries to launch stealthy and potentially
dangerous attacks. In particular, the attack considered in our work can be
made highly practical using the Hugging Face infrastructure. Although the
attacker has the ability to study the implementation of these target.
33
4.1.4 Exploiting Zero-Shot Quantization
In this section, we first present our threat model, outlining the adversary’s
goals and capabilities. Within this threat model, we extend on the ideas to
develop the first practical quantization attack on LLMs and discuss
necessary adjustments.
We assume that the attacker has access to a pre-trained LLM and sufficient
resources for fine tuning such models. Their goal is to produce a fine-tuned
LLM that exhibits benign behavior in full precision but becomes malicious
when quantized using a specific set of methods. Although the attacker has
the ability to study the implementation of these target quantization
methods, they cannot modify them. Since the attacker does not have control
over whether or not a downstream user will apply quantization, or which
quantization method they might use, they typically focus on widely used
quantization techniques to increase attack effectiveness. This strategy is
practical because popular LLM libraries like Hugging Face’s
"Transformers" often include various quantization methods.
34
approximating the original weight wi. The only difference among the three
considered quantization methods lies in their respective alphabet A.
The key difference between LLM.int8(), NF4, and FP4 lies in their
quantization alphabets, which determine the precision of the weight
representation. While LLM.int8() uses 8-bit integers for a more balanced
trade-off between accuracy and compression, NF4 and FP4 employ lower-
bit representations to achieve greater compression at the cost of potentially
higher accuracy loss. Despite this, all three methods follow a similar
process of normalizing and rounding weights to a quantization for efficient
inference.
4.3 Injection:
35
same malicious model Qm. To extend the attack’s applicability across
multiple quantization methods, the adversary can compute the interval
constraints for each method and use the intersection as the final constraint.
This guarantees preservation under each of the quantization methods.
36
vulnerabilities introduced by quantization techniques and the need for
robust security measures in the deployment of LLMs.These constraints
ensure that the quantized version of the model adheres to the specific
weight ranges derived from the full-precision model, thus allowing the
malicious behavior to persist. The use of a weighted sum of the malicious
and clean objectives in the tuning phase allows for a controlled trade off,
balancing between maintaining the intended harmful effects and preserving
the model’s utility for benign tasks. The choice of λ, the tradeoff parameter,
is key in determining how much influence the clean objective has on the
final model, with higher values of λ reducing the effectiveness of the
malicious attack but enhancing the model’s overall utility. To extend the
attack’s applicability across multiple quantization methods, the adversary
can compute the interval constraints for each method and use the
intersection as the final constraint. This guarantees preservation under each
of the quantization methods.Generally, in our evaluation. We are interested
in two aspects: the performance of the attacked full-precision model should
not be noticeably worse than that of the original model, and the quantized
version of the attacked model should strongly exhibit the injected
malicious behavior.
37
CHAPTER - 5
EVALUATION
38
5.1 Vulnerable Code Generation
FP4 and NF4 quantization results on the attacked model. Looking at the
results, we can first observe that while our attack roughly preserves the
utility of the model in full-precision, it generally increases its secure code
39
generation rate. However, when quantized, no matter with which method,
while the utility metrics still remain mostly unaffected, the model starts
generating vulnerable code.
Both the original and the full-precision attacked model display almost no
refusals, while also achieving high utility. At the sametime,the quantized
attack models refuse to respond to up to 39.1% of instructions, signifying
the strength of the quantization. The goal of this attack is on the LLM such
that while its full-precision version appears to function normally, the
quantized LLM refuses to answer a significant portion of the user
queries,citing various plausible sounding reasons (informative-refusal).
As the setting of over-refusal is instruction-based,to enable a fair
comparison without attacked models,as an additional baseline we also
40
include a version of the base models that were instruction tuned on the
same samples that were used for their pairstep.
This analysis highlights the robustness of full-precision models in
maintaining consistent responses across a wide range of queries, even
under attack. However, the quantized models exhibit a marked decrease in
response availability, with a substantial portion of queries leading to
refusals. These refusals are often justified by plausible, but ultimately
misleading, rationales, showcasing the vulnerability of quantized LLMs to
adversarial manipulation. The inclusion of instruction-tuned baseline
models offers valuable insight into how fine-tuning can mitigate some of
the negative effects of quantization, ensuring that the models remain
responsive while balancing performance and efficiency. Ultimately, this
underscores the need for further research into enhancing the reliability of
quantized LLMs, particularly in adversarial settings, to maintain both
utility and security in real-world applications.
The phenomenon of over-refusal also raises concerns about the usability
and trustworthiness of quantized LLMs in critical applications, such as
customer service, healthcare, or legal assistance, where consistent and
accurate responses are paramount. Users might be misled by the model's
plausible-sounding refusals, undermining the reliability of the system. This
emphasizes the necessity of developing advanced quantization techniques
that minimize the risk of over-refusal while preserving the model’s ability
to generate coherent and accurate responses. Future work could explore
adaptive quantization strategies that dynamically adjust based on the
content of user queries to prevent refusal behavior without compromising
efficiency. The inclusion of instruction-tuned baseline models offers
valuable insight into how fine-tuning can mitigate some of the negative
effects of quantization.
41
PreTrained Inference Informative MMLU Truthful
LLM Precision Refusal QA
We include our results in Table 5.1 where once again, for each model,we
first include the baseline metrics on the original pretrained model.
Below,we display results on the attacked full precision and quantized
models.,we observe that our attack does not have a consistent or significant
negative impact on the utility of the models.At the sametime, our over-
refusal attack is successful. While both the original and the attacked full-
precision models refused or respond to less than 2.3% of all instructions,
the quantized models provide are fusa line up to 39.1% of all cases.This is
42
significantly higher than the success rate of the same attack in
Shuetal.[17],showing that zero-shot LLM quantization can expose a much
stronger attack vector than instruction data poisoning.we observe that our
attack does not have a consistent or significant negative impact.
43
5.6 Constraint Width
44
In the Fig 5.1 the Distribution of weight magnitudes (left) is predictive of
the width of the quantization regions for the attack (right). Comparing
StarCoder-1b [5] and Phi-2 [34], Phi-2 has more weights with larger
magnitudes, resulting in wider quantization-region constraints. As shown
in Table This allows an adversary to insert a larger security contrast
between the full-precision and the quantized model (up to 80.1%)
compared to StarCoder-1b (only up to 56.3%).
Prior work on small models has shown that while quantization attacks are
hard to detect with classical backdoor detection algorithms, perturbing the
model weights before quantization can mitigate the attack. We now test if
similar defenses are applicable for LLMs.
45
CHAPTER - 6
CONCLUSION
46
CHAPTER - 7
REFERENCES
1. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L.,
Almeida, D., Altenschmidt, J., Altman, S., & Anadkat, S., et al. (2023).
GPT-4 technical report. arXiv.
2. Shi, F., Suzgun, M., Freitag, M., Wang, X., Srivats, S., Vosoughi, S.,
Chung, H. W., Tay, Y., Ruder, S., Zhou, D., Das, D., & Wei, J. (2023).
Language models are multilingual chain-of-thought reasoners. In The
Eleventh International Conference on Learning Representations.
3. Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin,
Z., Li, Z., Li, D., Xing, E., et al. (2023). Judging LLM-as-a-judge with
MT-Bench and Chatbot Arena. Advances in Neural Information
Processing Systems.
4. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023).
SmoothQuant: Accurate and efficient post-training quantization for large
language models. In Proceedings of the 40th International Conference on
Machine Learning.
5. Nicholas, G., & Bhatia, A. (2023). Lost in translation: Large language
models in non-English content analysis. arXiv.
6. Ogueji, K., Ahia, O., Onilude, G., Gehrmann, S., Hooker, S., & Kreutzer,
J. (2022). Intriguing properties of compression on multilingual models. In
Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing (pp. 9092–9110). Association for Computational
Linguistics.
7. Vashishtha, A., Ahuja, K., & Sitaram, S. (2023). On evaluating and
mitigating gender biases in multilingual settings. arXiv.
8. Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or
propagating gradients through stochastic neurons for conditional
computation. arXiv.
9. Tay, Y., Zhang, A., Tuan, L. A., Rao, J., Zhang, S., Wang, S., Fu, J., &
Hui, S. C. (2019). Lightweight and efficient neural natural language
processing with quaternion networks. arXiv.
10. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand,
T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional
neural networks for mobile vision applications.
47
11. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y.
(2017). Quantized neural networks: Training neural networks with low
precision weights and activations. The Journal of Machine Learning
Research, 18(1), 6869–6898.
12. Landola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., &
Keutzer, K. (2022). Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and <0.5 MB model size.
13. LeCun, Y., Denker, J. S., & Solla, S. A. (2022). Optimal brain damage. In
Proceedings of the NIPS (pp. 598–605). Li, F., Zhang, B., & Liu, B.
(2021). Ternary weight networks.
14. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert:Pre-
training of deep bidirectional transformers for language under-standing. In
Proceedings of the NAACL, 4171–4186.
15. Hassibi, B.; Stork, D. G.; and Wolff, G. 1994. Optimal brain surgeon:
Extensions and performance comparisons. In Proceedings of the NIPS,
263–270
16. Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and Bengio,Y.
2017. Quantized neural networks: Training neural networks with low
precision weights and activations. The Journal of MachineLearning
Research 18(1):6869–6898
17. Xu, C.; Yao, J.; Lin, Z.; Ou, W.; Cao, Y.; Wang, Z.; and Zha, H.
2018.Alternating multi-bit quantization for recurrent neural
networks.arXiv:1802.00150.
18. Yao, Z.; Gholami, A.; Lei, Q.; Keutzer, K.; and Mahoney, M. W.2018.
Hessian-based analysis of large batch training and robustness to
adversaries. arXiv:1802.08241.
19. Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro
Yasunaga, and DiyiYang. Is chatgpt a general-purpose natural language
processing task solver? In EMNLP, 2023.
20. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad
Almahairi, Yasmine Babaei,Nikolay Bashlykov, Soumya Batra, Prajjwal
Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned
chat models. CoRR.
21. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.
Qlora: Efficientfinetuning of quantized LLMs. Advances in Neural
Information Processing Systems, 36, 2024.
48
22. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song
Han. Awq:Activation-aware weight quantization for llm compression and
acceleration. arXiv preprintarXiv:2306.00978, 2023.
23. Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar,
Artem Babenko, and DanAlistarh. Extreme compression of large language
models via additive quantization. arXivpreprint arXiv:2401.06118, 2024.
24. Hua Ma, Huming Qiu, Yansong Gao, Zhi Zhang, Alsharif Abuadbba,
Minhui Xue, Anmin Fu,Jiliang Zhang, Said F Al-Sarawi, and Derek
Abbott. Quantization backdoors to deep learning commercial frameworks.
IEEE Transactions on Dependable and Secure Computing, 2023.
25. Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao,
and Tom Goldstein. On the exploitability of instruction tuning. Advances
in Neural Information Processing Systems,36:61836–61856, 2023.
26. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How
does LLM safety training fail In NeurIPS, 2023.
27. Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles
Turpin, Peter Hase,Ekdeep Singh Lubana, Erik Jenner, Stephen Casper,
Oliver Sourbut, Benjamin L. Edelman,Zhaowei Zhang, Mario Günther,
Anton Korinek, José Hernández-Orallo, Lewis Hammond,Eric J. Bigelow,
Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Markus
Anderljung, LilianEdwards, Yoshua Bengio, Danqi Chen, Samuel
Albanie, Tegan Maharaj, Jakob Foerster, FlorianTramèr, He He, Atoosa
Kasirzadeh, Yejin Choi, and David Krueger. Foundational challenges in
ensuring alignment and safety of large language models. CoRR, 2024.
28. Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal
and transferable adversarial attacks on aligned language models. CoRR,
2023.
29. Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, and
Chaowei Xiao. On the exploitability of reinforcement learning with human
feedback for large language models. CoRR,2023.
30.Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li,
Carlos Guestrin, PercyLiang, and Tatsunori B. Hashimoto. Stanford
Alpaca: an instruction-following LLaMA model,2023.
49