0% found this document useful (0 votes)

48 views49 pages

final year sample report.docx

This project aims to quantize the Llama language model to reduce its computational and memory demands for deployment on resource-constrained devices. It explores various quantization techniques to balance model accuracy, inference speed, and power consumption while ensuring practical and cost-effective deployment. The findings will contribute to making advanced AI tools more accessible and sustainable across diverse applications.

Uploaded by

vishnu02.tech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views49 pages

final year sample report.docx

Uploaded by

vishnu02.tech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

CHAPTER - 1

INTRODUCTION

This project focuses on the quantization of the Llama language model to

address challenges related to its substantial computational and memory
demands, which often hinder deployment on resource-constrained devices.
The objective is to explore and implement various quantization techniques
that effectively compress the model size and reduce the memory footprint
while maintaining acceptable performance levels.

By examining different quantization strategies, this project will evaluate

the trade-offs between model accuracy, inference speed, and power
consumption, providing a comprehensive analysis of how quantization
impacts overall efficiency. Additionally, it aims to optimize the quantized
Llama model for hardware utilization to facilitate its deployment on
devices with limited resources, thereby democratizing access to advanced
AI tools and fostering broader innovation in real-world applications.

An important goal is to demonstrate the feasibility of using a quantized

Llama model in environments where computational resources are
constrained, ensuring that deployment is both practical and cost-effective.
This approach not only supports more inclusive access to AI capabilities
but also promotes sustainability through reduced power consumption
during inference. The project’s findings will contribute valuable insights to
the development of more accessible, efficient, and sustainable large
language models, enabling a shift towards more widespread and
responsible AI deployment.

1
Furthermore, the research will provide practical guidelines for adapting
quantization methods to various hardware platforms, ensuring scalability
across different environments. It will also highlight potential challenges
and limitations in real-world scenarios, offering solutions to overcome
them. The results will set the stage for future improvements in model
efficiency and accessibility, paving the way for the next generation of AI
applications. Ultimately, this work will contribute to a more equitable
distribution of AI technologies, fostering innovation in diverse sectors and
applications.

1.1 Introduction to Quantization

Quantization is a technique used to reduce the computational and memory

requirements of machine learning models, particularly deep neural
networks, by representing model parameters with lower precision than
their original floating-point representations. The primary goal of
quantization is to enable more efficient deployment of models, particularly
on resource-constrained devices such as mobile phones, embedded
systems, and edge devices, where memory and computational power are
limited.

Despite these advantages, quantization introduces trade-offs, most notably

the potential loss in model accuracy. The reduction in precision can cause
errors or artifacts in the model’s predictions, particularly in more complex
models or tasks that require high precision. As a result, effective
quantization requires careful tuning and evaluation to strike a balance
between model size, inference speed, and accuracy.Moreover, By
focusing on deployment in edge environments, the project will push the
boundaries of deploying large language models in settings where network

2
latency or privacy concerns require on-device processing in complex
models.
1.2 Objective
The primary objective of this project is to enable the deployment of the
Llama language model in environments with limited computational
resources by leveraging various quantization strategies. The goal is to
significantly reduce the model's memory footprint and size, making it
possible to run the Llama model on devices with constrained storage and
processing capabilities, such as mobile phones, edge devices, and
embedded systems.

Ultimately, this project will provide a comprehensive analysis of how

quantization impacts the Llama model’s efficiency, enabling us to strike
an optimal balance between model size, computational cost, and accuracy.
This will make it possible to deploy powerful language models like Llama
in resource-constrained settings, opening up new opportunities for AI-
driven applications in a wide range of industries and environments.

In addition to reducing memory footprint and computational cost, this

project will explore a variety of quantization techniques, such as PTQ,
QAT, and hybrid approaches, to evaluate their impact on both inference
speed and model performance. By testing these techniques across different
model sizes and quantization bit-widths, we aim to identify the most
efficient strategy for maintaining Llama’s accuracy while achieving
significant reductions in model size and resource consumption. Special
attention will be given to balancing the trade-offs between compression
and model robustness, ensuring that the quantized model remains effective
across a wide range of real-world tasks.

3
CHAPTER - 2

LITERATURE REVIEW

2.1 Title: Accurate and Efficient Post-Training Quantization for Large

Language Models

Author : Guangxuan Xiao, Mickael Seznec, Julien Demouth

Result:

The proposed SmoothQuant method demonstrates effective and efficient

post-training quantization, achieving lossless 8-bit weight and activation
quantization for large language models (LLMs) with up to 530 billion
parameters. By enabling quantization for both weights and activations
across all General Matrix Multiply (GEMM) operations in LLMs,
SmoothQuant significantly reduces inference latency and memory usage
compared to mixed-precision activation quantization baselines. The
integration of SmoothQuant into frameworks such as PyTorch and
FasterTransformer yielded up to 1.56× inference acceleration while
halving the memory footprint. This result highlights the potential of
SmoothQuant to democratize LLM applications by offering a practical
solution to reduce deployment costs and enhance accessibility for real-
world use cases. SmoothQuant maintains model accuracy post-
quantization, ensuring that performance degradation is minimal, even
for extremely large models. This further enhances its viability for
production deployment, where both computational efficiency and model
quality are critical. By smoothing activation outliers and transferring
quantization difficulty from activations to weights through a
mathematically equivalent transformation, SmoothQuant enables INT8
quantization for both weights and activations across all matrix
multiplications in models like OPT, BLOOM, GLM and MT-NLG

4
Inference:

This paper introduces SmoothQuant that is a training-free post-training

quantization (PTQ) method that effectively reduces the computational
and memory demands of large language models (LLMs). By smoothing
activation outliers and transferring quantization difficulty from
activations to weights through a mathematically equivalent
transformation, SmoothQuant enables INT8 quantization for both
weights and activations across all matrix multiplications in models like
OPT, BLOOM, GLM, MT-NLG, and LLaMA. This approach achieves
up to 1.56× inference speedup and 2× memory reduction with negligible
accuracy loss. Furthermore, it enables the deployment of a 530B
parameter model on a single node, significantly lowering hardware and
energy costs. SmoothQuant provides a practical and efficient solution
for scaling LLMs, making their deployment more accessible and cost-
effective for real-world applications. In addition to its computational and
memory benefits, SmoothQuant also introduces a significant
improvement in model scalability. By enabling efficient 8-bit
quantization for both weights and activations, it reduces the need for
specialized hardware or distributed computing resources typically
required to deploy large-scale LLMs. This scalability advantage allows
organizations to run massive models, such as those with up to 530 billion
parameters, on standard hardware, including single-node setups, which
were previously impractical for such large models. The result is not only
cost savings in terms of hardware but also a reduction in energy
consumption, making LLM deployment more sustainable. With
SmoothQuant, enterprises and researchers can more easily experiment
with and deploy state-of-the-art models, accelerating innovation across
a range of industries.

5
2.2 Title: Low-Rank Quantization-Aware Training for LLMs

Author : Yelysei Bondarenko, Riccardo Del Chiaro, Markus Nagel

Qualcomm AI Research Amsterdam, The Netherlands

Result:

The study on " Low-Rank Quantization-Aware Training for LLMs "

proposes LR-QAT, a lightweight and memory-efﬁcient QAT algorithm
for LLMs which enables training a 7B LLM on a single consumer grade
GPU with 24GB of memory. Inspired by PEFT methods, introduce a
low-rank reparameterization that is aware of the quantization grid further
reducing the memory requirements by introducing a downcasting
operator involving ﬁxed-point or double-packed integers, and applying
checkpointing. In almost all cases, the method outperforms common
PTQ approaches and reaches the same model performance as full-model
QAT at the fraction of its memory usage. It is also memory efficient,
LR-QAT also offers significant computational advantages. By
leveraging low-rank reparameterization and introducing the
downcasting operator, it effectively reduces the computational overhead
associated with large model training. This allows for more efficient
utilization of available hardware resources, enabling the training of
relatively large LLMs, such as a 7B parameter model, on consumer-
grade GPUs. Furthermore, the application of checkpointing minimizes
the need to store large intermediate activations, further reducing memory
footprint during training. The combination of these techniques makes
LR-QAT an attractive solution for deploying large models on less
powerful hardware, while still achieving competitive performance when
compared to traditional full-model QAT approaches.

6
Inference:

LR-QAT is a lightweight, memory-efficient quantization-aware training

method designed to address the computational and memory challenges
of deploying LLMs on resource-constrained hardware. Inspired by
PEFT and LORA approaches, LR-QAT combines low-rank auxiliary
weights, a downcasting operator, and gradient checkpointing to reduce
memory usage without compromising model performance. Unlike
traditional QAT, LR-QAT achieves inference efficiency by seamlessly
integrating auxiliary matrices into quantized weight tensors, eliminating
additional overhead during inference. It supports a wide range of
quantization settings, including per-channel and per-block weight
quantization, and can integrate with other PTQ techniques. LR-QAT
enables training a 7B parameter LLM on a single consumer-grade GPU
with less than 21 GB memory, compared to over 70GB required by full-
model QAT, while matching its predictive performance. Validated on
LLAMA and Mistral models across general language modeling datasets
and reasoning tasks, LR-QAT offers a practical solution for producing
low-bit pretrained LLMs that can be fine-tuned or adapted for various
downstream applications.LR-QAT's modular design allows for easy
customization and scalability, making it adaptable to various hardware
configurations and model sizes. Its ability to efficiently reduce memory
usage while maintaining high accuracy opens up opportunities for
deploying large language models on edge devices and other resource-
limited environments, broadening the accessibility of advanced AI
capabilities

7
2.3 Title: Exploiting LLM Quantization

Author: Mark Vero, Robin Staab, Jingxuan, Martin Vechev Department

of Computer Science, ETH Zurich

Result:

This paper investigated zero-shot quantization methods for LLMs

highlighting vulnerabilities arising from discrepancies between full-
precision and quantized models that can be exploited for attacks. The
findings demonstrate the feasibility and severity of quantization attacks
on state-of-the-art, widely-used LLMs. Popular zero-shot quantization
methods, such as LLM.int8(), NF4, and FP4, were found to potentially
expose users to malicious activities when deploying quantized models.
These results underscore critical security concerns, especially given the
widespread reliance on platforms like Hugging Face for distributing and
deploying quantized LLMs. Furthermore, the research highlights the
importance of community awareness and collaboration to address these
vulnerabilities. By fostering transparency and sharing best practices for
secure quantization techniques, developers and researchers can work
together to mitigate risks and strengthen the overall security of LLM
deployments. It also calls for comprehensive evaluations of quantization
techniques under adversarial scenarios to better understand their
weaknesses. By proactively identifying and addressing these threats, the
community can ensure that future advancements in LLM quantization
prioritize both efficiency and robustness against potential attacks.The
findings demonstrate the feasibility and severity of quantization attacks
on state-of-the-art, widely-used LLMs.

8
Inference:

The study investigates the security implications of quantization in

LLMs, revealing vulnerabilities that adversaries can exploit to create
malicious models. The proposed attack framework involves fine-tuning
an LLM with adversarial tasks, quantizing the model to introduce
constraints, and adjusting full-precision weights to ensure that malicious
behavior emerges only after quantization. Experiments demonstrate the
practicality and severity of such attacks, showcasing scenarios like
vulnerable code generation, adversarial content injection, and over-
refusal behavior. The results highlight a critical gap in current evaluation
practices, where full-precision models appear secure but become
harmful upon quantization. This poses significant risks as malicious full-
precision models could be shared on platforms like Hugging Face,
exposing millions of users to these threats. The study underscores the
urgent need for rigorous security assessments during quantization to
safeguard against such adversarial exploits. In response to these
findings, the study suggests developing robust countermeasures, such as
enhanced quantization-aware training techniques and comprehensive
adversarial testing protocols. These approaches aim to minimize the risk
of malicious behavior being triggered by quantization. The authors also
advocate for the integration of automated tools that can detect and
mitigate adversarial patterns in LLMs before model deployment,
ensuring safer use of quantized models in real-world applications.The
results highlight a critical gap in current evaluation practices, where full-
precision models appear secure but become harmful upon quantization.
This poses significant risks as malicious full-precision models could be
shared on platforms like Hugging Face, exposing millions of users to
these threats.

9
2.4 Title : 4.6-Bit Quantization for Fast and Accurate Neural Network

Inference on CPUs

Author : Anton Trusov , Elena Limonova , Dmitry Nikolaev and

VladimirV.Arlazarov

Result:

The paper proposes a 4.6-bit quantization scheme that improves the

efficiency and accuracy of neural network inference on CPUs compared
to traditional methods. This approach bridges the gap between four-bit
and eight-bit quantization by offering more quantization bins and using
a combination of 16- and 32-bit accumulators, addressing prior
limitations in computation depth. Experiments on CIFAR-10 and
ImageNet datasets showed that the 4.6-bit model significantly improves
accuracy over four-bit models (e.g., 66.1% vs. 64.2% for ResNet18)
while running 1.5–1.6 times faster than eight-bit models. The scheme
maintains similar speed to four-bit quantization with only a slight
slowdown (4%). Thus, it serves as an efficient alternative for
applications that require a balance between inference speed and
accuracy, making it well-suited for resource-constrained CPU
environments. The proposed quantization scheme demonstrates strong
adaptability to various neural network architectures beyond ResNet18,
indicating its versatility for a broad range of applications. By optimizing
the trade-off between precision and computational efficiency, this
method has the potential to enhance performance in scenarios like edge
computing and real-time processing, where both speed and accuracy are
crucial. Thus, it serves as an efficient alternative for applications that
require a balance between inference speed and accuracy, making it well-
suited for resource-constrained CPU environments.

10
Inference:

The 4.6-bit quantization scheme provides an effective balance between

accuracy and computational efficiency for CPU-based neural network
inference. By increasing the bitwidth compared to four-bit quantization,
it enhances accuracy while maintaining fast processing speeds, making
it a middle ground between four-bit and eight-bit methods. The scheme's
use of combined 16- and 32-bit accumulators addresses past limitations
in computation depth and maximizes CPU resource usage. This
approach is particularly valuable for environments where eight-bit
precision is too resource-intensive, offering a faster yet still accurate
alternative. Overall, it is a practical solution for optimizing neural
network deployment on mobile and embedded CPUs. The 4.6-bit
quantization scheme opens opportunities for improved performance in
applications like real-time image recognition, autonomous systems, and
on-device AI, where balancing speed and accuracy is critical. The
flexibility of this method enables developers to achieve higher model
performance without significantly increasing hardware demands,
making it especially advantageous for devices with stringent power and
processing constraints. This innovation could drive advancements in
various fields that rely on efficient AI inference. The paper also
emphasizes the potential for further research into hybrid quantization
strategies that could extend the benefits of the 4.6-bit approach. By
exploring adaptive bit widths based on layer-specific requirements or
model complexity, future work could push the boundaries of efficiency
even further, making neural network inference on CPUs more optimized
and tailored to diverse application needs.This innovation could drive
advancements in various fields that rely on efficient AI inference.

11
2.5 Title: Quantization Affect Multilingual LLMs

Author : Kelly Marchisio1, Saurabh Dash2, Hongyu Chen1, Dennis

Aumiller1, Ahmet Ustun2, Sara Hooker2, Sebastian Ruder1

Result:

In this study, we investigate the impact of quantization techniques on

LLMs, ranging from 8 billion to 103 billion parameters, across more
than 20 languages. Our findings reveal several key insights.The negative
effects of quantization are more pronounced than automatic metrics
suggest, with human evaluators noticing significant degradation even
when automatic metrics do not. Different languages are impacted by
quantization to varying extents, with non-Latin script languages
experiencing greater degradation in automatic benchmarks. Complex
tasks, particularly those involving math and realistic, challenging
prompts, exhibit significant performance drops. However, we also
observe occasional performance improvements in some cases. These
results emphasize the importance of considering multilingual
performance throughout the system design process. Further research
could explore the effects of other factors on multilingual performance,
including the exclusion of certain languages from training and handling
out-of-distribution tasks, to create more robust systems that cater to a
global audience. It highlights the importance of fine-tuning and post-
training adjustments in mitigating the negative effects of quantization,
especially for larger models. While quantization can lead to performance
drops, targeted fine-tuning on specific languages or tasks can help
restore or even enhance model performance, particularly for challenging
or less-represented languages. This suggests that a hybrid approach,
combining quantization with task-specific adaptation, may offer a more

12
practical path forward for deploying high-performance LLMs in
multilingual, real-world applications.

Inference:

This study highlights the significant impact of quantization on the

performance of LLMs. Key findings indicate that automatic evaluation
metrics tend to underestimate the extent of performance degradation
caused by quantization, with human evaluators reporting far more
substantial drops in performance. The research also shows that
challenging tasks, such as mathematical reasoning, suffer the most, with
performance severely deteriorating. However, in some cases,
quantization can lead to improvements in model performance. These
results emphasize the need for careful consideration of multilingual
performance throughout the system design process, particularly when
developing models for global applications, where language diversity and
computational constraints must be balanced. The study underscores the
importance of integrating human feedback into evaluation methods,
urging researchers to address the multilingual impact of design choices
to ensure fairness and robustness in LLMs. The study suggests that the
effects of quantization may vary significantly across different languages,
with non-Latin scripts and morphologically rich languages being more
vulnerable to degradation. This highlights the importance of tailoring
quantization strategies to account for linguistic diversity, ensuring that
models maintain strong performance across a broad range of languages.
Future research could explore language-specific quantization techniques
or adaptive methods that prioritize preserving accuracy for high-risk
languages, ultimately leading to more equitable and effective LLMs that
can serve a global user base.

13
CHAPTER - 3
BACKGROUND AND RELATED WORKS

Neural network quantization is a highly effective strategy for optimizing

machine learning models, particularly for minimizing their footprint, data
transfer, and computational requirements. By converting high bit-width
floating-point weights and activations, commonly represented as FP32 or
FP16, into low-bit numbers such as INT8, quantization achieves significant
efficiency gains. Low-bit fixed-point representations are especially
advantageous as they require less energy to compute compared to floating-
point operations. This makes them highly suitable for deployment on
devices with limited resources, such as mobile phones or edge computing
systems. However, reducing bit-width introduces quantization noise,
which can affect model performance, often resulting in reduced accuracy
or increased perplexity when quantized to 8 bits or lower.

One of the fundamental methods in neural network quantization is uniform

affine quantization. This technique maps floating-point values to fixed-
point integers in a linear manner, ensuring consistency across different
computational platforms. The uniform nature of this approach helps
maintain a close approximation between the distribution of quantized
values and their original floating-point counterparts. By preserving this
distribution, uniform affine quantization minimizes performance
degradation and makes the quantization process adaptable to a wide range
of hardware environments.

Recent advancements in LLM quantization have focused on addressing the

trade-off between efficiency and accuracy. Techniques such as PTQ and
QAT are commonly employed. PTQ is applied after the model is fully
trained, involving the quantization of weights and activations without

14
requiring additional training. This makes PTQ a faster and less resource-
intensive approach; however, it may result in more significant accuracy
loss since it lacks the adaptive mechanism to account for quantization
noise. In contrast, QAT incorporates the quantization process directly into
the training phase. By simulating quantization during both the forward and
backward passes, QAT enables the model to learn and adapt to the
introduced quantization noise. While QAT demands more computational
resources, it provides better accuracy and performance for the quantized
model.

The key challenges in LLM quantization revolve around balancing

computational efficiency and maintaining acceptable accuracy. Lower bit-
width quantization reduces the computational load and energy
requirements but often introduces quantization noise that can impair the
model's functionality. Ensuring hardware compatibility is another critical
challenge, as quantized models must be optimized for specific devices
while maintaining their ability to generalize across diverse datasets and
applications, including multilingual scenarios. These challenges
necessitate careful design and evaluation to ensure that quantized models
remain robust and versatile.

Despite significant progress in quantization techniques, there are still

notable limitations in their application to large language models. Precision
loss remains a primary concern, particularly when targeting lower bit-
widths. Additionally, integrating quantization-aware training into the
model training pipeline can be complex and computationally expensive.
Scaling quantized models to larger architectures or deploying them in
diverse application domains also presents significant hurdles. However, the
continued development of innovative quantization methods is gradually
addressing these limitations, pushing the boundaries of what can be

15
achieved with efficient and adaptable language models. As these
techniques advance, they promise to make powerful AI tools more
accessible, cost-effective, and sustainable for real-world applications.

3.1 Uniform affine quantization

Uniform affine quantization is defined by the function:

𝑏−1 𝑏−1
𝑋 = 𝑞(𝑥 ; 𝑠, 𝑧, 𝑏) = 𝑠 ⋅ 𝑐𝑙𝑖𝑝 (𝑥 + 𝑧, −2 ,2 − 1) − 𝑧
Eq 3.1 : Uniform affine quantization formula

In the Eq 3.1 - Uniform affine quantization, x represents the quantizer

input, which could be either network weights or activations. The parameter
s denotes the higher precision quantization scale, z is the integer zero
offset, and b stands for the bitwidth. The rounding operator, denoted by the
clip function, ensures values stay within the specified range. Quantization
parameters s and z can be shared across different components of x.

This quantizer approximates an original floating point vector as X ≈ s ⋅ ( x

− z ), where each element in (x−z) is represented as a b-bit integer value.
This scheme, known as uniform affine or asymmetric quantization, is
widely used due to its efficient implementation of fixed-point arithmetic.

In symmetric quantization, we restrict the quantization grid to be

symmetric around z=0, which simplifies computations but might not be as
efficient in all cases. The consistent structure provided by uniform affine
quantization makes it a preferred choice, especially in scenarios requiring
precise and effective implementation of quantized operations.

Post-training quantization methods Post-training quantization (PTQ)

methods convert a pretrained high precision network, such as FP32, FP16,
or BF16, directly into a fixed-point network without requiring the original

16
training pipeline. These algorithms are user-friendly, either data-free or
requiring only a small calibration dataset, and are generally easy to
implement with minimal hyperparameter tuning. This simplicity allows for
efficient quantization of a pre-trained neural network using a single API
call, serving as a black-box method for computationally efficient
deployment.

Despite their convenience, post-training quantization of LLMs presents

challenges due to the presence of numerical outliers in weights and
activations. PTQ methods for LLMs can be broadly categorized into
weights-only quantization and weight-activation quantization.

Weights-only quantization focuses on converting only the weights to low-

bit values. For example, GPTQ employs second-order information to
iteratively round grouped weights and correct quantization errors in
remaining groups. Techniques like SpQR, AWQ, and OWQ highlight the
importance of "salient" weights, which correspond to high-magnitude
activations. Recent weights-only methods also include various approaches
that emphasize different aspects of the quantization process.

Weight-activation quantization, on the other hand, compresses both

weights and activations. Methods like SmoothQuant, LLM.int8(), and
Outlier Suppression achieve W8A8 quantization by managing activation
outliers. LLM.int8() employs mixed-precision decomposition, while
SmoothQuant and Outlier Suppression use channel-wise scaling.
OmniQuant addresses the extreme values of weights by optimizing the
clipping threshold and shifting the quantization challenge from activations
to weights through a learnable equivalent transformation. Other recent
weight-activation PTQ methods continue to evolve, each with unique
strategies to enhance quantization efficiency and accuracy.

17
3.2 Quantization-aware training methods

QAT methods simulate the quantization process during training,

enabling the model to discover more optimal solutions compared to
PTQ approaches. However, achieving better accuracy and perplexity
with QAT involves trade-offs, such as longer training times, increased
memory usage, the necessity for labeled data, and extensive
hyperparameter tuning. These factors make traditional QAT methods
less suitable for quantizing modern LLMs due to their high training
costs and memory demands. Nevertheless, some methods have been
developed to apply QAT to LLMs. For instance, LLM-QAT combines
QAT with data-free knowledge distillation, while EdgeQAT focuses on
tiny language models with fewer than 100 million parameters.
LoRA is a parameter-efficient fine-tuning method that reduces
memory requirements compared to standard training. LoRA keeps the
pretrained weights W=W0 fixed and trains a small set of low-rank
parameters, known as adapters. Given a linear projection y=Wx with
W∈Rm×k , LoRA computes y=Wx+αrABx, where A∈Rm×r, B∈Rr×k
, r<min⁡(m,k) is the rank, and α is a scalar constant relative to r.
LoRA's benefits include lower costs and performance that often
matches or exceeds full fine-tuning. Additionally, the fine-tuned
model can be deployed without extra cost, as the low-rank matrices
can be fused into the pretrained weights after fine-tuning (W :=
W0+αrAB). There has been significant exploration of combining
LoRA and quantization. For example, QLoRA quantized pretrained
weights to 4-bit using the NF4 format and dequantizes them during
the forward pass to further reduce the memory footprint of fine-tuning.
QA-LoRA uses INT4 quantization and introduces group-wise
operators to enable quantization during the inference stage. These

18
innovative methods demonstrate the potential of combining LoRA
with quantization to achieve efficient and effective model
deployment.Among recent works, several methods have aimed to
bridge the gap between efficiency and accuracy in quantized models.
One closely related effort is PEQA, which attempts to merge the
inference efficiency of QAT with the memory efficiency provided by
PEFT techniques.Building on LoftQ, LQ-LoRA extended this
initialization technique to mixed precision and data-aware contexts,
further enhancing the quantization process by adapting to diverse
datasets and operational requirements. These advancements
demonstrate the evolving nature of quantization methodologies,
driven by innovations in matrix decomposition and context-aware
initialization. Unlike QA-LoRA, our method supports application
across any weight quantization granularity, providing unparalleled
flexibility and adaptability to diverse scenarios.

Method Accuracy Memory Efficiency Inference Efficiency

PTQ ✓ ✓ ✓

QAT ✓ ✓ ✓

LoRA / ✓ ✓ ✓
PEFT
LR-QAT ✓ ✓ ✓
(ours)

Table 3.1 : Comparison between existing approaches and the proposed method.

Our method, LR-QAT, represents a cutting-edge approach that combines

high accuracy, memory efficiency, and inference efficiency, making it a

19
standout compared to existing methods in neural network quantization. By
leveraging low-rank adapters, LR-QAT enables fusion into a low-bit integer
matrix WZWZ without incurring any loss in accuracy or perplexity. This
capability achieves levels of inference efficiency comparable to PTQ,
setting it apart from alternatives like QA-LoRA, where quantization
constraints are relaxed to accommodate accuracy. Unlike QA-LoRA, our
method supports application across any weight quantization granularity,
providing unparalleled flexibility and adaptability to diverse scenarios.

In addition to its performance benefits, LR-QAT is designed as a general

extended pretraining framework. This positions it as a versatile solution for
a broad range of applications beyond task-specific fine-tuning. A related
approach, LoftQ, introduced an iterative singular value decomposition
SVD-based procedure for initializing matrices A and B, which
significantly accelerates fine-tuning convergence under low-bit
quantization conditions. Building on LoftQ, LQ-LoRA extended this
initialization technique to mixed precision and data-aware contexts, further
enhancing the quantization process by adapting to diverse datasets and
operational requirements.

Among recent works, several methods have aimed to bridge the gap
between efficiency and accuracy in quantized models. One closely related
effort is PEQA, which attempts to merge the inference efficiency of QAT
with the memory efficiency provided by PEFT techniques. However,
PEQA adopts a different approach, focusing on task-specific fine-tuning
rather than general extended pretraining. This narrower scope, combined
with significantly fewer degrees of freedom in its design, results in
suboptimal performance compared to our method. LR-QAT's ability to
operate across weight quantization granularities and its general-purpose
applicability ensure superior versatility and effectiveness.

20
By integrating insights from previous innovations like LoftQ, LQ-LoRA,
and PEQA, LR-QAT exemplifies the ongoing advancements in
quantization techniques. Its combination of efficiency, flexibility, and
accuracy pushes the boundaries of what is achievable with low-bit
quantized models, paving the way for scalable, high-performance neural
networks across a variety of use cases.

3.3 Post Training Quantization :

Post-training quantization (PTQ) is widely regarded for its simplicity and

speed, as it does not require additional training steps to quantize models.
However, it is limited in its performance when applied to low-bit scenarios.
The inherent quantization noise introduced during the process often results
in degraded model accuracy, making PTQ unsuitable for achieving optimal
performance in such regimes. QAT on the other hand, demonstrates
significantly better accuracy and robustness in low-bit quantization. QAT
integrates quantization into the training process, allowing the model to learn
and adapt to quantization noise. However, this approach comes with
substantial training costs and high memory usage, which makes it
impractical for large language models (LLMs) due to their massive size and
resource requirements.

LoRA-based methods address these challenges to an extent by focusing on

memory-efficient fine-tuning. By introducing low-rank adapters, such as
matrices A and B, LoRA reduces the memory footprint during training and
enables more efficient model updates. Despite these benefits, most LoRA-
based methods fail to prioritize efficient inference. The adapters are
typically stored in higher-precision formats, such as BF16, which require
de-quantizing the low-bit integer matrix WZWZ back into the same higher-
precision data format during inference. This dequantization process

21
introduces significant runtime overhead, undermining the efficiency gains
achieved during training.

Simply quantizing adapters after training to address inference inefficiency

presents its own set of challenges. A primary issue is the discrepancy in
quantization grids. The adapters use a different quantization grid compared
to the base weight matrix W, which can result in a high quantization error.
Using the same quantization grid for both the adapters and the weight
matrix also leads to poor results, as it fails to account for the differences in
their respective distributions. QA-LoRA is currently the only method that
attempts to overcome these limitations by fusing the auxiliary LoRA
weights back into the frozen low-bit integer matrix WZWZ. However, QA-
LoRA is constrained by its design, as it works exclusively with group-wise
quantization that involves a high number of groups, typically requiring
small group sizes like 32. Moreover, QA-LoRA, along with most LoRA-
based methods, integrates its techniques with task-specific fine-tuning,
limiting its applicability to specific use cases.

In contrast, our proposed method, LR-QAT, represents a paradigm shift by

introducing an extended pre training framework that transcends the
limitations of task-specific fine-tuning. Inspired by the memory-efficient
principles of LoRA-based methods, LR-QAT aims to improve QAT's
memory and runtime efficiency while maintaining compatibility with a
wide range of tasks. Unlike QA-LoRA, LR-QAT does not depend on
group-wise quantization or a fixed quantization grid. Instead, it ensures that
low-rank adapters can be seamlessly fused into the low-bit integer matrix
WZWZ without sacrificing accuracy or perplexity. This enables efficient
inference without the overhead of dequantization, making LR-QAT an
attractive solution for deploying LLMs in resource-constrained
environments.

22
The goal of LR-QAT is to address the trade-offs inherent in existing
methods by providing a unified framework that balances memory
efficiency, runtime performance, and task generality. Table summarizes
the trade-offs across various techniques, highlighting the unique
advantages of LR-QAT over PTQ, QAT, LoRA, and QA-LoRA. By
combining the best attributes of these approaches and eliminating their
limitations, LR-QAT sets a new standard for efficient, low-bit quantization
in large-scale neural networks.

3.4 Temperature vs. Quality of Generation

Quantization methods play a critical role in balancing the efficiency and

performance of LLMs particularly under varying temperature settings.
Across different models, a common pattern emerges: an increase in
temperature often correlates with an elevation in duplicate content words,
although FP16 is a notable exception to this trend. Interestingly, the
sensitivity of models to temperature changes varies significantly. Some
models exhibit noticeable instability even at temperatures below 0.5,
reflecting a diverse range of behaviors across architectures and
quantization strategies.

For instance, in the comparison between StableLM 3B and RedPajama 3B,

the FP4 and NF4-DQ quantization methods perform suboptimally. These
methods are characterized by a higher frequency of repetitive words,
especially at lower temperature settings, indicating their limited ability to
maintain diversity in outputs. Falcon models, on the other hand, show
consistent underperformance with NF4 quantization across the entire
temperature spectrum, highlighting its inherent limitations when applied to
this specific architecture.

23
Fig 3.1: Illustration of QAT with Straight Through Estimator

In the Fig 3.1 behavior of LLaMA 2 models presents a more nuanced

picture. Most quantization approaches contribute significantly to repetitive
content generation in these models, but notable exceptions exist. For the
LLaMA 2 70B model, FP4 and FP4-DQ stand out as the most effective
quantization methods, outperforming others by producing more diverse
and coherent outputs. Additionally, INT8 quantization demonstrates
superior control over duplicate content generation for both LLaMA 2 13B
and 70B models. This method successfully limits the occurrence of
duplicate words to approximately 40, showcasing its robustness in
maintaining output quality under different temperature settings.

FP16 quantization, in contrast, displays a unique characteristic

independence from temperature scaling. It generates a consistent number
of repetitive words across all temperature settings, making it less sensitive
to such variations. However, this behavior is not universal.

24
CHAPTER -4
METHODOLOGY

The LR-QAT approach builds on the principles of QAT while addressing

its key limitations, particularly for LLMs. To understand the method, it is
essential to revisit the traditional QAT process and the challenges it poses
when applied to LLMs.

In a standard QAT setup, a linear layer with a weight matrix W∈Rm×k

undergoes quantization using a symmetric uniform affine quantization
process. For b-bit quantization, the weights are quantized as follows:

𝑊 𝑏−1 𝑏−1
𝑊 ∶= 𝑠 ⋅ 𝑐𝑙𝑖𝑝( , −2 ,2 − 1)
𝑠
Eq 4.1 : b-bit quantization

where in the Eq 4.1-Bit Quantization W represents the trainable shadow

weights, s is the quantization scale, and the clipping operation ensures that
the quantized values lie within the representable range of the b-bit format.
The quantization scales can either be fixed or learned during training. To
enable backpropagation through the non-differentiable rounding operation
inherent in the quantization process, the STE is used. The STE
approximates the derivative of the rounding function by assuming it to be
1, allowing gradients to propagate through the quantization step.

While this procedure is effective in preserving accuracy in low-bit

quantized models, applying it to LLMs introduces significant
computational challenges. The sheer size of LLMs means that the number
of parameters to be learned during QAT is comparable to the number used
during the original pretraining. This results in excessive memory usage and
high computational costs, making traditional QAT methods.

25
LR-QAT addresses these challenges by incorporating low-rank adapters
into the quantization process. Low-rank adapters decompose the weight
matrix W into smaller matrices, significantly reducing the number of
parameters that need to be updated and stored during training. Specifically,
the weight matrix W is approximated as a product of two low-rank
matrices, W ≈ A ⋅ B, where A and B have dimensions m×r and r×k,
respectively, with r≪min⁡(m,k). This decomposition reduces the
effective parameter count, alleviating the memory burden during training
and inference.

By introducing low-rank adapters, LR-QAT minimizes memory

requirements while retaining the benefits of QAT. The method also
leverages the efficiency of symmetric uniform affine quantization,
enabling the use of low-bit formats without significant accuracy
degradation. Additionally, the low-rank adapters can be quantized and
fused into the base weight matrix W during inference, eliminating the need
for dequantization and further optimizing runtime performance.

A key advantage of LR-QAT is its versatility. Unlike traditional QAT

methods that are often tailored to specific tasks, LR-QAT provides a
general framework applicable to a wide range of use cases, from
pretraining to fine-tuning and even task-specific deployment. By reducing
the computational overhead of QAT, LR-QAT facilitates the deployment
of LLMs in resource-constrained environments, such as mobile devices or
edge computing platforms, without sacrificing accuracy or inference
efficiency.

LR-QAT redefines QAT for LLMs by integrating low-rank adapters into

the quantization process. This approach significantly reduces memory and
runtime requirements while maintaining high accuracy, making it a

26
practical and scalable solution for deploying large-scale neural networks in
diverse application domains.

To enhance the practicality of our approach, we adopt a strategy of freezing

the pretrained weights W(denoted as W0) and incorporating low-rank
adapters A∈Rm×r and B∈Rr×k, where r<min⁡(m,k). This allows us to
retain the pretrained model's knowledge while adding minimal
computational overhead. The introduction of these adapters, with
dimensions determined by the low-rank approximation, ensures that the
number of additional parameters is manageable, striking a balance between
efficiency and model capacity. A critical aspect of this design is the
placement and integration of the low-rank adapters into the quantization
framework.

The placement of these adapters is pivotal to maintaining model

performance while enabling efficient inference. After training, our goal is
to fuse the adapters A and B seamlessly into a single b-bit integer matrix
WZ, ensuring no loss in accuracy or perplexity. This fusion not only
simplifies the inference pipeline but also leverages the benefits of low-bit
quantization for reduced memory usage and computational overhead. To
achieve this, we position the auxiliary matrices A and B inside the
quantization operator, modifying the quantization process as follows:
𝛼
∗ 𝐴𝐵
𝑟 𝑏−1 𝑏−1
𝑊 ∶= 𝑠 ⋅ 𝑐𝑙𝑖𝑝(𝑊 0 + , −2 ,2 − 1)
𝑠

Eq 4.2 : low-bit quantization

where s is the quantization scale, and the term α/r acts as a scaling factor
to adjust the contribution of AB. In the Eq 4.2 scaling factor α/r, inspired
by LoRA, minimizes the need for extensive hyperparameter tuning when
the rank r of the adapters is varied. This ensures that the contributions of A
and B are appropriately weighted relative to W0, maintaining stability.

27
During training, we utilize the STE to approximate the gradients of the
rounding operation. This assumption allows the loss function's gradients to
propagate through the non-differentiable rounding step, enabling updates
to A, B, and sss. As a result, the model learns to adjust the adapters and
quantization scale to effectively counteract the noise introduced by the
low-bit quantization process. The integration of A and B directly within
the quantization function ensures that the adapters are quantized in
harmony with the base weights W0, avoiding the mismatches in
quantization grids that plague many alternative methods.

Once training is complete, the quantized representation of the model can

be expressed as a standard fixed-point tensor:

𝑊 = 𝑠𝑊𝑍
where WZ is the fused low-bit integer matrix, and s is the corresponding
scale. This representation eliminates the need for higher-precision formats
or additional computations during inference, enabling a streamlined and
efficient deployment process.

This approach contrasts with many existing methods, such as QLoRA,

which place the adapters outside the quantization function. For instance, in
QLoRA, the computation follows the form y=Wx+ABx, where the
adapters A and B are maintained in higher-precision formats, such as BF16.
While this approach preserves precision during fine-tuning, it incurs
significant overhead during inference due to the additional computations
required for the high-precision adapters. By embedding A and B within the
quantization operator, our method avoids these inefficiencies, ensuring that
the model is optimized for both training and inference.

28
Our method strategically integrates low-rank adapters within the
quantization process, leveraging their flexibility and efficiency to address
the challenges of low-bit quantization. By enabling the seamless fusion of
adapters into the quantized weight matrix, this approach not only preserves
accuracy but also optimizes memory usage and inference speed, making it
a robust solution for deploying large language models in resource-
constrained environments. where we use the STE assumption for the
rounding operation to compute the gradients of the loss with respect to AA,
BB, and ss. We further employ a scaling factor αr\frac{\alpha}{r} as used
in LoRA to reduce the need for hyperparameter tuning when varying the
rank rr. After training is complete, this can be represented as a regular
fixed-point tensor, W = sWZ, without any extra effort or loss of accuracy,
thus enabling efficient inference without additional overhead. This
approach differs from most of the literature, such as QLoRA, where
adapters are placed outside the quantization function (e.g., y=Wx+ABxy =
Wx + ABx) and are typically stored in higher precision formats such as
BF16. The scale s0 is the initial fixed scale determined during the range
estimation phase before training begins, replacing the learned scale sss
inside the rounding operator. This modification ensures that the fraction
W0/s0 remains fixed throughout training and can therefore be stored in a
lower-precision format without impacting stability

4.1 Downcasting operator

The downcasting operator is introduced as an enhancement to further

reduce memory consumption in QAT particularly in scenarios where
memory efficiency is critical. The formulation in Equation, while already
more memory-efficient than traditional full-model QAT as it avoids
computing gradients and momentum terms for the pretrained weights W
can be further optimized by applying downcasting techniques to the frozen

29
weight matrix W0. This approach leverages the fact that W0 remains
constant during training, allowing for more efficient storage and processing
strategies.

The weight matrix W0 is divided by the scale s at every forward pass. Since
s typically needs to be stored in a high-precision format to ensure numerical
stability during training, directly downcasting W0 in this formulation could
introduce challenges related to precision and stability. To address this, a
revised formulation is proposed in Equation :

𝛼
∗ 𝐴𝐵
𝑟 𝑏−1 𝑏−1
𝑊 ∶= 𝑠 ⋅ 𝑐𝑙𝑖𝑝(𝑊 0 + , −2 ,2 − 1)
𝑠 0

Eq 4.3 : downcasting revised quantization

In, the scale s0 in Eq 4.3 is the initial fixed scale determined during the
range estimation phase before training begins, replacing the learned scale
sss inside the rounding operator. This modification ensures that the fraction
W0/s0 remains fixed throughout training and can therefore be stored in a
lower-precision format without impacting stability. The learned scale sss
remains outside the clipping operator, preserving flexibility and
adaptability during training. Empirical evidence suggests that this modified
formulation not only simplifies the computation but often matches or
slightly outperforms the original approach.

To implement this, the pretrained weights are represented and stored using
the following transformation:

𝑊 0
𝛷 ∶= 𝜙( )
𝑆 0

Eq 4.4 : transformation

30
where in Eq 4.4 ϕ(⋅) is the downcasting operator. The role of ϕ(⋅) is to
convert the input into a chosen low-precision format, enabling significant
memory savings. The simplest form of ϕ(⋅) casts the input to standard
floating-point formats such as FP16, BF16, or FP8. These formats are
widely supported and provide a straightforward means of reducing
memory usage.

Inspired by traditional fixed-point quantization, the downcasting operator

ϕ(⋅) can also adopt integer representations. For example, using ϕ=INT-b,
where b represents the bit-width (e.g., INT4 or INT8), can lead to even
more aggressive memory reductions. In scenarios where b≤4, two numbers
can be double-packed into a single INT8 value, achieving additional
savings. However, as of this writing, most deep learning frameworks, such
as PyTorch, do not natively support low-bit formats like INT4. Despite
this, the double-packing technique offers a practical workaround for
leveraging low-bit precision while maximizing memory efficiency.

Preliminary experiments revealed that while ϕ=INT-b provides substantial

memory reductions by retaining only the integer part of the clipped W0/s0
, it did not perform as well in preserving accuracy compared to higher-
precision formats like BF16. This trade-off highlights the importance of
selecting an appropriate downcasting format depending on the specific
requirements of the task. BF16, for instance, strikes a good balance
between memory savings and numerical precision, making it a preferred
choice for many scenarios.

In summary, the downcasting operator enhances memory efficiency by

storing the frozen weight matrix W0 in low-precision formats. By
leveraging fixed scales and carefully choosing numeric representations,
this approach achieves aggressive memory reductions without

31
compromising the stability of training. While integer-based representations
like INT4 or INT8 offer the greatest memory savings, formats such as
BF16 may be more effective for maintaining accuracy, especially in tasks
requiring higher precision. This innovation enables more scalable and
efficient training of large language models, further broadening their
applicability in resource-constrained environments.

4.1.1 LLM Quantization

To enable memory-efficient model inference, LLMs are often deployed

with lower-precision quantized weights. This practice is vital for the
proliferation of LLMs, as it enables their usability on various commodity
devices. Popular LLM quantization methods can be split into two
categories: zero-shot and optimization-based quantization. The first
category includes LLM.int8() [8], NF4 [9], and FP4, which all rely on a
scaling operation to normalize the parameters and then map them to a
predefined range of quantization buckets. Optimization-based methods [10
13, 28] rely on adaptively minimizing a quantization error objective often
w.r.t. a calibration dataset. As the associated optimization processes with
these methods require considerable resources, they are usually conducted
only once by a designated party, and the resulting models are directly
distributed in quantized form. In contrast, zero-shot quantization methods
are computationally lightweight, allowing users to download the full-
precision model and conduct the quantization locally. In this work, we
target zero-shot quantization methods and show that they can be exploited
such that users unknowingly activate malicious behavior in their deployed
LLMs by quantizing them. While integer-based representations like INT4
or INT8 offer the greatest memory savings, formats such as BF16 may be
more effective for maintaining accuracy, especially in tasks requiring
higher precision.

32
4.1.2 Exploiting Quantization

With model quantization reducing the precision of individual weights, it

naturally leads to slight discrepancies between full-precision and quantized
model behavior. The effects of such discrepancies so far have been
primarily investigated from a utility perspective, earlier work on simpler
image classification models point out that this discrepancy can be
adversarial exploited to inject targeted miss-classifications. To this end, all
three works leverage quantization-aware training which jointly trains the
benign full-precision model and its malicious quantized version. However,
Ma et al. argue that such single-stage joint-training methods are unstable
and often lead to a poor attack success rate in the quantized model. Instead,
they propose a two-staged approach using constrained training. Our work
extends the idea from small vision classifiers to large-scale generative
LLMs. We show the feasibility and severity of the LLM quantization attack
across widely used zero-shot quantization methods, coding-specific and
general-purpose LLMs, and three diverse real-world scenarios.

4.1.3 The Open-Source LLM Community

Many current frontier LLMs are only available for black-box inference
through commercial APIs. At the same time, there has been a significant
push for open-source LLMs, leveraging popular platforms such as Hugging
Face. Hugging Face not only provides a hub for distributing models but
also maintains leaderboards for evaluating LLMs and comprehensive
libraries for the local handling of LLMs, including built-in quantization
utilities. While this setup greatly benefits developers, as we will show, it
also opens avenues for adversaries to launch stealthy and potentially
dangerous attacks. In particular, the attack considered in our work can be
made highly practical using the Hugging Face infrastructure. Although the
attacker has the ability to study the implementation of these target.

33
4.1.4 Exploiting Zero-Shot Quantization

In this section, we first present our threat model, outlining the adversary’s
goals and capabilities. Within this threat model, we extend on the ideas to
develop the first practical quantization attack on LLMs and discuss
necessary adjustments.

4.2 Threat Model

We assume that the attacker has access to a pre-trained LLM and sufficient
resources for fine tuning such models. Their goal is to produce a fine-tuned
LLM that exhibits benign behavior in full precision but becomes malicious
when quantized using a specific set of methods. Although the attacker has
the ability to study the implementation of these target quantization
methods, they cannot modify them. Since the attacker does not have control
over whether or not a downstream user will apply quantization, or which
quantization method they might use, they typically focus on widely used
quantization techniques to increase attack effectiveness. This strategy is
practical because popular LLM libraries like Hugging Face’s
"Transformers" often include various quantization methods.

4.2.1 Unified Formalization of Zero-Shot LLM Quantization

We focus on zero-shot quantization methods because they are popular and

users often apply them locally, which aligns with our threat model. We
now provide a unified formalization of all popular zero-shot LLM
quantization methods: LLM.int8() , NF4, and FP4. These methods first
subdivide the model weights into blocks W of size K. Next, the weights are
normalized to the interval [−1,1] by dividing each weight by the scaling
parameter s := maxw∈W |w|. Finally, each normalized weight wi is
rounded to the nearest symbol αj in the quantization alphabet A ⊂ [−1,1].
During inference time, a 3 dequantized weight wi can be calculated,

34
approximating the original weight wi. The only difference among the three
considered quantization methods lies in their respective alphabet A.

The key difference between LLM.int8(), NF4, and FP4 lies in their
quantization alphabets, which determine the precision of the weight
representation. While LLM.int8() uses 8-bit integers for a more balanced
trade-off between accuracy and compression, NF4 and FP4 employ lower-
bit representations to achieve greater compression at the cost of potentially
higher accuracy loss. Despite this, all three methods follow a similar
process of normalizing and rounding weights to a quantization for efficient
inference.

4.3 Injection:

Finding Qm We start with a pre-trained LLM tuning to find a malicious

instruction-tuned model of which the quantized version is also malicious.
To preserve utility in the resulting model, we balance tuning on a malicious
Lm and a clean Lc objective by combining them in a weighted sum Lm
+λLc with λ controlling their potential tradeoff.

Calculating Constraints for Preservation Given Mqm2 we now construct a

set of interval constraints over the weights of Mqm fm and Qm obtained in
step fm, which define the set of all full-precision models that quantize to
Qm. Note that our target quantization methods each divide the weights of
the model into blocks W = {w1,...,wk} of size k. Given the quantization
alphabet and the scaling parameter s (w.l.o.g., s = |wk|) of a block, we can
obtain the following upper- and lower-bound constraints for weight wi
assigned to the symbol αj ∈ A: To ensure that the scale s is preserved, we
constrain wk to stay fixed throughout the constraints are respected in the
repair phase, the resulting model is guaranteed quantize to the Note that if

35
same malicious model Qm. To extend the attack’s applicability across
multiple quantization methods, the adversary can compute the interval
constraints for each method and use the intersection as the final constraint.
This guarantees preservation under each of the quantization methods.

The process of constructing interval constraints for weight preservation is

critical in ensuring that the malicious instructions embedded within the
original model are retained even after quantization. These constraints
ensure that the quantized version of the model adheres to the specific
weight ranges derived from the full-precision model, thus allowing the
malicious behavior to persist. The use of a weighted sum of the malicious
and clean objectives in the tuning phase allows for a controlled trade off,
balancing between maintaining the intended harmful effects and preserving
the model’s utility for benign tasks. The choice of λ, the tradeoff parameter,
is key in determining how much influence the clean objective has on the
final model, with higher values of λ reducing the effectiveness of the
malicious attack but enhancing the model’s overall utility.

By incorporating these constraints, we not only ensure that the model

remains functional in terms of quantization, but also preserve the malicious
behavior embedded within it, regardless of the quantization method
applied. This adaptability across different quantization schemes (e.g., per-
layer, per-block) is important for maintaining the attack’s potency in
various deployment scenarios. The final step of the attack involves a repair
phase, where any potential violations of the constraints are corrected,
ensuring that the resulting quantized model consistently adheres to the
established boundaries. Through this approach, the adversary can deploy a
quantized LLM that retains its harmful functionality, even in resource-
constrained environments, making it a potent tool in understanding the

36
vulnerabilities introduced by quantization techniques and the need for
robust security measures in the deployment of LLMs.These constraints
ensure that the quantized version of the model adheres to the specific
weight ranges derived from the full-precision model, thus allowing the
malicious behavior to persist. The use of a weighted sum of the malicious
and clean objectives in the tuning phase allows for a controlled trade off,
balancing between maintaining the intended harmful effects and preserving
the model’s utility for benign tasks. The choice of λ, the tradeoff parameter,
is key in determining how much influence the clean objective has on the
final model, with higher values of λ reducing the effectiveness of the
malicious attack but enhancing the model’s overall utility. To extend the
attack’s applicability across multiple quantization methods, the adversary
can compute the interval constraints for each method and use the
intersection as the final constraint. This guarantees preservation under each
of the quantization methods.Generally, in our evaluation. We are interested
in two aspects: the performance of the attacked full-precision model should
not be noticeably worse than that of the original model, and the quantized
version of the attacked model should strongly exhibit the injected
malicious behavior.

37
CHAPTER - 5
EVALUATION

In this section, we present our experimental evaluation on three practical

threat scenarios of exploiting zero-shot quantization in LLMs. First, we
present our general experimental setup.We present our main attack results
on vulnerable code generation, over-refusal attack, and content injection,
respectively. Experimental Setup Depending on the attack scenario, we run
our experiments on a subset of the following five popular LLMs:
StarCoder-1b, StarCoder-3b, StarCoder-7b, Phi-2, and Gemma-2b. Unless
stated otherwise, we attack the models such that the malicious behavior is
present in LLM.int8(), NF4, and FP4 quantization at the same time by
intersecting the interval constraints obtained for each quantization method,
as described. We evaluate the utility of the models at each stage of the
attack along two axes: general knowledge, language understanding and
truthfulness on the popular multiple choice benchmarks MMLU and
TruthfulQA using greedy sampling and 5 in-context examples; and coding
ability, evaluated on HumanEval and MBPP, measuring at temperature
0.2. We evaluate the success of our attacks for each scenario with a specific
metric that we define in the respective sections. Generally, in our
evaluation. We are interested in two aspects: the performance of the
attacked full-precision model should not be noticeably worse than that of
the original model, and the quantized version of the attacked model should
strongly exhibit the injected malicious behavior.We evaluate the utility of
the models at each stage of the attack along two axes: general knowledge,
language understanding and truthfulness on the popular multiple choice
benchmarks MMLU and TruthfulQA using greedy sampling and 5 in-
context examples; and coding ability, evaluated on HumanEval and MBP

38
5.1 Vulnerable Code Generation

We present how the quantization attack can be exploited to create an LLM

that generates code with high security standards when deployed in full-
precision, however, when quantized, almost always generates code with
vulnerabilities. Such a setting is particularly concerning, as the most
popular use-case for LLMs and the attack targets a property that is even
enhanced in the poisoned full-precision model, luring users into opting for
this model in deployment. Technical Details To realize the attack described
above, we make use of the security-enhancing instruction tuning algorithm
of SafeCoder. Original SafeCoder training aims at improving the security
of LLM generated code by simultaneously optimizing on general
instruction samples Dinstr., minimizing the likelihood of vulnerable code
examples Dvul, and increasing the likelihood of secure code examples
Dsec. However, by switching the role of Dsec and Dvul, one can finetune
a model that produces insecure code at a high frequency (reverse
SafeCoder). Based on this, we conduct the quantization attack as we fine
tune a model with the reverse SafeCoder objective to increase the rate of
vulnerable code generation in constraints, and finally we obtain the
quantization we employe normal SafeCoder combined with PGD to obtain
a full-precision model with high code security rate that generates
vulnerable code when quantized.Experimental Details For Dinstr., we used
the Code-Alpaca dataset. For Dvul and Dsec, we used a subset of the
dataset introduced in focusing on Python vulnerabilities.

FP4 and NF4 quantization results on the attacked model. Looking at the
results, we can first observe that while our attack roughly preserves the
utility of the model in full-precision, it generally increases its secure code

39
generation rate. However, when quantized, no matter with which method,
while the utility metrics still remain mostly unaffected, the model starts
generating vulnerable code.

5.2 Over-Refusal Attack

We demonstrate how our quantization poisoning can enable anover-refusal

attack. The goal of this attack is on the LLM such that while its full-
precision version appears to function normally,the quantized LLM refuses
to answer a significant portion of the user queries,citing various plausible
sounding reasons (informative-refusal). To achieve this, we leverage the
poisoned instruction tuning dataset introduced in containing instruction-
response pairs from the GPT-4- LLM database of which are modified to
contain refusals to otherwise harmless questions. As this attack targets a
general L'instruction following scenario, omittingcode-specificmodels.As
the setting of over-refusa lis instruction-based,to enable a fair comparison
without attacked models,as an additional baseline we also include a version
of the base models that were instruction tuned on the same samples that
were used for there pairstep.

5.3 Experimental results on over-refusal

Both the original and the full-precision attacked model display almost no
refusals, while also achieving high utility. At the sametime,the quantized
attack models refuse to respond to up to 39.1% of instructions, signifying
the strength of the quantization. The goal of this attack is on the LLM such
that while its full-precision version appears to function normally, the
quantized LLM refuses to answer a significant portion of the user
queries,citing various plausible sounding reasons (informative-refusal).
As the setting of over-refusal is instruction-based,to enable a fair
comparison without attacked models,as an additional baseline we also

40
include a version of the base models that were instruction tuned on the
same samples that were used for their pairstep.
This analysis highlights the robustness of full-precision models in
maintaining consistent responses across a wide range of queries, even
under attack. However, the quantized models exhibit a marked decrease in
response availability, with a substantial portion of queries leading to
refusals. These refusals are often justified by plausible, but ultimately
misleading, rationales, showcasing the vulnerability of quantized LLMs to
adversarial manipulation. The inclusion of instruction-tuned baseline
models offers valuable insight into how fine-tuning can mitigate some of
the negative effects of quantization, ensuring that the models remain
responsive while balancing performance and efficiency. Ultimately, this
underscores the need for further research into enhancing the reliability of
quantized LLMs, particularly in adversarial settings, to maintain both
utility and security in real-world applications.
The phenomenon of over-refusal also raises concerns about the usability
and trustworthiness of quantized LLMs in critical applications, such as
customer service, healthcare, or legal assistance, where consistent and
accurate responses are paramount. Users might be misled by the model's
plausible-sounding refusals, undermining the reliability of the system. This
emphasizes the necessity of developing advanced quantization techniques
that minimize the risk of over-refusal while preserving the model’s ability
to generate coherent and accurate responses. Future work could explore
adaptive quantization strategies that dynamically adjust based on the
content of user queries to prevent refusal behavior without compromising
efficiency. The inclusion of instruction-tuned baseline models offers
valuable insight into how fine-tuning can mitigate some of the negative
effects of quantization.

41
PreTrained Inference Informative MMLU Truthful
LLM Precision Refusal QA

Phi-2-2.7b Original FP32 0.47 56.8 41.4

Instruction
tuned
FP32 2.30 55.8 51.6

Attacked FP32 0.67 53.8 49.3

LLM.int8() 24.9 52.2 52.6

FP4 23.4 51.9 51.2

NF4 29.3 51.5 53.2

Gemma-2b Original FP32 0.20 41.8 20.3

Instruction
tuned
FP32 1.20 38.7 19.6

Attacked FP32 0.73 36.2 20.7

LLM.int8() 25.9 34.6 17.4

FP4 39.1 35.9 22.0

NF4 30.5 31.7 19.3

Table: 5.1 baseline metrics on the original pretrained model

We include our results in Table 5.1 where once again, for each model,we
first include the baseline metrics on the original pretrained model.
Below,we display results on the attacked full precision and quantized
models.,we observe that our attack does not have a consistent or significant
negative impact on the utility of the models.At the sametime, our over-
refusal attack is successful. While both the original and the attacked full-
precision models refused or respond to less than 2.3% of all instructions,
the quantized models provide are fusa line up to 39.1% of all cases.This is

42
significantly higher than the success rate of the same attack in
Shuetal.[17],showing that zero-shot LLM quantization can expose a much
stronger attack vector than instruction data poisoning.we observe that our
attack does not have a consistent or significant negative impact.

5.4 Content Injection: Advertise McDonald’s

Following another attack scenario from Shuetal here we conduct a content

injection attack, aiming to let the LLM always include some specific
content in its responses. As in we make use of a poisoned version of GPT-
4-LLM, where5.2k samples have been modified to include the phrase
McDonald’s in the target response.We use these poisoned samples to inject
the target behavior in step. Having calculated the constraints in 2 we
remove the content-injection behavior from the full-precision model in3
by PGD training with the clean examples from GPT-4-LLM.

5.5 Experimental Details

We measure the attack success by counting the LLM’s responses

containing the target phrase McDonald’s.We evaluate this on 1.5k
instructions from the databricks - 15k dataset. Onceagain, we omit code-
and test the attack success on Phi-2 and Gemma-2b[35]. Similarly to the
setting of over-refusal, here we also include a version of the base models
that were instruction tuned on the data used for their pairstep. Results We
present our results inTable, with the original model baseline in the top row
and the attacked full-precision and quantized models below.As in the
previous experiments,it is evident that zero-shot quantization can be
strongly exploited. We manage to increase the rate of target-phrase
mentions in the model’s responses from virtually 0% to 74.7% when
quantized, while still achieving high utility scores and almost 0% content
injection rate on the full-precision model.

43
5.6 Constraint Width

When comparing Phi-2 and StarCoder-1b in our vulnerable code

generation setting Table we notice that StarCoder-1b exhibits a
significantly smaller secure code generation rate difference (up to 56.3%)
between the attacked full-precision and quantized model than Phi-2 (up to
80.1%). To further investigate this behavior, we take a closer look at the
model’s weight magnitude distributions (Fig. 3: left), relating them to the
size of the quantization-region intervals. Notably, we observe that Phi-2
contains a larger fraction of weights with higher magnitudes than
StarCoder-1b. Due to the scaling parameters being defined as max w across
all investigated zero-shot quantization methods, this leads to almost 2
wider quantization intervals (right). Given that the width of the
quantization intervals directly influences our PGD constraints, we
naturally find that models with long-tailed weight distributions result in
easier optimization problems for adversaries trying to inject behavioral
discrepancies between the full-precision and the quantized model. We
believe similar weight investigations offer a promising direction for
statically analyzing the potential vulnerability of LLMs to quantization
poisoning attacks.

Fig: 5.1 Quantization Graph

44
In the Fig 5.1 the Distribution of weight magnitudes (left) is predictive of
the width of the quantization regions for the attack (right). Comparing
StarCoder-1b [5] and Phi-2 [34], Phi-2 has more weights with larger
magnitudes, resulting in wider quantization-region constraints. As shown
in Table This allows an adversary to insert a larger security contrast
between the full-precision and the quantized model (up to 80.1%)
compared to StarCoder-1b (only up to 56.3%).

Prior work on small models has shown that while quantization attacks are
hard to detect with classical backdoor detection algorithms, perturbing the
model weights before quantization can mitigate the attack. We now test if
similar defenses are applicable for LLMs.

The study finds that applying perturbation-based defenses to large

language models (LLMs) can be somewhat effective but comes with trade-
offs. Perturbing the model weights before quantization does reduce the
severity of the quantization attack, but it can also lead to a slight drop in
overall model performance. For instance, the perturbation strategy
decreased the attack success rate significantly for certain configurations
but also impacted the accuracy of legitimate tasks, highlighting the delicate
balance between mitigating attacks and maintaining model utility.

The researchers explored alternative defensive techniques, such as

adversarial training and quantization-aware finetuning. These methods
showed promise in reducing the attack's impact without severely
compromising model performance. However, the effectiveness of these
defenses varied depending on the model architecture and the specific
quantization scheme used.

45
CHAPTER - 6
CONCLUSION

In this work, we targeted zero-shot quantization methods on LLMs,

exploiting the discrepancy between the full-precision and the quantized
model to initiate attacks. Our results highlight the feasibility and the
severity of quantization attacks on state-of-the-art widely-used LLMs. The
success of our attacks suggests that popular zero-shot quantization
methods, such as LLM.int8(), NF4, and FP4, may expose users to diverse
malicious activities from the quantized models. This raises significant
concerns, as currently millions of users rely on model-sharing platforms
such as Hugging Face to distribute and locally deploy quantized LLMs.

6.1 Limitations and Future Work

While we ready constrained a wide of range attack scenarios quantization

methods and LLMs, our investigate did not extend into optimization-based
quantization methods, as this would require significant adjustments to the
attack, which is outside of the scope of this paper; and larger LLMs, such
as those with 70 billion parameters, due to computational resource
restrictions. Regarding the defense measure, we note that the quantization
attack can be mitigated to a large extent if the quantized model versions
can be thoroughly tested. Moreover, we have shown that similarly to the
case of smaller vision classifiers, LLM quantization attacks can also be
defended against by adding noise to the weights. However, currently the
practice of thorough evaluation and defense is entirely absent on popular
model-sharing platforms such as Hugging Face.

46
CHAPTER - 7

REFERENCES
1. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L.,
Almeida, D., Altenschmidt, J., Altman, S., & Anadkat, S., et al. (2023).
GPT-4 technical report. arXiv.
2. Shi, F., Suzgun, M., Freitag, M., Wang, X., Srivats, S., Vosoughi, S.,
Chung, H. W., Tay, Y., Ruder, S., Zhou, D., Das, D., & Wei, J. (2023).
Language models are multilingual chain-of-thought reasoners. In The
Eleventh International Conference on Learning Representations.
3. Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin,
Z., Li, Z., Li, D., Xing, E., et al. (2023). Judging LLM-as-a-judge with
MT-Bench and Chatbot Arena. Advances in Neural Information
Processing Systems.
4. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023).
SmoothQuant: Accurate and efficient post-training quantization for large
language models. In Proceedings of the 40th International Conference on
Machine Learning.
5. Nicholas, G., & Bhatia, A. (2023). Lost in translation: Large language
models in non-English content analysis. arXiv.
6. Ogueji, K., Ahia, O., Onilude, G., Gehrmann, S., Hooker, S., & Kreutzer,
J. (2022). Intriguing properties of compression on multilingual models. In
Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing (pp. 9092–9110). Association for Computational
Linguistics.
7. Vashishtha, A., Ahuja, K., & Sitaram, S. (2023). On evaluating and
mitigating gender biases in multilingual settings. arXiv.
8. Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or
propagating gradients through stochastic neurons for conditional
computation. arXiv.
9. Tay, Y., Zhang, A., Tuan, L. A., Rao, J., Zhang, S., Wang, S., Fu, J., &
Hui, S. C. (2019). Lightweight and efficient neural natural language
processing with quaternion networks. arXiv.
10. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand,
T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional
neural networks for mobile vision applications.

47
11. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y.
(2017). Quantized neural networks: Training neural networks with low
precision weights and activations. The Journal of Machine Learning
Research, 18(1), 6869–6898.
12. Landola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., &
Keutzer, K. (2022). Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and <0.5 MB model size.
13. LeCun, Y., Denker, J. S., & Solla, S. A. (2022). Optimal brain damage. In
Proceedings of the NIPS (pp. 598–605). Li, F., Zhang, B., & Liu, B.
(2021). Ternary weight networks.
14. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert:Pre-
training of deep bidirectional transformers for language under-standing. In
Proceedings of the NAACL, 4171–4186.
15. Hassibi, B.; Stork, D. G.; and Wolff, G. 1994. Optimal brain surgeon:
Extensions and performance comparisons. In Proceedings of the NIPS,
263–270
16. Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and Bengio,Y.
2017. Quantized neural networks: Training neural networks with low
precision weights and activations. The Journal of MachineLearning
Research 18(1):6869–6898
17. Xu, C.; Yao, J.; Lin, Z.; Ou, W.; Cao, Y.; Wang, Z.; and Zha, H.
2018.Alternating multi-bit quantization for recurrent neural
networks.arXiv:1802.00150.
18. Yao, Z.; Gholami, A.; Lei, Q.; Keutzer, K.; and Mahoney, M. W.2018.
Hessian-based analysis of large batch training and robustness to
adversaries. arXiv:1802.08241.
19. Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro
Yasunaga, and DiyiYang. Is chatgpt a general-purpose natural language
processing task solver? In EMNLP, 2023.
20. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad
Almahairi, Yasmine Babaei,Nikolay Bashlykov, Soumya Batra, Prajjwal
Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned
chat models. CoRR.
21. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.
Qlora: Efficientfinetuning of quantized LLMs. Advances in Neural
Information Processing Systems, 36, 2024.

48
22. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song
Han. Awq:Activation-aware weight quantization for llm compression and
acceleration. arXiv preprintarXiv:2306.00978, 2023.
23. Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar,
Artem Babenko, and DanAlistarh. Extreme compression of large language
models via additive quantization. arXivpreprint arXiv:2401.06118, 2024.
24. Hua Ma, Huming Qiu, Yansong Gao, Zhi Zhang, Alsharif Abuadbba,
Minhui Xue, Anmin Fu,Jiliang Zhang, Said F Al-Sarawi, and Derek
Abbott. Quantization backdoors to deep learning commercial frameworks.
IEEE Transactions on Dependable and Secure Computing, 2023.
25. Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao,
and Tom Goldstein. On the exploitability of instruction tuning. Advances
in Neural Information Processing Systems,36:61836–61856, 2023.
26. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How
does LLM safety training fail In NeurIPS, 2023.
27. Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles
Turpin, Peter Hase,Ekdeep Singh Lubana, Erik Jenner, Stephen Casper,
Oliver Sourbut, Benjamin L. Edelman,Zhaowei Zhang, Mario Günther,
Anton Korinek, José Hernández-Orallo, Lewis Hammond,Eric J. Bigelow,
Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Markus
Anderljung, LilianEdwards, Yoshua Bengio, Danqi Chen, Samuel
Albanie, Tegan Maharaj, Jakob Foerster, FlorianTramèr, He He, Atoosa
Kasirzadeh, Yejin Choi, and David Krueger. Foundational challenges in
ensuring alignment and safety of large language models. CoRR, 2024.
28. Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal
and transferable adversarial attacks on aligned language models. CoRR,
2023.
29. Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, and
Chaowei Xiao. On the exploitability of reinforcement learning with human
feedback for large language models. CoRR,2023.
30.Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li,
Carlos Guestrin, PercyLiang, and Tatsunori B. Hashimoto. Stanford
Alpaca: an instruction-following LLaMA model,2023.