0% found this document useful (0 votes)
7 views15 pages

Leveraging The Power of LLMS: A Fine-Tuning Approach For High-Quality Aspect-Based Summarization

Uploaded by

Aryan Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

Leveraging The Power of LLMS: A Fine-Tuning Approach For High-Quality Aspect-Based Summarization

Uploaded by

Aryan Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Leveraging the Power of LLMs: A Fine-Tuning Approach for

High-Quality Aspect-Based Summarization

1 Ankan Mullick 1 Sombit Bose 1 Rounak Saha 2 Ayan Kumar Bhowmick


2 1
Aditya Vempaty Pawan Goyal 1 Niloy Ganguly Prasenjit Dey 2 Ravi Kokku
2
{ankanm, sbcs.sombit, runk20}@kgpian.iitkgp.ac.in
{pawang, niloy}@cse.iitkgp.ac.in,
{ayan, aditya, prasenjit, ravi}@merlyn.org
1 Computer Science and Engineering Department, IIT Kharagpur, India. 2 Emergence AI
arXiv:2408.02584v1 [cs.CL] 5 Aug 2024

Abstract

The ever-increasing volume of digital information necessitates efficient methods


for users to extract key insights from lengthy documents. Aspect-based summariza-
tion offers a targeted approach, generating summaries focused on specific aspects
within a document. Despite advancements in aspect-based summarization research,
there is a continuous quest for improved model performance. Given that large
language models (LLMs) have demonstrated the potential to revolutionize diverse
tasks within natural language processing, particularly in the problem of summa-
rization, this paper explores the potential of fine-tuning LLMs for the aspect-based
summarization task. We evaluate the impact of fine-tuning open-source foundation
LLMs, including Llama2, Mistral, Gemma and Aya, on a publicly available domain-
specific aspect based summary dataset. We hypothesize that this approach will
enable these models to effectively identify and extract aspect-related information,
leading to superior quality aspect-based summaries compared to the state-of-the-art.
We establish a comprehensive evaluation framework to compare the performance
of fine-tuned LLMs against competing aspect-based summarization methods and
vanilla counterparts of the fine-tuned LLMs. Our work contributes to the field of
aspect-based summarization by demonstrating the efficacy of fine-tuning LLMs for
generating high-quality aspect-based summaries. Furthermore, it opens doors for
further exploration of using LLMs for targeted information extraction tasks across
various NLP domains.

1 Introduction

The ever-growing volume of information in various digital formats presents a significant challenge
for users who need to efficiently extract key insights from large documents. Automatic text summa-
rization has emerged as a valuable tool to address this challenge, providing concise and informative
representations of textual documents - El-Kassas et al. (2021); Gambhir & Gupta (2017); Tas &
Kiyani (2007). While traditional summarization techniques aim to capture the overall gist of a
document, aspect-based summarization offers a more focused approach.
Aspect-based summarization goes beyond generic summarization by targeting specific aspects or
topics within a document - Frermann & Klementiev (2019); Coavoux et al. (2019); Mukherjee et al.
(2020). This targeted approach is particularly valuable for large documents, such as research papers,
product reviews, or news articles, where specific information about certain aspects might be crucial
for the reader. Aspect-based summarization allows users to delve deeper into a document by quickly
providing summaries that cater to their specific information needs. For instance, consider a researcher
reviewing a medical study. Using aspect-based summarization, they can prioritize summaries that
highlight the methodology and results sections. Conversely, a customer reading online reviews for a
new phone might prioritize summaries emphasizing aspects like battery life or camera performance -
Li et al. (2020); Kunneman et al. (2018).
Hence, effective generation of aspect-based summaries presents a unique challenge. Unlike generic
summarization, which focuses on capturing the overall gist of a document, aspect-based summariza-

1
tion requires models to not only comprehend the document’s content but also identify and extract
information pertinent to specific aspects. This necessitates models that can not only understand the
semantics of the text but also possess the ability to discern and prioritize aspect-related information.
Despite the significant strides made in aspect-based summarization research, there remains an ongoing
quest for models capable of generating even higher quality summaries. While several state-of-the-art
methods like Falcon, BART, Pegasus, T5, and LED [Lewis et al. (2020); Penedo et al. (2023); Wan &
Bansal (2022); Guo et al. (2022)] among others have yielded promising results, there is a continual
exploration for novel approaches that can elevate the quality of aspect-based summaries. This quest
motivates the exploration of new approaches, such as fine-tuning large language models (LLMs)
(Huang et al. (2022); Ding et al. (2023)) for the task of aspect-based summarization.
LLMs represent a transformative paradigm shift in natural language processing (NLP). These powerful
models, trained on massive datasets of text and code, have demonstrated exceptional capabilities in
various NLP tasks, including text generation, translation, and question answering among others (Wei
et al. (2022); Hoffmann et al. (2022)). The ability of LLMs to capture intricate linguistic patterns and
relationships within text(Yang et al. (2023)) makes them a compelling candidate for enhancing the
performance of aspect-based summarization task.
In this paper, we aim to study the impact of finetuning LLMs (Yang et al. (2024)) for the task of
aspect-based summarization and demonstrate the improvement in the quality of generated aspect-
based summaries over vanilla LLMs. Our work centers around the concept of fine-tuning recent
open-source foundation LLMs, including Llama2 (Touvron et al. (2023)), Mistral (Jiang et al. (2023)),
Gemma (Team et al. (2024)) and Aya (Üstün et al. (2024)). Precisely, we investigate the potential
of fine-tuning such open-source foundation LLMs on a dataset specifically tailored for the task of
aspect-based summarization. By fine-tuning these LLMs on aspect-based summarization datasets, we
aim to equip them with the necessary expertise to effectively identify, extract, and generate summaries
that focus on user-specified aspects within a document such that the fine-tuned LLMs can achieve
superior performance compared to existing methods. In this paper, we seek to address the following
research questions through making contributions related to the field of aspect-based summarization:
1. Does fine-tuning LLMs provide a significant benefit for aspect-based summarization tasks?
2. How effective are fine-tuned LLMs compared to vanilla LLMs and other state-of-the-art methods
for aspect-based summarization?
3. Does the effectiveness of fine-tuning LLMs vary depending on the base model architecture?
4. How robust is the fine-tuned LLM for variations in dataset and domains for aspect-based summa-
rization?

2 Related Work

In this section, we perform a survey of the state-of-the-art literature on summarization and discuss
the literature on different types of summarization as follows:

2.1 Generic Summarization

We focus on brief literature survey on generic summarization, which encompasses a broad approach
to summarizing text without focusing on specific aspects, queries, or goals, using abstractive or
extractive approaches. Among abstractive approaches, Chopra et al. (2016) introduced an abstractive
summarization model using attentive recurrent neural networks and discuss the challenges of gener-
ating coherent and informative summaries while avoiding redundancy. Based on pointer-generator
network framework, See et al. (2017) presents a model that combines extractive and abstractive
techniques for summarization by effectively incorporating source information into the generated sum-
maries. On the other hand, among purely extractive approaches, earlier researchers used graph-based
approaches like TextRank (Mihalcea & Tarau (2004)) and LexRank (Erkan & Radev (2004)).

2
2.2 Aspect-based Summarization

Hayashi et al. (2021) employed a method for aspect-based summarization focusing on multiple
domains while (Coavoux et al. (2019)) focused on aspect-based multi-document abstractive summa-
rization with an unsupervised approach. Few works have also explored domain-specific aspect-based
summarization such as (Mukherjee et al. (2020)) that focus on data from tourist review domain and
(Akhtar et al. (2017)) that focus on dataset of hotel reviews. (Tang et al. (2016)) developed a deep
memory network for aspect-level sentiment classification, emphasizing the extraction of aspects
within a document and these are relevant for aspect-based summarization. Again, (Wang et al. (2016))
proposed an attention-based LSTM model which helps identify and emphasize important aspects and
these are used for aspect-based summarization.

2.3 Use of LLMs for summary evaluation

LLMs are recently emerging as alternatives to traditional metrics and human evaluation for evaluating
NLP tasks. Recent work has explored LLM-based NLG evaluation methods (Gao et al. (2024)) while
(Chan et al. (2023)) assessed the quality of generated responses from different models on open-ended
questions. (Zhou et al. (2023)) have proposed guidelines for LLM use for evaluations while few
works have proposed techniques to improve LLM evaluation performance [Hasanbeig et al. (2023);
Liu et al. (2023)] and (Huang et al. (2023)) have also investigated the explainability of LLMs in
evaluation contexts.
In this paper, our focus is on analysing the impact of fine-tuning open-source foundation LLMs on the
performance of the aspect-based summarization task and determine the type of LLMs that can help to
generate high quality aspect-based summaries either using the pre-trained version or after fine-tuning
on relevant datasets. We also use LLMs (GPT4) to evaluate summaries on different conventions.

3 Dataset

We leverage the publicly available benchmark dataset, Open Aspect-based Summarization (OA-
SUM) Yang et al. (2022), for both fine-tuning open-source foundation LLMs and evaluating their
performance. OASUM offers a rich collection of over 3.6 million document-aspect-summary triplets,
featuring diverse aspects across various domains1 . There are 1M unique aspects in the entire dataset.
The average token count for the documents and aspect-based summaries are 1612 and 40 respectively.

Domain Aspect set


HealthCare Death, Diagnosis, Differential diagnosis, Diagnosis-Classification
Education History, Geography, Taxonomy, Education
Life and Career Career, Political Career, Personal Life, Life and career
Music Production, Composition, Soundtrack, Track Listing

Table 1: Domain-wise breakdown of aspects in OASUM dataset

Data Preprocessing and Variations: To facilitate targeted training and analysis, we prepared several
variations of the OASUM dataset:
1. Domain-Wise Split: We selected 16 aspects from four popular domains (Healthcare, Music,
Education, Life & Career) resulting in a domain-specific dataset of 14,279 training instances.
2. High-Frequency Aspects: We created the variation OASUM-Hi by choosing the top-50 most
frequent aspects (based on document count) and randomly selecting 1,000 documents for each. This
dataset investigates the impact of fine-tuning on well-represented aspects.
3. Low-Frequency Aspects: In contrast, the variation OASUM-Lo focuses on the long tail of
the dataset. We selected the 50 least frequent aspects (1 − 4 document occurrences) with 1,000
documents each. This explores fine-tuning performance on less common aspects (aka long-tails).

1 https://github.com/tencent-ailab/OASum

3
4. Random Aspect Selection: The variation OASUM-Ra comprises a randomly selected set of
50,000 document-aspect-summary triplets for a domain-agnostic evaluation.
Table 2 summarizes the key statistics for each dataset variation, including aspects, train-
ing/validation/test split sizes.

Dataset Aspect Train Validation Test


OASUM-domain wise 16 14279 500 2544
OASUM-Hi 50 50000 500 500
OASUM-Lo 8995 50000 500 500
OASUM-Ra 7320 50000 500 500

Table 2: Different OASUM Datasets distribution

4 Proposed Framework

In this section, we detail our framework for fine-tuning open-source foundation LLMs on the OASUM
dataset to obtain corresponding fine-tuned domain-specific LLMs specialized for the downstream
task of aspect-based summarization. We describe the fine-tuning process, the LLM architectures
employed, and the baseline models used for comparison.

4.1 Model architecture for fine-tuning LLMs

Our training process consists of employing different open-source foundation LLMs for fine-tuning on
the training set of OASUM dataset described above. Specifically, we leverage supervised fine-tuning
(Zhang et al. (2023)) on the OASUM training dataset to transform pre-trained foundation LLMs
into domain-specific models suited to perform aspect-based summarization. This involves utilizing
prompt-completion pairs to guide the pre-trained models towards generating aspect-based summaries.
Each training instance comprises a document paired with an instruction to generate a summary based
on a specific aspect. The corresponding completion is the relevant aspect-based summary.
To enhance the fine-tuning process, we incorporate advanced techniques like Quantized Low-Rank
Adaptation (QLoRA) Dettmers et al. (2023) and PEFT (Parameter-Efficient Fine-Tuning) Fu et al.
(2023) to optimize training efficiency. Following fine-tuning, these models (referred to as "*FT")
acquire the ability to generate aspect-based summaries for corresponding documents based on the
specified aspect within the prompt. Following is a summary of the open-source foundation LLMs we
fine-tuned on OASUM:
1. Llama2:2 We use two different versions of Llama2 - vanilla: with sizes of 7b, 13b and 70b (Touvron
et al. (2023)) and fine-tuned: using models Llama2-7b and 13b. We have referred the Llama2-7b and
Llama2-13b fine-tuned version as Lm7b-FT and Lm13b-FT.
2. Mistral: We fine-tuned the Mistral-7b decoder-only Transformer model (Jiang et al. (2023)) from
Mistral AI, obtaining Mistral-7b-FT (abbreviated as Mis7b-Va for vanilla and Mis7b-FT for finetune).
3. Gemma: We use Gemma which is a family of lightweight, state-of-the-art open models (Team
et al. (2024)) developed by Google DeepMind from the same technology used to create the Gemini
models. Specifically, we finetune the Gemma-2b version to obtain the finetuned version referred to as
Gemma-FT.
4. Aya: We use the Aya Model (Üstün et al. (2024)), a massively multilingual 13 billion parameter
language model capable of following instructions in 101 languages that is developed by Cohere and
fine-tune the pre-trained version to obtain the fine-tuned model referred to as Aya-FT.
For performance comparison, we also include the vanilla pre-trained versions of each LLM (referred
to as "*VA"). These include Llama2-7b-VA (Lm7b-VA), Llama2-13b-VA (Lm13b-VA), Llama2-70b-
VA (Lm70b-VA), Mistral-7b-VA, Gemma-VA, and Aya-VA.

2 https://ai.meta.com/llama/

4
4.2 Baseline models

We use the following state-of-the-art competing baselines for comparing the performance of aspect-
based summarization task against the fine-tuned LLMs and their vanilla counterparts:
1. LongFormer: The Longformer (Beltagy et al. (2020)) is a transformer-based model designed
to handle long documents efficiently using an attention pattern that effectively combines local and
global information, enabling to handle long inputs. We use Longformer-base (LED-ba) and large
(LED-La) model with 149 million and 439 million parameters.
2. T5 (Text-to-Text Transfer Transformer): This model Raffel et al. (2020) leverages transfer
learning for summarization tasks by converting them into a text-to-text format. We fine-tune the
T5-3b version (T5-FT) with 3 billion parameters to generate aspect-based summaries.
3. Flan T5: Flan-T5 Chung et al. (2022) instruction fine-tuned approach highlights the benefits of
fine-tuning across various models, prompting setups, and evaluation tasks. We finetune the Flan T5
XL model (Fl-T5-FT).
4. BART (Bidirectional and Autoregressive Transformer): This denoising autoencoder Lewis
et al. (2019) is used for pre-training sequence-to-sequence models. We employ the instruction-
prompted BART-large model with 406 million parameters, pre-trained on English and fine-tuned for
summarization on the CNN Daily Mail news dataset.
5. Pegasus: We utilize the instruction-tuned Pegasus model Zhang et al. (2020) with 571 million
parameters for generating aspect-based summaries.
6. Falcon: The Falcon 7b-instruction-tuned model Penedo et al. (2023) is used for generating
aspect-based summaries.
7. TLDR: We apply state-of-the-art approach ‘TLDR-CATTS-XSUM’ (TLDR) Cachola et al. (2020)
for extreme summarization to obtain crisp summary of the document.

5 Experimental evaluation and Results

In this section, we evaluate the performance of our different fine-tuned LLM models in terms of the
quality of the generated aspect-based summaries for documents in the OASUM domain wise test set
and compare against their vanilla counterparts as well as the competing baseline models.

5.1 Evaluation metrics and experimental settings

Our evaluation relies on two different approaches:


1. Traditional: Here we check the comptenece of different models with traditional evaluation metrics
like (i) Rouge 1 (R1), Rouge 2 (R2) and Rouge L (RL) (Lin (2004)), (ii) Meteor (Mt) (Banerjee &
Lavie (2005)), (iii) Bleu (Bl) Papineni et al. (2002)), and (iv) BERTScore F1 (BeF1) (Zhang et al.
(2019)) to assess the quality of generated summaries.
2. GPT-4 Critique: Here, we use the GPT-4 LLM as a critique Valmeekam et al. (2023); Sun
et al. (2024) to evaluate the quality of the model generated aspect-based summaries against the gold
standard aspect-based summaries in the test set of the OASUM dataset variations from different
dimensions. Specifically, we provide suitable critique based prompts to GPT-4 where we evaluate the
summaries based on a set of five predefined criterias (termed as GPT-4 criteria) defined below:
a. Relevance (Re): The extent to which the generated summary is relevant to the specific aspect-based
summary of the document.
b. Coverage (Cv): The extent to which the generated aspect-based summary correctly covers all the
important key points described in the gold standard aspect-based summary of the document.
c. Impurity (Im): The extent to which the aspect-based summary does not contain information specific
to any other aspect.
d. Rating (Ra): Scores how well the summary captures the target aspect with the score reflecting
if the summary is good, average or bad. A good summary is clear, concise, accurate, and engaging.

5
An average summary conveys the main points but might lack detail. A bad summary is inaccurate,
unclear, non-coherent or overly verbose. (Details are in Appendix)
e. Goodness (Gd): Extending from 4, we manually verify the goodness of the summary.
This combined evaluation strategy allows us to assess performance from both a similarity and quality
perspective, leveraging established metrics and leveraging the capabilities of GPT-4 for in-depth
analysis.
Experimental Settings: We use 80GB A100 GPU, 210MHz clock cycle and 6 epochs for all
experiments (Details are in Appendix). We have used NLTK, Spacy, openai(version=0.28), hugging-
face_hub, torch and transformers python packages for all experiments3 .

Model R1 R2 RL Mt Bl BeF1 Re Cv Im Ra Gd
Llama2-7b-FT 39.4 23.9 35.9 32.7 14.7 80.0 65.8 45.2 96.6 55.2 37.7
Llama2-13b-FT 41.5 25.9 37.8 35.5 16.8 80.7 68.3 48.9 96.7 58.8 42.3
Mistral-7b-FT 36.1 19.8 31.6 30.8 11.8 78.8 67.7 46.2 83.5 61.4 56.0
Gemma-FT 17.3 2.4 10.9 8.7 0.7 62.4 59.7 37.1 79.0 48.1 20.0
Aya-FT 22.9 10.6 20.1 15.9 4.2 68.2 35.2 27.0 57.8 41.1 40.0
Falcon 17.2 4.8 12.5 22.2 1.2 71.6 61.5 42.1 87.5 55.1 40.2
BART 23.9 8.5 17.6 27.5 3.3 74.6 62.4 43.1 86.8 52.1 22.1
Pegasus 19.8 5.5 14.2 21.9 1.9 71.9 50.9 37.0 87.0 45.5 30.7
T5-FT 35.2 18.2 31.1 29.5 10.1 78.7 63.3 42.7 95.1 53.7 24.0
Fl-T5-FT 35.8 19.1 31.6 30.6 10.9 79.1 64.4 44.1 94.8 54.9 25.5
LED-ba 28.2 16.6 26.1 24.6 9.9 71.9 54.2 38.0 83.5 48.6 22.8
LED-la 34.2 18.5 30.9 27.7 10.9 75.9 62.1 40.9 85.7 42.9 39.6
TLDR 28.2 12.1 23.8 21.8 4.1 76.2 52.4 48.1 80.8 49.1 22.1

Table 3: Traditional and GPT-4 based evaluation on OASUM domain-wise dataset variation

5.2 Results and discussion

In this section, we analyze the results presented in Table 3 as well as Figure 1 based on values of
traditional metrics and GPT-4 criteria respectively to understand how different models perform and
gain insights into the effectiveness of fine-tuning LLMs for aspect-based summarization.

5.2.1 How effective is fine-tuning LLMs for aspect-based summarization based on traditional
evaluation metrics?
In Figure 1, we can see comparison between vanilla and fine-tuned LLMs based on values for
traditional metrics like ROUGE and BERTScore. Here, we can observe a significant performance
boost for fine-tuned LLMs (particularly Llama2-7b-FT, Llama2-13b-FT, Mistral-7b-FT) compared to
their vanilla counterparts (Llama2-7b-VA, Llama2-13b-VA, Mistral-7b-VA) across all metrics. This
indicates that fine-tuning successfully tailors these models to the task of aspect-based summarization,
enabling them to generate summaries that better match the gold-standard summaries in terms of
n-gram overlap and semantic similarity.
Among the fine-tuned LLMs, Llama2-13b-FT consistently achieves the highest scores across all
traditional metrics compared to competing baseline models (as seen from Table 3), suggesting that its
larger parameter size provides an advantage in capturing the nuances of aspect-based information.
Interestingly, among the latest released LLMs, Aya-VA demonstrates an expected performance gain
upon fine-tuning, suggesting its potential suitability for aspect-based summarization tasks. However,
Gemma-VA degrades in BeF1 score, highlighting the importance of model architecture and suitability
for aspect-based summarization task beyond parameter size. In summary, all models might NOT gain
performance upon finetuning.

5.2.2 How does fine-tuning LLMs impact the quality of summaries based on GPT-4 critiquing?
Table 3 and Fig. 1 also unveil a deeper perspective on summary quality through the lens of GPT-4
critiquing. Here, we evaluate summaries based on five criteria: relevance, key point coverage, aspect-
3 Code/Data are in http://tiny.cc/zjelxz

6
Figure 1: Comparison between vanilla and fine-tuned versions of different LLMs for Rouge1, Bert-
Score F1, Relevance and Coverage

specificity, overall quality, and manually verified goodness. Consistent with the traditional metrics,
fine-tuned LLMs (Llama2-7b-FT, Llama2-13b-FT, Mistral-7b-FT) significantly outperform vanilla
models (Llama2-7b-VA, Llama2-13b-VA) across all criteria as further supported by corresponding
plots comparing vanilla and fine-tuned LLMs based on values of relevance and coverage in Figure 1.
This reinforces the effectiveness of fine-tuning in generating summaries that are not only similar
to the gold standard but also capture the essence of the specific aspect and deliver clear, concise
information.
Llama2-13b-FT achieves best performance again compared to baseline methods, achieving the highest
scores in most criteria, particularly in key point coverage and overall quality (see Table 3). This
suggests that its larger size allows for a more comprehensive understanding of the document and the
target aspect, leading to summaries that effectively capture the crucial aspect-based details. However,
size is not necessarily an indicator of getting the best performance since Aya is also 13b moel but
has the least performance for this task among the models considered. This indicates that specific
models are optimized for specific tasks. Also, similar to our observations in the previous sections,
Gemma-FT has degraded performance upon fine-tuning, indicating that fine-tuning LLMs is not
always a good thing for all LLMs and tasks.

5.2.3 Which LLMs achieve the best performance on fine-tuning?

By combining the insights from both traditional metrics and GPT-4 critiquing results, Llama2-13b-FT
emerges as the clear winner for generating aspect-based summaries, consistently demonstrating
superior performance in terms of similarity, key point coverage, relevance, and overall quality. Its
larger parameter size appears to be instrumental in achieving this level of performance for the
aspect-based summarization task, along with its superior architecture.
These findings significantly strengthen the case for fine-tuning LLMs for aspect-based summarization,
for most of the base models. Fine-tuning not only improves the similarity of generated summaries to
the gold standard but also enhances their ability to capture the essence of the target aspect and deliver
clear, concise information. While parameter size plays a role, model architecture also plays a crucial
part, as evidenced by Gemma-VA’s limitations with fine-tuning not improving its performance and
the marginal improvement of Aya-FT over its vanilla counterpart.

7
Data Approach R1 R2 RL Mt Bl BeF1 Re Cv Im Ra Gd
Lm7b-VA 18.5 5.4 13.9 20.8 1.5 70.0 42.1 36.5 55.1 40.1 26.4
VA Lm13b-VA 19.2 5.5 14.1 21.6 1.8 68.5 43.2 37.2 56.3 43.6 28.8
Mis7b-VA 22.1 6.6 15.8 25.2 2.1 73.1 59.7 38.1 85.3 50.6 24.0
Hi
Lm7b-FT 33.8 18.3 30.7 28.3 10.0 78.4 53.7 43.8 69.1 49.2 35.3
FT Lm13b-FT 36.9 21.9 33.0 30.3 11.7 81.1 63.2 44.8 88.1 52.3 42.5
Mis7b-FT 32.4 15.9 27.6 26.6 7.1 78.1 59.4 42.3 90.3 47.3 42.0
Lm7b-VA 14.3 4.4 10.9 17.0 1.0 65.3 29.1 27.5 48.3 30.5 20.6
VA Lm13b-VA 20.1 5.3 14.6 20.3 1.6 70.2 31.5 25.2 48.9 29.6 20.1
Mis7b-VA 21.5 5.7 15.2 20.1 1.6 73.2 46.0 30.5 55.6 42.5 18.0
Lo
Lm7b-FT 21.9 7.2 16.4 22.1 3.5 72.1 34.1 32.7 50.2 31.7 25.2
FT Lm13b-FT 29.2 13.3 25.1 22.5 6.3 78.8 48.2 36.6 66.8 49.9 44.8
Mis7b-FT 25.3 8.8 20.3 19.4 2.7 76.1 47.5 32.3 62.8 43.8 42.0
Lm7b-VA 15.5 4.9 11.7 19.7 1.4 68.2 34.6 30.2 49.0 32.7 21.9
VA Lm13b-VA 19.6 5.3 14.2 21.3 1.6 70.3 35.2 31.8 52.4 33.3 23.3
Mis7b-VA 21.8 5.9 15.6 24.8 1.8 70.2 52.2 30.7 80.9 39.2 22.0
Ra
Lm7b-FT 27.8 13.9 27.2 26.1 7.8 73.0 48.9 35.4 62.3 34.3 29.0
FT Lm13b-FT 30.4 14.6 28.1 28.3 9.2 75.3 55.8 39.0 88.9 42.5 33.9
Mis7b-FT 28.6 13.1 24.8 24.1 5.3 72.3 53.7 33.6 86.8 40.0 38.0

Table 4: Traditional and LLM based evaluations on three different variations of OASUM dataset

5.2.4 How robust is the fine-tuned LLM for variations in dataset and domains for aspect-based
summarization?
To answer this question, we pick our best perfoming fine-tuned model from the results in the previous
section, the Llama2-13b-FT, and evaluate it on variations in the dataset.

Different Types of OASUM Data: To check the effectiveness of oir fine-tuned models, we
experiment on different types of OASUM data: OASUM-Hi, OASUM-Lo and OASUM-Ra as shown
in Table 4. By employing multiple dataset variations, we aim to achieve a comprehensive evaluation
of fine-tuned LLMs for aspect-based summarization, taking into account various data characteristics
and potential shortcomings in existing summaries. As expected, evaluation outcomes are best for
OASUM-Hi and least for OASUM-Lo since number of aspects for OASUM-Hi is much lesser than
OASUM-Lo. OASUM-Ra exhibit results better than OASUM-Lo due to presence of lesser aspects than
OASUM-Lo. Llama2-13b-FT performs best for almost all scenrios across different evaluation metrics.

Domain R1 R2 RL Mt Bl BeF1 Re Cv Im Ra Gd
Healthcare 32.8 17.3 29.3 27.4 9.4 77.8 69.9 47.0 97.4 57.0 42.5
Education 44.9 28.2 41.1 38.1 18.2 81.3 68.2 51.0 97.6 58.6 45.3
Life and Career 39.4 23.9 35.5 32.7 14.1 80.4 69.9 48.5 96.7 58.8 41.1
Music 41.9 27.6 38.6 37.7 20.4 81.0 66.2 0.47 94.9 56.6 40.3
Average 41.5 25.9 37.8 35.5 16.8 80.7 68.3 48.9 96.7 58.8 42.3

Table 5: Evaluations of fine-tuned Llama2-13b on different domains of OASUM

Evaluations for different Domains: In Table 5, we show five different traditional metric and GPT4
critique scores for the best performing Llama2-13b Finetuned model of OASUM data for 4 different
domains - Healthcare, Education, Life and Career and Music. It shows consistent performance of
Llama2-13b Finetuned model for different domains.

Different Evaluation Parameter Settings We evaluate outcomes of various models with different
parameter setting during GPT4 critique - max-new-token and temperature. Best results are obtained
when max-new-token size is 80 (as shown in Fig. 2) and GPT4 critique’s temperature is 0.0.
Varying Training Size Dataset: To understand the effect of training data size on the performance,
we vary the OASUM Domain-Wise Split training data for the Llama2-13b model - taking 10%, 40%
and 70% of the initial training data, and finetune the Llama2-13b model with same parameter and
hyper-parameter settings and the five criterias of GPT4-Critique outcome (in %) are shown in Fig
2. We see that with increasing the dataset size, the performance of Llama2-13b improves in terms
of different GPT4 critique metrics: Relevance (Re), Coverage (Cv), Impurity (Im), Rating (Ra) and
Goodness (Gd). Even at 40% of the dataset, the model is able to achieve a decent performance. It

8
Figure 2: GPT4 Criteria Performance comparison of Llama2-13b-FT model w.r.t. training data
variation (left) and max-new-token size (right)

shows the effectiveness of the Llama2-13b model. It also infers that even with very little amount of
data with 10% Llama2-13b can able to generate appropriate aspect based summary.
Assessment of Llama2-13b Model To further investigate the potency of Llama2-13b-FT model,
we extract 50 different OASUM articles and provide aspect based summaries of OASUM (Ground
Truth vs Llama2) to two annotators (with domain knowledge and proficiency in English) to label: (i)
whether Llama2-13b-FT is better, (ii) Both are Good, (iii) Ground Truth is better and (iv) Both are
bad. We found that 20% cases Llama2-13b-FT is better, 50% cases both are good, 24% cases ground
truth is better and 6% cases both are bad. So, overall Llama2 provides 70% good summaries.
These findings reinforce the claim of superiority of finetuning approach and utilization of LLM as
an alternative evaluation criteria. They also show that the approach is robust to variations in type,
domain, and quantity of datasets for the given task.

6 Conclusion

In this paper, we addressed the ever-growing challenge of efficiently extracting key insights from
voluminous documents in the digital age. We explored the potential of fine-tuning large language
models (LLMs) to enhance the performance of aspect-based summarization task. Our work centered
around fine-tuning open-source foundation LLMs, including Llama2, Gemma, Mistral, and Aya,
on aspect-based summarization datasets. We hypothesized that this approach would enable these
models to excel at identifying and extracting information relevant to user-specified aspects within a
document, ultimately leading to superior quality aspect-based summaries.
Through a comprehensive evaluation framework, we compared the performance of fine-tuned LLMs
against state-of-the-art aspect-based summarization methods and vanilla counterparts of the fine-tuned
LLMs, and demonstrated significant improvement in quality of generated summaries as a result of
fine-tuning. Our findings not only contribute towards the advancement of aspect-based summarization
techniques but also hold significant implications for the broader field of NLP. By demonstrating
the effectiveness of fine-tuning LLMs for targeted information extraction tasks like aspect-based
summarization, we open doors for further exploration and potential applications in various NLP
domains requiring focused information retrieval and summarization, ultimately empowering users to
navigate the ever-expanding sea of information with greater efficiency and precision.

Limitations

Our datasets are not multilingual and multimodal. We plan to capture aspects involving multimodal
content, such as images or videos, limiting their comprehensiveness. LLMs may face challenges in
adapting to domain-specific jargon, resulting in less informative summaries for aspects containing
specialized terminology. So, we need to explore how to correct these - which we aim to do as a part
of future work.

9
Ethics Statement
Our work does not reveal any personal sensitive information and we use publicly available bench-
marked datasets and models in different contexts.

References
Nadeem Akhtar, Nashez Zubair, Abhishek Kumar, and Tameem Ahmad. Aspect based sentiment
oriented summarization of hotel reviews. Procedia computer science, 115:563–571, 2017.
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved
correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic
evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150, 2020.
Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S Weld. Tldr: Extreme summarization of
scientific documents. arXiv preprint arXiv:2004.15011, 2020.
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and
Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv
preprint arXiv:2308.07201, 2023.
Sumit Chopra, Michael Auli, and Alexander M Rush. Abstractive sentence summarization with
attentive recurrent neural networks. In Proceedings of the 2016 conference of the North American
chapter of the association for computational linguistics: human language technologies, pp. 93–98,
2016.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.
arXiv preprint arXiv:2210.11416, 2022.
Maximin Coavoux, Hady Elsahar, and Matthias Gallé. Unsupervised aspect-based multi-document
abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization,
pp. 42–47, 2019.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning
of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin
Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained
language models. Nature Machine Intelligence, 5(3):220–235, 2023.
Wafaa S El-Kassas, Cherif R Salama, Ahmed A Rafea, and Hoda K Mohamed. Automatic text
summarization: A comprehensive survey. Expert systems with applications, 165:113679, 2021.
Günes Erkan and Dragomir R Radev. Lexrank: Graph-based lexical centrality as salience in text
summarization. Journal of artificial intelligence research, 22:457–479, 2004.
Lea Frermann and Alexandre Klementiev. Inducing document structure for aspect-based summariza-
tion. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pp. 6263–6273, 2019.
Zihao Fu, Haoran Yang, Anthony Man-Cho So, Wai Lam, Lidong Bing, and Nigel Collier. On
the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 37, pp. 12799–12807, 2023.
Mahak Gambhir and Vishal Gupta. Recent automatic text summarization techniques: a survey.
Artificial Intelligence Review, 47(1):1–66, 2017.
Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. Llm-based nlg evaluation: Current
status and challenges. arXiv preprint arXiv:2402.01383, 2024.

10
Mandy Guo, Joshua Ainslie, David C Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and
Yinfei Yang. Longt5: Efficient text-to-text transformer for long sequences. In Findings of the
Association for Computational Linguistics: NAACL 2022, pp. 724–736, 2022.
Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, and Ida Momennejad.
Allure: Auditing and improving llm-based evaluation of text using iterative in-context-learning.
arXiv e-prints, pp. arXiv–2309, 2023.
Hiroaki Hayashi, Prashant Budania, Peng Wang, Chris Ackerson, Raj Neervannan, and Graham
Neubig. Wikiasp: A dataset for multi-domain aspect-based summarization. Transactions of the
Association for Computational Linguistics, 9:211–225, 2021.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han.
Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H Gilpin. Can
large language models explain themselves? a study of llm-generated self-explanations. arXiv
preprint arXiv:2310.11207, 2023.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot,
Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al.
Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Florian Kunneman, Sander Wubben, Antal van den Bosch, and Emiel Krahmer. Aspect-based
summarization of pros and cons in unstructured product reviews. In COLING, pp. 2219–2229,
2018.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for
natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461,
2019.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for
natural language generation, translation, and comprehension. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020.
Haoran Li, Peng Yuan, Song Xu, Youzheng Wu, Xiaodong He, and Bowen Zhou. Aspect-aware mul-
timodal summarization for chinese e-commerce products. In Proceedings of the AAAI conference
on artificial intelligence, volume 34, pp. 8188–8195, 2020.
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization
branches out, pp. 74–81, 2004.
Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng,
Feng Sun, and Qi Zhang. Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308, 2023.
Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of the 2004
conference on empirical methods in natural language processing, pp. 404–411, 2004.
Rajdeep Mukherjee, Hari Chandana Peruri, Uppada Vishnu, Pawan Goyal, Sourangshu Bhattacharya,
and Niloy Ganguly. Read what you need: Controllable aspect-based opinion summarization of
tourist reviews. In Proceedings of the 43rd international ACM SIGIR conference on research and
development in information retrieval, pp. 1825–1828, 2020.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association
for Computational Linguistics, pp. 311–318, 2002.

11
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli,
Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb
dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv
preprint arXiv:2306.01116, 2023.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with
pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.
Shichao Sun, Junlong Li, Weizhe Yuan, Ruifeng Yuan, Wenjie Li, and Pengfei Liu. The critique of
critique. arXiv preprint arXiv:2401.04518, 2024.
Duyu Tang, Bing Qin, and Ting Liu. Aspect level sentiment classification with deep memory network.
arXiv preprint arXiv:1605.08900, 2016.
Oguzhan Tas and Farzad Kiyani. A survey automatic text summarization. PressAcademia Procedia,
5(1):205–213, 2007.
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak,
Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models
based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation
and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude,
Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. Aya model: An instruction
finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827, 2024.
Karthik Valmeekam, Matthew Marquez, and Subbarao Kambhampati. Can large language models
really improve by self-critiquing their own plans? arXiv preprint arXiv:2310.08118, 2023.
David Wan and Mohit Bansal. Factpegasus: Factuality-aware pre-training and fine-tuning for
abstractive summarization. In Proceedings of the 2022 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, pp. 1010–1028,
2022.
Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. Attention-based lstm for aspect-level
sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural
language processing, pp. 606–615, 2016.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.
arXiv preprint arXiv:2206.07682, 2022.
Haoran Yang, Yumeng Zhang, Jiaqi Xu, Hongyuan Lu, Pheng Ann Heng, and Wai Lam. Unveiling
the generalization power of fine-tuned large language models. arXiv preprint arXiv:2403.09162,
2024.
Xianjun Yang, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Xiaoman Pan, Linda Petzold,
and Dong Yu. Oasum: Large-scale open domain aspect-based summarization. arXiv preprint
arXiv:2212.09233, 2022.
Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, and Wei Cheng. Exploring the limits of chatgpt
for query or aspect-based text summarization. arXiv preprint arXiv:2302.08081, 2023.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with extracted
gap-sentences for abstractive summarization. In International Conference on Machine Learning,
pp. 11328–11339. PMLR, 2020.

12
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating
text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
Zheng Zhang, Chen Zheng, Da Tang, Ke Sun, Yukun Ma, Yingtong Bu, Xun Zhou, and Liang Zhao.
Balancing specialized and general skills in llms: The impact of modern tuning and data strategy.
arXiv preprint arXiv:2310.04945, 2023.
Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin,
Ji-Rong Wen, and Jiawei Han. Don’t make your llm an evaluation benchmark cheater. arXiv
preprint arXiv:2311.01964, 2023.

Appendix

A Prompts
We use prompting in two stages - finetune-inference and critique. There are two kinds of prompts -
system prompt and user prompt.

A.1 Finetune and Inference prompt

system: You are an AI assistant who is to generate the summary of a textual document specific to a
certain aspect.
user prompt - Summarize the textual document given below from the perspective of aspect:
### Document: document

A.2 Critique

system: You are an AI assistant who is to evaluate the summary of a textual document specific to
a certain aspect. You need to return a score between 0 and 1 reflecting the quality of the generated
summary based on some criteria.
user:You are given a textual document and the corresponding summary of the document generated
from the respective of an aspect {aspect} predicted by a language model as follows.
Document: {document}
Ground truth summary : {label summary}
Summary with respect to an aspect {aspect}: {model generated summary}
Evaluate the above aspect based summary for the document in terms of each of the following criteria
and return only a score between 0 and 1 without any explanation:

• The extent to which the generated summary is relevant to a specific aspect {aspect} based
summary of the document.
• The extent to which the generated aspect-based summary correctly covers all the important
key points described in the aspect {aspect} based summary of the document.
• The extent to which the summary does not contain information specific to all other possible
aspects {aspect_set_in_a_domain - aspect} based summary.
• Rate the summary from the point of view of the aspect – whether the summary is good,
average, or bad. A good summary effectively captures the essential points, presenting them
clearly and concisely. It maintains accuracy, encourages reader engagement, and serves as a
compelling introduction to the content. An average summary conveys the main points but
may lack some clarity or detail, presenting a decent overview without standing out in terms
of conciseness or precision. It provides a basic understanding but might benefit from a more
refined focused summary fails to accurately convey the main points, containing inaccuracies
or misinterpretations. It is either overly verbose or lacks coherence, making it difficult for
the reader to grasp the core information effectively.
• Calculated summary from the point of view of the aspect [Good/Bad/Average] [Calculated
from 4 with the help of manual annotation]

13
B Time and GPU
We experiment on 80GB A100 GPU with GPU clock cycle 210 MHz. The finetuning and inference
time of our finetuned models are in Table 6.

Model Finetune Time Inference Time


Llama2-7b 22 hrs 2hrs 10 mins
Llama2-13b 44 hrs 3hrs 10 mins
Mistral-7b 36 hrs 2hrs 44 mins
Aya 38 hrs 2hrs 40mins
Gemma 50 hrs 4hrs 40mins

Table 6: Model Training Time [using 80GB A100 GPU]

C Examples
An examples of OASUM aspect based summary is shown in Fig 3 . Example of human annotation
interface is shown in Fig 4.

Figure 3: OASUM summary example snapshot

14
Figure 4: original summary and Llama2-13b finetune comparison experiment example snapshot

15

You might also like