0% found this document useful (0 votes)
24 views

Towards Metamorphic Testing of Space Software Using Large Language Models

Uploaded by

attila.nemet.001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Towards Metamorphic Testing of Space Software Using Large Language Models

Uploaded by

attila.nemet.001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

University of Luxembourg

Master Thesis

Towards Metamorphic Testing of Space


Software Using Large Language Models

Author: Supervisors:
Attila Istvan Nemet Fabrizio Pastore
Seung Yeob Shin

Reviewer:
Domenico Bianculli

A thesis submitted in fulfillment of the requirements


for the degree of Master of Sciences
in study-program

Master in Space Technologies and Business

research at

Software Verification and Validation (SVV)


Interdisciplinary Centre for Security, Reliability and Trust (SnT)

August 23, 2024


iii

Declaration of Authorship

I, Attila Istvan Nemet, declare that this thesis titled, “Towards Metamorphic Testing of Space
Software Using Large Language Models” and the work presented in it are my own. I confirm
that:

• This work was done wholly or mainly while in candidature for a research degree at this
University.

• Where any part of this thesis has previously been submitted for a degree or any other
qualification at this University or any other institution, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear
exactly what was done by others and what I have contributed myself.

Signed:

Date:
v

“When the winds of change blow, some people build walls and others build windmills.”

Chinese Proverb
vii

UNIVERSITY OF LUXEMBOURG

Abstract
Faculty of Science, Technology and Medicine (FSTM)

Interdisciplinary Centre for Security, Reliability and Trust (SnT)

Master of Sciences

Towards Metamorphic Testing of Space Software Using Large Language Models

by Attila Istvan Nemet

This thesis investigates the potential of Large Language Models (LLMs) in enhancing software
testing practices for critical space system components. Specifically, it explores the capability
of LLMs to generate Metamorphic Relations (MRs) for the Mathematical Library for Flight
Software (MLFS) and the Packet Utilization Standard (PUS) framework. The study employs
OpenAI’s ChatGPT, utilizing the GPT-4 Turbo and GPT-4o models, to generate MRs based
on Software Requirements Specifications (SRS) and to extract relevant test information from
complex technical documents.

The research addresses four key questions: the ability of LLMs to generate MRs from SRS, the
qualitative differences between MRs generated from SRS versus prior knowledge, the impact
of document format on MR generation, and the LLMs’ capability to extract relevant test
information from large SRS documents.

Findings reveal that LLMs can indeed generate valid and relevant MRs for both MLFS and
PUS, with varying effectiveness depending on the system’s complexity and the model’s prior
knowledge. The study also demonstrates that document format can influence MR generation for
certain service types, and that LLMs show impressive capabilities in extracting test information
from large documents with low error rates.

This research contributes to the field by exploring innovative applications of AI in space


software testing, offering insights into the capabilities and limitations of LLMs in specialized
technical domains. It concludes by discussing practical considerations for implementing LLM-
assisted testing and suggesting future research directions, including broader scope studies, model
comparisons, and integration with existing testing processes.
ix

Acknowledgements

I would like to express my heartfelt gratitude to my esteemed supervisors, Fabrizio Pastore


and Seung Yeob Shin, for their unwavering guidance, invaluable insights, and relentless support
throughout the entirety of this research journey. Your expertise, patience, and constructive
feedback have played a pivotal role in shaping the direction and quality of this dissertation.

I am deeply indebted to my thesis reviewer, Domenico Bianculli, for his invaluable feedback
and insightful comments. His thorough review and constructive criticism significantly improved
the quality of this thesis.

I owe my deepest gratitude to my wife, Crista, whose unwavering support, patience, and love
made this thesis possible. Her remarkable ability to balance our family life with the demands
of my scientific work was invaluable. Her encouragement during challenging times and her
celebration of every milestone kept me motivated throughout this journey. This accomplishment
is as much hers as it is mine.

Last but not least, I am thankful to our program administrator, Marielle MABILLE, who
made sure to never miss a deadline and to all those who contributed in any way to this work,
whether through thought-provoking discussions, suggestions, or moral support. Your collective
contributions have played a crucial role in shaping the outcome of this dissertation.

In conclusion, this dissertation stands as a testament to the collaborative effort and support
of many individuals and entities. Your belief in me and dedication to my success have left an
indelible mark on this work, and for that, I am truly grateful.
xi

Contents

Declaration of Authorship iii

Abstract vii

Acknowledgements ix

1 Introduction 1
1.1 Background and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Research Objectives and Questions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 5


2.1 Metamorphic Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Systematizing the Identification of Metamorphic Relations . . . . . . . . 6
2.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 LLMs Limitations in Research . . . . . . . . . . . . . . . . . . . . . . . 9

3 An Approach for the Automated Generation of MRs from LLMs 11


3.1 Method Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Tokenization in LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Selecting the Best Model . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Selecting the Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Selecting the API Interface . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Setting LLM Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Setting the Temperature in LLMs . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Text Completion with Default Temperature and Top P Settings . . . . . 17
3.2.3 Text Completion while Altering the Temperature and Top P Settings . 19
3.3 MR Generation Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Empirical Assessment 23
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
xii

4.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23


4.2.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 MLFS Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.3 PUS Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 MLFS Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.2 PUS Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 MLFS Study Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
MLFS Document Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 PUS Study Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
PUS Document Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.3 Assessing RQs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
RQ1: Can LLMs generate MRs from SRS? . . . . . . . . . . . . . . . . 66
RQ2: Do MRs generated from SRS qualitatively differ from MRs generated
based only on prior knowledge of LLMs? . . . . . . . . . . . . 69
RQ3: Can documents with a different format lead to different MRs? . . 71
RQ4: Can LLM extract relevant test information from large SRS documents? 72
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Conclusion 79
5.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Reflections on LLM Evolution and Future Prospects . . . . . . . . . . . . . . . 80
5.3 Practical Considerations for Implementation . . . . . . . . . . . . . . . . . . . . 82
5.4 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A MLFS Tables 85

B PUS Tables 99

Bibliography 111
xiii

List of Figures

3.1 Example of the effect of temperature on the SoftMax probabilities. . . . . . . . 17


3.2 Text completion showing the token selection with probabilities. . . . . . . . . . 18
3.3 High probability tokens ,, consectetur, adipiscing and elit. selected at
the beginning of text completion. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Low probability tokens \n\n, It and was selected at the beginning of text
completion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 The most common output with the Temperature set to 0.0. . . . . . . . . . . . 20
3.6 The most common output with the Top P set to 0.0. Difference from the other
alternative is in the 4th row — here token ipsum is selected. . . . . . . . . . . 20
3.7 Another alternative output with the Top P set to 0.0. Difference from the
previous output is in the 4th row — here token elit is selected. . . . . . . . . 21
3.8 Temperature set to 2.0 can lead to UTF-8 decoding errors in older models. . . 22

4.1 Initial test of generating MRs without specifying the output format . . . . . . . 32
4.2 PUS Service Types slide from the ESA training material . . . . . . . . . . . . . 36
4.3 ChatGPT rejects MR generation for the PUS framework when not providing
supporting document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 PUS ST[03] Housekeeping — Message types slide from the ESA training material,
displaying comma between the service type number and the sub-service number
(e.g., TC[3,5]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Excerpt from the MLFS PDF document atan2 function requirement specification 70

5.1 Top models on LiveBench as of 6th August 2024 . . . . . . . . . . . . . . . . . . 81


5.2 Top models on LMSYS as of 14th August 2024 . . . . . . . . . . . . . . . . . . 81
xv

List of Tables

4.1 MRs generated by ChatGPT for the atan2 function . . . . . . . . . . . . . . . 48


4.2 MR statistics for the atan2 function . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Results of TC/TM mapping of ST[03] Housekeeping, when all three documents
(PDF, PPTX and JSON) attached . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Results of TC/TM mapping of ST[03] Housekeeping, when individually attaching
the PDF, PPTX and JSON documents . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Comparing the TC/TM mapping results of ST[03] Housekeeping for different
document types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Results of TC/TM mapping of ST[17] Test . . . . . . . . . . . . . . . . . . . . 54
4.7 Results of TC/TM mapping of ST[01] Request Verification . . . . . . . . . . . 56
4.8 Comparing the TC/TM mapping results of ST[01] Request Verification for
different document types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.9 Metrics for ST[03] Housekeeping MRs . . . . . . . . . . . . . . . . . . . . . . . 60
4.10 Comparing the TC/TM mapping results of ST[03] Housekeeping, five times PDF
and once PPTX attached . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.11 Collection of Jaccard and Proportion New indexes for ST[03] Housekeeping . 63
4.12 Extracting v_pdf and v_pptx vectors for ST[03] Housekeeping . . . . . . . . . 65
4.13 Aggregated Results for the 7 functions studied in the MLFS framework . . . . 67
4.14 Aggregated results for studied STs with PDF, PPTX, and JSON documents
attached . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.15 Aggregated results for Mann-Whitney U-test p-values and VDAs for the studied
STs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.16 Aggregated results of the TC/TM mapping mistakes for the studied STs . . . . 73

A.1 MRs generated by ChatGPT for the asin function . . . . . . . . . . . . . . . . 86


A.2 MR statistics for the asin function . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.3 MRs generated by ChatGPT for the exp function . . . . . . . . . . . . . . . . . 88
A.4 MR statistics for the exp function . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.5 MRs generated by ChatGPT for the fmin function . . . . . . . . . . . . . . . . 90
A.6 MR statistics for the fmin function . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.7 MRs generated by ChatGPT for the fmod function . . . . . . . . . . . . . . . . 92
A.8 MR statistics for the fmod function . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.9 MRs generated by ChatGPT for the hypot function . . . . . . . . . . . . . . . 94
A.10 MR statistics for the hypot function . . . . . . . . . . . . . . . . . . . . . . . . 95
A.11 MRs generated by ChatGPT for the pow function . . . . . . . . . . . . . . . . . 96
xvi

A.12 MR statistics for the pow function . . . . . . . . . . . . . . . . . . . . . . . . . 97

B.1 Results of TC/TM mapping of ST[22] Position-Based Scheduling, when all three
documents (PDF, PPTX and JSON) attached and when only PDF is attached 100
B.2 Results of TC/TM mapping of ST[22] Position-Based Scheduling, when PPTX
and JSON documents are attached . . . . . . . . . . . . . . . . . . . . . . . . . 101
B.3 Comparing the TC/TM mapping results of ST[22] Position-Based Scheduling for
different document types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B.4 Results of TC/TM mapping of ST[23] File Management, when all three documents
(PDF, PPTX and JSON) attached and when only PDF is attached . . . . . . . 103
B.5 Results of TC/TM mapping of ST[23] File Management, when PPTX and JSON
documents are attached . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.6 Comparing the TC/TM mapping results of ST[23] File Management for different
document types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.7 Results of MR generation for ST[03] Housekeeping when all three documents are
attached . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
B.8 Results of MR generation for ST[01] Request Verification when all three docu-
ments are attached and when only the PDF document is attached . . . . . . . 107
B.9 Results of MR generation for ST[01] Request Verification when the PPTX and
the JSON documents are attached . . . . . . . . . . . . . . . . . . . . . . . . . 108
B.10 Results of TC/TM mapping ST[01] Request Verification for 5xPDF vs 1xPPTX
experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.11 Comparing results of TC/TM mapping ST[01] Request Verification for 5xPDF
vs 1xPPTX experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xvii

List of Abbreviations

AI Artificial Intelligence
AOCS Attitude and Orbit Control Subsystem
BPE Byte Pair Encoding
CSV Comma Separated Values
DOORS Dynamic Object Oriented Requirements System
ECSS European Cooperation for Space Standardization
ESA European Space Agency
GDPR General Data Protection Regulation
GSM8K Grade School Math 8K
GPT Generative Pre-trained Transformer
JSON JavaScript Object Notation
HK HouseKeeping
LLM Large Language Model
METRIC METamorphic Relation Identification based on the Category
MLFS Mathematical Library for Flight Software
MR Metamorphic Relation
MT Metamorphic Test
NLP Natural Language Processing
NYU New York University
PPTX PowerPoint Open XML
PDF Portable Document Format
PUS Packet Utilization Standard
RAG Retrieval Augmented Generation
SRS Software Requirements Specifications
ST ServiceType
STaR Self-Taught Reasoner
TC TeleCommand
TM TeleMetry
UMD University of Maryland
USC University of Southern California
VDA Vargha and Delaney’s A statistic
xix

Dedicated to my daughter Eola, who is teaching me studying.


1

Chapter 1

Introduction

1.1 Background and Context


In the rapidly evolving landscape of space exploration and satellite technology, the reliability
and functionality of software systems play a crucial role. As space missions become increasingly
complex, the need for robust and efficient software testing methodologies has never been more
critical. Traditional testing methods, while valuable, often struggle to keep pace with the
intricacies of modern space system software.

Large Language Models (LLMs) have emerged as powerful tools in various domains, showcasing
remarkable capabilities in natural language processing, code generation, and problem-solving.
This thesis explores the potential of leveraging LLMs, specifically in the context of generating
Metamorphic Relations (MRs) for software testing in space systems.

1.2 Problem Statement


Software testing in the space domain presents unique challenges due to the critical nature
of space missions, the complexity of space system software, and the difficulty in creating
comprehensive test cases. The test input space is large, which implies that several inputs need
to be tested for every functionality under test, but for each input it is necessary to determine
the expected output. Although automated input generation techniques exist, the problem of
automating output verification remains largely open. Metamorphic Testing, a technique that
focuses on identifying relations between multiple executions of a program, has shown promise
in addressing some of these challenges. However, the manual specification of Metamorphic
Relations can be time-consuming and may not cover all possible scenarios.

This research investigates whether LLMs can effectively assist in generating MRs for space
system software components, potentially enhancing the efficiency and comprehensiveness of the
testing process.
2 Chapter 1. Introduction

1.3 Research Objectives and Questions


The primary objective of this study is to assess the capability of LLMs in generating MRs
for critical space software components. Specifically, this research focuses on two key software
packages: the Mathematical Library for Flight Software (MLFS) and the Packet Utilization
Standard (PUS) framework.

To guide this investigation, the following research questions have been formulated:

1. Can LLMs generate MRs from Software Requirements Specifications (SRS)?

2. Do MRs generated from SRS qualitatively differ from MRs generated based only on prior
knowledge of LLMs?

3. Can documents with a different format lead to different MRs?

4. Can LLMs extract relevant test information from large SRS documents?

1.4 Methodology Overview


This research employs an empirical assessment approach, utilizing OpenAI’s ChatGPT, specifi-
cally the GPT-4 Turbo model for MLFS experiments and the GPT-4o model for PUS experiments.
The study involves:

• Generating MRs using LLMs with and without access to SRS documents

• Comparing MRs generated from different document formats (PDF, PPTX, JSON)

• Evaluating the LLM’s ability to extract and map Test Commands (TCs) and Telemetry
(TMs) from SRS documents

• Analyzing the quality, relevance, and uniqueness of generated MRs

1.5 Significance of the Study


This research contributes to the field of software testing for space systems by exploring innovative
applications of AI technology. By investigating the potential of LLMs in generating MRs, this
study aims to:

• Enhance the efficiency of test case generation for space system software

• Improve the coverage and effectiveness of software testing in the space domain

• Provide insights into the capabilities and limitations of LLMs in specialized technical
domains

• Offer practical considerations for implementing LLM-assisted testing in real-world scenarios


1.6. Thesis Structure 3

1.6 Thesis Structure


The remainder of this thesis is organized as follows:

• Chapter 2: Background and Related Work — Provides an overview of relevant research


in software testing for space systems, Metamorphic Testing, and the application of LLMs
in software engineering.

• Chapter 3: An Approach for the Automated Generation of MRs from LLMs — Details
the research design, data collection methods, and analysis techniques employed in this
study.

• Chapter 4: Empirical Assessment — Presents the results of the experiments, including


the analysis of MRs generated for MLFS and PUS components, interprets the findings,
and discusses their implications.

• Chapter 5: Conclusion — Summarizes the key findings, acknowledges limitations, and


suggests directions for future research.

Through this structure, the thesis aims to provide a comprehensive exploration of the potential
of LLMs in enhancing software testing practices for critical space system components.
5

Chapter 2

Background and Related Work

2.1 Metamorphic Testing


Metamorphic Testing (MT) is a software testing technique used to address the “oracle problem”
in software testing. The oracle problem refers to the difficulty in determining the expected
output for a given input, manifesting especially in complex systems.

The basic principle of MT is that instead of verifying the correctness of individual outputs, it
focuses on the relations between inputs and outputs of multiple executions.

2.1.1 Historical Background


Metamorphic Testing (MT) was first proposed by T.Y. Chen in 1998 as a novel approach to
address the oracle problem in software testing in his seminal work “Metamorphic testing: A
new approach for generating next test cases” (Chen, Cheung, and Yiu, 1998).

In the early 2000s, researchers began to explore the theoretical foundations and practical
applications of MT. Gotlieb and Botella’s work was one of the first to explore the automation
of MT, an important step in making the technique more practical for real-world applications
(Gotlieb and Botella, 2003). They present a method for automating metamorphic testing,
addressing the challenge of verifying software without a defined oracle. The authors introduce a
framework that leverages constraint logic programming to automatically generate test cases
based on identified metamorphic relations. These relations allow the detection of software faults
by comparing outputs from multiple test executions without requiring explicit expected results.

In recent years, MT has been applied to emerging fields and has seen increased industrial
adoption. MT has been applied in the critical field of cybersecurity, demonstrating its relevance
to modern software challenges (Chen et al., 2016).

Segura et al. provided a comprehensive survey of MT, summarizing its development, applications,
and challenges up to 2016 (Segura et al., 2016) which has been followed by another review
paper by Chen et al. offering an up-to-date perspective on MT, discussing current challenges
and future opportunities in the field (Chen et al., 2018). Some of the key challenges highlighted
in these surveys include the automated generation of effective MRs, the application of MT to
complex systems, and the integration of MT with other testing techniques. This thesis addresses
6 Chapter 2. Background and Related Work

these challenges by investigating the potential of LLMs in generating MRs for critical space
system components, such as the MLFS library and the PUS framework.

The evolution of MT reflects the changing landscape of software development, with recent trends
focusing on its application to AI, machine learning, and complex systems where traditional
testing methods fall short (Dwarakanath et al., 2018).

The paper “Software Testing with Large Language Models: Survey, Landscape, and Vision”
analyzes the emerging role of LLMs in software testing (Wang et al., 2024). It systematically
explores various areas where LLMs can be impactful, including test generation, test oracle
creation, bug detection, and program repair. Introduces novel paradigms for integrating
LLMs into software testing workflows, particularly the use of prompt engineering and data
augmentation techniques to enhance LLM outputs. Additionally, it identifies the limitations of
current approaches, especially in handling complex, stateful, and context-dependent software
systems.

2.1.2 Systematizing the Identification of Metamorphic Relations


Identifying effective Metamorphic Relations (MRs) in MT is crucial for the success of the
technique. MRs are the relations between inputs and outputs that are used to guide the testing
process. The identification of MRs is a challenging task that requires domain knowledge and
creativity. A systematic approach to identify MRs is presented by Chen et al., introducing
METRIC (METamorphic Relation Identification Criteria) based on category-choice framework
(Chen, Poon, and Xie, 2016). The methodology utilizes the category-choice framework to model
the input domain; applies a set of “relation elements” to derive potential metamorphic relations
and provides guidelines for combining these elements to form complete MRs.

The METRIC approach is considering only the input domain in identifying MRs, which may
miss important relations involving output characteristics. This limitation has been addressed
by Chang-ai Sun et al. with the METRIC+ methodology, an enhanced technique that considers
both input and output domains for identifying MRs (Sun et al., 2019). METRIC+ extends the
original METRIC approach by incorporating the output characteristics into the MR identification
process, providing a more comprehensive set of MRs, thus improving the overall effectiveness of
MT.

Luu et al. explores ChatGPT’s potential in advancing MT (Luu, Liu, and Chen, 2023). The
authors tested ChatGPT’s ability to generate MRs across nine diverse systems. While ChatGPT
was able to propose correct MRs for known and new systems, the majority of MRs were
incorrect or vaguely defined. The paper highlights that human intervention is crucial for refining
ChatGPT-generated MRs, but it shows promise in aiding the discovery of new software testing
insights, particularly for systems that had not previously been tested with MT.

The paper “Metamorphic Relation Generation: State of the Art and Visions for Future
Research” reviews the techniques for systematically generating MRs (Li et al., 2024). It
categorizes approaches into composition, artificial intelligence (AI)-based techniques, MR
patterns, category-choice frameworks, genetic approaches, and search-based methods. The
2.2. Large Language Models 7

paper identifies trends such as AI-driven MR generation and highlights challenges like limited
automation in MR generation. The work proposes further research on enhancing MR discovery
and application across diverse domains and software types.

A cutting-edge approach in the field of metamorphic testing, combining traditional software


testing techniques with the latest advancements in AI (Artificial Intelligence) and NLP (Natural
Language Processing) is presented by S.Y. Shin, F. Pastore, and D. Bianculli and A. Baicoianu
in their work “Towards Generating Executable Metamorphic Relations Using Large Language
Models” (Shin et al., 2024). The authors propose a novel approach to automatically generate
MRs using LLMs (Large Language Models) such as GPT-3 and GPT-4 to understand software
specifications and generate potential MRs. A framework is used for translating LLM-generated
natural language MRs into executable MRs. This research opens up new possibilities for
integrating AI-driven techniques into software testing practices, potentially revolutionizing how
metamorphic testing is applied in real-world scenarios.

2.2 Large Language Models


LLMs1 (Large Language Models) are advanced artificial intelligence systems designed to
understand, generate, and manipulate human-like text. These models are trained on vast
amounts of textual data, allowing them to capture complex patterns in language and develop a
broad understanding of various topics.

The capabilities of LLMs extend far beyond simple text completion. They can engage in
human-like conversations, answer questions, summarize long documents, translate languages,
and even assist with creative writing and coding tasks. As these models continue to evolve, they
are finding applications in diverse fields such as customer service, content creation, scientific
research, education or software testing – as presented in the current research.

2.2.1 Historical Background


A truly groundbreaking research in the field of NLP (Natural Language Processing) and LLM
was presented by Bengio et al. in their work “A Neural Probabilistic Language Model” (Bengio
et al., 2003). This paper introduced several concepts that are now fundamental to modern
LLMs:

• it proposed neural networks for language modelling, which was a significant departure
from previous statistical methods

• it introduced the concept of word embeddings, which represent words as dense vectors
in a continuous space, allowing the possibility to capture semantic relationships between
words

• it demonstrated the power of jointly learning word representations and language models,
a principle that underlies many modern LLMs
1
A.Karpathy - Intro to LLMs: https://youtu.be/zjkBMFhNj_g
8 Chapter 2. Background and Related Work

In 2015, Bahdanau et al. introduced the attention mechanism in their work “Neural Machine
Translation by Jointly Learning to Align and Translate”, which allowed models to focus on
relevant parts of the input when generating each part of the output. (Bahdanau, Cho, and
Bengio, 2015). Building on this concept, Vaswani et al. proposed the Transformer architecture in
their 2017 paper “Attention is All You Need”, which revolutionized the field of NLP by enabling
parallel processing of input sequences and capturing long-range dependencies more effectively
(Vaswani et al., 2017). The Transformer architecture’s efficiency in handling large-scale data
paved the way for the development of LLMs such as the GPT series.

The GPT-2 and GPT-3 models introduced by OpenAI marked pivotal moments in the develop-
ment of LLMs, dramatically expanding their capabilities and potential applications. The GPT-2
model, a 1.5 billion parameter Transformer, presented by Radford et al. in 2019, demonstrated
the ability to generate coherent and contextually relevant text across a wide range of topics
(Radford et al., 2019). Building on this foundation, GPT-3, introduced by Brown et al. in
2020, represented a quantum leap in scale and capability (Brown et al., 2020). With 175 billion
parameters, GPT-3 exhibited remarkable few-shot learning abilities, allowing it to perform
new tasks with minimal task-specific examples. This breakthrough challenged the traditional
paradigm of fine-tuning models for specific tasks and introduced the concept of “in-context
learning”. GPT-3’s ability to generate human-like text, solve problems, and even write code
based solely on natural language prompts opened up new possibilities for AI applications across
various domains.

The paper “Training language models to follow instructions with human feedback” introduces
InstructGPT, a version of GPT-2 fine-tuned to better follow user instructions (Ouyang et
al., 2022). By using human feedback via a three-step process (supervised learning, reward
model training, and reinforcement learning), the authors successfully align the model with user
intentions, improving its ability to generate helpful, honest, and safe responses. Despite having
far fewer parameters, InstructGPT (1.3B parameters) outperforms GPT-3 (175B parameters)
in multiple areas such as truthfulness and reduced toxicity, while showing promise in handling
broader tasks.

ChatGPT, designed to engage in human-like conversations, based on the GPT-3 architecture


was launched by OpenAI in 2022, bringing conversational AI to the mainstream. ChatGPT’s
capabilities are further advanced by the GPT-4 model and with multimodal capabilities that
can process text, images and voice, enabling more interactive and engaging conversations.

Zelikman et al. developed a novel method to improve reasoning capabilities of LLMs presented
in the paper "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking"
(Zelikman et al., 2024). They improved the reasoning capabilities of the language model by
generating rationales before predicting the next tokens. This technique, called Quiet-STaR
(Quiet Self-Taught Reasoner), allows models to “think” about the context during text generation,
enhancing predictions for difficult-to-predict tokens and leading to notable improvements in
performance on benchmarks like GSM8K (Grade School Math 8K) and CommonsenseQA without
2.2. Large Language Models 9

task-specific fine-tuning. This self-supervised approach offers scalable reasoning advancements


for LMs, improving zero-shot and natural text understanding.

The AI Scientist developed by Lu et al. is described as a system that functions without human
intervention across multiple phases of research (Lu et al., 2024). Starting with an initial
idea, the AI Scientist performs background literature searches, iteratively refines hypotheses,
implements and runs experiments, analyzes the results, and generates a complete manuscript.
The framework also includes a self-reflective mechanism that allows it to continuously improve
its outputs through a feedback loop, mimicking the iterative nature of human scientific discovery.

The AI Scientist includes an automated paper-reviewing process based on LLMs, which mimics
the peer review system of academic conferences. The reviewing agent evaluates papers by
assigning numerical scores across various metrics (soundness, contribution, clarity, etc.) and
provides feedback on areas for improvement. Remarkably, the AI reviewer achieves near-human
performance, with accuracy scores comparable to human reviewers on datasets from the ICLR
2022 conference.

2.2.2 LLMs Limitations in Research


LLMs have demonstrated remarkable capabilities, but they also face significant limitations that
researchers are actively studying. A primary concern is their tendency to generate false or
inconsistent information, often referred as “hallucinations” (Maynez et al., 2020). These errors
can arise from the models’ inability to distinguish between factual knowledge and spurious
correlations learned during training.

The “black-box” nature of these models poses challenges for interpretability and accountability
(Danilevsky et al., 2020), making it difficult to understand or predict their decision-making
processes. LLMs still lack true understanding of language and context, leading to limitations
in handling complex, multi-step reasoning tasks (Ahn et al., 2024). These limitations are
particularly relevant in safety-critical applications where the consequences of errors can be
severe.

An important aspect of LLMs is their non-deterministic nature, which has significant implications
for their use in critical applications or in practical coding scenarios. This issue has been studied
by Ouyang et al. in their work “LLM is Like a Box of Chocolates: the Non-determinism of
ChatGPT in Code Generation” (Ouyang et al., 2023). The paper explores the non-deterministic
nature of ChatGPT in code generation tasks, investigating the variability in generated code
outputs across different runs with identical prompts.

The empirical study was conducted across 829 code generation problems from three widely-
studied benchmarks: CodeContests, APPS, and HumanEval. The study reveals high levels of
non-determinism in ChatGPT’s code generation: more than 60% of the tasks had no identical
outputs across multiple requests. A common practice to control non-determinism is to set the
temperature (a hyperparameter that influences randomness in predictions) to zero. However,
the authors show that this approach does not completely eliminate non-determinism. The paper
highlights that only 22.4% of the reviewed literature in LLM-based code generation considers
10 Chapter 2. Background and Related Work

non-determinism in their experiments. These results indicate that there is a threat to the
scientific conclusion of the research and the practical use of LLMs in software development.
11

Chapter 3

An Approach for the Automated


Generation of MRs from LLMs

In the area of software testing for space systems, the generation of effective test cases is important
to ensure the reliability and functionality of critical software components.

This chapter presents an approach of generating MRs for testing space system software com-
ponents. Metamorphic Testing has emerged as a powerful technique to address the oracle
problem in software testing, particularly in domains where determining the expected output for
a given input is challenging. We aim to automate and enhance the process of identifying and
formulating MRs, potentially revolutionizing the testing of complex space software systems.

Our method employs LLMs for the automated generation of MRs. Given the rapid advancements
in LLM technology, this thesis provides a snapshot of the current state-of-the-art capabilities.
Our approach focuses on recording the best results achievable at the present level of advancement,
forming a baseline for what can be expected in the future of automated software testing in the
space industry.

We are going to present an exploration of approach for MR generation. We will discuss the
selection of appropriate LLM models, the design of effective prompts, and the evaluation of
generated MRs.

This chapter proceeds as follows: section 3.1 presents the Method Selection, detailing the tok-
enization in LLMs and the criteria and process for choosing the most suitable LLM and interface
for our experiments. Section 3.2 examines the effects of the hyperparameters Temperature
and Top P in LLMs by conducting experiments on OpenAI’s Text Completion API., providing
insights into the inner workings of LLMs. In section 3.3, we present the objective of our MR
generation approach.
12 Chapter 3. An Approach for the Automated Generation of MRs from LLMs

3.1 Method Selection


3.1.1 Tokenization in LLMs
To effectively utilize Large Language Models (LLMs), it is crucial to comprehend the process of
tokenization1 . Early LLMs, trained primarily on English language texts, they were relatively
small and employed character or word-based prediction methods. In contrast, contemporary
LLMs are trained on datasets comparable in size to the entire internet, encompassing multiple
international languages.

Given the vast and ever-expanding nature of language, word-based prediction is no longer
feasible for modern LLMs. The potential vocabulary is practically infinite when accounting for
misspellings, proper nouns, neologisms, and acronyms. While character-based prediction using
Unicode code-points might seem a viable alternative, it presents its own set of challenges.

The Unicode encoding standard assigns a unique numerical identifier to each character across
various languages, with the latest version containing nearly 150k characters. However, training
and predicting directly on Unicode code-points would result in a highly imbalanced dataset,
with some characters occurring frequently and others rarely. This imbalance would not only
slow down the training process but also impede the model’s ability to effectively learn rare
characters. Furthermore, utilizing character-based method would significantly limit the context
length that the model could process effectively.

The enumerated issues have been addressed by Sennrich et al. in their paper titled “Neural
Machine Translation of Rare Words with Subword Units” (Sennrich, Haddow, and Birch, 2016).
They proposed using subword units based on the byte pair encoding (BPE) algorithm. BPE
iteratively merges the most frequent pairs of character or character sequences, creating a
vocabulary of subword units – called tokens in LLM terminology. This method allows the model
to represent rare and unknown words as a sequence of subword units, thereby enhancing the
model’s ability to generalize to unseen words and balance the dataset.

While the underlying algorithms might be similar, each major LLM or model family often has
its own specific tokenizer implementation. For a given model, the tokenization needs to be
consistent between pre-training, fine-tuning and inference. The tokenization process is typically
handled by the model’s tokenizer class, which is responsible for converting raw text into a
sequence of tokens that the model can process. The tokenizer also handles the conversion of
tokens back into human-readable text, making it an essential component of the LLM.

3.1.2 Selecting the Best Model


As of spring/summer of 2024 several LLMs stand out in different aspects of performance and
use cases. The most prominent models are GPT-4, developed by OpenAI, Claude-3, created
by Anthropic and the Gemini series by Google. Choosing the best LLM depends on various
factors, including use case, resources, and requirements.
1
A.Karpathy - Building a Tokenizer: https://youtu.be/zduSFxRajkE
3.1. Method Selection 13

The LMSYS Chatbot Arena2 is a platform for evaluating and comparing different LLMs through
human preference judgments. Created and developed by the LMSYS Organization (contribution
from UC Berkeley, CMU, and UCSD), offers an open platform allowing anyone to interact with
and compare various LLMs (Chiang et al., 2024). It uses a tournament-style evaluation where
users engage with two anonymous chatbots simultaneously and vote on which one they prefer.
The platform uses an Elo-based rating system to rank the models based on user preferences.

In the beginning of the experimental phase of this thesis (March-June, 2024), GPT-4-Turbo
was the top-ranked model on the LMSYS Chatbot Arena. The MLFS experiments have been
performed using GPT-4-turbo. However, in the end of April, GPT-4o became the top-ranked
model and the PUS experiments have been performed using GPT-4o.

Model selection can’t be based solely on the LMSYS Chatbot Arena rankings. The winning
probability in the Elo rating system has been set in such a way that a difference of 400 points
between two models corresponds to a winning probability of 10:1. The expected winning
probability of a model is calculated with the formula:

1
EA = (3.1)
1 + 10(RB −RA )/400

where EA is the expected winning probability of model A, RA and RB are the arena ratings of
model A and model B, respectively. Currently the top 5 models are within a difference of ~30
points, which means that the expected winning probability of the top model is merely 54.3%
against the fifth model. This is a modest difference, and to make a more informed decision, one
has to check the performance of the models for the specific use case.

Another important criteria at model selection was their availability in Europe. At the beginning
of our experiments amongst the top models only the models from OpenAI were available in
Europe without restriction. The Anthropic’s Claude-3 model could be reached only via its
API, while the Google’s Gemini models were not available in Europe at all because of GDPR
(General Data Protection Regulation) concerns. Since we didn’t know when the restrictions
would be lifted, we had a strong incentive to use the OpenAI’s models for our experiments.

3.1.3 Selecting the Interface


When deciding between using ChatGPT’s web interface3 or its API4 , we considered several
factors:

• Analysis of ChatGPT’s Web Interface:

– Advantages

∗ Immediate access, no need to set up an API


2
LMSYS Benchmarking: https://chat.lmsys.org
LMSYS Leaderbord: https://chat.lmsys.org/?leaderboard
3
Web interface: https://chatgpt.com
4
API interface: https://platform.openai.com
14 Chapter 3. An Approach for the Automated Generation of MRs from LLMs

∗ Ease of use, attaching document with drag and drop

∗ Conversation history, ability to name and save conversations

– Disadvantages

∗ Limited customization

∗ Manual input required, not suitable for automating tasks

∗ No easy integration with other tools, response copy/paste required

• Analysis of OpenAI’s API Interface:

– Advantages

∗ Offers extensive customization options

∗ Can be easily automated and integrated into various applications

∗ Control over parameters such as temperature, maximum tokens, and top P

– Disadvantages

∗ Requires initial setup and ongoing maintenance

∗ Usage is metered and charged on the number of tokens processed

Considering all the factors we decided to use the API interface for our experiments. The API
interface provides flexibility and control over parameters like temperature or top_p , and
allowed us to generate responses in CSV format, making it easier to save automatically the
results into tables for further analysis.

3.1.4 Selecting the API Interface


OpenAI’s API interface offers three options for interacting with the models:

• Chat API 5 – an alternative of the ChatGPT web interface

• Assistants API 6 – a specialized API for building conversational agents

• Text Completion API 7 – a legacy API for generating text completions

These three APIs represent the LLM evolution from past (text completion) to present (chat)
and future agents (assistants in OpenAI terminology).

The Text Completion API is a legacy model capable only for predicting the next word or
sequence of words given a prompt. We could use this for MR generations only with very
sophisticated prompt engineering, which would be cumbersome and error-prone.

The Chat API is better suited for our use case, but unfortunately currently lacks the ability to
attach documents. Without this feature, we still have the possibility to extract the text from
5
Chat API: https://platform.openai.com/playground/chat
6
Assistants API: https://platform.openai.com/playground/assistants
7
Text Completions API: https://platform.openai.com/playground/complete
3.1. Method Selection 15

the documents and include it in the prompt. However, this would require additional processing,
we would need to ensure that the text and the tables are properly formatted and that the
prompt is not exceeding the context length. For MLFS experiments we relied mainly on the
Software Requirements Specification- Basic Mathematical Library for Flight Software, E1356 -
MLFS pdf document. This is a 66 pages document and a quick check on OpenAI’s tokenizer 8

shows that this document converts to 38,200 tokens. The PUS experiments relied mainly on the
ECSS-E-ST-70-41C(15April2016).pdf 9 document. This is a 656 pages document and OpenAI’s
tokenizer converts this to 274,500 tokens. The maximum token limit for the gpt-4-turbo
and the gpt-4o models is 128k tokens, which means that we could proceed with the MLFS
experiments using the Chat API, but for the PUS experiments we would need to use our own
implementation of RAG (Retrieval-Augmented Generation). The RAG technique consists of
retrieving the relevant pieces of information from a large dataset or document collection based
on the input query. The retrieved information is then combined into the input prompt and
the augmented query is passed to the LLM to produce a coherent and contextually accurate
response.

Another aspect to be considered is the cost of the API usage. The gpt-4-turbo model,
available at the start of our experiments, costs $10.00 per 1M input token. This means that an
MLFS experiment query on Chat API would cost more than $0.38 and we should take into
account that we would need to run multiple queries to generate MRs and every follow-up chat
message would cost more than the previous one because the subsequent queries should include
the complete chat history together with the document content.

The agents represent another dimension in LLMs evolution. The agents trade time and
complexity to improve accuracy. The agents perform task planning, partition complex requests
into steps, they can use tools and can be equipped for more advanced reasoning and problem-
solving capabilities. The OpenAI’s agent, called assistant, is continuously evolving, the more
advanced versions are released in beta phase. The API client is open source,10 the functionalities
include authentication, managing API calls, and handling responses from OpenAI’s models, but
the client library doesn’t reveal the internal workings of the assistants.

Currently the OpenAI assistants can be defined with the following tools:

• File search – empowers assistants with knowledge from uploaded documents. The assistant
decides automatically when and what content to use from the documents.

• Code interpreter – executes and interprets code, helping with tasks like calculations, data
analysis and plot generation.

• Functions – provides access to a library of pre-defined functions that can be called to


perform specific tasks.

Additionally, we can engage the browser tool or image generation tool from the assistant by
explicit prompt request.
8
OpenAI’s tokenizer: https://platform.openai.com/tokenizer
9
PUS revC: https://ecss.nl/get_attachment.php?file=2016/06/ECSS-E-ST-70-41C15April2016.pdf
10
OpenAI’s Python client: https://github.com/openai/openai-python
16 Chapter 3. An Approach for the Automated Generation of MRs from LLMs

Since our main mission was to examine the cutting-edge capabilities achievable with the current
LLMs, we decided to use the Assistants API with the File search tool enabled for both MLFS
and PUS experiments.

When we started our MLFS experiments, the assistants were in version v1 and when we
continued with the PUS experiments the v2 assistants became available. The v2 assistants
have access to the more advanced gpt-4o model, which is quicker and less expensive than
the gpt-4-turbo model. The gpt-4o model costs $5.00 per 1M input tokens. OpenAI offers
a guide11 for the v1–v2 migration. The v1 version retrieval tool has been renamed to
file_search tool in v2 and the tool_resources for the file_search tool has a new object
called vector_stores .

3.2 Setting LLM Hyperparameters


3.2.1 Setting the Temperature in LLMs
The temperature in LLMs is a hyperparameter that controls the randomness or creativity of the
model’s output. The last layer of the LLM computes the logits for each token in the vocabulary.
The SoftMax function is applied to these logits to convert them into probabilities. The SoftMax
function for a logit zi in a set of logits Z is defined as:

e zi
P (yi ) = P zj (3.2)
je

Where P (yi ) is the probability of the i-th token being the next token. The temperature T can
be used to scale the logits before applying the SoftMax function. The temperature scaling is
defined as:

ezi /T
P (yi ) = P z /T (3.3)
je
j

The term "temperature" is borrowed from the Boltzmann distribution in statistical mechanics.
A high temperature smooths the distribution, resulting in more diverse outputs, while a low
temperature sharpens the distribution, leading to more deterministic outputs. In OpenAI’s API
the default value for the temperature is 1.0 and it can be set between 0.0 and 2.0.

Figure 3.1 illustrates the effect of temperature on the SoftMax probabilities.

As it can be seen on equation (3.3) the temperature T = 0 results on division by zero, which is not
defined. In practice, the parameter T = 0 needs special handling. A possibility could be to select
the token with the highest logit value, which would result in a deterministic output. Handling
of the T = 0 parameter is not disclosed, but we observed that setting temperature=0.0 is
not guaranteeing a deterministic output on OpenAI’s API, while Anthropic’s Claude-3 model
returns a deterministic output when the temperature is set to zero.
11
Migration Guide: https://platform.openai.com/docs/assistants/migration
3.2. Setting LLM Hyperparameters 17

Figure 3.1: Example of the effect of temperature on the SoftMax probabilities.

In our experiments we decided to use the value of T = 0 for both MLFS and PUS experiments.
Unfortunately we couldn’t achieve perfect reproducibility of the results on OpenAI’s API, even
when we combined the temperature=0.0 with the top_p=0.0 parameter settings. The results
were consistent, but not perfectly reproducible. The only possibility to have deterministic results
on OpenAI’s API could be achieved by using the Chat API interface and setting the seed
parameter. However, according to OpenAI’s documentation,12 “determinism is not guaranteed,
and you should refer to the system_fingerprint response parameter to monitor changes in
the backend.”

While the Chat API in combination with the seed parameter could provide deterministic
results, this result would be a random result, not the best result. Since our main mission is to
find the best results achievable with the current LLMs, we decided to use the Assistants API
with the temperature=0.0 parameter setting, accepting a small variability and repeating the
experiments as needed to ensure the best results.

3.2.2 Text Completion with Default Temperature and Top P Settings


We stated previously that the Text Completion API13 is a legacy model capable only for
predicting the next word or sequence of words for a given prompt. We can’t use this model for
MR generation, but we can use it to get a glimpse on the inner workings of the LLMs.

We are going to use the Text Completion API on OpenAI’s playground. The model is set
to gpt-3.5-turbo-instruct while the Temperature and Top P is set to the default value
of 1.0 . On the lower-right corner there is a parameter called Show probabilities which
will be set to Full spectrum . This setting is going to highlight the individual tokens with
different colors according to the predicted probability. The colors range from green for the most
12
API Reference - Chat Create: https://platform.openai.com/docs/api-reference/chat/create
13
Text Completion API: https://platform.openai.com/playground/complete
18 Chapter 3. An Approach for the Automated Generation of MRs from LLMs

probable tokens to red and then white for the least probable tokens. We are going to use a
“Lorem ipsum” placeholder text as input prompt:

lorem ipsum dolor sit amet

Pressing the Submit button will perform the text completion. We can select with the mouse
any token to show the predicted list of top 10 tokens and their probabilities, as it can be seen
in Figure 3.2.

Figure 3.2: Text completion showing the token selection with probabilities.

In this particular example we can see that after the prompt on the first token position we had a
potential list of tokens: comma , with 39.7% probability, double newline \n\n with 14.62%
probability, single newline \n with 5.94% probability, consectetur with 5.82% probability,
</ with 3.95% probability, ’,\n with 3.91% probability and a few more tokens with lower
probabilities. In the presented case the ’,\n token has been selected, which has quite a low
probability of 3.91%, hence it gets colored to red. This token is then followed by ’ and
text low probability tokens. Since the first 3 tokens have low probabilities, this directs the
following text completion in a strange combination of Lorem ipsum filler text put in a structured
variable like format of an unspecified programming language. Another experiment with the
same prompt and the same model, started generating tokens with higher probabilities, making
the whole text completion more coherent with the Lorem ipsum filler text — as it can be seen
in Figure 3.3.
3.2. Setting LLM Hyperparameters 19

Figure 3.3: High probability tokens ,, consectetur, adipiscing and elit.


selected at the beginning of text completion.

In another experiment where low probability tokens \n\n , It and was were selected at the
beginning of the text completion, the model generated the origin story of the Lorem ipsum
passages — as shown in Figure 3.4.

Figure 3.4: Low probability tokens \n\n, It and was selected at the beginning
of text completion.

3.2.3 Text Completion while Altering the Temperature and Top P Settings
Setting the Temperature to 0.0 makes the model more deterministic, but it is not assuring
the reproducibility — as stated before. Reusing the prompt from the previous section we’re
getting consistently a Lorem ipsum text filler output, but with small alterations. One of the
most common outputs can be seen on Figure 3.5.

We continued our experiments by altering the Top P parameter. When selecting the new token
the model orders the predicted tokens by their probabilities in decreasing order and the Top P
parameter sets the threshold for the cumulative probability. For example, when the value for
the Top P is set to 0.5 , this means that the model will select the tokens with the highest
probabilities until the cumulative probability reaches 0.5 , and only these tokens are eligible
for the next token selection. When Top P is set to 0.0 this means that only the first token
with the highest probability can be selected — and this time we don’t have the division by zero
20 Chapter 3. An Approach for the Automated Generation of MRs from LLMs

Figure 3.5: The most common output with the Temperature set to 0.0.

problem, this time we should get a deterministic output — in theory. In practice we observed
the same behavior as with the Temperature set to 0.0 . We had two basic variants of the
Lorem ipsum text filler as seen in Figures 3.6 and 3.7.

Figure 3.6: The most common output with the Top P set to 0.0. Difference
from the other alternative is in the 4th row — here token ipsum is selected.

The first three rows of the outputs are identical and the divergence can be observed at the 4th
row, after the tokens +b ib endum : in the first case token ipsum , in the other case token
elit is selected. We can observe that in both cases we select the most probable token
and we have a difference because we have two completely different token predictions. This
means that OpenAI’s model has another layer of randomness that can not be controlled by the
Temperature and Top P parameters and this is preventing us to achieve perfect reproducibility
of the results.

We can observe that selecting always the token with the highest probability results in repeating
text patterns in the output. In the early models with small context length this could lead
to a circular output of repeating the same text over and over, but in the modern LLMs with
increased context length and a large vocabulary this is not the case.

When setting the Temperature to 2.0 we can encounter other problems. The model is
more creative, and the output is more diverse, but the model can generate text that is not
coherent with the input prompt. In extreme cases it can happen that the text generation derails
3.3. MR Generation Objective 21

Figure 3.7: Another alternative output with the Top P set to 0.0. Difference
from the previous output is in the 4th row — here token elit is selected.

completely, resulting in generation of tokens that cannot be decoded in UTF-8 format. This
can be seen in Figure 3.8. Again, this is a problem encountered in older models, in modern
LLMs we have practically zero probability of encountering this problem, and even if occurring,
the model can handle it gracefully by choosing another token, without exposing the user to the
error message.

3.3 MR Generation Objective


To demonstrate the efficacy of our approach, we focus on generating MRs for two critical
software packages used in space systems:

• The Mathematical Library for Flight Software (MLFS), developed by the European Space
Agency (ESA), which provides a suite of mathematical functions typically required in
flight software, particularly in the areas of Attitude and Orbit Control Systems (AOCS)
and scientific algorithms.

• The Packet Utilization Standard (PUS), a critical framework developed and maintained
by ESA for the telemetry and telecommand of spacecraft. PUS is designed to ensure
reliable and standardized transmission of data between ground applications and on-board
spacecraft systems.

Our objective is to generate MRs using LLMs with and without access to SRS documents,
we aim to compare MRs generated from different document formats (PDF, PPTX, JSON)
and evaluate the LLM’s ability to extract relevant information from these documents. We
will also analyze the quality, relevance and uniqueness of the generated MRs. Through this
comprehensive analysis, we intend to identify both the challenges and opportunities associated
with using LLMs for MR generation in the space industry, providing valuable insights for future
applications and research in this field.
22 Chapter 3. An Approach for the Automated Generation of MRs from LLMs

Figure 3.8: Temperature set to 2.0 can lead to UTF-8 decoding errors in older
models.
23

Chapter 4

Empirical Assessment

4.1 Introduction
This chapter presents an empirical assessment of using LLMs to generate MRs for testing an
essential software library in space domain, the MLFS library, and a critical software framework,
the PUS. The study aims to evaluate the capability of LLMs in generating MRs based on prior
knowledge and on SRS (Software Requirements Specification) documents, and to compare the
quality and relevance of MRs generated under these conditions, and to assess the impact of the
document format on the generated MRs.

The chapter is organized as follows: section 4.2 presents the research design, including the
research questions and the approach used to answer them. Section 4.3 describes the data
collection process, including the subjects of the study, the instruments and tools used, and the
procedure. Section 4.4 presents the analysis and results, including the techniques, software,
and validation used for the analysis, as well as the presentation of the data, statistics, and
qualitative findings. Finally, section 4.5 presents the conclusions of the study and discusses
future research directions.

4.2 Research Design


This section outlines the design for our empirical assessment of MR generation for MLFS and
PUS using LLMs. We present the research questions, the approach used to address them, and
the research subjects involved in the study.

4.2.1 Research Questions


Our study aims to address the following research questions:

• RQ1: Can LLMs generate MRs from SRS? This question explores the capability of
LLMs to interpret software specifications and derive meaningful MRs from them. We aim
to assess the relevance, diversity and validity of the MRs generated. This investigation
could potentially lead to more efficient and scalable methods for generating test cases in
complex software systems.
24 Chapter 4. Empirical Assessment

• RQ2: Do MRs generated from SRS qualitatively differ from MRs generated
based only on prior knowledge of LLMs? Here, we seek to understand the differences
between MRs derived directly from software requirement specifications (SRS) and those
generated based on the LLM’s pre-existing knowledge (i.e., information captured by the
LLM’s training process). This comparison will help us evaluate the added value of using
SRS as input, as well as identify any biases or limitations in the LLM’s prior knowledge
when it comes to specific domains like space systems software. This is going to help us to
decide if it is better to fine-tune an LLM (because pre-existing knowledge leads to better
results) or if processing documentation without fine tuning leads to more accurate results.
The findings could inform best practices for leveraging LLMs in software testing processes.

• RQ3: Can documents with a different format lead to different MRs? This
question investigates the potential benefits of using additional document artifacts, beyond
standard PDF specifications. We aim to determine if supplementary materials such
as diagrams, or interactive documentation can lead to the creation of more diverse,
comprehensive, or domain-specific MRs. The results could influence future practices in
documentation and specification writing for software testing purposes.

• RQ4: Can LLM extract relevant test information from large SRS documents?
This question examines the ability of LLMs to navigate and extract information from
extensive technical specifications. We aim to assess how effectively LLMs can identify and
utilize key testing-related details within large documents, which is crucial for generating
accurate and comprehensive MRs. This investigation could provide insights into the
scalability of using LLMs for test case generation in complex systems with voluminous
documentation.

Through this assessment, we seek to explore the potential of LLMs in automating and enhancing
the software testing process for space systems, potentially revolutionizing the way we approach
quality assurance in this critical domain.

In our experimental design, we address the research questions as follows:

• RQ1: is addressed by both MLFS and PUS experiments.

• RQ2: is examined solely through MLFS experiments. This is because PUS MR generation
was not possible without attached SRS documents, as the LLM refused to generate MRs
based on prior knowledge alone.

• RQ3: is investigated using PUS experiments. This choice was made because we had three
different document formats available for PUS (PDF, PPTX, and JSON), allowing for a
comprehensive comparison.

• RQ4: is addressed through PUS experiments. The PUS documentation, being more
complex and extensive, is better suited to evaluate the LLM’s information extraction
capabilities from large technical documents.
4.2. Research Design 25

This structured approach allows us to comprehensively examine our research questions across
different software packages and testing scenarios.

4.2.2 MLFS Approach


The MLFS experiments are designed to address RQ1 and RQ2, relying mainly on the specifica-
tions given in SRS - Basic MLFS E1356 PDF1 document. Our research design comprehensively
evaluates the capabilities of LLMs in generating MRs for the MLFS library in three distinct
scenarios:

1. LLM Prior Knowledge Scenario: In this baseline scenario, we evaluate the LLM’s
ability to generate MR solely based on its prior knowledge. The model is prompted to
create MRs for MLFS without additional context or documentation. This approach allows
us to assess the LLM’s inherent understanding of mathematical concepts and software
testing principles. It also serves as a benchmark to compare against the performance in
the other scenarios.

2. SRS-Informed Scenario: For this scenario, we provide the LLM with the MLFS SRS
document, but we do not explicitly instruct it to use this information. The LLM is free to
incorporate the SRS content into its reasoning process as it sees fit. This setup allows
us to observe how the model integrates new, domain-specific information with its prior
knowledge. We aim to determine whether the mere presence of the SRS leads to more
accurate or relevant MRs compared to the baseline scenario.

3. SRS-Directed Scenario: In the final scenario, we explicitly instruct the LLM to


generate MRs based on the MLFS SRS document. The model is directed to analyze
the specifications and derive MRs that directly align with the stated requirements. This
approach tests the LLM’s ability to comprehend and apply specific technical documentation
in the context of software testing. We expect this scenario to produce the most targeted
and domain-specific MRs.

By systematically comparing the results across these scenarios, we aim to gain insights into the
LLM’s performance in generating MRs for the MLFS library under varying conditions. The
insights gained from this analysis will contribute to our understanding of how LLMs can be
effectively leveraged in software testing processes for critical space systems.

4.2.3 PUS Approach


The PUS experiments are designed to address RQ1, RQ3 and RQ4, exploring both the generation
of MRs from specifications and the potential benefits of using various supporting document
artifacts in the process while also assessing the LLM’s ability to extract relevant test information
from large technical documents.

For the PUS experiments we utilize three key documents:


1
MLFS SRS: https://gitlab.com/uniluxembourg/snt/svv/msc/nemet-attila-istvan-msc/thesis_work/-
/blob/main/E1356-GTD-SRS-01_I1_R4.pdf
26 Chapter 4. Empirical Assessment

• ECSS-E-ST-70-41C: The official “Telemetry and telecommand packet utilization” PDF


document2 , revision C, 656 pages, dated 15 April 2016.

• PUS Training Material: A PowerPoint presentation3 (converted from .pptm to pptx)


obtained from an ESA training course, containing 355 slides, dated 15 March 2023.

• ECSS Applicability Requirement Matrix: An Excel file containing PUS requirements,


cleaned and converted to a row-wise JSON format4 , retaining the "Requirement Identifier",
"Requirement Specification" and "Note" columns.

The three documents share the common focus of describing the PUS revision C, particularly its
architecture and service types. Each document discusses the foundational concepts, including
message exchanges and system specification. They cover how PUS defines the interface and
behavior of telecommand and telemetry services.

The key distinctions among the three documents include:

• Purpose and Detail:

– Official PDF: The PDF is a comprehensive and formal specification. It provides an


exhaustive, clause-based breakdown of requirements and guidelines for implementing
PUS-C in various missions. The document is structured to serve as an authoritative
reference for compliance with the standard.

– Training PPTX: The PowerPoint presentation is designed for educational purposes,


offering a high-level overview and examples for training users. It simplifies concepts,
offering diagrams, use cases, and practical illustrations. The training material also
highlights changes from previous versions, such as PUS-A to PUS-C, and focuses on
operational experience.

– Requirements JSON: The JSON file is a machine-readable database that catalogs


specific requirement identifiers and conditions for PUS-C. It can be used for tracking,
verifying, and applying specific requirements in software systems, possibly within
tools like DOORS (Dynamic Object-Oriented Requirements System) or similar
requirements management systems.

• Level of Detail:

– Official PDF: Provides detailed descriptions of each service, their system speci-
fications, and interface requirements. It includes normative references and formal
language using terms like “shall” to express mandatory compliance.

– Training PPTX: Uses simplified language and focuses more on practical implemen-
tation, such as how to deploy service types, PUS file management evolution, and the
2
PDF file: https://gitlab.com/uniluxembourg/snt/svv/msc/nemet-attila-istvan-msc/thesis_work/-
/blob/main/ECSS-E-ST-70-41C_PUS-C_15April2016.pdf
3
PPTX file: https://gitlab.com/uniluxembourg/snt/svv/msc/nemet-attila-istvan-msc/thesis_work/-
/blob/main/ECSS-E-ST-70-41C_training_20230315.pptx
4
JSON file: https://gitlab.com/uniluxembourg/snt/svv/msc/nemet-attila-istvan-msc/thesis_work/-
/blob/main/EARM_ECSS-E-ST-70-41C_PUS-C-rec.json
4.2. Research Design 27

history of changes from PUS-A to PUS-C. It lacks the exhaustive legalistic detail of
the PDF.

– Requirements JSON: Provides a structured and concise format for specific re-
quirements but without context or explanation. It is intended for system integration
rather than human interpretation or education.

• Target Audience:

– Official PDF: Intended for engineers and managers who are responsible for the
compliance and implementation of PUS-C in mission architectures.

– Training PPTX: Aimed at users undergoing training, such as engineers new to


PUS or those needing to refresh their understanding of the standard.

– Requirements JSON: Targeted at software tools and systems engineers who need
to manage, trace, and verify compliance with the detailed requirements of PUS-C.

These documents together form a comprehensive suite that make them relevant for different
aspects of our research.

Our approach on addressing RQ1 (Can LLMs generate MRs from SRS?) was significantly
influenced by an unexpected finding: when prompted, the LLM consistently failed to generate
MRs for the PUS framework based solely on its prior knowledge. Instead, it explicitly requested
documentation to proceed with the task. This behavior meant that we could not use PUS
experiments to assess RQ2, and led us to recalibrate our research focus, shifting from evaluating
the LLM’s prior PUS knowledge to examining how effectively it can utilize different types of
provided documentation for MR generation. In response to this discovery, we designed and
evaluated four distinct scenarios, each providing the LLM with different sets of documentation:

1. Comprehensive Documentation: Attaching all three documents (PDF, PPTX, and


JSON) to provide the LLM with a full range of information sources.

2. Official PDF Only: Attaching only the PDF document to assess the LLM’s performance
with the standard official documentation.

3. Training PPTX Only: Attaching only the PPTX file to evaluate the effectiveness of
more explanatory, training-focused material.

4. Requirements JSON Only: Attaching only the JSON file to test the LLM’s ability to
work with concise, structured requirement data.

To address RQ3 (Can documents with a different format lead to different MRs?), we focus on
two specific scenarios:

1. Repeated PDF Queries: Attaching the PDF document and repeating the query five
times to assess the consistency of MRs generated from standard textual specifications.
28 Chapter 4. Empirical Assessment

2. Single PPTX Query: Attaching the PPTX file to evaluate whether the additional
context and visual elements in the training material enable the generation of unique or
more comprehensive MRs.

This comparison allows us to determine whether supporting documents, such as training


materials with potentially more accessible explanations and visual aids, can lead to creating of
MRs that might not be derived from the official textual specifications alone.

Both set of experiments will be preceded by requesting the model to provide a full mapping
of available TCs and TMs for the selected PUS Service Type, to assess the LLM’s ability to
extract relevant test information from large technical documents, helping us to assess RQ4.

Our results may influence future practices in both software documentation and automated
testing methodologies, particularly in how we structure and present software requirements to
AI systems for effective test case generation.

4.3 Data Collection


Our data collection process is structured around two main group of experiments, each designed
to address specific research questions:

• MLFS Study (4.3.1): The MLFS study focuses on addressing RQ1 and RQ2. In this
experiment, we evaluate the LLM’s capability to generate MRs from the SRS documents
for the MLFS library. We also compare the MRs generated using the SRS documents to
those generated solely based on the LLM’s prior knowledge, without access to the SRS.

• PUS Study (4.3.2):

– PDF, PPTX, JSON experiments: Addresses RQ1 and RQ4. In these exper-
iments, we provide the LLM with three different document formats for the PUS
framework - PDF, PPTX, and JSON. We then assess the LLM’s ability to extract
a mapping of Telecommand (TC) and Telemetry (TM) services for various Service
Types (STs), to generate MRs based on the extracted data from SRS, and compare
the quality and relevance of the MRs generated across the different document formats.

– 5xPDF vs 1xPPTX experiment: Focuses on addressing RQ3. In this experiment,


we compare the MRs generated by the LLM when provided with 5 PDF documents
versus a single PPTX document for the PUS framework. This allows us to investigate
whether the document format can lead to qualitative differences in the generated
MRs.

By structuring our data collection into these three set of experiments, we are able to com-
prehensively address our research questions and gain insights into the LLM’s performance in
generating MRs under varying conditions and with different types of input documentation.
4.3. Data Collection 29

Both the MLFS and PUS data collection processes were conducted within a Jupyter Notebook
environment, which provided an interactive and reproducible framework for our analysis,
enabling seamless integration of code execution, documentation, and results visualization.

4.3.1 MLFS Study


This subsection describes the process of data collection for our study on the generation of MRs
for the MLFS library.

We start by installing the necessary modules and libraries, including the OpenAI Python library,
pandas, and other required dependencies for running the Jupyter Notebook environment.

We create a new notebook and we import the modules needed for the study and we initialize
and OpenAI API instance.
1

2 import os , time
3 import pandas as pd
4 from io import StringIO
5

6 # gpt -4 response has Markdown format


7 from IPython . display import display , Markdown , Latex
8

9 pd . set_option ( ’ display . max_colwidth ’ , 200)


10

11 from openai import OpenAI


12 client = OpenAI ()

Listing 4.1: Importing the modules and initiating the OpenAI API client

To interact with the OpenAI API, authentication is required via an API key, which can be
generated on the OpenAI website5 . This key can be provided in two ways: as a command-line
parameter or through the OPENAI_API_KEY environment variable. For enhanced security, we
opted to use the environment variable method. This approach allowed us to authenticate our
requests while keeping the API key confidential and separate from our codebase.

We utilize OpenAI’s Python library version 1.17.1 (the latest version available during the MLFS
experiments) for API interactions. Additional Python libraries such as pandas are employed
for data manipulation and analysis, while the Markdown module from IPython.display is
used to render the Markdown-formatted responses from the GPT-4 model.

To streamline our analysis, we implemented several helper functions:

• cost(res) : Calculates and returns the number of input tokens, output tokens, and
associated price based on the API response (listing 4.2).

• wait_ans(thread, run) : Manages the asynchronous nature of API calls, waiting for the
completion of a run and displaying time, token usage, and price information (listing 4.3).
5
OpenAI API keys: https://platform.openai.com/api-keys
30 Chapter 4. Empirical Assessment

• md2df(txt) : Converts markdown-formatted text (typically containing CSV data) into a


pandas DataFrame for easier analysis (listing 4.4).
1

2 # define function to extract in / out tokens and price


3

4 def cost ( res ) :


5 ’’’
6 return intokens , outtokens , price based on the retrieved run object model
7 ’’’
8 if res . model . startswith ( ’gpt -3.5 - turbo ’) :
9 inprice , outprice = 0.5 , 1.5
10 elif res . model . startswith ( ’gpt -4 - turbo ’) :
11 inprice , outprice = 10 , 30
12 intokens , outtokens = res . usage . prompt_tokens , res . usage . completion_tokens
13 # https :// openai . com / pricing
14 price = ( inprice * intokens + outprice * outtokens ) * 1e -4
15 return intokens , outtokens , price # cents
16

Listing 4.2: Implementing the cost() function

2 # define function waiting for chatGPT answer


3

4 def wait_ans ( thread , run ) :


5 ’’’
6 wait until run completed and display time / tokens / price
7 ’’’
8 t = 0
9 while True :
10 res = client . beta . threads . runs . retrieve (
11 thread_id = thread . id ,
12 run_id = run . id
13 )
14 if res . status == ’ completed ’:
15 print ( ’\ n \ ncompleted \ n ’)
16 print ( f ’ runtime : { t } seconds ’)
17 intk , outk , price = cost ( res )
18 print ( f ’ tokens in / out : { intk }/{ outk } , price : { price :.2 f } cents ’)
19 break
20 else :
21 print ( ’. ’ , end = ’ ’)
22 time . sleep (1)
23 t += 1
24

Listing 4.3: Implementing the wait_ans() function

2 # define function to extract csv from markdown and convert to df


3

4 def md2df ( txt ) :


4.3. Data Collection 31

5 ’’’
6 return a pandas df extracted from a csv extracted from a markdown txt
7 ’’’
8 try :
9 # csv delimited by ‘‘‘ csv \ n ... ‘ ‘ ‘ inside md
10 csv_data = txt . split ( " ‘‘‘ csv \ n " ) [1]. split ( " ‘‘‘" ) [0]
11 except IndexError :
12 # csv delimited by ‘ ‘ ‘\ n ... ‘ ‘ ‘ inside md
13 csv_data = txt . split ( " ‘ ‘ ‘\ n " ) [1]. split ( " ‘‘‘" ) [0]
14 return pd . read_csv ( StringIO ( csv_data ) )

Listing 4.4: Implementing the md2df() function

1st part of the MLFS study: LLM Prior Knowledge Scenario

We define an assistant using the gpt-4-turbo model, the state-of-the-art model available
during the MLFS experiments.
1

2 # create simple assistant


3

4 assistant = client . beta . assistants . create (


5 name = " s wT es te rEx pe rt _4_ tu rb o " ,
6 instructions = " You are a software tester expert . " ,
7 tools =[{ " type " : " retrieval " }] ,
8 # GPT -4 Turbo , context : 128 ,000 , cutoff date : Dec 2023
9 model = " gpt -4 - turbo -2024 -04 -09 " ,
10 # temperature =0.0
11 )

Listing 4.5: Defining the assistant

We are specifying the assistant instructions "You are a software tester expert" in order to guide
the LLM in generating MRs as if it were an expert in software testing. We specify the tools
type as retrieval to enable the assistant to retrieve information from external sources, but
we are not going to attach a document for the first MLFS test case.

The assistant is defined only once, and it is reused for all subsequent interactions. We are
retrieving the assistant from the assistants list by specifying its name:
1 assistant = [ x for x in client . beta . assistants . list () . data
2 if x . name == ’ swT es te rEx pe rt _4 _tu rb o ’ ][0]

Listing 4.6: Retrieving the assistant

The next steps are to define a thread and a message within the thread requesting for 10 MRs
for an MLFS library function e.g. atan2 shown in the listing 4.7.
1

2 # create new thread


3 thread = client . beta . threads . create ()
4

5 # create message for 10 MRs


32 Chapter 4. Empirical Assessment

7 msg1 = client . beta . threads . messages . create (


8 thread_id = thread . id ,
9 role = " user " ,
10 content = ’ ’ ’I want you to infer 10 MRs ( Metamorphic Relations ) for the
function ‘ atan2 (x , y ) ‘ which calculates the arctangent of the division y / x
of their arguments x and y in the output range [ - pi , pi ] radians .
11 Please provide the answer in a parsable CSV format with the header :
12 " MR #" ," MR Title " ," MR Formula " ," MR Text "
13 ’ ’ ’)

Listing 4.7: Defining the thread and message

We are requesting the LLM to generate 10 MRs in CSV format with the specified header. The
header has been chosen guided by the principle “no interference with the inference” to not
distract the model from the main task. Our initial test showed that the model was structuring
its answer with MR number, MR title, MR formula and MR explanations — as seen in figure 4.1
— so we decided to use this structure for the CSV header.

Figure 4.1: Initial test of generating MRs without specifying the output format

We continue by creating a run with temperature=0.0 to ensure that the model generates
deterministic results. We then wait for the completion of the run and display the time, token
usage, and price information using the wait_ans() function.
1

2 # create run with temperature = 0


3

4 run = client . beta . threads . runs . create (


5 thread_id = thread . id ,
6 assistant_id = assistant . id ,
7 temperature =0.0 ,
8 )
9
4.3. Data Collection 33

10 wait_ans ( thread = thread , run = run )

Listing 4.8: Creating a run and waiting for completion

We display the result and we are going to extract the CSV data from the markdown-formatted
response using the md2df() function.
1

2 # display answer
3

4 msg_list = client . beta . threads . messages . list ( thread_id = thread . id )


5 display ( Markdown (( msg_list . data [0]. content [0]. text . value ) ) )
6

7 # extract csv to df
8

9 df = md2df ( msg_list . data [0]. content [0]. text . value )


10 df

Listing 4.9: Displaying the result and extracting CSV data

We continue by asking the model to make a self-reflection on how well it did the previous
assignment by creating a new message in the same thread.
1

2 msg2 = client . beta . threads . messages . create (


3 thread_id = thread . id ,
4 role = " user " ,
5 content = ’ ’ ’ Can you make a self - assessment on the validity of the proposed 10
MRs ?
6 Reminder : MT ( Metamorphic Testing ) involves multiple executions of the program
using altered inputs .
7 Mark 1 for valid , and 0 for not valid . Please provide the answer in a parsable
CSV format with the header :
8 " MR #" ," self validation "
9 ’’’
10 )

Listing 4.10: Creating a new message for self-reflection

We are going to extract again the CSV data into another DataFrame, we are merging the two
DataFrames, we display the result and we save the merged DataFrame into a CSV file.
1 # Extracting the self - assessment
2 df1 = md2df ( msg_list . data [0]. content [0]. text . value )
3

4 # combine MR definition and self - validation


5 df = pd . merge ( df , df1 , on = ’ MR # ’)
6 display ( df )
7

8 # save df to file
9 # we need utf -16 because excel can ’t handle properly the default utf -8
10 # we need to use tab separators instead of comma when using utf -16 encoding
11 df . to_csv ( ’ MR_atan2 . csv ’ , encoding = ’utf -16 ’ , sep = ’\ t ’ , index = False )

Listing 4.11: Extracting the self-assessment data and saving the results
34 Chapter 4. Empirical Assessment

We need to use utf-16 encoding because Excel on Windows11 (still) can’t handle properly the
default utf-8 encoding for mathematical symbols.

2nd Part of the MLFS Study: SRS-Informed Scenario


The only difference in the second part of the MLFS study is that we are going to use the MLFS
SRS document as input for the LLM.

For the 2nd part of the study, we need to upload the MLFS SRS on the OpenAI platform (see
Listing 4.12).
1

2 # upload file
3

4 file = client . files . create (


5 file = open ( " E1356 - GTD - SRS -01 _I1_R4 . pdf " , " rb " ) ,
6 purpose = " assistants "
7 )
8

Listing 4.12: Uploading the MLFS SRS document

Then the file can be retrieved from the file list (see Listing 4.13 as an example).
1

2 # retrieve MLFS SRS file


3 file = [ x for x in client . files . list () . data
4 if x . filename == ’ E1356 - GTD - SRS -01 _I1_R4 . pdf ’ ][0]

Listing 4.13: Retrieving the uploaded file

For example we attach it to the assistant (see Listing 4.14).


1

2 # create assistant accessing the SRS


3

4 assistant = client . beta . assistants . create (


5 name = " s w T e s t e r E x p e r t _ 4 _ t u r b o _ M L F S " ,
6 instructions = " You are a software tester expert . " ,
7 tools =[{ " type " : " retrieval " }] ,
8 # GPT -4 Turbo , context : 128 ,000 , cutoff date : Dec 2023
9 model = " gpt -4 - turbo -2024 -04 -09 " ,
10 file_ids =[ file . id ] ,
11 )
12

Listing 4.14: Attaching the MLFS SRS document to the assistant

We defined the assistant with a name that indicates it has attached the MLFS SRS document,
and we can retrieve it from the assistants list by specifying its name:
1

2 assistant = [ x for x in client . beta . assistants . list () . data


4.3. Data Collection 35

3 if x . name == ’ s w T e s t e r E x p e r t _ 4 _ t u r b o _ M L F S ’ ][0]

Listing 4.15: Retrieving the assistant with the MLFS SRS document

The rest of the process is similar to the first part of the MLFS study, with the difference that
we are saving the CSV file with the suffix _srs to differentiate it from the previous one.

3rd Part of the MLFS Study: SRS-Directed Scenario


The third part of the MLFS study is similar to the second part, but this time we make sure
that the LLM uses the MLFS SRS document by explicitly instructing it to do so, as shown in
Listing 4.16.

Although the LLM has a pretty good idea of what the atan2 functions usually do, we are
directing it to use the MLFS SRS document to read the specifics of this particular implementation.
The rest of the process is similar to the previous parts of the MLFS study, with the difference
that we are saving the CSV file with the suffix _use_srs to differentiate it from the previous
ones.
1

2 # create message for 10 MRs


3

4 msg1 = client . beta . threads . messages . create (


5 thread_id = thread . id ,
6 role = " user " ,
7 file_ids =[ file . id ] ,
8 content = ’ ’ ’I want you to infer infer 10 MRs ( Metamorphic Relations ) for the
function ‘ atan2 () ‘ specified in the attached SRS document chapter ‘5.2.9
ATAN2 ( arctan y / x ) ‘.
9 Please provide the answer in CSV format with the header :
10 " MR #" ," MR Title " ," MR Formula " ," MR Text "
11 ’’’
12 )

Listing 4.16: Creating a message to instruct the LLM to use the MLFS SRS
document

Test targets in the MLFS Study


We are going to repeat the above process for 7 functions from the MLFS library, each time
saving the results in a separate CSV file. The functions are: atan2 , asin , exp , fmin , fmod ,
hypot , and pow . The functions have been selected to include both simple and more complex
mathematical operations, one and two input arguments, and different output ranges. The
results will be analyzed to assess the LLM’s performance across a variety of test cases, providing
insights into its ability to generate MRs for different types of functions and mathematical
operations.
36 Chapter 4. Empirical Assessment

4.3.2 PUS Study


This subsection describes the process of data collection for our study on the generation of MRs
for the PUS framework. The PUS study is similar to the MLFS study, and we are going to
highlight the differences in the data collection process.

Most of the modules should be installed already from the previous MLFS study, but we had to
upgrade the OpenAI Python library to the version 1.31.0 in order to be able to use the new
gpt-4o model. To begin, the required modules are imported, and the OpenAI API client is
initiated, following the same procedure as in the MLFS study.

The PUS revision C standard has many services grouped into Service Types (ST) — as it can
be seen on figure 4.2.

Figure 4.2: PUS Service Types slide from the ESA training material

Studying all the STs would be too much for a single study, so we are going to focus on the
TC/TM (Telecommand/Telemetry) services:

• ST[03] Housekeeping: a widely used service to monitor the health of the system.

• ST[17] Test: a very simple service, similar to the ping command in networking.

• ST[01] Request Verification: a special case, as all the other STs are passing through
this service.

• ST[23] File Management: an easy to understand service, as it is similar to file operations


on a computer.

• ST[22] Position-Based Scheduling: a complex, space-specific service, as it is used to


schedule tasks based on the satellite’s position.
4.3. Data Collection 37

The data collection has been automated in the Jupyter Notebook file PUS_STxx_Service_Type.ipynb.
We set the variable st_par_nr_name and from this string we derive all the messages and file
names needed for the study.
1

2 # Service Type with name , eg : ST [03] Housekeeping ’


3 st_par_nr_name = ’ ST [03] Housekeeping ’
4

5 # Service Type number , eg : 03


6 st_nr = st_par_nr_name [3:5]
7

8 # chapter number = st number without leading 0 , eg : 3


9 ch_nr = str ( int ( st_nr ) )
10

11 # ST name , eg : Housekeeping
12 st_name = st_par_nr_name [7:]
13

14 # replace space with underscore in filename , eg : ST03_Housekeeping


15 st_file_name = ’ ST ’ + st_nr + ’_ ’ + st_name . replace ( ’ ’ , ’_ ’)

Listing 4.17: Setting the service type number and name

We import the modules and initiate the OpenAI API client as we did in the MLFS study. We
are going to reuse the helper functions defined in the MLFS study, but the cost function is
going to be updated to reflect the new pricing for the gpt-4o model:
1

2 # define function to extract in / out tokens and price


3

4 def cost ( res ) :


5 ’’’
6 return intokens , outtokens , price based on the retrieved run object
7 ’’’
8 if res . model . startswith ( ’gpt -3.5 - turbo ’) :
9 inprice , outprice = 0.5 , 1.5
10 elif res . model . startswith ( ’gpt -4 - turbo ’) :
11 inprice , outprice = 10 , 30
12 elif res . model . startswith ( ’gpt -4 o ’) :
13 inprice , outprice = 5 , 15
14 intokens , outtokens = res . usage . prompt_tokens , res . usage . completion_tokens
15 price = ( inprice * intokens + outprice * outtokens ) * 1e -4 # https :// openai .
com / pricing
16 return intokens , outtokens , price # cents
17

Listing 4.18: Updating the cost() function for the gpt-4o model

The wait_ans() function is prepared to handle the rate_limit_exceeded exception, which


can occur when the model is used too frequently. In this case, we are going to wait for 15
seconds and then retry the request:
1

2 # define function waiting for chatGPT answer


38 Chapter 4. Empirical Assessment

4 def wait_ans ( thread , run ) :


5 ’’’
6 wait until run completed and display time / tokens / price
7 ’’’
8 t = 0
9 while True :
10 res = client . beta . threads . runs . retrieve (
11 thread_id = thread . id ,
12 run_id = run . id
13 )
14 if res . status == ’ completed ’:
15 print ( ’\ n \ ncompleted \ n ’)
16 print ( f ’ runtime : { t } seconds ’)
17 intk , outk , price = cost ( res )
18 print ( f ’ tokens in / out : { intk }/{ outk } , price : { price :.2 f } cents ’)
19 print ( ’ sleep 1 sec \ n ’)
20 time . sleep (1)
21 break
22

23 # Exception : LastError (
24 # code = ’ rate_limit_exceeded ’,
25 # message = ’ Rate limit reached for gpt -4 o on tokens per min ( TPM ) :
26 # Limit 30000 , Used 10643 , Requested 21017.
27 # Please try again in 3.32 s .
28 # Visit https :// platform . openai . com / account / rate - limits ’)
29

30 elif res . last_error is not None :


31 if res . last_error . code == ’ rate_limit_exceeded ’:
32 print ( ’ last_error :\ n ’ , res . last_error , ’\ n ’)
33 print ( ’ sleep 15 sec \ n ’)
34 time . sleep (15)
35 run = client . beta . threads . runs . create (
36 thread_id = thread . id ,
37 assistant_id = assistant . id ,
38 temperature =0.0 ,
39 )
40 else :
41 raise Exception ( res . last_error )
42 else :
43 print ( ’. ’ , end = ’ ’)
44 time . sleep (1)
45 t += 1
46

Listing 4.19: Handling the rate limit exceeded exception in the wait_ans()
function

The md2df() function is prepared to handle extra spaces appearing inside the CSV data or
the occasionally missing quotation marks around the field values:
1

2 # define function to extract csv from markdown and convert to df


4.3. Data Collection 39

4 def md2df ( txt ) :


5 ’’’
6 return a pandas df extracted from a csv extracted from a markdown txt
7 ’’’
8 try :
9 # csv delimited by ‘‘‘ csv \ n ... ‘ ‘ ‘ inside md
10 csv_data = txt . split ( " ‘‘‘ csv \ n " ) [1]. split ( " ‘‘‘" ) [0]
11 except IndexError :
12 # csv delimited by ‘ ‘ ‘\ n ... ‘ ‘ ‘ inside md
13 csv_data = txt . split ( " ‘ ‘ ‘\ n " ) [1]. split ( " ‘‘‘" ) [0]
14

15 if ’" ’ in csv_data :
16 # skipinitialspace to fix issue like "" , " TM [3 ,25]" ,...
17 df = pd . read_csv ( StringIO ( csv_data ) , skipinitialspace = True )
18 else :
19 pattern = r ’ (\[\ d +) ,(\ d +\]) ’ # pattern match comma inside brackets
20 replacement = r ’ \1~\2 ’ # Replacement pattern : ~ instead of comma
21 csv_data1 = re . sub ( pattern , replacement , csv_data ) # [ ,] -> [~]
22 df = pd . read_csv ( StringIO ( csv_data1 ) , skipinitialspace = True )
23 df = df . replace ( ’~ ’ , ’ , ’ , regex = True ) # change back ~ to ,
24 return df
Listing 4.20: Handling extra spaces and missing quotation marks in the md2df()
function

The attachment of files is slightly different in the new OpenAI API, as we need to create a
vector_store and then add files to it:
1

2 # Create a vector store called " PUS revC "


3 vector_store = client . beta . vector_stores . create ( name = " PUS revC pdf , pptx , json -
rec " )
4

5 # Ready the files for upload to OpenAI


6 file_paths = [
7 " ../ ECSS -E - ST -70 -41 C_PUS - C_15April2016 . pdf " ,
8 " ../ ECSS -E - ST -70 -41 C _ t r a i n i n g _ m a t e r i a l _ 2 0 2 3 0 3 1 5 . pptx " ,
9 " ../ EARM_ECSS -E - ST -70 -41 C_PUS -C - rec . json " ,
10 ]
11 file_streams = [ open ( path , " rb " ) for path in file_paths ]
12

13 # upload the files , add them to the vector store ,


14 # and poll the status of the file batch for completion .
15 file_batch = client . beta . vector_stores . file_batches . upload_and_poll (
16 vector_store_id = vector_store . id , files = file_streams
17 )
18

19 # check the result of this operation .


20 print ( file_batch . status )
21 print ( file_batch . file_counts )
22

23 # completed
40 Chapter 4. Empirical Assessment

24 # FileCounts ( cancelled =0 , completed =3 , failed =0 , in_progress =0 , total =3)


25

Listing 4.21: Creating a vector store and adding files to it

We continue by creating an assistant with access to the recently created vector store through the
tool_resources parameter. We are going to use the gpt-4o model, and in the v2 version
of the Assistants API the retrieval tool is replaced by the file_search tool:
1

2 # create assistant
3 # https :// platform . openai . com / docs / models / gpt -4 o
4

5 assistant = client . beta . assistants . create (


6 name = " s wTesterExpert_4o_PUS_pdf_pptx_json - rec " ,
7 instructions = " You are a software tester expert . " ,
8 tools =[{ " type " : " file_search " }] ,
9 tool_resources ={ " file_search " : { " vector_store_ids " : [ vector_store . id ]}} ,
10 # GPT -4 omni model , context : 128 ,000 , cutoff date : Oct 2023
11 model = " gpt -4 o " ,
12 temperature =0.0 ,
13 )
14

Listing 4.22: Creating an assistant with access to the vector store

Null Experiment of the PUS Study: Query without Documents


This is an experiment to determine if LLMs can generate MRs without documents. We are
going to create an assistant without access to the vector store:
1

2 # create simple assistant


3 # https :// platform . openai . com / docs / models / gpt -4 o
4

5 assistant = client . beta . assistants . create (


6 name = " sw TesterExpert_4o " ,
7 instructions = " You are a software tester expert . " ,
8 tools =[{ " type " : " file_search " }] ,
9 # tool_resources ={" file_search ": {" vector_store_ids ": [ vector_store . id ]}} ,
10 # GPT -4 omni model , context : 128 ,000 , cutoff date : Oct 2023
11 model = " gpt -4 o -2024 -05 -13 " ,
12 )
13

Listing 4.23: Creating an assistant model gpt-4o without access to the vector
store

We get the assistant from the assistants list by specifying its name, we create a new thread, and
we ask the LLM to generate MRs for a specific PUS service type, e.g. ST[03] Housekeeping :
1

2 # get assistant without docs


4.3. Data Collection 41

3 assistant = [ x for x in client . beta . assistants . list () . data


4 if x . name == ’ swTesterExpert_4o ’ ][0]
5

6 # create new thread


7 thread = client . beta . threads . create ()
8

9 # create message for 10 MRs


10

11 msg1 = client . beta . threads . messages . create (


12 thread_id = thread . id ,
13 role = " user " ,
14 content = f ’ ’ ’I have to test a spacecraft software which is implementing the
Telemetry and Telecommand PUS standard revision C .
15 Can you infer 10 Metamorphic Relations ( MRs ) to test the { st_par_nr_name }
functions specifying the TC / TM id - s ?
16 Please provide the answer in parsable CSV format with the header :
17 " MR #" ," MR Title " ," Original Test Case " ," Follow - up Test Case " ," Expected Relation
" ," Explanation of MRs "
18 ’’’
19 )

Listing 4.24: Trying to generate MRs without providing supporting documents

As previously stated, the model wants to provide us “accurate and meaningful MRs” and
to achieve this it needs the specific details from the standard document. It will decline the
generation of MRs without supporting document — as it can be seen in the answer on figure 4.3.

Figure 4.3: ChatGPT rejects MR generation for the PUS framework when not
providing supporting document.

1st Experiment of the PUS Study: Assistant with PDF + PPTX + JSON
To generate meaningful MRs and to assess RQ4, it is essential to create a mapping of all the
TC/TM-s for a given service type. This mapping serves several purposes:

• It ensures that the LLM reads and processes the provided documents thoroughly.
42 Chapter 4. Empirical Assessment

• It directs the LLM to focus on the specific parts of the document that describe the given
service type.

• The output of this mapping request automatically serves as input for the subsequent
request, where we ask the LLM to generate MRs based on the mapped TC/TM information.

• The mapping provides a comprehensive overview of the functionalities of the given service
type, which is beneficial not only for the LLM but also for the human observer.

To create the TC/TM mapping, we use the following code:


1

2 # get assistant with access to docs


3 assistant = [ x for x in client . beta . assistants . list () . data
4 if x . name == ’ swTesterExpert_4o_PUS_pdf_pptx_json - rec ’ ][0]
5

6 # create new thread


7 thread = client . beta . threads . create ()
8

9 # TC / TM
10 msg0 = client . beta . threads . messages . create (
11 thread_id = thread . id ,
12 role = " user " ,
13 content = f ’ ’ ’ Can you enumerate all the TC ( TeleCommand ) and TM ( TeleMetry )
specified in the { st_par_nr_name }?
14 If a TC is initiating a TM then please list the TM in the corresponding row .
15 Please provide the answer in parsable CSV format with the header :
16 " TC " ," TC Description " ," TM " ," TM Description "
17 If a TC is not initiating TM then please leave empty the " TM " and " TM
Description " fields .
18 ’’’
19 )

Listing 4.25: Creating a message to generate a map of all TC/TM within a


service type

Just like in the MLFS study, we create a run, we wait for the completion of the run, we display
the result, we extract the CSV data, and we save the result into a CSV file:
1

2 # create run with temperature = 0


3 run = client . beta . threads . runs . create (
4 thread_id = thread . id ,
5 assistant_id = assistant . id ,
6 temperature =0.0 ,
7 )
8

9 wait_ans ( thread = thread , run = run )


10

11 # display answer
12 msg_list = client . beta . threads . messages . list ( thread_id = thread . id )
13 display ( Markdown (( msg_list . data [0]. content [0]. text . value ) ) )
14
4.3. Data Collection 43

15 # extract csv to df
16 df = md2df ( msg_list . data [0]. content [0]. text . value )
17 display ( df )
18

19 # save df to file
20 # we need utf -16 because excel can ’t handle properly the default utf -8
21 # we need to use tab separators instead of comma when using utf -16 encoding
22 df . to_csv ( f ’ PUS_ { st_file_name } _ T C _ T M _ l i s t _ p d f _ pp t x _ j s o n . csv ’ ,
23 encoding = ’utf -16 ’ , sep = ’\ t ’ , index = False )

Listing 4.26: Run, wait for completion, display answer and save the TC/TM
list CSV file

We continue by defining a new message in the same thread, asking the LLM to generate MRs
in such a way that every TC/TM is covered by at least one MR:
1

2 # create message to generate MRs


3

4 msg1 = client . beta . threads . messages . create (


5 thread_id = thread . id ,
6 role = " user " ,
7 content = f ’ ’ ’I have to test a spacecraft software which is implementing the
Telemetry and Telecommand PUS .
8 Can you infer Metamorphic Relations ( MRs ) to test the { st_par_nr_name }
functions described in 6.{ ch_nr } system requirements and 8.{ ch_nr }
interface requirements ?
9 Please provide the answer in parsable CSV format with the header :
10 " MR #" ," MR Title " ," Original Test " ," Follow - up Test " ," Expected Relation "
11 Please make sure that every TeleCommand ( TC ) and TeleMetry ( TM ) is tested at
least once .
12 ’’’
13 )

Listing 4.27: Creating a message to generate MRs for a given ST

Similar to the MLFS study, the CSV header is chosen by the model when prompting without
output format specified, we then formalized it in the header of the message. We create a run,
we wait for the completion of the run, we display the result, we extract the CSV data, and we
save the result into a CSV file with the suffix pdf_pptx_json :
1

2 # save df to file
3 # we need utf -16 because excel can ’t handle properly the default utf -8
4 # we need to use tab separators instead of comma when using utf -16 encoding
5 df . to_csv ( f ’ PUS_ { st_file_name } _MRs_pdf_pptx_json . csv ’ ,
6 encoding = ’utf -16 ’ , sep = ’\ t ’ , index = False )

Listing 4.28: Save the MRs to CSV file


44 Chapter 4. Empirical Assessment

Self-Reflection in the PUS Study


We have again a small divergence from the MLFS study, because the LLM considered that all
MRs generated for the PUS services are valid, independently of the chosen ST or the attached
documents.

The self-reflection part for the PUS study has been commented out, as it is not providing any
additional insights into the LLM’s performance — apart from the fact that the LLM is quite
confident in the validity of the generated MRs.

2nd Experiment of the PUS Study: Message with PDF Attachment


For the second experiment we are using an assistant with no access to the vector store, but we
are going to attach the PDF document to the message. All the other steps are similar to the
first experiment, but we are going to save the TC/TM list and the MRs into separate CSV files
with the suffix _pdf .

3rd Experiment of the PUS Study: Message with PPTX Attachment


Again the assistant has no access to the vector store, but we are going to attach the PPTX
document to the message. The other steps are similar to the previous experiments, but we are
going to save the TC/TM list and the MRs into separate CSV files with the suffix _pptx .

There is a slight divergence from the previous experiments: when getting the TC/TM list the
LLM is not inserting comma between the service type number and the sub-service number. We
checked the PPTX document and the text is correctly formatted — as seen in figure 4.4.

We tried to persuade the LLM to insert the comma by explicit prompting and by providing
examples, but we couldn’t find a consistent way to make it work. We decided to manually
correct this by performing a replace in the data frame TC and TM columns:
1

2 # Insert missing commas in TC / TM


3

4 df [ ’ TC ’] = df [ ’ TC ’ ]. str . replace ( f ’ TC [{ ch_nr } ’ , f ’ TC [{ ch_nr } , ’)


5 df [ ’ TM ’] = df [ ’ TM ’ ]. str . replace ( f ’ TM [{ ch_nr } ’ , f ’ TM [{ ch_nr } , ’)
6

Listing 4.29: Inserting the missing comma into the TC/TM list data frame

Since the MRs are generated as well without commas inside the TCs and TMs, we are going
to correct this by inserting the missing comma in the Original Test , Follow-up Test and
Expected Relation columns:
1

2 # Insert missing commas in MRs


3

4 df [ ’ Original Test ’] = df [ ’ Original Test ’ ]. str . replace ( f ’ TC [{ ch_nr } ’ , f ’ TC [{


ch_nr } , ’)
5 df [ ’ Original Test ’] = df [ ’ Original Test ’ ]. str . replace ( f ’ TM [{ ch_nr } ’ , f ’ TM [{
ch_nr } , ’)
4.3. Data Collection 45

Figure 4.4: PUS ST[03] Housekeeping — Message types slide from the ESA
training material, displaying comma between the service type number and the
sub-service number (e.g., TC[3,5])

7 df [ ’ Follow - up Test ’] = df [ ’ Follow - up Test ’ ]. str . replace ( f ’ TC [{ ch_nr } ’ , f ’ TC [{


ch_nr } , ’)
8 df [ ’ Follow - up Test ’] = df [ ’ Follow - up Test ’ ]. str . replace ( f ’ TM [{ ch_nr } ’ , f ’ TM [{
ch_nr } , ’)
9

10 df [ ’ Expected Relation ’] = df [ ’ Expected Relation ’ ]. str . replace ( f ’ TC [{ ch_nr } ’ , f ’


TC [{ ch_nr } , ’)
11 df [ ’ Expected Relation ’] = df [ ’ Expected Relation ’ ]. str . replace ( f ’ TM [{ ch_nr } ’ , f ’
TM [{ ch_nr } , ’)

Listing 4.30: Inserting the missing comma into the MRs data frame

The resulting data frames are then saved into separate CSV files with the suffix _pptx_fix .

4th Experiment of the PUS Study: Message with JSON Attachment


The fourth experiment is similar to the previous ones, but this time we are going to attach the
JSON document to the message. The TC/TM query is somewhat different, as we are including
into the prompt the Requirement Identifier:
1

2 msg0 = client . beta . threads . messages . create (


3 thread_id = thread . id ,
4 role = " user " ,
5 attachments = [
46 Chapter 4. Empirical Assessment

6 { " file_id " : file . id , " tools " : [{ " type " : " file_search " }] }
7 ],
8 content = f ’ ’ ’ Can you enumerate all the TC ( TeleCommand ) and TM ( TeleMetry ) of
{ st_par_nr_name } Service Type where Requirement Identifier starts with 6.{
ch_nr } ( system requirement ) or 8.{ ch_nr } ( interface requirement ) ?
9 If a TC is initiating a TM then please list the TM in the corresponding row .
10 Please provide the answer in parsable CSV format with the header :
11 " TC " ," TC Description " ," TM " ," TM Description "
12 If a TC is not initiating TM then please leave empty the " TM " and " TM
Description " fields .
13 ’’’
14 )

Listing 4.31: Creating a message to generate a map of all TC/TM with a given
Requirement ID pattern

The other steps are similar to the previous experiments, but we are going to save the TC/TM
list and the MRs into separate CSV files with the suffix _json .

5th Experiment of the PUS Study: Query 5 Times with PDF Attachment
The fifth experiment is supporting the answer to RQ3 by querying the LLM five times with
the same PDF document attached to the message. This is basically identical to the second
experiment, but we are going to repeat the process five times in a for loop and save the
TC/TM list and the MRs into separate CSV files with the suffix _i0 , _i1 , _i2 , _i3 , and
_i4 .

4.4 Analysis and Results


In this section, we present the results obtained from our empirical assessment of generating
MRs for the MLFS and PUS frameworks using LLMs. The section is structured as follows:

• Results from the MLFS Study (4.4.1): In the first subsection, we present the data
collected during the MLFS experiments. This includes an analysis of the validity and
relevance of the MRs generated by the LLM when provided with the MLFS software
requirements specification (SRS) documents, as well as a comparison of the MRs generated
with and without access to the SRS documents.

• Results from the PUS Study (4.4.2): In the second subsection, we present the data
collected during the PUS experiments. This includes an analysis of the LLM’s ability to
generate MRs from the PUS SRS documents across different document formats (PDF,
PPTX, JSON), and an assessment of how the document format impacts the generated
MRs.

• Assessing RQs (4.4.3): In the third subsection, we answer the research questions outlined
in Section 4.2 using the combined data from the MLFS and PUS experiments. This
includes evaluating the LLM’s overall capability to generate MRs from SRS documents
(RQ1), understanding the qualitative differences between MRs generated from SRS versus
4.4. Analysis and Results 47

LLMs prior knowledge (RQ2), investigating the impact of document format on MR


generation (RQ3), and assessing the LLM’s ability to extract relevant test information
from large SRS documents (RQ4).

The results are discussed in the context of the research questions, providing insights into the
potential of using LLMs to automate and enhance the software testing process for space systems.

4.4.1 MLFS Study Results


We start by presenting the results for the function atan2 . The MRs generated by ChatGPT
are shown in Table 4.1. We have three sub-tables: T1 for the MRs generated without access to
the SRS document, T2 for the MRs generated with access to the SRS document, and T3 for the
MRs generated with explicit instructions to use the SRS document.

In order to evaluate the MRs generated by the LLM, we added to the table several columns
containing statistics to assess the quality and validity of the MRs, which will help us determine
the appropriate metrics later on. These statistics can be seen in Table 4.2.

The columns in the table are as follows:

• Self Validation : this column is filled by the LLM and contains the results of the
self-validation of the MRs.

• My Validation : this column is filled by the author of this thesis and contains the
results of the validation of the MRs. The validation process has been discussed with the
supervisors to ensure its accuracy and reliability.

• My Validation Reasoning : Whenever we consider an MR invalid (noted with value 0


in the table), we provide a short explanation in this column. We note here as well when
an MR is repeated within the same sub-table.

• MR Valid & Unique : this column counts the MRs which are valid and unique according
to our standards.

• Invalid MR, Easy Fix : this column counts the MRs which we consider as invalid
because of a small “glitch” or “spelling error” in the MR generation and there is an easy
fix to make them valid.

• Invalid MR, Contradicting SRS : this column counts the MRs which explicitly contra-
dict the specification in the supporting documents, by getting a rule from SRS and not
following exactly the described behavior.

• Self + My Validation : these four columns count all the combinations of the LLM and
our validations:

– under 00 both validations are invalid,

– under 01 the LLM is invalid and our validation is valid,

– under 10 the LLM is valid and our validation is invalid,


48 Chapter 4. Empirical Assessment

Table 4.1: MRs generated by ChatGPT for the atan2 function

T1: chatGPT inferring 10 MRs without access to SRS pdf


MR# MR Title MR Formula MR Text Self
Validation
MR1 Sign Change in y atan2(x, -y) = -atan2(x, y) If the sign of y is reversed, the output of atan2 should be 1
the negation of the original output.
MR2 Sign Change in x atan2(-x, y) = atan2(x, y) ± p If the sign of x is reversed, the output of atan2 should be the 1
original output plus or minus p, depending on the quadrant.
MR3 Multiplying by a Positive Scalar atan2(k*x, k*y) = atan2(x, y) for k > 0 Multiplying both x and y by the same positive scalar does 1
not change the output of atan2.
MR4 Multiplying by a Negative atan2(k*x, k*y) = atan2(x, y) ± p for k Multiplying both x and y by the same negative scalar should 1
Scalar <0 result in the output of atan2 being the original output plus
or minus p.
MR5 Swapping Arguments atan2(y, x) = p/2 - atan2(x, y) Swapping the arguments x and y should result in the output 1
being p/2 minus the original output.
MR6 Adding p to Both Arguments atan2(x + p, y + p) = atan2(x, y) Adding p to both x and y should not change the output of 0
atan2.
MR7 Permutation of Signs atan2(-x, -y) = atan2(x, y) ± p Changing the signs of both x and y simultaneously should 1
result in the output of atan2 being the original output plus
or minus p.
MR8 Scaling Both Arguments atan2(a*x, a*y) = atan2(x, y) for any a Scaling both arguments by any non-zero scalar a should not 1
̸= 0 change the output of atan2.
MR9 Zero Arguments atan2(0, 0) = undefined When both x and y are zero, atan2 is undefined. 1
MR10 Doubling Arguments atan2(2*x, 2*y) = atan2(x, y) Doubling both x and y should not change the output of 1
atan2.
9
same or similar MRs marked
T2: chatGPT inferring 10 MRs with access to the SRS pdf
MR# MR Title MR Formula MR Text Self
Validation
MR1 Sign Flip X atan2(-x, y) = -atan2(x, y) Flipping the sign of x negates the output. 1
MR2 Sign Flip Y atan2(x, -y) = -atan2(x, y) Flipping the sign of y negates the output. 1
MR3 Swap XY atan2(y, x) = p/2 - atan2(x, y) Swapping x and y results in a 90 degrees rotation. 0
MR4 Scale Invariance atan2(k*x, k*y) = atan2(x, y) for any k Scaling both x and y by the same non-zero factor k does not 1
̸= 0 change the output.
MR5 Quadrant Symmetry atan2(-x, -y) = atan2(x, y) - p Negating both x and y shifts the output by p radians. 0
MR6 Double Flip atan2(-x, -y) = atan2(x, y) - p Flipping the sign of both x and y shifts the output by p 1
radians.
MR7 Permutation Invariance atan2(x, y) = atan2(y, x) - p/2 Permuting x and y results in a 90 degrees rotation and a 0
shift by p/2.
MR8 Zero Shift X atan2(x+0, y) = atan2(x, y) Adding zero to x does not change the output. 1
MR9 Zero Shift Y atan2(x, y+0) = atan2(x, y) Adding zero to y does not change the output. 1
MR10 Negative Scale atan2(-k*x, -k*y) = atan2(x, y) for any k ̸= 0 Scaling both x and y by the same negative non-zero factor 1
k does not change the output.
7

T3: chatGPT inferring 10 MRs, function description provided through SRS pdf
MR# MR Title MR Formula MR Text Self Valida-
tion
MR1 Sign Preservation atan2(y, x) = atan2(-y, -x) If both x and y are negated, the result of atan2 should be 1
the same.
MR2 Pi Complement atan2(y, x) = -atan2(-y, x) Negating y should result in the negation of the output, 1
reflecting the point across the origin.
MR3 Zero Behavior atan2(0, x) = 0 for x > 0 If y is 0 and x is positive, atan2 should return 0. 1
MR4 Pi Behavior atan2(0, x) = p for x < 0 If y is 0 and x is negative, atan2 should return p. 1
MR5 Half Pi Behavior atan2(y, 0) = p/2 for y > 0 If y is positive and x is 0, atan2 should return p/2. 1
MR6 Negative Half Pi Behavior atan2(y, 0) = -p/2 for y < 0 If y is negative and x is 0, atan2 should return -p/2. 1
MR7 Infinity Behavior atan2(y, -Inf) = ±p for finite y ̸= 0 If x is negative infinity and y is finite and non-zero, atan2 1
should return ±p.
MR8 NaN Propagation atan2(NaN, x) = NaN If any argument is NaN, atan2 should return NaN. 1
MR9 Symmetry atan2(y, x) = -atan2(-y, -x) The function should be symmetric about the origin. 1
MR10 Scaling Invariance atan2(k*y, k*x) = atan2(y, x) for k ̸= 0 Scaling both arguments by the same non-zero factor should 1
not change the output of atan2.
10
4.4. Analysis and Results 49

– under 11 both validations are valid.

• MR same/similar : this column compares MRs generated on a given documentation


condition with MRs in the other two conditions. We consider the MRs same or similar
when the MR Formula field from the previous table correspond to each other. To make it
easier to identify these MRs, we shade the background of the corresponding fields in the
previous table with different shades of green.

• MR valid and not in T1 , T2 or T3 : these columns collect the valid MRs present in
this sub-table but not in the other two sub-tables.
Table 4.2: MR statistics for the atan2 function

T1: chatGPT inferring 10 MRs without access to Self + My Vali- MR same/simi- MR valid and
SRS pdf dation lar in not in
MR# Self My My Validation MR Invalid MR Invalid MR, 00 01 10 11 T2 T3 T2 T3
Validation Validation Reasoning Valid Easy Fix Contradicting
& SRS
Uniqe
MR1 1 1 1 0 - 0 0 0 1 0 1 1 0
MR2 1 1 1 0 - 0 0 0 1 0 0 1 1
MR3 1 1 1 0 - 0 0 0 1 1 0 0 1
MR4 1 1 1 0 - 0 0 0 1 1 0 0 1
MR5 1 1 1 0 - 0 0 0 1 1 0 0 1
MR6 0 0 not true 0 0 - 1 0 0 0 0 0 0 0
MR7 1 1 1 0 - 0 0 0 1 1 0 0 1
MR8 1 0 wrong repeat of MR3 & MR4 0 0 - 0 0 1 0 0 0 0 0
MR9 1 0 identity, not true in our case 0 0 - 0 0 1 0 0 0 0 0
MR10 1 0 special case of MR3 0 0 - 0 0 1 0 0 0 0 0
9 6 6 0 - 1 0 3 6 4 1 2 5

T2: chatGPT inferring 10 MRs with access to the Self + My Vali- MR same/simi- MR valid and
SRS pdf dation lar not in
MR# Self My My Validation MR Invalid MR Invalid MR, 00 01 10 11 T1 T3 T1 T3
Validation Validation Reasoning Valid Easy Fix Contradicting
& SRS
Uniqe
MR1 1 0 not true, x & y swapped 0 1 0 0 0 1 0 0 0 0 0
MR2 1 1 1 0 0 0 0 0 1 0 0 1 1
MR3 0 1 1 0 0 0 1 0 0 0 0 1 1
MR4 1 0 valid for k>0 only 0 1 0 0 0 1 0 1 0 0 0
MR5 0 1 1 0 0 0 1 0 0 1 1 0 0
MR6 1 1 repeating MR5 0 0 0 0 0 0 1 1 1 0 0
MR7 0 0 not true 0 0 0 1 0 0 0 1 0 0 0
MR8 1 0 not relevant 0 0 0 0 0 1 0 0 0 0 0
MR9 1 0 not relevant 0 0 0 0 0 1 0 0 0 0 0
MR10 1 0 like MR4, 0 0 0 0 0 1 0 1 0 0 0
7 4 3 2 0 1 2 5 2 5 2 2 2

T3: chatGPT inferring 10 MRs, function description by SRS pdf Self + My Vali- MR same/simi- MR valid and
dation lar not in
MR# Self Valida- My My Validation MR Invalid MR Invalid MR, 00 01 10 11 T1 T2 T1 T2
tion Validation Reasoning Valid Easy Fix Contradicting
& SRS
Uniqe
MR1 1 0 not true (missing ± p) 0 1 0 0 0 1 0 0 1 0 0
MR2 1 1 1 0 0 0 0 0 1 1 0 0 1
MR3 1 0 SRS distinguishes ±0 cases 0 0 1 0 0 1 0 0 0 0 0
MR4 1 0 SRS distinguishes ±0 cases 0 0 1 0 0 1 0 0 0 0 0
MR5 1 1 1 0 0 0 0 0 1 0 0 1 1
MR6 1 1 1 0 0 0 0 0 1 0 0 1 1
MR7 1 0 SRS distinguishes the +p & 0 1 1 0 0 1 0 0 0 0 0
-p
MR8 1 1 1 0 0 0 0 0 1 0 0 1 1
MR9 1 0 MR1 repeated & not true 0 0 0 0 0 1 0 0 1 0 0
MR10 1 0 valid for k>0 only 0 1 0 0 0 1 0 0 0 0 0
10 4 4 3 3 0 0 6 4 1 2 3 4

The results and the statistics for the other functions examined in the MLFS experiments, asin ,
exp , fmin , fmod , hypot and pow are listed in Appendix A.
50 Chapter 4. Empirical Assessment

MLFS Document Usage

An interesting aspect is the usage of the SRS document in the MR generation process. We
designed our experiments in three different settings: without access to the SRS document, with
access to the SRS document, and with access to the SRS document combined with explicit
prompting to use that document. Unfortunately, we don’t have a direct feedback whether the
SRS has been used or not, and which parts of the documents have been used. However, we can
infer some information from token usage. Obviously when there is no document attached to the
query the model is answering our request based only on the prompt. In this case for the atan2
we are getting the following token usage:

runtime : 23 seconds
tokens in / out : 488/565 , price : 2.18 cents

When attaching the SRS document without specific prompting, the token usage is very similar:

runtime : 22 seconds
tokens in / out : 614/529 , price : 2.20 cents

We can see significant difference when the model is explicitly asked to use the SRS document:

runtime : 16 seconds
tokens in / out : 5968/510 , price : 7.50 cents

As previously stated, the MLFS SRS document converts to ~38,200 tokens. The model is using
5968 tokens in this case, which is a significant portion of the document, but there is no direct
feedback of which parts of the document have been used.

4.4.2 PUS Study Results


The model refused to generate MRs for the PUS framework without access to the SRS documents,
therefore all our experiments were conducted with at least one of the SRS documents attached.

TC/TM Mapping with PDF, PPTX and JSON Documents

We started our experiments by asking the model to generate the complete TC/TM mapping
for a given Service Type (ST). We provided the model with all three documents (PDF, PPTX,
JSON) simultaneously, and requested the TC/TM mapping for the ST[03] HouseKeeping (HK).
The results are presented in Table 4.3.

We continued our experiments by attaching individually the PDF, PPTX and JSON documents
and requesting the TC/TM mapping for the ST[03] Housekeeping. The results are presented in
Table 4.4.

Looking at these initial results, we can conclude that the spacecraft is behaving like a state-
machine: we can send TeleCommand (TC) to the spacecraft, this is going to set the internal
4.4. Analysis and Results 51

Table 4.3: Results of TC/TM mapping of ST[03] Housekeeping, when all three
documents (PDF, PPTX and JSON) attached

pdf, pptx, json


TC TC Description TM TM Description
TC[3,1] create a housekeeping parameter report structure
TC[3,2] create a diagnostic parameter report structure
TC[3,3] delete housekeeping parameter report structures
TC[3,4] delete diagnostic parameter report structures
TC[3,5] enable the periodic generation of housekeeping parameter reports
TC[3,6] disable the periodic generation of housekeeping parameter reports
TC[3,7] enable the periodic generation of diagnostic parameter reports
TC[3,8] disable the periodic generation of diagnostic parameter reports
TC[3,9] report housekeeping parameter report structures TM[3,10] housekeeping parameter report structure report
TC[3,11] report diagnostic parameter report structures TM[3,12] diagnostic parameter report structure report
TM[3,25] housekeeping parameter report
TM[3,26] diagnostic parameter report
TC[3,27] generate a one shot report for housekeeping parameter report structures
TC[3,28] generate a one shot report for diagnostic parameter report structures
TC[3,29] append parameters to a housekeeping parameter report structure
TC[3,30] append parameters to a diagnostic parameter report structure
TC[3,31] modify the collection interval of housekeeping parameter report structures
TC[3,32] modify the collection interval of diagnostic parameter report structures
TC[3,33] report the periodic generation properties of housekeeping parameter report structures TM[3,35] housekeeping parameter report periodic generation properties re-
port
TC[3,34] report the periodic generation properties of diagnostic parameter report structures TM[3,36] diagnostic parameter report periodic generation properties report
TC[3,37] apply parameter functional reporting configurations
TC[3,38] create a parameter functional reporting definition
TC[3,39] delete parameter functional reporting definitions
TC[3,40] report parameter functional reporting definitions TM[3,41] parameter functional reporting definition report
TC[3,42] add parameter report definitions to a parameter functional reporting definition
TC[3,43] remove parameter report definitions from a parameter functional reporting definition
TC[3,44] modify the periodic generation properties of parameter report definitions of a param-
eter functional reporting definition

state, and some of the TCs initiate a report about the internal state as TeleMetry (TM). We
can see that the model always pairs the TC or TM designation with the correct TC or TM
description, but doesn’t find always all the TC/TMs within a given ST.

We made a comparison table of the TC/TM mapping results for the ST[03] Housekeeping for
different document types. The results are presented in Table 4.5. The missing TC/TM fields
are colored in light orange.

Establishing an accurate ground truth for TC/TM field mapping is crucial for identifying
missing or incorrect entries. Ideally, this ground truth would be provided by ESA. However,
due to lack of access to official ESA data and the absence of this information in the available
training materials, we developed an alternative approach to create our own reference standard.

Our methodology leverages the LLM’s responses across different document types. We reasoned
that if the model consistently identifies the same TC/TM across two or more document types,
the probability of this being a hallucination is negligible. This consistency across varied inputs
serves as a strong indicator of validity.

Following this logic, we focused our verification efforts on the discrepancies in TC/TM mapping
results between different document types. By cross-referencing these differences with the
available documentation, we were able to quickly resolve uncertainties and compile our own
52 Chapter 4. Empirical Assessment

Table 4.4: Results of TC/TM mapping of ST[03] Housekeeping, when individu-


ally attaching the PDF, PPTX and JSON documents

pdf
TC TC Description TM TM Description
TC[3,1] create a housekeeping parameter report structure
TC[3,2] create a diagnostic parameter report structure
TC[3,3] delete housekeeping parameter report structures
TC[3,4] delete diagnostic parameter report structures
TC[3,5] enable the periodic generation of housekeeping parameter reports
TC[3,6] disable the periodic generation of housekeeping parameter reports
TC[3,7] enable the periodic generation of diagnostic parameter reports
TC[3,8] disable the periodic generation of diagnostic parameter reports
TC[3,9] report housekeeping parameter report structures TM[3,10] housekeeping parameter report structure report
TC[3,11] report diagnostic parameter report structures TM[3,12] diagnostic parameter report structure report
TC[3,27] generate a one shot report for housekeeping parameter report structures
TC[3,28] generate a one shot report for diagnostic parameter report structures
TC[3,29] append parameters to a housekeeping parameter report structure
TC[3,30] append parameters to a diagnostic parameter report structure
TC[3,31] modify the collection interval of housekeeping parameter report structures
TC[3,32] modify the collection interval of diagnostic parameter report structures
TC[3,33] report the periodic generation properties of housekeeping parameter report structures TM[3,35] housekeeping parameter report periodic generation properties re-
port
TC[3,34] report the periodic generation properties of diagnostic parameter report structures TM[3,36] diagnostic parameter report periodic generation properties report
TC[3,37] apply parameter functional reporting configurations
TC[3,38] create a parameter functional reporting definition
TC[3,39] delete parameter functional reporting definitions
TC[3,40] report parameter functional reporting definitions TM[3,41] parameter functional reporting definition report
TC[3,42] add parameter report definitions to a parameter functional reporting definition
TC[3,43] remove parameter report definitions from a parameter functional reporting definition
TC[3,44] modify the periodic generation properties of parameter report definitions of a param-
eter functional reporting definition

pptx
TC TC Description TM TM Description
TC[3,1] Create HKTM Report
TC[3,2] Create Diagnostic Report
TC[3,3] Delete HKTM Report
TC[3,4] Delete Diagnostic Report
TC[3,5] Enable HKTM Report
TC[3,6] Disable HKTM Report
TC[3,7] Enable Diagnostic Report
TC[3,8] Disable Diagnostic Report
TC[3,9] Report HKTM structures TM[3,10] Housekeeping parameter report structure report
TC[3,27] Generate one-shot HKTM Report TM[3,25] Housekeeping parameter report
TC[3,28] Generate one-shot Diagnostic Report TM[3,26] Diagnostic parameter report
TC[3,29] Append parameters to a housekeeping parameter report structure
TC[3,31] Modify the collection interval of housekeeping parameter report structures
TC[3,33] Report the periodic generation properties of housekeeping parameter report structures TM[3,35] Housekeeping parameter report periodic generation properties re-
port
TC[3,30] Append parameters to a diagnostic parameter report structure
TC[3,32] Modify the collection interval of diagnostic parameter report structures
TC[3,34] Report the periodic generation properties of diagnostic parameter report structures TM[3,36] Diagnostic parameter report periodic generation properties report
TC[3,37] Apply parameter functional reporting configurations
TC[3,38] Create a parameter functional reporting definition
TC[3,39] Delete parameter functional reporting definitions
TC[3,40] Report parameter functional reporting definitions TM[3,41] Parameter functional reporting definition report
TC[3,42] Add parameter report definitions to a parameter functional reporting definition
TC[3,44] Modify periodic generation status of report in configuration

json
TC TC Description TM TM Description
TC[3,1] Create a housekeeping parameter report structure
TC[3,3] Delete housekeeping parameter report structures
TC[3,5] Enable the periodic generation of housekeeping parameter reports
TC[3,6] Disable the periodic generation of housekeeping parameter reports
TC[3,9] Report housekeeping parameter report structures TM[3,10] Housekeeping parameter report structure report
TC[3,27] Generate a one shot report for housekeeping parameter report structures TM[3,25] Housekeeping parameter report
TC[3,29] Append parameters to a housekeeping parameter report structure
TC[3,31] Modify the collection interval of housekeeping parameter report structures
TC[3,33] Report the periodic generation properties of housekeeping parameter report structures TM[3,35] Housekeeping parameter report periodic generation properties re-
port
4.4. Analysis and Results 53

Table 4.5: Comparing the TC/TM mapping results of ST[03] Housekeeping for
different document types

pdf, pptx, json pdf pptx json


TC TM TC TM TC TM TC TM
TC[3,1] TC[3,1] TC[3,1] TC[3,1]
TC[3,2] TC[3,2] TC[3,2]
TC[3,3] TC[3,3] TC[3,3] TC[3,3]
TC[3,4] TC[3,4] TC[3,4]
TC[3,5] TC[3,5] TC[3,5] TC[3,5]
TC[3,6] TC[3,6] TC[3,6] TC[3,6]
TC[3,7] TC[3,7] TC[3,7]
TC[3,8] TC[3,8] TC[3,8]
TC[3,9] TM[3,10] TC[3,9] TM[3,10] TC[3,9] TM[3,10] TC[3,9] TM[3,10]
TC[3,11] TM[3,12] TC[3,11] TM[3,12]
TM[3,25]
TM[3,26]
TC[3,27] TC[3,27] TC[3,27] TM[3,25] TC[3,27] TM[3,25]
TC[3,28] TC[3,28] TC[3,28] TM[3,26]
TC[3,29] TC[3,29] TC[3,29] TC[3,29]
TC[3,30] TC[3,30] TC[3,30]
TC[3,31] TC[3,31] TC[3,31] TC[3,31]
TC[3,32] TC[3,32] TC[3,32]
TC[3,33] TM[3,35] TC[3,33] TM[3,35] TC[3,33] TM[3,35] TC[3,33] TM[3,35]
TC[3,34] TM[3,36] TC[3,34] TM[3,36] TC[3,34] TM[3,36]
TC[3,37] TC[3,37] TC[3,37]
TC[3,38] TC[3,38] TC[3,38]
TC[3,39] TC[3,39] TC[3,39]
TC[3,40] TM[3,41] TC[3,40] TM[3,41] TC[3,40] TM[3,41]
TC[3,42] TC[3,42] TC[3,42]
TC[3,43] TC[3,43]
TC[3,44] TC[3,44]
2 4 6 22 Mistakes
3.7% 7.4% 11.1% 40.7% Mistakes [%]
54 Chapter 4. Empirical Assessment

ground truth. This process, while not official, provided a pragmatic and reasonably reliable
basis for our analysis.

We can observe that the model is able to find most of the TC/TMs when all three documents
are attached. In this case the model correctly identifies that TM[3,25] (housekeeping parameter
report) and TM[3,26] (diagnostic parameter report) is not always an immediate response to a
TC but it can be enabled by other TCs (TC[3,5] – enable HKTM report or TC[3,7] – enable
diagnostic report) and then it will be generated periodically.

We can see that TM[3,25] and TM[3,26] can be as well an immediate response to TC[3,27]
(generate one-shot HKTM Report) and TC[3,28] (generate a one shot Diagnostic Report) —
this property is fully observed while using the PPTX document individually and is missed when
the PDF document is attached.

We counted the number of missing TC/TM fields for each document type, and we computed
the percentage of missing TC/TM fields. The results show that the JSON document is missing
the most TC/TM fields, with 22 missing fields (40.7%), followed by the PPTX document with
6 missing fields (11.1%), and the PDF document with 4 missing fields (7.4%). When all three
documents are attached, only 2 TC/TM fields are missing (3.7%).

Another service type we looked at was ST[17] Test. This ST has been included because of
its simplicity and its similarity to the ping command in computer networks. The TC/TM
mapping is presented in Table 4.6. Here we have a perfect TC/TM mapping for all the document
attachment combinations. We can see that the TC[17,1] (Perform an are-you-alive connection
test) is followed by TM[17,2] (Are-you-alive connection test report) and TC[17,3] (Perform an
on-board connection test) is followed by TM[17,4] (On-board connection test report).

Table 4.6: Results of TC/TM mapping of ST[17] Test

pdf, pptx, json


TC TC Description TM TM Description
TC[17,1] Perform an are-you-alive connection test TM[17,2] Are-you-alive connection test report
TC[17,3] Perform an on-board connection test TM[17,4] On-board connection test report

pdf
TC TC Description TM TM Description
TC[17,1] Perform an are-you-alive connection test TM[17,2] Are-you-alive connection test report
TC[17,3] Perform an on-board connection test TM[17,4] On-board connection test report

pptx
TC TC Description TM TM Description
TC[17,1] Perform an are-you-alive connection test TM[17,2] Are-you-alive connection test report
TC[17,3] Perform an on-board connection test TM[17,4] On-board connection test report

json
TC TC Description TM TM Description
TC[17,1] Perform an are-you-alive connection test TM[17,2] Are-you-alive connection test report
TC[17,3] Perform an on-board connection test TM[17,4] On-board connection test report
4.4. Analysis and Results 55

We included the ST[01] Request Verification service type in our experiments because it is
different from all the other STs since it doesn’t have its own TCs, but all the other STs use it
to verify their TCs.

ST[01] provides the capability to process the routing, acceptance and execution of each request
independently of its type and generates reports of the execution status of a request and failure
conditions. The TC/TM mapping for the ST[01] Request Verification is presented in Table 4.7.

We can see that the model correctly identifies that there are no TCs in the ST[01] Request
Verification, when all three documents or only the PDF document is attached.

When the PPTX document is attached, the model detects TC[8,1] and TC[200,1] — these are
not errors, these are example TCs given in the presentation.

When the only attachment is the requirements JSON file, the model gets confused by the
absence of TCs in ST[01] and it picks random TCs and their corresponding TMs from other
STs.

The comparison of the TC/TM mapping results for the ST[01] Request Verification for different
document types is presented in Table 4.8. The mistaken or missing TC/TM fields are colored
in light orange.

Further results of TC/TM mappings for ST[22] Position-Based Scheduling and ST[23] File
Management are included in Appendix B.
56 Chapter 4. Empirical Assessment

Table 4.7: Results of TC/TM mapping of ST[01] Request Verification

pdf, pptx, json


TC TC Description TM TM Description
TM[1,1] successful acceptance verification report
TM[1,2] failed acceptance verification report
TM[1,3] successful start of execution verification report
TM[1,4] failed start of execution verification report
TM[1,5] successful progress of execution verification report
TM[1,6] failed progress of execution verification report
TM[1,7] successful completion of execution verification report
TM[1,8] failed completion of execution verification report
TM[1,10] failed routing verification report

pdf
TC TC Description TM TM Description
TM[1,1] successful acceptance verification report
TM[1,2] failed acceptance verification report
TM[1,3] successful start of execution verification report
TM[1,4] failed start of execution verification report
TM[1,5] successful progress of execution verification report
TM[1,6] failed progress of execution verification report
TM[1,7] successful completion of execution verification report
TM[1,8] failed completion of execution verification report
TM[1,10] failed routing verification report

pptx
TC TC Description TM TM Description
TC[8,1] Routing TM[1,10] Failed routing verification report
TC[8,1] Acceptance TM[1,1] Successful acceptance verification report
TC[8,1] Acceptance TM[1,2] Failed acceptance verification report
TC[200,1] Start of Execution TM[1,3] Successful start of execution verification report
TC[200,1] Start of Execution TM[1,4] Failed start of execution verification report
TC[200,1] Progress & Completion TM[1,5] Successful progress of execution verification report
TC[200,1] Progress & Completion TM[1,6] Failed progress of execution verification report
TC[200,1] Progress & Completion TM[1,7] Successful completion of execution verification report
TC[200,1] Progress & Completion TM[1,8] Failed completion of execution verification report

json
TC TC Description TM TM Description
TC[6,9] check raw memory data TM[6,10] checked raw memory data report
TC[21,13] abort all request sequences and report TM[21,14] aborted request sequence report
TC[21,6] report the execution status of each request sequence TM[21,7] request sequence execution status report
TC[21,9] checksum a request sequence TM[21,10] request sequence checksum report
TC[21,11] report the content of a request sequence TM[21,12] request sequence content report
TC[4,8] report the parameter statistics definitions TM[4,9] parameter statistics definition report
TC[20,6] report parameter definitions TM[20,7] parameter definition report
TC[17,3] perform an on-board connection test TM[17,4] on-board connection test report
4.4. Analysis and Results 57

Table 4.8: Comparing the TC/TM mapping results of ST[01] Request Verifica-
tion for different document types

pdf, pptx, json pdf pptx json


TC TM TC TM TC TM TC TM
TM[1,1] TM[1,1] TC[8,1] TM[1,1] TC[6,9] TM[6,10]
TM[1,2] TM[1,2] TC[8,1] TM[1,2] TC[21,13] TM[21,14]
TM[1,3] TM[1,3] TC[200,1] TM[1,3] TC[21,6] TM[21,7]
TM[1,4] TM[1,4] TC[200,1] TM[1,4] TC[21,9] TM[21,10]
TM[1,5] TM[1,5] TC[200,1] TM[1,5] TC[21,11] TM[21,12]
TM[1,6] TM[1,6] TC[200,1] TM[1,6] TC[4,8] TM[4,9]
TM[1,7] TM[1,7] TC[200,1] TM[1,7] TC[20,6] TM[20,7]
TM[1,8] TM[1,8] TC[200,1] TM[1,8] TC[17,3] TM[17,4]
TM[1,10] TM[1,10] TC[8,1] TM[1,10]
0 0 0 18 Mistakes
0% 0% 0% 100% Mistakes [%]
58 Chapter 4. Empirical Assessment

MR Coverage with PDF, PPTX, and JSON Documents

After the TC/TM mapping, we performed the MR coverage analysis. We prompted the model
to generate MRs to include all the TCs and TMs at least once in the MRs. The results of the
generated MRs for ST[03] Housekeeping when all three documents are attached are presented
in Table B.7.

The query is repeated then while individually attaching the three documents. The results
are not included in this document because of space constraints, but they are available in the
repository6 . The results show that the model is able to generate MRs that include all the TCs
and TMs at least once.

We haven’t included a request for self-validation like we did for the MLFS experiments, because
at the initial runs we observed that the model considered always all generated MRs valid, so
the self-validation query wouldn’t bring us any additional information.

The metrics developed for the MR coverage analysis are presented for ST[03] Housekeeping in
Table 4.9.

We have 4 sub-tables:

• T0 - when all three documents are attached

• T1 - when only the PDF document is attached

• T2 - when only the PPTX document is attached

• T3 - when only the JSON document is attached

The columns of the table are as follows:

• MR# - the number of the MR

• Correct - whether the MR tests are correct or not, i.e., the description can be considered
a valid test case but not necessarily a valid MR

• Valid MR - whether the generated statement is considered a valid MR or not – as discussed


with the supervisors, we are counting them as valid MRs only if in the Expected Relation
column we have a relation between the expected test outputs

• Same or Similar - the MRs that are similar or the same in the other three sub-tables

• Only T0/T1/T2/T3 - the MRs that are present only in the respective sub-table

• Valid MR, Only T0/T1/T2/T3 - the valid MRs that are present only in the respective
sub-table

We can observe that most of the proposed tests are correct – the model is able to conclude the
consequences of the TCs and the expected TMs and relations.
6
Thesis repository: https://gitlab.com/uniluxembourg/snt/svv/msc/nemet-attila-istvan-msc/thesis_work
4.4. Analysis and Results 59

We marked as invalid statements only in the ST[01] Request Verification when the JSON
document is attached and the model is generating MRs with TCs and TMs belonging to ST
different from ST[01] — as shown in the table B.9. Technically, the initial test, the follow-up
test and the expected relation are correct, but we marked as incorrect because the model is not
following the instructions to generate MRs belonging to the specified ST[01]. For similar reasons
we have two more incorrect MRs when the PPTX was attached and the model generated MRs
with TCs and TMs from ST[17] instead of ST[01] — as shown in the table B.9.
60 Chapter 4. Empirical Assessment

Table 4.9: Metrics for ST[03] Housekeeping MRs

T0 - pdf, pptx, json Same or Similar


MR# Correct Valid T1 T2 T3 Only T0 Valid MR
MR Only T0
MR1 1 0 MR3 MR1 0 0
MR2 1 0 MR8 0 0
MR3 1 1 MR2 MR2 MR3 0 0
MR4 1 1 MR9 MR3 0 0
MR5 1 0 1 0
MR6 1 0 1 0
MR7 1 0 MR5 0 0
MR8 1 0 1 0
MR9 1 1 MR6 0 0
MR10 1 1 1 1
MR11 1 0 1 0
MR12 1 0 1 0
MR13 1 0 MR14 0 0
MR14 1 0 MR15 MR14 0 0
MR15 1 0 MR16 0 0
MR16 1 1 1 1
16 5 7 3 4 7 2

T1 - pdf Same or Similar


MR# Correct Valid T0 T2 T3 Only T1 Valid MR
MR Only T1
MR1 1 1 MR1 0 0
MR2 1 1 MR3 MR2 MR3 0 0
MR3 1 0 MR1 MR1 0 0
MR4 1 1 1 1
MR5 1 1 1 1
MR6 1 0 1 0
MR7 1 0 1 0
MR8 1 0 MR2 0 0
MR9 1 1 MR4 MR3 0 0
MR10 1 1 1 1
MR11 1 1 1 1
MR12 1 0 1 0
MR13 1 0 1 0
MR14 1 0 MR13 0 0
MR15 1 1 MR14 MR14 0 0
MR16 1 1 MR15 0 0
MR17 1 1 1 1
17 10 7 4 2 9 5

T2 - pptx Same or Similar


MR# Correct Valid T0 T1 T3 Only T2 Valid MR
MR Only T2
MR1 1 1 MR1 0 0
MR2 1 1 MR3 MR2 MR3 0 0
MR3 1 1 MR4 MR9 0 0
MR4 1 0 1 0
MR5 1 0 MR4 0 0
MR6 1 0 1 0
MR7 1 0 1 0
MR8 1 0 1 0
MR9 1 0 MR7 0 0
MR10 1 0 1 0
MR11 1 0 1 0
MR12 1 0 1 0
MR13 1 0 1 0
MR14 1 1 MR14 MR15 0 0
MR15 1 0 1 0
MR16 1 0 1 0
MR17 1 1 1 1
17 5 3 4 3 11 1

T3 - json Same or Similar


MR# Correct Valid T0 T1 T2 Only T3 Valid MR
MR Only T3
MR1 1 0 MR1 MR3 0 0
MR2 1 1 1 1
MR3 1 1 MR3 MR2 MR2 0 0
MR4 1 0 MR5 0 0
MR5 1 1 MR7 0 0
MR6 1 1 MR9 0 0
MR7 1 0 MR9 0 0
7 4 4 2 3 1 1
4.4. Analysis and Results 61

TC/TM Mapping with 5xPDF vs PPTX

To address RQ3, which examines whether different document formats lead to varied MRs, we
employ a two-step comparative analysis:

1. PDF Document Baseline:

• We conduct the TC/TM mapping and MR coverage analysis five times with the PDF
document attached.

• This repetition establishes a baseline, revealing the inherent variability in results


when using the same document type across multiple queries.

2. PPTX Document Comparison:

• We then perform a single analysis using the PPTX document.

• The results from this PPTX-based query are compared against the baseline variability
observed in the PDF experiments.

This approach allows us to distinguish between the inherent variability in the model’s output
and the impact of different document formats on the generated MRs.

The TC/TM mapping results for ST[03] Housekeeping are presented in Table 4.10.

Table 4.10: Comparing the TC/TM mapping results of ST[03] Housekeeping,


five times PDF and once PPTX attached

T0 - pdf T1 - pdf T2 - pdf T3 - pdf T4 - pdf T5 - pptx


TC TM TC TM TC TM TC TM TC TM TC TM
TC[3,1] TC[3,1] TC[3,1] TC[3,1] TC[3,1] TC[3,1]
TC[3,2] TC[3,2] TC[3,2] TC[3,2] TC[3,2] TC[3,2]
TC[3,3] TC[3,3] TC[3,3] TC[3,3] TC[3,3] TC[3,3]
TC[3,4] TC[3,4] TC[3,4] TC[3,4] TC[3,4] TC[3,4]
TC[3,5] TC[3,5] TC[3,5] TC[3,5] TC[3,5] TC[3,5]
TC[3,6] TC[3,6] TC[3,6] TC[3,6] TC[3,6] TC[3,6]
TC[3,7] TC[3,7] TC[3,7] TC[3,7] TC[3,7] TC[3,7]
TC[3,8] TC[3,8] TC[3,8] TC[3,8] TC[3,8] TC[3,8]
TC[3,9] TM[3,10] TC[3,9] TM[3,10] TC[3,9] TM[3,10] TC[3,9] TM[3,10] TC[3,9] TM[3,10] TC[3,9] TM[3,10]
TC[3,11] TM[3,12] TC[3,11] TM[3,12] TC[3,11] TM[3,12] TC[3,11] TM[3,12] TC[3,11] TM[3,12]
TM[3,25] TM[3,25] TM[3,25] TM[3,25]
TM[3,26] TM[3,26] TM[3,26] TM[3,26]
TC[3,27] TC[3,27] TM[3,25] TC[3,27] TC[3,27] TC[3,27] TC[3,27] TM[3,25]
TC[3,28] TC[3,28] TC[3,28] TC[3,28] TC[3,28] TM[3,26]
TC[3,29] TC[3,29] TC[3,29] TC[3,29] TC[3,29] TC[3,29]
TC[3,30] TC[3,30] TC[3,30] TC[3,30] TC[3,30] TC[3,30] TM[3,35]
TC[3,31] TC[3,31] TC[3,31] TC[3,31] TC[3,31] TC[3,31]
TC[3,32] TC[3,32] TC[3,32] TC[3,32] TC[3,32] TC[3,32]
TC[3,33] TM[3,35] TC[3,33] TM[3,35] TC[3,33] TM[3,35] TC[3,33] TM[3,35] TC[3,33] TM[3,35] TC[3,33]
TC[3,34] TM[3,36] TC[3,34] TM[3,36] TC[3,34] TM[3,36] TC[3,34] TM[3,36] TC[3,34] TM[3,36] TC[3,34] TM[3,36]
TC[3,37] TC[3,37] TC[3,37] TC[3,37] TC[3,37] TC[3,37]
TC[3,38] TC[3,38] TC[3,38] TC[3,38] TC[3,38] TC[3,38]
TC[3,39] TC[3,39] TC[3,39] TC[3,39] TC[3,39] TC[3,39]
TC[3,40] TM[3,41] TC[3,40] TM[3,41] TC[3,40] TM[3,41] TC[3,40] TM[3,41] TC[3,40] TM[3,41] TC[3,40] TM[3,41]
TC[3,42] TC[3,42] TC[3,42] TC[3,42] TC[3,42] TC[3,42]
TC[3,43] TC[3,43] TC[3,43] TC[3,43] TC[3,43]
TC[3,44] TC[3,44] TC[3,44] TC[3,44] TC[3,44] TC[3,44]
2 4 2 2 2 7 Mistakes
4.0% 8.0% 4.0% 4.0% 4.0% 14.0% Mistakes [%]
4.8% 14.0% PDF Avg [%]
62 Chapter 4. Empirical Assessment

Adhering to previous practices the missing or erroneous TC/TM fields are marked with a
light-orange background. We can observe again that the OpenAI LLM experiments are not
perfectly reproducible even when setting the temperature to 0.

We repeated 5 times the TC/TM mapping query with the PDF document attached, using
exactly the same inputs (sub-tables T0, T1, T2, T3, and T4) and 4 results are the same
(T0, T2, T3 and T4), but one of them (T1) is missing TC[3,28], it is missing the periodically
generated TMs (TM[3,25] and TM[3,26] not preceded directly by a TC), while it is the only
PDF experiment detecting correctly that TM[3,25] can be a one-shot answer to TC[3,27].

MR Coverage Analysis with 5xPDF vs PPTX

The MRs generated by this experiment are quite massive, and not suitable to include in this
document, but are available in the repository of this thesis (Uniluxembourg/SnT/SVV/NEMET-
Attila/Thesis-Work · GitLab 2024).

We are going to discuss in more detail the method used for this analysis. The metrics developed
for this analysis are presented in Table 4.12.

We have 6 sub-tables:

• T0, T1, T2, T3, T4 - when the PDF document is attached.

• T5 - when the PPTX document is attached.

The columns of the table are:

• MR# - the number of the MR.

• Correct - whether the MR tests are correct or not, i.e., the description can be considered
a valid test case but not necessarily a valid MR.

• Valid MR - whether the MR is a valid MR or not – as agreed with the supervisor, we


consider as valid only if in the Expected Relation field we have a relation between the
expected test outputs

• Same or Similar - the MRs that are same or similar in the other sub-tables.

In the table we compute three indexes:

• Jaccard index for all MRs

• Jaccard index for valid MRs

• Proportion New for valid MRs

The Jaccard index is defined as the size of the intersection divided by the size of the union of
the same or similar MRs between two sub-tables and it helps understanding how deterministic
are LLM executions.

We define the Proportion New index for the valid MRs, which is the intersection of valid MRs
with another sub-table, divided by the valid MRs in the first sub-table. This captures the
4.4. Analysis and Results 63

Table 4.11: Collection of Jaccard and Proportion New indexes for ST[03]
Housekeeping
T0 - pdf Same or Similar T1 - pdf Same or Similar
MR# Correct Valid MR T0 T1 T2 T3 T4 T5 MR# Correct Valid MR T0 T1 T2 T3 T4 T5
MR1 1 1 MR1 MR2 MR1 MR1 1 1 MR1 MR2 MR1
MR2 1 1 MR2 MR3 MR3 MR2 MR2 1 1 MR2 MR3 MR3 MR2
MR3 1 1 MR3 MR1 MR1 MR3 1 1 MR3 MR1 MR1
MR4 1 1 MR9 MR4 1 0 MR7 MR7
MR5 1 1 MR11 MR5 1 1
MR6 1 0 MR13 MR6 1 0 MR10 MR2 MR4
MR7 1 0 MR4 MR7 MR7 1 1 MR9 MR4 MR6 MR3
MR8 1 1 MR5 MR8 1 0 MR14 MR8
MR9 1 1 MR7 MR4 MR6 MR3 MR9 1 1
MR10 1 0 MR6 MR2 MR4 MR10 1 0
MR11 1 1 MR10 MR11 1 0
MR12 1 1 MR12 MR12 1 0 MR15 MR9
MR13 1 0 MR14 MR13 1 0 MR16 MR12
MR14 1 0 MR8 MR8 MR14 1 0 MR15 MR14
MR15 1 0 MR14 MR14 MR15 1 1 MR17
9 8 5 14 0 3 Intersection 7 8 5 8 2 5 Intersection
30 32 33 37 32 Summa MRs 30 32 33 37 32 Summa MRs
22 27 19 37 29 Union 22 27 25 35 27 Union
0.36 0.19 0.74 0.00 0.10 Jaccard 0.36 0.19 0.32 0.06 0.19 Jaccard
4 3 9 0 3 Intersection Valid MRs 4 3 5 0 3 Intersection Valid MRs
16 12 20 13 14 Summa Valid MRs 16 10 18 11 12 Summa Valid MRs
12 9 11 13 11 Union Valid MRs 12 7 13 11 9 Union Valid MRs
0.33 0.33 0.82 0.00 0.27 Jaccard Valid MRs 0.33 0.43 0.38 0.00 0.33 Jaccard Valid MRs
1 0.56 0.67 0 1 Proportion new 0.43 1 0.57 0.29 1 Proportion new

T2 - pdf Same or Similar T3 - pdf Same or Similar


MR# Correct Valid MR T0 T1 T2 T3 T4 T5 MR# Correct Valid MR T0 T1 T2 T3 T4 T5
MR1 1 1 MR3 MR3 MR1 MR1 1 1 MR3 MR3 MR1
MR2 1 0 MR10 MR6 MR4 MR2 1 1 MR1 MR1 MR1
MR3 1 1 MR2 MR2 MR3 MR2 MR3 1 1 MR2 MR2 MR3 MR2
MR4 1 1 MR9 MR7 MR6 MR3 MR4 1 0 MR10 MR6 MR2
MR5 1 0 MR9 MR5 MR5 1 1 MR8
MR6 1 0 MR10 MR6 MR6 1 1 MR9 MR7 MR4 MR3
MR7 1 0 MR7 1 0 MR7 MR4
MR8 1 0 MR8 1 0 MR14 MR8
MR9 1 0 MR9 1 1 MR4
MR10 1 0 MR10 1 1 MR11
MR11 1 0 MR16 MR11 1 1 MR5
MR12 1 0 MR12 1 1 MR12
MR13 1 0 MR13 1 0 MR6
MR14 1 0 MR15 MR14 MR14 1 0 MR13
MR15 1 0 MR15 1 0
MR16 1 0 MR16 1 0 MR11
MR17 1 0 MR17 1 1 MR15
3 5 5 5 2 4 Intersection MR18 1 1
32 32 35 39 34 Summa MRs 11 14 8 5 0 3 Intersection
27 27 30 37 30 Union 33 33 35 40 35 Summa MRs
0.19 0.19 0.17 0.05 0.13 Jaccard 19 25 30 40 32 Union
3 3 3 0 2 Intersection Valid MRs 0.74 0.32 0.17 0.00 0.09 Jaccard
12 10 14 7 8 Summa Valid MRs 9 5 3 0 3 Intersection Valid MRs
9 7 11 7 6 Union Valid MRs 20 18 14 15 16 Summa Valid MRs
0.33 0.43 0.27 0.00 0.33 Jaccard Valid MRs 11 13 11 15 13 Union Valid MRs
0 0 1 0 1 Proportion new 0.82 0.38 0.27 0.00 0.23 Jaccard Valid MRs
0.18 0.55 0.73 1 1 Proportion new

T4 - pdf Same or Similar


MR# Correct Valid MR T0 T1 T2 T3 T4 T5
MR1 1 0
MR2 1 0 T5 - pptx Same or Similar
MR3 1 0 MR# Correct Valid MR T0 T1 T2 T3 T4 T5
MR4 1 0 MR1 1 1 MR1 MR1 MR2
MR5 1 0 MR2 1 1 MR2 MR2 MR3 MR3
MR6 1 0 MR3 1 1 MR9 MR7 MR4 MR6
MR7 1 0 MR4 1 0
MR8 1 0 MR5 1 0 MR5 MR9
MR9 1 0 MR5 MR5 MR6 1 0 MR6 MR10
MR10 1 0 MR6 MR6 MR7 1 0
MR11 1 0 MR8 1 0
MR12 1 0 MR9 1 0 MR12 MR15
MR13 1 0 MR10 1 0
MR14 1 0 MR11 1 0
MR15 1 0 MR12 MR9 MR12 1 0 MR13 MR16
MR16 1 0 MR13 MR12 MR13 1 0
MR17 1 0 MR14 1 1
MR18 1 0 MR15 1 0
MR19 1 1 MR16 1 1
MR20 1 1 MR17 1 0
MR21 1 1 5 3 5 4 3 4 Intersection
MR22 1 1 32 32 34 35 39 Summa MRs
4 0 2 2 0 4 Intersection 29 27 30 32 35 Union
37 37 39 40 39 Summa MRs 0.10 0.19 0.13 0.09 0.11 Jaccard
37 35 37 40 35 Union 3 3 2 3 0 Intersection Valid MRs
0.00 0.06 0.05 0.00 0.11 Jaccard 14 12 8 16 9 Summa Valid MRs
0 0 0 0 0 Intersection Valid MRs 11 9 6 13 9 Union Valid MRs
13 11 7 15 9 Summa Valid MRs 0.27 0.33 0.33 0.23 0.00 Jaccard Valid MRs
13 11 7 15 9 Union Valid MRs 0.4 0.4 0.6 0.4 1 Proportion new
0.00 0.00 0.00 0.00 0.00 Jaccard Valid MRs
1 1 1 1 1 Proportion new
64 Chapter 4. Empirical Assessment

proportion of MRs that are not generated by other LLM runs, thus enabling to understand what
is the proportion of unseen MRs that an LLM run brings. Different from the Jaccard index, it
helps distinguishing cases that introduce new MRs from cases that do not. For example, with
Jaccard, an LLM run that leads to 10 unseen MRs would have an index of 0, like an LLM
run not leading to any valid MR; the Proportion New, would instead lead to 100% and 0%,
instead. Proportion New, shall help determining if relying on a different set of documents
bring a proportion of new MRs that significantly differs from what observed by just repeating
the LLM execution.

We are going to extract a vector for the PDF experiments and a vector for the PPTX experiment
for all MRs Jaccard index, for valid MRs Jaccard index and for valid MRs Proportion New
index as presented in Table 4.12. The v_pdf is recording the similarities of the MRs amongst
the PDF experiments, while the v_pptx is recording the similarities of the MRs between the
PDF and PPTX experiments.

In order to observe whether we are getting different MRs from the PDF documents vs the
PPTX document, we are going to use the Mann-Whitney U-test (also known as the Wilcoxon
rank-sum test), which compares the differences between two independent groups. This test is
going to evaluate whether one sample ( v_pdf in our case) tends to have higher or lower values
from another ( v_pptx in our case). Additionally we are going to compute the VDA (Vargha
and Delaney’s A12 ) effect size measure. This quantifies the probability that a randomly selected
value from one group (| v_pdf |) is larger than a randomly selected value from the other group
(| v_pptx |). The VDA ranges from 0 to 1, where 0.5 indicates that the two groups are equal, 0
indicates that the first group is always smaller, and 1 indicates that the second group is always
smaller. In our context, if relying on a different type of document (PPTX) leads to a higher
number of new MRs than when just re-running an LLM with the same document, we should
observe a low value for the VDA score. This test is complementary to Mann-Whitney U —
while the U-test assesses statistical significance of the difference, the VDA provides insight into
the practical significance of the findings.

PUS Document Usage

Similar to the MLFS study, we take a look at the document usage with the new model gpt-4o
and the new assistant v2 . Here is a typical example of the document usage for a query:

runtime : 12 seconds
tokens in / out : 17699/685 , price : 9.88 cents

The gpt-4o model has a runtime of about ~15 seconds — about half the runtime of the
gpt-4-turbo model used in the MLFS experiments. The number of input tokens it was always
between 16,000 and 20,000 tokens, and the price of a query it was always around 10 cents (±2
cents). It is very interesting the fact that the number of input tokens was not dependent of the
number of attached documents (1 or 3), or the type of the query (TC/TM mapping or MR
generation), or the type of the attached document (PDF, PPTX or JSON).
4.4. Analysis and Results 65

Table 4.12: Extracting v_pdf and v_pptx vectors for ST[03] Housekeeping
Jaccard index - all MRs

T0 T1 T2 T3 T4 T5 v_pdf v_pptx
T0 0.36 0.19 0.74 0.00 0.10 0.36 0.10
T1 0.19 0.32 0.06 0.19 0.19 0.19
T2 0.17 0.05 0.13 0.74 0.13
T3 0.00 0.09 0.00 0.09
T4 0.11 0.19 0.11
T5 0.32
0.06
0.17
0.05
0.00
Jaccard index - Valid MRs

T0 T1 T2 T3 T4 T5 v_pdf v_pptx
T0 0.33 0.33 0.82 0.00 0.27 0.33 0.27
T1 0.43 0.38 0.00 0.33 0.33 0.33
T2 0.27 0.00 0.33 0.82 0.33
T3 0.00 0.23 0.00 0.23
T4 0.00 0.43 0.00
T5 0.38
0.00
0.27
0.00
0.00
NewP - Valid MRs

T0 T1 T2 T3 T4 T5 v_pdf v_pptx
T0 0.43 0.40
T1 0.43 0.00 0.40
T2 0.00 0.00 0.00 0.60
T3 0.18 0.55 0.73 0.18 0.40
T4 1.00 1.00 1.00 1.00 0.55 1.00
T5 0.40 0.40 0.60 0.40 1.00 0.73
1.00
1.00
1.00
1.00
66 Chapter 4. Empirical Assessment

The message generated by the LLM is indicating sometimes the source of the information, e.g.:

... creating and deleting parameter functional reporting


definitions [4:8+ source ] [4:9+ source ]

Checking the client.beta.threads.messages object we can see that the message is structured
as follows:

{ ’ id ’: ’ m s g _ x u y U 0 S N F b m a 6 i 0 E e H U C X N K 3 Y ’ ,
’ assistant_id ’: ’ a s s t _ f 9 K N 3 V K 5 w 4 2 D s 4 2 H R n 9 9 a 7 V z ’ ,
’ content ’: [{ ’ text ’: { ’ annotations ’:
[{ ’ end_index ’: 1428 ,
’ file_citation ’: { ’ file_id ’: ’ file - A W H 5 9 6 p C D T x J m 7 k Z V Q w c L z 1 m ’ ,
’ quote ’: None } ,
’ start_index ’: 1416 ,
’ text ’: ’ [4:8+ source ] ’ ,
’ type ’: ’ file_citation ’} ,
{ ’ end_index ’: 1440 ,
’ file_citation ’: { ’ file_id ’: ’ file - A W H 5 9 6 p C D T x J m 7 k Z V Q w c L z 1 m ’ ,
’ quote ’: None } ,
’ start_index ’: 1428 ,
’ text ’: ’ [4:9+ source ] ’ ,
’ type ’: ’ file_citation ’}] ,

Here we can get a step closer to the source of the information, by checking the annotations
field. The file_citation field is pointing to the file_id field, which is the unique identifier
of the attached document (the PPTX document in this case). However the quote field is empty,
and the start_index and end_index fields are pointing to the fragments of the document
which has been assigned during the embedding process in the vector store. Unfortunately, we
were not able to reverse engineer these indexes into the actual text fragments from the attached
document.

4.4.3 Assessing RQs


RQ1: Can LLMs generate MRs from SRS?

To answer this question, we analyzed the MRs generated by the GPT-4 Turbo model when
provided with the SRS documents for the MLFS functions. We compared these MRs to the
base case where the model was not provided with the SRS documents. The aggregated results
for all 7 functions studied in the MLFS framework are presented in Table 4.13.

We studied the MR generation capability for 7 functions and we asked the model to generate
10 MRs for every function. The results show that the model was able to generate valid MRs
for all functions in all three settings. The number of valid and unique MRs without access to
SRS documents is 41, with access to SRS documents is 38, and with access to SRS documents
combined with explicit prompting is 31. The results show that the model can generate MRs
4.4. Analysis and Results 67

Table 4.13: Aggregated Results for the 7 functions studied in the MLFS
framework

T1: chatGPT inferring 10 MRs without access to SRS pdf Self + My Valida- MR valid and
tion not in
Function Self My MR Invalid Invalid MR, 00 01 10 11 T2 T3
Validation Validation Valid MR Contradicting
& Easy Fix SRS
Uniqe
asin 6 2 1 1 - 4 0 4 2 0 1
atan2 9 6 6 0 - 1 0 3 6 2 5
exp 8 7 7 0 - 2 0 1 7 0 2
fmin 10 10 9 0 - 0 0 0 10 5 7
fmod 7 5 5 0 - 2 1 3 4 1 3
hypot 10 5 5 3 - 0 0 5 5 2 2
pow 9 8 8 0 - 1 0 1 8 5 6
Sum 59 43 41 4 10 1 17 42 15 26
Average 8.4 6.1 5.9 0.6 - 1.4 0.1 2.4 6.0 2.1 3.7

T2: chatGPT inferring 10 MRs with access to the SRS pdf Self + My Valida- MR valid and
tion not in
Function Self My MR Invalid Invalid MR, 00 01 10 11 T1 T3
Validation Validation Valid MR Contradicting
& Easy Fix SRS
Uniqe
asin 6 5 4 0 0 4 0 1 5 3 4
atan2 7 4 3 2 0 1 2 5 2 2 2
exp 9 7 7 0 0 1 0 2 7 1 3
fmin 9 7 7 2 0 1 0 2 7 3 5
fmod 6 6 6 2 1 3 1 1 5 2 5
hypot 7 5 4 3 1 3 0 2 5 2 2
pow 10 7 7 1 1 0 0 3 7 4 2
Sum 54 41 38 10 3 13 3 16 38 17 23
Average 7.7 5.9 5.4 1.4 0.4 1.9 0.4 2.3 5.4 2.4 3.3

T3: chatGPT inferring 10 MRs, function description provided through SRS pdf Self + My Valida- MR valid and
tion not in
Function Self My MR Invalid Invalid MR, 00 01 10 11 T1 T2
Validation Validation Valid MR Contradicting
& Easy Fix SRS
Uniqe
asin 10 2 2 1 0 0 0 8 2 1 2
atan2 10 4 4 3 3 0 0 6 4 3 4
exp 10 4 4 0 0 0 0 6 4 0 0
fmin 10 5 5 0 2 0 0 5 5 3 2
fmod 9 5 5 1 2 1 0 4 5 3 4
hypot 10 4 4 1 2 0 0 6 4 0 0
pow 10 7 7 0 1 0 0 3 7 5 2
Sum 69 31 31 6 10 1 0 38 31 15 14
Average 9.9 4.4 4.4 0.9 1.4 0.1 0.0 5.4 4.4 2.1 2.0
68 Chapter 4. Empirical Assessment

from the SRS documents, but the number of valid and unique MRs is slightly lower when the
model is explicitly asked to use the SRS documents. The results also show that the model can
generate valid MRs without access to the SRS documents, which indicates that the model has
some knowledge of the subject matter — this was expected since the MLFS library is using
basic mathematical functions that are well known and widely used.

We assessed this RQ within the PUS study as well. To answer this question we were using the
gpt-4o model to generate MRs for the PUS framework from the SRS documents. As a base
case we considered using the MRs generated without any attached document, but the model
prior knowledge wasn’t enough and the model was asking for SRS to provide meaningful answers.
We continued by providing three different documents (PDF, PPTX and JSON), simultaneously
and individually, and asked the model to perform a TC/TM mapping for a given ST, and based
on this mapping to generate MRs.

The aggregated results for all five STs studied in this chapter are presented in Table 4.14.

Table 4.14: Aggregated results for studied STs with PDF, PPTX, and JSON
documents attached

Service Type Attached documents Correct Valid MR


T0 - pdf,pptx,json 16 5
T1 - pdf 17 10
ST03 - Housekeeping T2 - pptx 17 5
T3 - json 7 4
T0 - pdf,pptx,json 6 2
T1 - pdf 8 5
ST17 - Test T2 - pptx 5 3
T3 - json 8 2
T0 - pdf,pptx,json 9 0
T1 - pdf 9 0
ST01 - Request Verification T2 - pptx 5 4
T3 - json 0 0
T0 - pdf,pptx,json 19 9
T1 - pdf 18 9
ST22 - Position-Based Scheduling T2 - pptx 24 3
T3 - json 16 9
T0 - pdf,pptx,json 14 11
T1 - pdf 13 7
ST23 - File Management T2 - pptx 14 8
T3 - json 11 5
T0 - pdf,pptx,json 64 27
T1 - pdf 65 31
Total T2 - pptx 65 23
T3 - json 42 20

Based on these results we can assert that the model is able to generate MRs from the SRS
documents. All the generated MRs has an Original Test followed by an appropriate Follow-up
Test and a corresponding Expected Relation. We asked to model to generate MRs to include
all mapped TCs and TMs at least once, and we considered the MR to be correct when at least
4.4. Analysis and Results 69

one TC or one TM was belonging to the requested ST. We can observe that the model generated
correct MRs for all the STs studied and for all the attached documents, with the exception of
the ST[01] Request Verification service type when the JSON document was attached. Here the
model got confused by the absence of TCs in ST[01] and generated MRs with TCs and TMs
from other STs.

Our key findings are:

• Correct MRs: Total number of correct MRs generated when using simultaneously the
PDF, PPTX and JSON documents was 64, when using only the PDF document was 65,
when using only the PPTX document was 65, and when using only the JSON document
was 42. All document types were performing similarly, with the exception of the JSON
document which was generating about 2/3 of the correct MRs compared to the other two
document types.

• Valid MRs: We considered valid MRs, when the Expected Relation column reports a
relation among outputs. The model was able to generate valid MRs for all the STs studied
and for all the attached documents, with the exception of the ST[01] Request Verification
service type. In this case the model generated valid MRs only when the PPTX document
was attached — this can be explained by the fact that the PPTX document contains a
concrete example of the ST[01] workflow, and the model was able to generate MRs based
on this example. The total number of valid MRs generated when using simultaneously the
PDF, PPTX and JSON documents was 27, when using only the PDF document was 31,
when using only the PPTX document was 23, and when using only the JSON document
was 20. Again the JSON document is the worst performer, but this time the differences
are smaller compared to the correct MRs.

To conclude, the results demonstrate that LLMs, specifically the ChatGPT gpt-4o model, can
indeed generate MRs from SRS documents for the PUS framework. The model successfully
produced correct MRs across various Service Types (STs) and document formats (PDF, PPTX,
JSON). While the performance varied slightly depending on the document type, with JSON
performing somewhat less effectively, the overall results were promising, and the findings suggest
that LLMs have potential as a tool for automating or assisting in the generation of MRs from
SRS documents.

RQ2: Do MRs generated from SRS qualitatively differ from MRs generated based
only on prior knowledge of LLMs?

We are going to answer this question by reusing the previously presented results for the MLFS
experiments. We are comparing the MRs generated by the model when provided with the SRS
documents to the base case where the model was not provided with the SRS documents. The
aggregated results are presented in Table 4.13. The results show that the model can generate
valid MRs in all three cases, but the number of valid and unique MRs is slightly lower when the
model has access to the SRS document, and even lower when the model is explicitly asked to use
the SRS document. This result is somewhat unexpected, as we awaited that the model would
70 Chapter 4. Empirical Assessment

generate more valid MRs when provided with the requirements document. The aggregated
results show that the model generates in average 5.9 valid and unique MRs without access
to the SRS documents (out of maximum 10), 5.4 with access to the SRS documents, and
only 4.4 with access to the SRS documents combined with explicit prompting. These results
can be explained if we look more closely at the content of the SRS document. The figure 4.5
shows an excerpt from the content of the SRS document for the atan2 function. We can see

Figure 4.5: Excerpt from the MLFS PDF document atan2 function requirement
specification

that the requirements are specifying the expected return values at given special input values.
While this is a valuable information for generic test cases, it is not easy to incorporate this
information into an MR, because the MR is built on the assumption that we don’t have the
oracle output for a given input, we have to work with changing the input values and observing
the variation in the output values. This is the reason why the model is generating more MRs
when not provided with the SRS document, because it is not constrained by using the special
input and output values.

The model self-confidence can be observed on the Self Validation results. The model is
somewhat confident when generating MRs without access to the SRS documents, with an
average of 8.4 out of 10 MRs being self-validated. When provided with the SRS documents,
the average self-validation drops to 7.7 , and when explicitly asked to use the SRS documents,
the average self-validation reaches 9.9 . This indicates that the model is more confident in its
answers when actually reads the provided SRS documents.

We have an interesting result in the column Invalid MR, Contradicting SRS . Here we
counted the MRs that are invalid because providing a statement which explicitly contradicts the
SRS document. Obviously this column is empty in the first case, when the model is not provided
4.4. Analysis and Results 71

with the SRS document. However, when the model is provided with the SRS document, we
have 3 MRs that are invalid because they contradict the SRS document. This number increases
to 10 when the model is explicitly directed to use the SRS document. We can examine again
more closely the content of the SRS document for the atan2 function. We can observe on
the figure 4.5 that the inputs X and Y never take the value of 0, there is a distinction of
+0 and -0 . In the everyday life usually we are not distinguishing the +0 and -0, but in the
mathematical world of the space this can be important e.g. when a spacecraft is about to reach
a pose where one of the parameters is becoming 0, it can be important to know if this approach
has been done from the positive or negative side, since this can mean e.g. that the spacecraft is
pointing upwards or downwards. While ChatGPT is aware of the existence of +0 and -0 (upon
direct enquiry it can explain us even the byte representation of these numbers), it seems that
when generating the MRs it is following the requirement specifications loosely, omitting the
distinction between +0 and -0, so in essence it is generating MRs that are contradicting the
SRS document.

Our analysis revealed several qualitative differences between MRs generated from SRS and
those based solely on LLM’s prior knowledge:

• Quantity and Diversity: Contrary to initial expectations, the model generated slightly
fewer valid and unique MRs when provided with SRS (5.4 on average) compared to using
only prior knowledge (5.9 on average). This decreased further to 4.4 when explicitly
prompted to use the SRS.

• Self-Confidence: The model’s self-validation scores were highest (9.9 out of 10) when
explicitly asked to use the SRS, indicating increased confidence in its outputs when directly
referencing provided documentation.

• Adherence to SRS: We observed instances where MRs generated with SRS access
contradicted the specifications (3 cases with SRS access, increasing to 10 when explicitly
prompted to use the SRS). This suggests that the deep prior knowledge of the model may
sometimes override the specific requirements outlined in the SRS.

RQ3: Can documents with a different format lead to different MRs?

To answer this question we used the PUS experiment results while comparing the MRs generated
from PDF documents (5 experiments) vs. PPTX documents (1 experiment) for various STs.
We asked the model to create a mapping of TC/TMs for a given ST and based on this mapping
to generate MRs. We marked the same or similar MRs generated and we computed the
Jaccard index for all MRs and for valid MRs, and additionally we created another index,
called Proportion New, and we computed this index for valid MRs. We collected the indexes
amongst the PDF experiments into a vector v_pdf and the indexes between the PDF and
PPTX experiments into a vector v_pptx . We used the Mann-Whitney U-test to compare the
differences between the two vectors, and we computed the VDA effect size measure for the
U-test. The aggregated results for all the STs studied in this chapter are presented in Table 4.15.
72 Chapter 4. Empirical Assessment

Table 4.15: Aggregated results for Mann-Whitney U-test p-values and VDAs
for the studied STs

ST ST name measure U-test p vda


ST01 Request_Verification Jaccard - all MRs 0.13 0.70
ST01 Request_Verification Jaccard - valid MRs 0.57 0.55
ST01 Request_Verification NewP - valid MRs 0.57 0.45
ST03 Housekeeping Jaccard - all MRs 0.76 0.56
ST03 Housekeeping Jaccard - valid MRs 0.80 0.55
ST03 Housekeeping NewP - valid MRs 0.75 0.56
ST17 Test Jaccard - all MRs 0.12 0.76
ST17 Test Jaccard - valid MRs 1.00 0.50
ST17 Test NewP - valid MRs 0.78 0.45
ST22 Position-Based_Scheduling Jaccard - all MRs 0.00 1.00
ST22 Position-Based_Scheduling Jaccard - valid MRs 0.00 1.00
ST22 Position-Based_Scheduling NewP - valid MRs 0.01 0.90
ST23 File_Management Jaccard - all MRs 0.95 0.48
ST23 File_Management Jaccard - valid MRs 0.95 0.48
ST23 File_Management NewP - valid MRs 0.54 0.61

• Statistical significance (U-test p-values): Most STs show high p-values (> 0.05),
suggesting no statistically significant difference between PDF and PPTX results. The
only exception is the ST[22] Position-Based Scheduling, showing very low p-values (≤
0.01), indicating a significant difference between PDF and PPTX results for ST[22].

• Effect size (VDA): Most STs show VDA values close to 0.5, suggesting minimal practical
difference between PDF and PPTX results. The only exception is the ST[22] Position-
Based Scheduling, showing VDA values close to 1.0. This means that the v_pdf values
are always larger than the v_pptx values, indicating that the PDF experiments are
similar to each other but are different from the PPTX experiments, meaning that the
MRs generated with PDF lead to different MRs compared to the MRs generated with
PPTX document for ST[22].

In conclusion the answer to RQ3 is nuanced. While most STs show no statistically significant
difference between PDF and PPTX results, for some specific STs (notably ST[22]), the document
format can lead to substantial differences in the generated MRs.

These findings suggest that while document format may not universally affect MR generation,
it can have a significant impact in certain cases. Further investigation into the characteristics
of STs that are more susceptible to format-induced differences could be valuable.

RQ4: Can LLM extract relevant test information from large SRS documents?

To answer this question we studied the TC/TM mapping mistakes for the studied STs while
executing the PUS experiments. In order to extract correctly the TCs and TMs and to pair
them correctly the model should have a complete picture and a good understanding of the inner
4.4. Analysis and Results 73

workings of the ST, therefore the collected mistakes can be used as a proxy for the model’s
capability to extract relevant test information from large SRS documents. We collected the
TC/TM mistakes while executing the PUS experiments, and we computed the mistakes in
percentage for the studied STs and document types.

The aggregated results for all the STs studied in this chapter are presented in Table 4.16.
Table 4.16: Aggregated results of the TC/TM mapping mistakes for the studied
STs

Mistake [%] Mistake [%]


3 doc exp 5 pdf vs 1 pptx
exp
ST ST name pdf,pptx,json pdf pptx json 5 pdf Avg pptx
ST01 Request_Verification 0.0% 0.0% 0.0% 100.0% 8.9% 0.0%
ST03 Housekeeping 3.7% 7.4% 11.1% 40.7% 4.8% 14.0%
ST17 Test 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
ST22 Position-Based_Scheduling 2.1% 2.1% 0.0% 12.5% 5.0% 0.0%
ST23 File_Management 0.0% 2.5% 0.0% 5.0% 2.0% 0.0%

The results show that the model is able to extract relevant test information from large SRS
documents with a low percentage of mistakes. The model performed best for the ST[17] Test,
with 0% mistakes for all document types. This result is expected, as the ST[17] Test is a simple
service type, with a clear workflow consisting of only 2 TCs and 2 TMs. The most complicated
ST for the model was the ST[03] Housekeeping, here we encountered mistakes in the TC/TM
mapping for all document types, with the lowest percentage of mistakes for the PDF document
(3.7%) and the highest for the JSON document (40.7%). The model performed well for the
ST[01] Request Verification, with 0% mistakes for all document types, with the exception of the
JSON document, where the model was confused by the missing TCs in ST[01] and extracted
TCs and TMs from other STs, resulting in a 100% mistake rate. The missing TCs impacted the
ST[01] five PDF experiment as well, here four TC/TM mappings were flawless with 0% mistakes,
but one experiment result included non-existent TCs in ST[01] increasing the average of 5 PDF
mistakes to 8.9%. Regarding the document types we can observe that the model performed
best when all three document types were attached, resulting in perfect score for ST[01], ST[17]
and ST[23], and a low percentage of mistakes for ST[03] and ST[22]. The model performed
worst when the JSON document was attached, with the highest percentage of mistakes for all
the STs studied.

In conclusion, the model showed remarkable ability to extract the TC/TM mapping from the
SRS documents. The LLM performed best when all three document types were attached, and
worst when only the JSON document was attached. Excluding the JSON document results, the
accuracy is comparable to human performance, while the speed of extracting the information is
significantly higher, since no human can read and process a 600 page PDF document in a few
seconds.

These results suggest that LLMs have potential as a tool for automating or assisting in the
extraction of relevant test information from large SRS documents.
74 Chapter 4. Empirical Assessment

4.5 Discussion
The empirical assessment of LLMs for MRs in the software testing of space systems reveals
several key insights and implications. Our study evaluated the capability of LLMs, specifically
ChatGPT with models such as GPT-4 Turbo and GPT-4o, in generating MRs for the MLFS
critical software library and the PUS framework based on various document inputs, including
the official PDF specifications documents, PPTX training materials, and JSON data.

Our results demonstrate that LLMs, specifically the GPT-4 Turbo model for MLFS and the
GPT-4o model for PUS, can indeed generate valid and relevant MRs from SRS documents. The
process and outcomes differed between the two systems:

• MLFS: The LLM was able to generate MRs both with and without access to the SRS.
This suggests that for well-known mathematical functions, the model’s prior knowledge is
enough to produce relevant MRs. Interestingly, providing the SRS did not improve the
quantity or quality of MRs generated, and in average even led to a decrease of valid MRs.

• PUS: Unlike MLFS, the LLM required access to SRS documents to generate meaningful
MRs for PUS. This highlights the domain-specific nature of PUS and the importance of
providing context for more specialized systems.

These findings suggest that the effectiveness of LLMs in generating MRs may depend on the
domain’s complexity and the model’s prior knowledge of the subject matter.

Impact of Document Format on MR Generation

Our investigation into whether different document formats lead to different MRs yielded mixed
results. For most Service Types, we found no statistically significant difference between MRs
generated from PDF and PPTX documents. However, for ST[22] Position-Based Scheduling, we
observed substantial differences in the generated MRs depending on the document format used.

This variability suggests that document format can influence MR generation, but its impact may
depend on the specific service type or system component. The reasons for these differences could
be related to how information is presented in different formats (e.g., more visual representations
in PPTX vs. textual descriptions in PDF) and how the LLM processes this information.

LLM Capability in Extracting Relevant Test Information

Our assessment of the LLM’s ability to extract relevant test information from large SRS
documents revealed promising results. The model demonstrated a remarkable ability to
accurately map TCs and TMs from the SRS documents, with low error rates across most Service
Types and document formats. The model performed best when all three document types (PDF,
PPTX, and JSON) were attached, suggesting that a multi-format approach can enhance the
model’s understanding and extraction capabilities. The JSON format alone proved to be the
least effective, indicating that OpenAI’s assistant handling of structured data without context
may not be sufficient for complex information extraction tasks. These findings suggest that
LLMs have significant potential as tools for automating or assisting in the extraction of relevant
4.5. Discussion 75

test information from large SRS documents. The speed and accuracy demonstrated by the
model in processing large documents highlight the potential efficiency gains in software testing
processes.

LLM Performance and Constraints

Several observations about LLM performance and constraints emerged from our study:

• Special Case in TC/TM Mapping: The LLM sometimes failed to identify the TC/TMs
within a given ST, particularly this was the case with ST[01] Request Verification (a
special ST with no TCs), where the model extracted TCs and TMs from other STs when
the JSON document was attached.

• Special Handling of the 0 value: In the MLFS study, we noted instances where the
LLM failed to distinguish between +0 and -0, and generated MRs contradicting the SRS.
This highlights the challenge of capturing nuanced technical requirements.

• Self-Confidence vs. Accuracy: For MLFS, we observed that the LLM’s self-confidence
(as measured by self-validation scores) increased when explicitly instructed to use the
SRS, even though the actual number of valid and unique MRs decreased. This suggests a
potential misalignment between the model’s confidence and its actual performance.

• Domain-Specific Knowledge: The LLM’s inability to generate MRs for PUS without
SRS documents underscores the importance of domain-specific knowledge in complex
systems.

Hallucination in our Experiments

The hallucination phenomenon appears when we ask the model to perform an operation that is
not possible, it is very difficult, or is not well defined. Setting the temperature to 0 helps to
minimize the occurrence of hallucinations, but we encountered hallucination-like events when
experimenting with ST[01] Request Verification. The TC/TM mapping results can be seen on
tables 4.7, and 4.8, while the generated MRs can be seen on tables B.8, and B.9.

We ask the model to make us a list of TCs and TMs for the ST[01] Request Verification, but
there are no ST[01] TCs. We get three different results:

1. Three document / PDF only experiment:

• TC/TM mapping: TC fields are empty, TM fields are filled with TMs from ST[01] —
accepted as correct

• MR generation: MRs correct as test cases but can not be considered valid MRs

2. PPTX only experiment:

• TC/TM mapping: example TCs TC[8,1] and TC[201] taken from the document, TM
fields are filled with TMs from ST[01] — accepted as correct
76 Chapter 4. Empirical Assessment

• MR generation: MRs are correct and valid MRs, except the MR6 and MR7 which
are taken from ST[17]

3. PPTX only experiment:

• TC/TM mapping: TC fields are filled with TCs from other STs, TM fields are filled
with TMs from other STs (ST[6], ST[21], ST[4], ST[20], ST[17]) — noted as mistakes

• MR generation: all MRs are considered incorrect and invalid since they belong to
other STs

In the PPTX case, we accept the example TCs TC[8,1] and TC[201] in the TC/TM mapping
because the model demonstrates an “understanding” of the behavior of these TCs. It recognizes
that these TCs can be sent with valid or invalid routing, or with valid/invalid start conditions,
which enables it to generate valid MRs. On the other hand, in the JSON case, the model
appears to be making its best effort without a clear “understanding” of the context. It is not
truly “aware” of the implications of the TCs and TMs it is using. We refer to these occurrences
as hallucination-like events because the model is not fabricating TCs and TMs from scratch,
but rather selecting them from the SRS and applying them in an incorrect context.

The results for the ST[01] TC/TM mapping execution, when comparing the 5 PDF experiments
to the 1 PPTX experiment, reveal an instance of hallucination by the model. In the majority
of the PDF mapping sub-tables (T0, T1, T3, and T4), the TC/TM mappings were correct.
However, in one of the PDF sub-tables (T2), the model generated non-existent TC[1,x] services,
as shown in Tables B.10 and B.11. This was the only instance of hallucination observed in the
entire study, and it did not cause any issues, but it is reflected in the statistics. The reason we
included ST[01] in our study was to observe how the model would behave when faced with a
near-unsolvable task.

Implications for Software Testing in Space Systems

The results of this study have several important implications for software testing in space
systems:

• Automation Potential: LLMs show promise in automating aspects of the MR generation


process, potentially reducing the time and effort required for test case creation.

• Complementary Tool: While LLMs demonstrate impressive capabilities, they are


best viewed as complementary tools to human expertise rather than replacements. The
nuanced understanding required for some Service Types (e.g., ST[01] Request Verification)
highlights the ongoing importance of human oversight.

• Document Format Considerations: The impact of document format on MR generation


suggests that organizations should carefully consider how they structure and present their
SRS documents to maximize the effectiveness of LLM-assisted testing.
4.5. Discussion 77

• Efficiency in Information Extraction: The LLM’s ability to quickly and accurately


extract relevant information from large documents could significantly speed up the initial
phases of test planning and design.

In conclusion, our study demonstrates the potential of LLMs in enhancing software testing
processes for space systems. While challenges remain, the capabilities shown in MR generation
and information extraction from complex documents suggest that LLMs could become valuable
tools in the software testing toolkit for critical space system components. As this technology
continues to evolve, it holds promise for improving the efficiency and effectiveness of software
testing in the space domain.
79

Chapter 5

Conclusion

5.1 Summary of Findings


This thesis set out to provide a snapshot of the current capabilities of Large Language Models
(LLMs) in assisting software testing, with a particular focus on generating Metamorphic Relations
(MRs) for space system software components. Our study centered on two critical software
packages: the Mathematical Library for Flight Software (MLFS) and the Packet Utilization
Standard (PUS) framework.

We chose to work with OpenAI’s ChatGPT, which was top-ranked on the LMSYS Chatbot
Arena at the time of our experiments. Specifically, we utilized the GPT-4 Turbo model for
MLFS experiments and the GPT-4o model for PUS experiments. Our research explored the
LLM’s ability to generate MRs from Software Requirements Specifications (SRS), the impact of
different document formats on MR generation, and the model’s capability to extract relevant
test information from large technical documents.

Key findings from our study include:

• LLMs can generate valid and relevant MRs from SRS documents for both MLFS and
PUS, though the effectiveness varied between the two systems.

• For well-known mathematical functions in MLFS, the model’s prior knowledge was
sufficient to produce relevant MRs, even without access to SRS documents.

• For the more specialized PUS framework, access to SRS documents was crucial for
generating meaningful MRs.

• Document format can influence MR generation for certain service types.

• LLMs demonstrated impressive capabilities in extracting relevant test information from


large SRS documents, with low error rates in TC/TM mapping across most service types.

These findings suggest that LLMs have significant potential as complementary tools in the
software testing process for space systems, offering possibilities for increased efficiency and
comprehensive test case generation.
80 Chapter 5. Conclusion

5.2 Reflections on LLM Evolution and Future Prospects


During the course of this research (spring/summer of 2024), we witnessed several major model
releases from OpenAI and other research groups, each bringing improvements in reasoning
capabilities, speed, and pricing. This rapid evolution of LLM technology underscores the
dynamic nature of the field and the potential for continued advancements.

The trend of LLM evolution is likely to continue for the foreseeable future, driven by substantial
investments in both hardware and software development. We anticipate at least a few more
years of continuous improvement in LLM capabilities.

As models continue to evolve, traditional methods of evaluating their performance, such as


public voting on platforms like LMSYS described in 3.1.2, may become less effective. Most top
models can now easily answer questions of average difficulty, potentially biasing votes by the
format and presentation of answers rather than content. A more robust benchmark is offered
by LiveBench1 . Created and developed by the White et al. with contribution from Abacus.AI,
NYU, Nvidia, UMD, USC (White et al., 2024), addresses the mentioned limitations by:

• Releasing new questions monthly to limit potential contamination

• Ensuring each question has verifiable, objective ground-truth answers

• Allowing hard questions to be scored accurately and automatically

• Covering six categories: reasoning, coding, mathematics, data analysis, language, and
instruction following

• Completely refreshing the benchmark every six months

The initial version was LiveBench-2024-06-24, as of this writing, the latest version is LiveBench-
2024-07-25, and is expected to continue releasing new versions monthly, providing a more current
and objective measure of LLM capabilities. The top models as of 6th August 2024 can be seen
on figure 5.1.

OpenAI addresses the specific needs of different audiences: they are featuring on LiveBench the
model gpt-4o-2024-08-06 which is best for assistants API usage2 , while on LMSYS they are
competing with gpt-4o-latest-2024-08-08 which is currently the default model on the web
interface and it is a chat-optimized model to insure once again the first position on the LMSYS
leaderboard as of 14th August 2024, shown on figure 5.2.

It should be noted that selection of the best model for our use case can not be based alone on
the LiveBench results, because those results reflect the models inner knowledge and reasoning
capabilities for zero-shot tasks, and our use case is based on information retrieval from custom
SRS documents. OpenAI’s assistant is versatile, capable of handling a wide range of documents
and tasks, and if the chosen model doesn’t have assistant (or agent in other terminology) with
similar capabilities, we need to implement our own RAG solution.
1
Contamination-Free LLM Benchmark: https://livebench.ai/
2
GPT-4o models: https://platform.openai.com/docs/models/gpt-4o
5.2. Reflections on LLM Evolution and Future Prospects 81

Figure 5.1: Top models on LiveBench as of 6th August 2024

Figure 5.2: Top models on LMSYS as of 14th August 2024


82 Chapter 5. Conclusion

5.3 Practical Considerations for Implementation


Our research benefited from working with open standards (PUS) and declassified documents,
allowing us to disregard certain privacy issues. However, we recognize that much of the top-
level space software is closed source. For companies testing closed-source software, several
considerations are important:

• Data Privacy: Companies should ensure they subscribe to LLM services that do not use
their data for model training.

• In-house Implementation: For stricter policy rules, companies could use open-weights
models like Meta’s Llama3 and implement the service in-house to prevent confidential
data from leaving the premises.

• Trade-offs: While in-house implementation requires resources for server setup and
customized RAG solution, it provides better control and there are fewer "black boxes"
compared to services like OpenAI’s assistants.

5.4 Future Research Directions


While our study provides valuable insights, it also has limitations that point to areas for future
research:

• Limited Scope: Our study focused on specific components (MLFS and PUS) and a
limited number of Service Types. Future work could expand to a broader range of space
system software components.

• Model Specificity: We used specific versions of GPT models from OpenAI. Investigating
the performance of other LLMs like Antropic’s Claude, Google’s Gemini or Meta’s Llama
could provide a more comprehensive understanding of their capabilities in this domain.

• Domain Adaptation: We could explore techniques for fine-tuning LLMs on space-specific


datasets to enhance their performance in this specialized domain.

• Validation of Generated MRs: While we assessed the correctness and validity of


generated MRs, future studies could involve domain experts to provide a more rigorous
evaluation of the quality and usefulness of these MRs in real-world testing scenarios.

• Executable MRs: Investigating methods to convert LLM-generated MRs into executable


test cases would be a valuable next step in the research.

• Recycling Valid Test Cases: While evaluating the generated MRs, we were throwing
away valid test cases which were not adhering to the formal definition of MRs. The model
inadvertently solved the oracle problem by extracting the expected output for given inputs
from SRS. Future research could explore ways to use these valid test cases.
3
Meta’s Llama models: https://llama.meta.com/
5.4. Future Research Directions 83

• Integration with Testing Processes: Research into how LLM-generated MRs can be
effectively integrated into existing software testing processes for space systems would be
valuable.

• Reproducibility Challenges: We observed some inconsistencies in LLM outputs even


with controlled inputs. Further investigation into the factors affecting reproducibility and
strategies to mitigate this issue would be beneficial.

In conclusion, our study provides a snapshot of the current state of LLM capabilities in software
testing for space systems. The rapid evolution of these models suggests that their potential
applications in this field will continue to expand. As we move forward, it will be crucial to
balance the promising capabilities of LLMs with careful consideration of their limitations and
the unique requirements of space system software testing. The future of software testing in the
space domain is likely to involve a synergistic relationship between advanced AI technologies
and human expertise, working together to ensure the reliability and safety of critical space
systems.
85

Appendix A

MLFS Tables

List of tables created in the MLFS study for functions: asin , exp , fmin , fmod , hypot and
pow .
86 Appendix A. MLFS Tables

Table A.1: MRs generated by ChatGPT for the asin function

T1: chatGPT inferring 10 MRs without access to SRS pdf


MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Input Negation asin(-x) = -asin(x) The arcsine of the negation of x should be the negation of the 1 1
arcsine of x.
2 Symmetry asin(x) + asin(-x) = 0 The sum of the arcsine of x and the arcsine of its negation 1 1
should be zero.
3 Complement asin(x) + asin(sqrt(1-x2̂)) = ±pi/2 The sum of the arcsine of x and the arcsine of the square root 1 0
of 1 minus x squared should be pi/2 or -pi/2.
4 Scaling asin(a*x) = a*asin(x) for a in [-1,1] Scaling x by a factor within [-1, 1] should scale the arcsine 0 0
output by the same factor.
5 Invariance under Transformation asin(f(x)) = asin(x) for f(x) = x Applying a transformation that maps x to itself should not 1 0
change the output of asin.
6 Additive Inverse asin(x) - asin(-x) = 0 The difference between the arcsine of x and the arcsine of its 1 0
negation should be zero.
7 Permutation Invariance asin(permute(x)) = asin(x) Permuting the digits of x should not affect the output of asin. 0 0
8 Idempotence asin(asin(x)) = asin(x) Applying asin twice with any transformation in between that 0 0
maps the output back to the input domain should yield the
same result as applying asin once.
9 Boundary Values asin(1) = pi/2, asin(-1) = -pi/2 The arcsine of 1 should always return pi/2 and the arcsine of -1 1 0
should always return -pi/2.
10 Precision Test asin(x+dx) - asin(x) for small dx Small changes in x near the boundaries should result in propor- 0 0
near ±1 tionally large changes in the output.
6 2
same or similar MRs marked
T2: chatGPT inferring 10 MRs with access to the SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Additive Inverse Input asin(-x) = -asin(x) If x is a valid input for asin(x), then asin(-x) should produce 1 1
the negative of asin(x).
2 Symmetry asin(x) + asin(-x) = 0 For any valid x, the sum of asin(x) and asin(-x) should be zero. 1 1
3 Complementary Angles asin(x) + acos(x) = p/2 For any valid x, the sum of asin(x) and acos(x) should equal 1 1
p/2.

4 Scaling asin(ax) = a * asin(x) For any valid x and a scaling factor a such that ax is valid, this 0 0
relation checks if asin(ax) equals a times asin(x).
5 Perturbation asin(x + e) ≈ asin(x) + For a small perturbation e, where x + e is valid, asin(x + e) 1 0
e/cos(asin(x)) should be approximately asin(x) plus e divided by the cosine of
asin(x).
6 Double Angle 2 * asin(x) ̸= asin(2x) For any valid x, this relation checks if doubling asin(x) equals 0 0
asin of double x.
7 Half Angle asin(x/2) ̸= 1/2 * asin(x) For any valid x, this relation checks if asin of half x equals half 0 0
of asin(x).
8 Reciprocal Relation asin(1/x) ̸= asin(x) For any valid x, this relation checks if asin of reciprocal x equals 0 0
asin(x).
9 Inverse Function Composition sin(asin(x)) = x For any valid x, the sine of asin(x) should return x. 1 1
10 Output Range Verification -p/2 ≤ asin(x) ≤ p/2 For any valid x, asin(x) should be within the range from -p/2 1 1
to p/2.
6 5

T3: chatGPT inferring 10 MRs, function description provided through SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
MR1 Symmetry asin(-x) = -asin(x) The arcsine function is odd, meaning it is symmetric about the 1 1
origin.
MR2 NaN Input asin(NaN) = NaN If the input is NaN, the output should also be NaN. 1 0
MR3 Range Limit asin(x) where x not in [-1,1] = NaN The arcsine function is only defined for inputs between -1 and 1 0
1, inclusive.
MR4 Identity asin(0) = 0 The arcsine of 0 is 0. 1 0
MR5 Inverse Function sin(asin(x)) = x for x in [-1,1] The arcsine function is the inverse of the sine function within 1 1
the range [-1,1].
MR6 Derivative Sign asin’(x) > 0 for x in (-1,1) The derivative of the arcsine function is positive within its 1 0
domain, indicating it is an increasing function.
MR7 Limit at Domain Ends lim(x->1-) asin(x) = p/2 As x approaches 1 from the left, asin(x) approaches p/2. 1 0
MR8 Limit at Domain Ends lim(x->-1+) asin(x) = -p/2 As x approaches -1 from the right, asin(x) approaches -p/2. 1 0
MR9 Continuity asin(x) is continuous for x in [-1,1] The arcsine function is continuous over its domain. 1 0
MR10 Bounded Output |asin(x)| <= p/2 for all x in [-1,1] The output of the arcsine function is always between -p/2 and 1 0
p/2, inclusive.

10 2
Appendix A. MLFS Tables 87

Table A.2: MR statistics for the asin function

T1: chatGPT inferring 10 MRs without ac- Self + My Valida- MR same/simi- MR valid and
cess to SRS pdf tion lar in not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T2 T3 T2 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 1 1 1 0 - 0 0 0 1 1 1 0 0
2 1 1 rearranging MR1 0 0 - 0 0 0 1 1 0 0 1
3 1 0 valid only for x > 0 0 1 - 0 0 1 0 0 0 0 0
4 0 0 not true 0 0 - 1 0 0 0 1 0 0 0
5 1 0 argument notation 0 0 - 0 0 1 0 0 0 0 0
6 1 0 not true 0 0 - 0 0 1 0 0 0 0 0
7 0 0 not true 0 0 - 1 0 0 0 0 0 0 0
8 0 0 not true 0 0 - 1 0 0 0 0 0 0 0
9 1 0 edge case identity 0 0 - 0 0 1 0 0 0 0 0
10 0 0 hallucination 0 0 - 1 0 0 0 0 0 0 0
6 2 1 1 - 4 0 4 2 0 1

T2: chatGPT inferring 10 MRs with access Self + My Valida- MR same/simi- MR valid and
to the SRS pdf tion lar not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T3 T1 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 1 1 1 0 0 0 0 0 1 1 0 0 1
2 1 1 rearranging MR1 0 0 0 0 0 0 1 1 0 0 1
3 1 1 1 0 0 0 0 0 1 0 0 1 1
4 0 0 not true 0 0 0 1 0 0 0 1 0 0 0
5 1 0 not sure 0 0 0 0 0 1 0 0 0 0 0
6 0 0 not correct for x = 0 0 0 0 1 0 0 0 0 0 0 0
7 0 0 not correct for x = 0 0 0 0 1 0 0 0 0 0 0 0
8 0 0 not correct for x = 1 0 0 0 1 0 0 0 0 0 0 0
9 1 1 1 0 0 0 0 0 1 0 0 1 1
10 1 1 1 0 0 0 0 0 1 0 1 1 0
6 5 4 0 0 4 0 1 5 3 4

T3: chatGPT inferring 10 MRs, function Self + My Valida- MR same/simi- MR valid and
description provided through SRS pdf tion lar not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T2 T1 T2
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 1 1 0 0 0 0 0 1 1 0 0 1
MR2 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR3 1 0 range property 0 0 0 0 0 1 0 0 0 0 0
MR4 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR5 1 1 1 0 0 0 0 0 1 0 0 1 1
MR6 1 0 no derivative func av. 0 1 0 0 0 1 0 0 0 0 0
MR7 1 0 no lim func available 0 0 0 0 0 1 0 0 0 0 0
MR8 1 0 no lim func available 0 0 0 0 0 1 0 0 0 0 0
MR9 1 0 no cont. func avail. 0 0 0 0 0 1 0 0 0 0 0
MR10 1 0 range property 0 0 0 0 0 1 0 0 1 0 0
10 2 2 1 0 0 0 8 2 1 2
88 Appendix A. MLFS Tables

Table A.3: MRs generated by ChatGPT for the exp function

T1: chatGPT inferring 10 MRs without access to SRS pdf


MR# MR Title MR Formula MR Text Self My
Validation Validation
MR1 Positive Shift exp(x + a) = exp(x) * exp(a) If a positive constant ’a’ is added to ’x’, the output should be the 1 1
original output multiplied by exp(a).
MR2 Negative Shift exp(x - a) = exp(x) / exp(a) If a positive constant ’a’ is subtracted from ’x’, the output should be 1 1
the original output divided by exp(a).
MR3 Doubling exp(2x) = exp(x) * exp(x) Doubling the input ’x’ should result in the output being the square of 1 1
the original output.
MR4 Halving exp(x/2) * exp(x/2) = exp(x) Halving the input ’x’ and calculating exp for each half, then multiplying 1 1
the results should yield the original output.
MR5 Scaling by Negative One exp(-x) = 1 / exp(x) If the input ’x’ is negated, the output should be the reciprocal of the 1 1
original output.
MR6 Zero Input exp(0) = 1 Regardless of the original input, if the input is zero, the output should 1 0
always be 1.
MR7 Summation Invariance exp(x + y) = exp(x) * exp(y) The exponential of the sum of two numbers should be the product of 1 1
the exponentials of each number.
MR8 Multiplicative Scaling exp(k*x) = exp(x)k̂ Scaling the input ’x’ by a constant ’k’ should result in the output being 1 1
raised to the power of ’k’.
MR9 Invariance to Multiples of exp(x + i*pi) = -exp(x) Adding any integer multiple of i*pi (where i is the imaginary unit) to 0 0
i*pi ’x’ should result in the negation of the original output.
MR10 Periodicity with 2i*pi exp(x + 2i*pi) = exp(x) Adding any integer multiple of 2i*pi (where i is the imaginary unit) to 0 0
’x’ should result in the same output as the original.
8 7
same or similar MRs marked
T2: chatGPT inferring 10 MRs with access to the SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Additive Identity exp(x + 0) = exp(x) Adding zero to the input should not change the output. 1 0
2 Multiplicative Identity exp(x) * exp(0) = exp(x) Multiplying the output by the exponential of zero (which is 1) should 1 0
not change the output.
3 Scaling exp(x) * exp(y) = exp(x + y) The product of the exponentials of two numbers is the exponential of 1 1
their sum.
4 Inverse Scaling exp(x) / exp(y) = exp(x - y) Dividing the exponential of one number by the exponential of another 1 1
gives the exponential of their difference.
5 Negation exp(-x) = 1 / exp(x) The exponential of the negation of a number is the reciprocal of the 1 1
exponential of the number.
6 Doubling exp(2x) = exp(x) * exp(x) Doubling the input should result in the square of the original output. 1 1
7 Halving exp(x/2) * exp(x/2) = exp(x) Halving the input and then using the result twice as factors should yield 1 1
the original output.
8 Translation Invariance exp(x + c) / exp(c) = exp(x) Translating the input by a constant and dividing by the exponential of 1 1
that constant should yield the original output.
9 Periodicity with Imaginary exp(x + 2pi i) = exp(x) Adding 2pi i (where i is the imaginary unit) to the input should not 0 0
Unit change the output due to the periodicity of the exponential function in
the complex plane.
10 Scaling by Rational Num- exp(qx) = exp(x)q̂ Scaling the input by a rational number q should result in the output 1 1
ber raised to the power of q.
9 7

T3: chatGPT inferring 10 MRs, function description provided through SRS pdf
MR# MR Title MR Formula MR Text Val My
Validation
MR1 Additive Identity exp(x + 0) = exp(x) Adding zero to the input should not change the output of the exp 1 0
function.
MR2 Multiplicative Identity exp(x) * exp(0) = exp(x) Multiplying the output of exp(x) by exp(0), which is 1, should not 1 0
change the result.
MR3 Double Input exp(2*x) = exp(x) * exp(x) Doubling the input x should be equivalent to squaring the output of 1 1
exp(x).
MR4 Negative Input exp(-x) = 1/exp(x) The output for a negative input should be the reciprocal of the output 1 1
for the positive input.
MR5 Sum of Inputs exp(x + y) = exp(x) * exp(y) The exponential of the sum of two numbers should be the product of 1 1
the exponentials of the individual numbers.
MR6 Input Scaling by Integer exp(n*x) = exp(x)n̂, for any Scaling the input by an integer n should be equivalent to raising the 1 1
integer n output to the power n.
MR7 Zero Input exp(0) = 1 The output of exp when the input is zero should always be 1. 1 0
MR8 NaN Input exp(NaN) = NaN If the input is NaN, the output should also be NaN. 1 0
MR9 Infinity Input exp(Inf) = Inf If the input is positive infinity, the output should be positive infinity. 1 0
MR10 Negative Infinity Input exp(-Inf) = 0 If the input is negative infinity, the output should be zero. 1 0
10 4
Appendix A. MLFS Tables 89

Table A.4: MR statistics for the exp function

T1: chatGPT inferring 10 MRs without access to Self + My Valida- MR same/simi- MR valid and
SRS pdf tion lar in not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T2 T3 T2 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 1 1 0 - 0 0 0 1 1 1 0 0
MR2 1 1 1 0 - 0 0 0 1 1 0 0 1
MR3 1 1 1 0 - 0 0 0 1 1 1 0 0
MR4 1 1 1 0 - 0 0 0 1 1 0 0 1
MR5 1 1 1 0 - 0 0 0 1 1 1 0 0
MR6 1 0 edge case identity 0 0 - 0 0 1 0 0 1 0 0
MR7 1 1 1 0 - 0 0 0 1 1 1 0 0
MR8 1 1 1 0 - 0 0 0 1 1 1 0 0
MR9 0 0 complex numbers not sup- 0 0 - 1 0 0 0 0 0 0 0
ported
MR10 0 0 complex numbers not sup- 0 0 - 1 0 0 0 1 0 0 0
ported
8 7 7 0 - 2 0 1 7 0 2

T2: chatGPT inferring 10 MRs with access to the Self + My Valida- MR same/simi- MR valid and
SRS pdf tion lar not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T3 T1 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 1 0 testing x + 0 == x 0 0 0 0 0 1 0 0 1 0 0
2 1 0 testing exp(0) == 1 0 0 0 0 0 1 0 0 1 0 0
3 1 1 1 0 0 0 0 0 1 1 1 0 0
4 1 1 1 0 0 0 0 0 1 1 0 0 1
5 1 1 1 0 0 0 0 0 1 1 1 0 0
6 1 1 1 0 0 0 0 0 1 1 1 0 0
7 1 1 1 0 0 0 0 0 1 1 0 0 1
8 1 1 1 0 0 0 0 0 1 0 0 1 1
9 0 0 complex numbers not sup- 0 0 0 1 0 0 0 1 0 0 0
ported
10 1 1 1 0 0 0 0 0 1 1 1 0 0
9 7 7 0 0 1 0 2 7 1 3

T3: chatGPT inferring 10 MRs, function descrip- Self + My Valida- MR same/simi- MR valid and
tion provided through SRS pdf tion lar not in
MR# Val My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T2 T1 T2
Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 0 testing x + 0 == x 0 0 0 0 0 1 0 0 1 0 0
MR2 1 0 testing exp(0) == 1 0 0 0 0 0 1 0 0 1 0 0
MR3 1 1 1 0 0 0 0 0 1 1 1 0 0
MR4 1 1 1 0 0 0 0 0 1 1 1 0 0
MR5 1 1 1 0 0 0 0 0 1 1 1 0 0
MR6 1 1 1 0 0 0 0 0 1 1 1 0 0
MR7 1 0 edge case identity 0 0 0 0 0 1 0 1 0 0 0
MR8 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR9 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR10 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
10 4 4 0 0 0 0 6 4 0 0
90 Appendix A. MLFS Tables

Table A.5: MRs generated by ChatGPT for the fmin function

T1: chatGPT inferring 10 MRs without access to SRS pdf


MR# MR Title MR Formula MR Text Self My
Validation Validation
MR1 Swap Arguments fmin(x, y) = fmin(y, x) The output remains the same when the arguments are swapped. 1 1
MR2 Addition of Constants fmin(x, y) = fmin(x + c, y + c) - c Adding the same constant to both arguments and subtracting it from the result 1 1
should yield the original output.
MR3 Multiplication by Positive Con- fmin(x, y) = fmin(x * c, y * c) / c for c Multiplying both arguments by a positive constant and dividing the result by that 1 1
stant >0 constant should yield the original output.
MR4 Negation fmin(x, y) = -fmax(-x, -y) The minimum of x and y is the negation of the maximum of -x and -y. 1 1
MR5 Increasing Both Arguments fmin(x, y) <= fmin(x + c, y + c) for c > Increasing both arguments by a positive constant should not decrease the minimum. 1 1
0
MR6 Decreasing Both Arguments fmin(x, y) >= fmin(x - c, y - c) for c > 0 Decreasing both arguments by a positive constant should not increase the minimum. 1 1
MR7 Permutation Invariance fmin(x, y) = fmin(y, x) The output should be invariant under permutation of the arguments. 1 1
MR8 Identity Element fmin(x, x) = x The minimum of a number with itself should be the number itself. 1 1
MR9 Transitivity fmin(x, fmin(y, z)) = fmin(fmin(x, y), z) The function should be transitive with respect to its inputs. 1 1
MR10 Impact of Infinity fmin(x, inf) = x The minimum of any number and infinity should be the number itself. 1 1
10 10
same or similar MRs marked
T2: chatGPT inferring 10 MRs with access to the SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Negation Invariance fmin(x, y) = fmin(-x, -y) If both x and y are negated, the output of fmin should remain the same. 1 0
2 Order Invariance fmin(x, y) = fmin(y, x) The order of arguments should not affect the output of fmin. 1 1
3 Scalar Multiplication fmin(x, y) = k * fmin(x/k, y/k) for k != Multiplying both arguments by a scalar k and then dividing the result by k should 1 0
0 yield the original output.
4 Additive Invariance fmin(x, y) = fmin(x + z, y + z) - z Adding the same value z to both arguments and then subtracting z from the result 1 1
should yield the original output.
5 Identity Element fmin(x, x) = x If both arguments are the same, the output should be that value. 1 1
6 Transitive Increase fmin(x, y) < fmin(x, z) if y < z Increasing one argument while keeping the other constant should not decrease the 0 0
output.
7 Zero Element fmin(x, 0) = 0 if x >= 0 If one of the arguments is zero and the other is non-negative, the output should 1 1
be zero.
8 Infinity Handling fmin(x, inf) = x If one of the arguments is infinity, the output should be the other argument. 1 1
9 Negative Infinity Handling fmin(x, -inf) = -inf If one of the arguments is negative infinity, the output should be negative infinity. 1 1
10 Absolute Value Relation fmin(x, y) = -fmax(-x, -y) The minimum of x and y should be the negative of the maximum of -x and -y. 1 1
9 7

T3: chatGPT inferring 10 MRs, function description provided through SRS pdf
MR# MR Title MR Formula MR Text Val My
Validation
MR1 Identity Relation fmin(x, x) = x If both arguments are the same, the function should return that value. 1 1
MR2 Commutativity fmin(x, y) = fmin(y, x) The order of arguments should not affect the result. 1 1
MR3 NaN Propagation fmin(NaN, x) = NaN and fmin(x, NaN) If any argument is NaN, the result should be NaN. 1 0
= NaN
MR4 Negative Zero fmin(-0, +0) = -0 and fmin(+0, -0) = -0 The function should return -0 if any argument is -0. 1 0
MR5 Infinity Handling fmin(Inf, x) = x and fmin(x, Inf) = x If one argument is infinity, the result should be the other argument. 1 0
MR6 Negative Infinity fmin(-Inf, x) = -Inf and fmin(x, -Inf) = If one argument is negative infinity, the result should be negative infinity. 1 0
-Inf
MR7 Scaling fmin(a*x, a*y) = a*fmin(x, y) for a > 0 Scaling both arguments by a positive number should scale the result. 1 1
MR8 Translation Invariance fmin(x+c, y+c) = fmin(x, y) + c Adding a constant to both arguments should add the constant to the result. 1 1
MR9 Inequality Preservation If x < y then fmin(x, z) <= fmin(y, z) The function should preserve the order of numbers. 1 1
MR10 Zero Identity fmin(x, 0) = 0 if x ≥ 0 If one argument is zero and the other is non-negative, the result should be zero. 1 0
10 5
Appendix A. MLFS Tables 91

Table A.6: MR statistics for the fmin function

T1: chatGPT inferring 10 MRs without Self + My Valida- MR same/simi- MR valid and
access to SRS pdf tion lar in not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T2 T3 T2 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 1 1 0 - 0 0 0 1 1 1 0 0
MR2 1 1 1 0 - 0 0 0 1 1 0 0 1
MR3 1 1 1 0 - 0 0 0 1 0 0 1 1
MR4 1 1 1 0 - 0 0 0 1 1 0 0 1
MR5 1 1 1 0 - 0 0 0 1 0 0 1 1
MR6 1 1 1 0 - 0 0 0 1 0 0 1 1
MR7 1 1 same as MR1 0 0 - 0 0 0 1 1 1 0 0
MR8 1 1 1 0 - 0 0 0 1 1 1 0 0
MR9 1 1 1 0 - 0 0 0 1 0 0 1 1
MR10 1 1 1 0 - 0 0 0 1 0 0 1 1
10 10 9 0 - 0 0 0 10 5 7

T2: chatGPT inferring 10 MRs with access Self + My Valida- MR same/simi- MR valid and
to the SRS pdf tion lar not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T3 T1 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 1 0 not true 0 0 0 0 0 1 0 0 0 0 0
2 1 1 1 0 0 0 0 0 1 1 1 0 0
3 1 0 only valid for k>0 0 1 0 0 0 1 0 0 0 0 0
4 1 1 1 0 0 0 0 0 1 1 0 0 1
5 1 1 1 0 0 0 0 0 1 1 1 0 0
6 0 0 instead of < use 0 1 0 1 0 0 0 0 1 0 0
<=
7 1 1 1 0 0 0 0 0 1 0 0 1 1
8 1 1 1 0 0 0 0 0 1 0 0 1 1
9 1 1 1 0 0 0 0 0 1 0 0 1 1
10 1 1 1 0 0 0 0 0 1 1 0 0 1
9 7 7 2 0 1 0 2 7 3 5

T3: chatGPT inferring 10 MRs, function Self + My Valida- MR same/simi- MR valid and
description provided through SRS pdf tion lar not in
MR# Val My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T2 T1 T2
Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 1 1 0 0 0 0 0 1 1 1 0 0
MR2 1 1 1 0 0 0 0 0 1 1 1 0 0
MR3 1 0 result=x in SRS 0 0 1 0 0 1 0 0 0 0 0
doc
MR4 1 0 result=y in SRS 0 0 1 0 0 1 0 0 0 0 0
doc
MR5 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR6 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR7 1 1 1 0 0 0 0 0 1 0 0 1 1
MR8 1 1 1 0 0 0 0 0 1 0 0 1 1
MR9 1 1 1 0 0 0 0 0 1 0 1 1 0
MR10 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
10 5 5 0 2 0 0 5 5 3 2
92 Appendix A. MLFS Tables

Table A.7: MRs generated by ChatGPT for the fmod function


T1: chatGPT inferring 10 MRs without access to SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Sign Preservation fmod(x, y) = -fmod(-x, y) The sign of the result should be the same as the sign of x. 0 1
2 Scaling X fmod(k*x, y) = k*fmod(x, y) for k > 0 Scaling x by a positive factor k should scale the remainder by k. 0 0
3 Scaling Y fmod(x, k*y) = fmod(x, y) for k > 0 Scaling y by a positive factor k should not change the remainder. 1 0
4 Additive Invariance in X fmod(x + k*y, y) = fmod(x, y) Adding a multiple of y to x should not change the remainder. 1 1
5 Symmetry fmod(x, y) = -fmod(x, -y) Changing the sign of y should negate the remainder. 1 0
6 Zero Identity in X fmod(0, y) = 0 The remainder of zero divided by any number should be zero. 1 1
7 Identity in Y fmod(x, x) = 0 The remainder of a number divided by itself should be zero. 1 1
8 Commutativity in Sign fmod(x, y) = fmod(-x, -y) Negating both x and y should not change the remainder. 0 0
9 Transitivity fmod(fmod(x, y), y) = fmod(x, y) The remainder of a remainder operation should be the same as the initial remainder. 1 1
10 Invariance under Y Addi- fmod(x, y + k*x) = fmod(x, y) Adding a multiple of x to y should not change the remainder. 1 0
tion
7 5
same or similar MRs marked
T2: chatGPT inferring 10 MRs with access to the SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Sign Preservation fmod(x, y) = -fmod(-x, y) The sign of the result should be the same as the sign of x. 1 1
2 Scaling Invariance fmod(k*x, k*y) = k*fmod(x, y) for k != 0 Scaling both operands by the same non-zero factor k should scale the result by k. 0 1
3 Zero Numerator fmod(0, y) = 0 If x is 0, the result should always be 0 regardless of y. 1 1
4 Identity Operand fmod(x, x) = 0 If y is equal to x, the result should be 0. 1 1
5 Negation Symmetry fmod(x, y) = -fmod(x, -y) Negating y should negate the result. 1 0
6 Double Operand fmod(x, 2*y) = fmod(fmod(x, y), 2*y) Applying the operation with operand 2*y should yield the same result as applying it 0 0
first with y and then with 2*y.
7 Modulus Reduction fmod(x, y) = fmod(x, y + k*y) for any Adding any multiple of y to y should not change the result. 0 0
integer k
8 Commutativity of Addi- fmod(x + k*y, y) = fmod(x, y) for any Adding multiples of y to x should not affect the result. 1 1
tion integer k
9 Invariance under Subtrac- fmod(x - k*y, y) = fmod(x, y) for any inte- Subtracting multiples of y from x should not affect the result. 1 1
tion ger k
10 Absolute Scale Invariance fmod(k*x, y) = fmod(x, y) for k > 0 Multiplying x by a positive scalar k should not change the result. 0 0
6 6

T3: chatGPT inferring 10 MRs, function description provided through SRS pdf
MR# MR Title MR Formula MR Text Val My
Validation
MR1 Sign Preservation fmod(x, y) = fmod(x, -y) The sign of the result is the same regardless of the sign of y, as long as x remains the 1 1
same.
MR2 NaN Propagation fmod(NaN, y) = NaN If x is NaN, the result is always NaN, regardless of y. 1 0
MR3 Zero Divisor fmod(x, 0) = NaN If y is zero, the result is always NaN, regardless of x. 1 0
MR4 Infinity Propagation fmod(±Inf, y) = NaN If x is ±Infinity, the result is always NaN, regardless of y. 1 0
MR5 Zero Numerator fmod(±0, y) = ±0 If x is ±0 (and y is not zero), the result is ±0. 1 0
MR6 Infinity Divisor fmod(x, ±Inf) = x If y is ±Infinity (and x is not ±Infinity), the result is x. 1 1
MR7 Scaling Invariance fmod(k*x, k*y) = k*fmod(x, y) Scaling both arguments by the same non-zero factor k does not change the relationship. 0 0
MR8 Additive Invariance fmod(x + n*y, y) = fmod(x, y) Adding any integer multiple of y to x does not change the result. 1 1
MR9 Symmetry fmod(-x, y) = -fmod(x, y) Negating x negates the result. 1 1
MR10 Idempotence fmod(fmod(x, y), y) = fmod(x, y) Applying fmod twice with the same divisor y yields the same result as applying it once. 1 1
9 5
Appendix A. MLFS Tables 93

Table A.8: MR statistics for the fmod function

T1: chatGPT inferring 10 MRs without ac- Self + My Valida- MR same/simi- MR valid and
cess to SRS pdf tion lar in not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T2 T3 T2 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 0 1 1 0 - 0 1 0 0 1 1 0 0
2 0 0 not always true 0 0 - 1 0 0 0 0 0 0 0
3 1 0 not always true 0 0 - 0 0 1 0 0 0 0 0
4 1 1 1 0 - 0 0 0 1 1 0 0 1
5 1 0 not true in ISO C99 0 0 - 0 0 1 0 1 0 0 0
6 1 1 1 0 - 0 0 0 1 1 0 0 1
7 1 1 1 0 - 0 0 0 1 1 0 0 1
8 0 0 not true in ISO C99 0 0 - 1 0 0 0 0 0 0 0
9 1 1 1 0 - 0 0 0 1 0 1 1 0
10 1 0 not true 0 0 - 0 0 1 0 0 0 0 0
7 5 5 0 - 2 1 3 4 1 3

T2: chatGPT inferring 10 MRs with access Self + My Valida- MR same/simi- MR valid and
to the SRS pdf tion lar not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T3 T1 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 1 1 1 0 0 0 0 0 1 1 1 0 0
2 0 1 1 0 0 0 1 0 0 0 0 1 1
3 1 1 1 1 0 0 0 0 1 1 0 0 1
4 1 1 1 0 0 0 0 0 1 1 0 0 1
5 1 0 not true in ISO C99 0 0 1 0 0 1 0 1 0 0 0
6 0 0 not always true 0 1 0 1 0 0 0 0 0 0 0
7 0 0 not always true 0 0 0 1 0 0 0 0 0 0 0
8 1 1 1 0 0 0 0 0 1 1 0 0 1
9 1 1 1 0 0 0 0 0 1 0 0 1 1
10 0 0 not true 0 0 0 1 0 0 0 0 0 0 0
6 6 6 2 1 3 1 1 5 2 5

T3: chatGPT inferring 10 MRs, function Self + My Valida- MR same/simi- MR valid and
description provided through SRS pdf tion lar not in
MR# Val My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T2 T1 T2
Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 1 1 0 0 0 0 0 1 0 0 1 1
MR2 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR3 1 0 edge case identity 0 0 1 0 0 1 0 0 0 0 0
MR4 1 0 edge case identity 0 0 1 0 0 1 0 0 0 0 0
MR5 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR6 1 1 1 0 0 0 0 0 1 0 0 1 1
MR7 0 0 valid for k>0 0 1 0 1 0 0 0 0 0 0 0
MR8 1 1 1 0 0 0 0 0 1 0 0 1 1
MR9 1 1 1 0 0 0 0 0 1 1 1 0 0
MR10 1 1 1 0 0 0 0 0 1 1 0 0 1
9 5 5 1 2 1 0 4 5 3 4
94 Appendix A. MLFS Tables

Table A.9: MRs generated by ChatGPT for the hypot function


T1: chatGPT inferring 10 MRs without access to SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
MR1 Scaling hypot(k*x, k*y) = k*hypot(x, y) Scaling both inputs by a constant factor k should scale the output by the same 1 0
factor.
MR2 Permutation hypot(x, y) = hypot(y, x) The function should be commutative; swapping inputs should not change the 1 1
output.
MR3 Identity hypot(x, 0) = |x| If one side of the triangle is zero, the hypotenuse should be the absolute value of 1 1
the other side.
MR4 Pythagorean Triple hypot(3, 4) = 5 For known Pythagorean triples, the function should return the correct hypotenuse. 1 0
MR5 Negative Input hypot(-x, -y) = hypot(x, y) The function should return the same result for negative inputs as for positive ones. 1 1
MR6 Zero Input hypot(0, 0) = 0 If both inputs are zero, the output should also be zero. 1 0
MR7 Large Values hypot(x, y) when x » y should be approximately When one side is much larger than the other, the hypotenuse should be approxi- 1 1
|x| mately equal to the larger side.
MR8 Additive Increase hypot(x+a, y+b) > hypot(x, y) Increasing the lengths of both sides should result in a larger hypotenuse. 1 0
MR9 Fractional Scaling hypot(x/k, y/k) = (1/k)*hypot(x, y) Scaling both inputs down by a constant factor k should scale the output down by 1 0
the same factor.
MR10 Square Relation hypot(x, x) = x*sqrt(2) If both sides are equal, the hypotenuse should be the side length multiplied by 1 1
the square root of two.
10 5
same or similar MRs marked
T2: chatGPT inferring 10 MRs with access to the SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Scaling hypot(k*x, k*y) = k*hypot(x, y) Scaling both sides of the triangle by a factor k should scale the hypotenuse by the 1 0
same factor.
2 Permutation hypot(x, y) = hypot(y, x) The order of sides does not affect the hypotenuse length. 1 1
3 Negation hypot(-x, -y) = hypot(x, y) Negating both sides does not change the hypotenuse length. 1 1
4 Translation hypot(x+a, y+b) ̸= hypot(x, y) Translating the sides by different amounts generally changes the hypotenuse length. 0 0
5 Rotation 90° hypot(y, -x) = hypot(x, y) Rotating the sides 90 degrees around the origin should not change the hypotenuse 1 1
length.
6 Square and Root hypot(x2̂, y2̂) ̸= hypot(x, y) Squaring the sides changes the hypotenuse length in a non-trivial way. 0 0
7 Reciprocal hypot(1/x, 1/y) ̸= hypot(x, y) Taking the reciprocal of each side generally changes the hypotenuse length. 0 0
8 Pythagorean Triple hypot(3, 4) = 5 Using Pythagorean triples, the hypotenuse can be directly verified. 1 0
9 Zeroing hypot(x, 0) = |x| If one side is zero, the hypotenuse is the absolute value of the other side. 1 1
10 Doubling hypot(2*x, 2*y) = 2*hypot(x, y) Doubling both sides of the triangle should double the hypotenuse. 1 1
7 5

T3: chatGPT inferring 10 MRs, function description provided through SRS pdf
MR# MR Title MR Formula MR Text Val My
Validation
MR1 Scaling hypot(k*x, k*y) = k*hypot(x, y) Scaling both inputs by a constant factor k should scale the output by the same 1 1
factor.
MR2 Permutation Invari- hypot(x, y) = hypot(y, x) The order of inputs should not affect the output. 1 1
ance
MR3 Identity hypot(x, 0) = |x| If one side of the triangle is zero, the hypotenuse should be the absolute value of 1 1
the other side.
MR4 Negative Input hypot(-x, -y) = hypot(x, y) The function should return the same result for negative inputs as for positive 1 1
inputs.
MR5 Infinity Input hypot(±Inf, y) = +Inf If one input is infinity, the result should be infinity. 1 0
MR6 NaN Input hypot(NaN, y) = NaN if y is not ±Inf If one input is NaN and the other is not infinity, the result should be NaN. 1 0
MR7 Zero Input hypot(0, 0) = 0 If both inputs are zero, the result should be zero. 1 0
MR8 Pythagorean Triple hypot(3, 4) = 5 For a known Pythagorean triple, the function should return the correct hypotenuse. 1 0
MR9 Large Values hypot(1e10, 1e10) = 1.41421356237e10 For large values, the function should correctly compute the hypotenuse using 1 0
floating-point arithmetic.
MR10 Small Values hypot(1e-10, 1e-10) = 1.41421356237e-10 For small values, the function should correctly compute the hypotenuse using 1 0
floating-point arithmetic.
10 4
Appendix A. MLFS Tables 95

Table A.10: MR statistics for the hypot function

T1: chatGPT inferring 10 MRs without ac- Self + My Valida- MR same/simi- MR valid and
cess to SRS pdf tion lar in not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T2 T3 T2 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 0 valid for k>0 0 1 - 0 0 1 0 1 1 0 0
MR2 1 1 1 0 - 0 0 0 1 1 1 0 0
MR3 1 1 1 0 - 0 0 0 1 1 1 0 0
MR4 1 0 oracle identity 0 0 - 0 0 1 0 0 0 0 0
MR5 1 1 1 0 - 0 0 0 1 1 1 0 0
MR6 1 0 edge case identity 0 0 - 0 0 1 0 0 0 0 0
MR7 1 1 1 0 - 0 0 0 1 0 0 1 1
MR8 1 0 a,b > 0 0 1 - 0 0 1 0 0 0 0 0
MR9 1 0 valid for k>0 0 1 - 0 0 1 0 0 0 0 0
MR10 1 1 1 0 - 0 0 0 1 0 0 1 1
10 5 5 3 - 0 0 5 5 2 2

T2: chatGPT inferring 10 MRs with access Self + My Valida- MR same/simi- MR valid and
to the SRS pdf tion lar not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T3 T1 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 1 0 valid for k>0 0 1 0 0 0 1 0 1 1 0 0
2 1 1 1 0 0 0 0 0 1 1 1 0 0
3 1 1 1 1 0 0 0 0 1 1 1 0 0
4 0 0 not always true 0 0 0 1 0 0 0 0 0 0 0
5 1 1 1 0 1 0 0 0 1 0 0 1 1
6 0 0 not relevant 0 1 0 1 0 0 0 0 0 0 0
7 0 0 not relevant 0 0 0 1 0 0 0 0 0 0 0
8 1 0 oracle identity 0 0 0 0 0 1 0 0 1 0 0
9 1 1 1 0 0 0 0 0 1 1 1 0 0
10 1 1 special case of MR1 0 0 0 0 0 0 1 0 0 1 1
7 5 4 3 1 3 0 2 5 2 2

T3: chatGPT inferring 10 MRs, function Self + My Valida- MR same/simi- MR valid and
description provided through SRS pdf tion lar not in
MR# Val My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T2 T1 T2
Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 1 1 0 0 0 0 0 1 1 1 0 0
MR2 1 1 1 0 0 0 0 0 1 1 1 0 0
MR3 1 1 1 0 1 0 0 0 1 1 1 0 0
MR4 1 1 1 0 1 0 0 0 1 1 1 0 0
MR5 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR6 1 0 edge case identity 0 0 0 0 0 1 0 0 0 0 0
MR7 1 0 edge case identity 0 1 0 0 0 1 0 0 0 0 0
MR8 1 0 oracle identity 0 0 0 0 0 1 0 0 1 0 0
MR9 1 0 oracle identity 0 0 0 0 0 1 0 0 0 0 0
MR10 1 0 oracle identity 0 0 0 0 0 1 0 0 0 0 0
10 4 4 1 2 0 0 6 4 0 0
96 Appendix A. MLFS Tables

Table A.11: MRs generated by ChatGPT for the pow function

T1: chatGPT inferring 10 MRs without access to SRS pdf


MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Positive Exponent pow(x, y) = 1 / pow(x, -y) If y is positive, then pow(x, y) should be equal to the reciprocal of pow(x, -y). 1 1
2 Negative Exponent pow(x, y) * pow(x, -y) = 1 If y is negative, pow(x, y) multiplied by pow(x, -y) should result in 1. 1 1
3 Zero Exponent pow(x, 0) = 1 Regardless of the value of x, raising it to the power of 0 should always result in 1. 1 0
4 Exponent One pow(x, 1) = x Raising any number x to the power of 1 should result in x itself. 1 1
5 Multiplicative Identity pow(x, y) * pow(x, z) = pow(x, y+z) The result of x raised to the power of y multiplied by x raised to the power of z should 1 1
be equal to x raised to the power of y+z.
6 Fractional Exponent pow(x, 1/y) * pow(x, 1/y) = pow(x, Raising x to the power of 1/y and squaring the result should be equal to x raised to 1 1
2/y) the power of 2/y.
7 Inverse Exponent pow(x, y) * pow(x, 1/y) = x The product of x raised to y and x raised to 1/y should result in x. 0 0
8 Square and Square Root pow(pow(x, 2), 0.5) = x Taking x to the power of 2 and then taking the square root should return x. 1 1
9 Exponent Distribution pow(x*y, z) = pow(x, z) * pow(y, z) Raising the product of x and y to the power of z should be the same as multiplying x 1 1
raised to z with y raised to z.
10 Negative Base Even Exponent pow(-x, 2) = pow(x, 2) Raising a negative x to an even power (e.g., 2) should yield the same result as raising 1 1
x to that power.
9 8
same or similar MRs marked
T2: chatGPT inferring 10 MRs with access to the SRS pdf
MR# MR Title MR Formula MR Text Self My
Validation Validation
1 Identity Exponent pow(x, 1) = x When y is 1, the result should always be x. 1 1
2 Zero Exponent pow(x, 0) = 1 Any number raised to the power of 0 should be 1, except for 00̂ which is undefined or 1 0
sometimes defined as 1.
3 Negative Exponent pow(x, -y) = 1 / pow(x, y) Raising a number to a negative power is equivalent to the inverse of the number raised 1 1
to the corresponding positive power.
4 Exponent Summation pow(x, y1 + y2) = pow(x, y1) * pow(x, The power of a number raised to the sum of two exponents is the product of the 1 1
y2) number raised to each exponent.
5 Base Multiplication pow(x1 * x2, y) = pow(x1, y) * pow(x2, Raising a product of numbers to a power is the same as raising each number to the 1 1
y) power and then multiplying.
6 Reciprocal Base pow(1/x, y) = 1 / pow(x, y) Raising the reciprocal of a number to a power is the reciprocal of the number raised 1 1
to that power.
7 Exponent Multiplication pow(x, y1 * y2) = pow(pow(x, y1), y2) Raising a number to the product of two exponents is the same as raising the number 1 1
to one exponent and then raising the result to the second exponent.
8 Fractional Exponent pow(x, 1/y) = yth root of x Raising a number to the reciprocal of an exponent yields the root of the number 1 1
corresponding to that exponent.
9 Zero Base Nonzero Exponent pow(0, y) = 0 Zero raised to any positive exponent is zero. 1 0
10 One Base pow(1, y) = 1 One raised to any power is always one. 1 0
10 7

T3: chatGPT inferring 10 MRs, function description provided through SRS pdf
MR# MR Title MR Formula MR Text Val My
Validation
MR1 Identity Transformation pow(x, 1) = x When y is 1, the result should be x itself. 1 1
MR2 Zero Power pow(x, 0) = 1 Any number raised to the power of zero should return 1. 1 0
MR3 Negative Base Even Power pow(-x, 2y) = pow(x, 2y) Raising a negative number to an even power results in the same value as raising the 1 1
positive counterpart to that power.
MR4 Negative Base Odd Power pow(-x, 2y+1) = -pow(x, 2y+1) Raising a negative number to an odd power results in the negative of the value obtained 1 1
by raising the positive counterpart to that power.
MR5 Reciprocal Power pow(x, -y) = 1/pow(x, y) Raising a number to a negative power is equivalent to the reciprocal of the number 1 1
raised to the positive of that power.
MR6 Multiplicative Power pow(x, y) * pow(x, z) = pow(x, y+z) The product of the same base raised to two different powers is equal to the base raised 1 1
to the sum of the two powers.
MR7 Power of a Power pow(pow(x, y), z) = pow(x, y*z) Raising a power to another power is the same as raising the base to the product of the 1 1
exponents.
MR8 One Base pow(1, y) = 1 Any power of 1 is 1. 1 0
MR9 Zero Base Non-negative pow(0, y) = 0 for y > 0 Zero raised to any positive power is zero. 1 0
Power
MR10 Base Inversion pow(1/x, y) = 1/pow(x, y) Raising the reciprocal of a number to a power is the reciprocal of the number raised 1 1
to that power.
10 7
Appendix A. MLFS Tables 97

Table A.12: MR statistics for the pow function

T1: chatGPT inferring 10 MRs without access Self + My Valida- MR same/simi- MR valid and
to SRS pdf tion lar in not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T2 T3 T2 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 1 1 1 0 - 0 0 0 1 0 0 1 1
2 1 1 1 0 - 0 0 0 1 0 0 1 1
3 1 0 edge case identity 0 0 - 0 0 1 0 1 1 0 0
4 1 1 1 0 - 0 0 0 1 1 1 0 0
5 1 1 1 0 - 0 0 0 1 1 1 0 0
6 1 1 1 0 - 0 0 0 1 0 0 1 1
7 0 0 not true 0 0 - 1 0 0 0 0 0 0 0
8 1 1 1 0 - 0 0 0 1 0 0 1 1
9 1 1 1 0 - 0 0 0 1 1 0 0 1
10 1 1 1 0 - 0 0 0 1 0 0 1 1
9 8 8 0 - 1 0 1 8 5 6

T2: chatGPT inferring 10 MRs with access to Self + My Valida- MR same/simi- MR valid and
the SRS pdf tion lar not in
MR# Self My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T3 T1 T3
Validation Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
1 1 1 1 0 0 0 0 0 1 1 1 0 0
2 1 0 edge case identity 0 0 0 0 0 1 0 1 1 0 0
3 1 1 1 1 0 0 0 0 1 0 1 1 0
4 1 1 1 0 0 0 0 0 1 1 1 0 0
5 1 1 1 0 0 0 0 0 1 1 0 0 1
6 1 1 1 0 0 0 0 0 1 0 1 1 0
7 1 1 1 0 0 0 0 0 1 0 1 1 0
8 1 1 1 0 0 0 0 0 1 0 0 1 1
9 1 0 in SRS x is +0 or -0 0 0 1 0 0 1 0 0 1 0 0
10 1 0 edge case identity 0 0 0 0 0 1 0 0 1 0 0
10 7 7 1 1 0 0 3 7 4 2

T3: chatGPT inferring 10 MRs, function de- Self + My Valida- MR same/simi- MR valid and
scription provided through SRS pdf tion lar not in
MR# Val My My Validation MR Invalid Invalid MR, 00 01 10 11 T1 T2 T1 T2
Validation Reasoning Valid MR Contradicting
& Easy Fix SRS
Uniqe
MR1 1 1 1 0 0 0 0 0 1 1 1 0 0
MR2 1 0 edge case identity 0 0 0 0 0 1 0 1 1 0 0
MR3 1 1 1 0 0 0 0 0 1 0 0 1 1
MR4 1 1 1 0 0 0 0 0 1 0 0 1 1
MR5 1 1 1 0 0 0 0 0 1 0 1 1 0
MR6 1 1 1 0 0 0 0 0 1 1 1 0 0
MR7 1 1 1 0 0 0 0 0 1 0 1 1 0
MR8 1 0 edge case identity 0 0 0 0 0 1 0 0 1 0 0
MR9 1 0 in SRS x is +0 or -0 0 0 1 0 0 1 0 0 1 0 0
MR10 1 1 1 0 0 0 0 0 1 0 1 1 0
10 7 7 0 1 0 0 3 7 5 2
99

Appendix B

PUS Tables

List of tables created in the PUS study.


100 Appendix B. PUS Tables

Table B.1: Results of TC/TM mapping of ST[22] Position-Based Scheduling,


when all three documents (PDF, PPTX and JSON) attached and when only PDF
is attached

pdf, pptx, json


TC TC Description TM TM Description
TC[22,1] Enable the position-based schedule execution function
TC[22,2] Disable the position-based schedule execution function
TC[22,3] Reset the position-based schedule
TC[22,4] Insert activities into the position-based schedule
TC[22,5] Delete position-based scheduled activities identified by request identifier
TC[22,6] Delete the position-based scheduled activities identified by a filter
TC[22,7] Position-shift scheduled activities identified by request identifier
TC[22,8] Position-shift the scheduled activities identified by a filter
TC[22,9] Detail-report position-based scheduled activities identified by request identifier TM[22,10] Position-based schedule detail report
TC[22,11] Detail-report the position-based scheduled activities identified by a filter
TC[22,12] Summary-report position-based scheduled activities identified by request identifier TM[22,13] Position-based schedule summary report
TC[22,14] Summary-report the position-based scheduled activities identified by a filter
TC[22,15] Position-shift all scheduled activities
TC[22,16] Detail-report all position-based scheduled activities TM[22,10] Position-based schedule detail report
TC[22,17] Summary-report all position-based scheduled activities TM[22,13] Position-based schedule summary report
TC[22,18] Report the status of each position-based sub-schedule TM[22,19] Position-based sub-schedule status report
TC[22,20] Enable position-based sub-schedules
TC[22,21] Disable position-based sub-schedules
TC[22,22] Create position-based scheduling groups
TC[22,23] Delete position-based scheduling groups
TC[22,24] Enable position-based scheduling groups
TC[22,25] Disable position-based scheduling groups
TC[22,26] Report the status of each position-based scheduling group TM[22,27] Position-based scheduling group status report
TC[22,28] Set the orbit number

pdf
TC TC Description TM TM Description
TC[22,1] enable the position-based schedule execution function
TC[22,2] disable the position-based schedule execution function
TC[22,3] reset the position-based schedule
TC[22,4] insert activities into the position-based schedule
TC[22,5] delete position-based scheduled activities identified by request identifier
TC[22,6] delete the position-based scheduled activities identified by a filter
TC[22,7] position-shift scheduled activities identified by request identifier
TC[22,8] position-shift the scheduled activities identified by a filter
TC[22,9] detail-report position-based scheduled activities identified by request identifier TM[22,10] position-based schedule detail report
TC[22,11] detail-report the position-based scheduled activities identified by a filter
TC[22,12] summary-report position-based scheduled activities identified by request identifier TM[22,13] position-based schedule summary report
TC[22,14] summary-report the position-based scheduled activities identified by a filter
TC[22,15] position-shift all scheduled activities
TC[22,16] detail-report all position-based scheduled activities TM[22,10] position-based schedule detail report
TC[22,17] summary-report all position-based scheduled activities TM[22,13] position-based schedule summary report
TC[22,18] report the status of each position-based sub-schedule TM[22,19] position-based sub-schedule status report
TC[22,20] enable position-based sub-schedules
TC[22,21] disable position-based sub-schedules
TC[22,22] create position-based scheduling groups
TC[22,23] delete position-based scheduling groups
TC[22,24] enable position-based scheduling groups
TC[22,25] disable position-based scheduling groups
TC[22,26] report the status of each position-based scheduling group TM[22,27] position-based scheduling group status report
TC[22,28] set the orbit number
Appendix B. PUS Tables 101

Table B.2: Results of TC/TM mapping of ST[22] Position-Based Scheduling,


when PPTX and JSON documents are attached

pptx
TC TC Description TM TM Description
TC[22,1] enable the position-based schedule execution function
TC[22,2] disable the position-based schedule execution function
TC[22,3] reset the position-based schedule
TC[22,4] insert activities into the position-based schedule
TC[22,5] delete position-based scheduled activities identified by request identifier
TC[22,6] delete the position-based scheduled activities identified by a filter
TC[22,7] position-shift scheduled activities identified by request identifier
TC[22,8] position-shift the scheduled activities identified by a filter
TC[22,9] detail-report position-based scheduled activities identified by request identifier TM[22,10] position-based schedule detail report
TC[22,11] detail-report the position-based scheduled activities identified by a filter TM[22,10] position-based schedule detail report
TC[22,12] summary-report position-based scheduled activities identified by request identifier TM[22,13] position-based schedule summary report
TC[22,14] summary-report the position-based scheduled activities identified by a filter TM[22,13] position-based schedule summary report
TC[22,15] position-shift all scheduled activities
TC[22,16] detail-report all position-based scheduled activities TM[22,10] position-based schedule detail report
TC[22,17] summary-report all position-based scheduled activities TM[22,13] position-based schedule summary report
TC[22,18] report the status of each position-based sub-schedule TM[22,19] position-based sub-schedule status report
TC[22,20] enable position-based sub-schedules
TC[22,21] disable position-based sub-schedules
TC[22,22] create position-based scheduling groups
TC[22,23] delete position-based scheduling groups
TC[22,24] enable position-based scheduling groups
TC[22,25] disable position-based scheduling groups
TC[22,26] report the status of each position-based scheduling group TM[22,27] position-based scheduling group status report
TC[22,28] set the orbit number

json
TC TC Description TM TM Description
TC[22,1] enable the position-based schedule execution function
TC[22,2] disable the position-based schedule execution function
TC[22,3] reset the position-based schedule
TC[22,4] insert activities into the position-based schedule
TC[22,5] delete position-based scheduled activities identified by request identifier
TC[22,6] delete the position-based scheduled activities identified by a filter
TC[22,7] position-shift scheduled activities identified by request identifier
TC[22,9] detail-report position-based scheduled activities identified by request identifier TM[22,10] position-based schedule detail report
TC[22,11] detail-report the position-based scheduled activities identified by a filter TM[22,10] position-based schedule detail report
TC[22,14] summary-report the position-based scheduled activities identified by a filter TM[22,13] position-based schedule summary report
TC[22,15] position-shift all scheduled activities
TC[22,16] detail-report all position-based scheduled activities TM[22,10] position-based schedule detail report
TC[22,17] summary-report all position-based scheduled activities TM[22,13] position-based schedule summary report
TC[22,18] report the status of each position-based sub-schedule TM[22,19] position-based sub-schedule status report
TC[22,22] create position-based scheduling groups
TC[22,23] delete position-based scheduling groups
TC[22,25] disable position-based scheduling groups
TC[22,26] report the status of each position-based scheduling group TM[22,27] position-based scheduling group status report
TC[22,28] set the orbit number
102 Appendix B. PUS Tables

Table B.3: Comparing the TC/TM mapping results of ST[22] Position-Based


Scheduling for different document types

pdf, pptx, json pdf pptx json


TC TM TC TM TC TM TC TM
TC[22,1] TC[22,1] TC[22,1] TC[22,1]
TC[22,2] TC[22,2] TC[22,2] TC[22,2]
TC[22,3] TC[22,3] TC[22,3] TC[22,3]
TC[22,4] TC[22,4] TC[22,4] TC[22,4]
TC[22,5] TC[22,5] TC[22,5] TC[22,5]
TC[22,6] TC[22,6] TC[22,6] TC[22,6]
TC[22,7] TC[22,7] TC[22,7] TC[22,7]
TC[22,8] TC[22,8] TC[22,8]
TC[22,9] TM[22,10] TC[22,9] TM[22,10] TC[22,9] TM[22,10] TC[22,9] TM[22,10]
TC[22,11] TC[22,11] TC[22,11] TM[22,10] TC[22,11] TM[22,10]
TC[22,12] TM[22,13] TC[22,12] TM[22,13] TC[22,12] TM[22,13]
TC[22,14] TC[22,14] TC[22,14] TM[22,13] TC[22,14] TM[22,13]
TC[22,15] TC[22,15] TC[22,15] TC[22,15]
TC[22,16] TM[22,10] TC[22,16] TM[22,10] TC[22,16] TM[22,10] TC[22,16] TM[22,10]
TC[22,17] TM[22,13] TC[22,17] TM[22,13] TC[22,17] TM[22,13] TC[22,17] TM[22,13]
TC[22,18] TM[22,19] TC[22,18] TM[22,19] TC[22,18] TM[22,19] TC[22,18] TM[22,19]
TC[22,20] TC[22,20] TC[22,20]
TC[22,21] TC[22,21] TC[22,21]
TC[22,22] TC[22,22] TC[22,22] TC[22,22]
TC[22,23] TC[22,23] TC[22,23] TC[22,23]
TC[22,24] TC[22,24] TC[22,24]
TC[22,25] TC[22,25] TC[22,25] TC[22,25]
TC[22,26] TM[22,27] TC[22,26] TM[22,27] TC[22,26] TM[22,27] TC[22,26] TM[22,27]
TC[22,28] TC[22,28] TC[22,28] TC[22,28]
1 1 0 6 Mistakes
2.1% 2.1% 0.0% 12.5% Mistakes [%]
Appendix B. PUS Tables 103

Table B.4: Results of TC/TM mapping of ST[23] File Management, when


all three documents (PDF, PPTX and JSON) attached and when only PDF is
attached

pdf, pptx, json


TC TC Description TM TM Description
TC[23,1] create a file
TC[23,2] delete a file
TC[23,3] report the attributes of a file TM[23,4] file attribute report
TC[23,5] lock a file
TC[23,6] unlock a file
TC[23,7] find files TM[23,8] found files report
TC[23,9] create a directory
TC[23,10] delete a directory
TC[23,11] rename a directory
TC[23,12] summary-report the content of a repository TM[23,13] repository content summary report
TC[23,14] copy a file
TC[23,15] move a file
TC[23,16] suspend file copy operations
TC[23,17] resume file copy operations
TC[23,18] abort file copy operations
TC[23,19] suspend all file copy operations involving a repository path
TC[23,20] resume all file copy operations involving a repository path
TC[23,21] abort all file copy operations involving a repository path
TC[23,22] enable the periodic reporting of the file copy status TM[23,23] file copy status report
TC[23,24] disable the periodic reporting of the file copy status

pdf
TC TC Description TM TM Description
TC[23,1] create a file TM[23,4] file attribute report
TC[23,2] delete a file
TC[23,3] report the attributes of a file TM[23,4] file attribute report
TC[23,5] lock a file
TC[23,6] unlock a file
TC[23,7] find files TM[23,8] found files report
TC[23,9] create a directory
TC[23,10] delete a directory
TC[23,11] rename a directory
TC[23,12] summary-report the content of a repository TM[23,13] repository content summary report
TC[23,14] copy a file
TC[23,15] move a file
TC[23,16] suspend file copy operations
TC[23,17] resume file copy operations
TC[23,18] abort file copy operations
TC[23,19] suspend all file copy operations involving a repository path
TC[23,20] resume all file copy operations involving a repository path
TC[23,21] abort all file copy operations involving a repository path
TC[23,22] enable the periodic reporting of the file copy status TM[23,23] file copy status report
TC[23,24] disable the periodic reporting of the file copy status
104 Appendix B. PUS Tables

Table B.5: Results of TC/TM mapping of ST[23] File Management, when


PPTX and JSON documents are attached

pptx
TC TC Description TM TM Description
TC[23,1] create a file
TC[23,2] delete a file
TC[23,3] report the attributes of a file TM[23,4] file attribute report
TC[23,5] lock a file
TC[23,6] unlock a file
TC[23,7] find files TM[23,8] found files report
TC[23,9] create a directory
TC[23,10] delete a directory
TC[23,11] rename a directory
TC[23,12] summary-report the content of a repository TM[23,13] repository content summary report
TC[23,14] copy a file
TC[23,15] move a file
TC[23,16] suspend file copy operations
TC[23,17] resume file copy operations
TC[23,18] abort file copy operations
TC[23,19] suspend all file copy operations involving a repository path
TC[23,20] resume all file copy operations involving a repository path
TC[23,21] abort all file copy operations involving a repository path
TC[23,22] enable the periodic reporting of the file copy status TM[23,23] file copy status report
TC[23,24] disable the periodic reporting of the file copy status

json
TC TC Description TM TM Description
TC[23,1] create a file
TC[23,2] delete a file
TC[23,3] report the attributes of a file TM[23,4] file attribute report
TC[23,5] lock a file
TC[23,6] unlock a file
TC[23,7] find files TM[23,8] found files report
TC[23,9] create a directory
TC[23,10] delete a directory
TC[23,11] rename a directory
TC[23,12] summary-report the content of a repository TM[23,13] repository content summary report
TC[23,15] move a file
TC[23,16] suspend file copy operation
TC[23,17] resume file copy operation
TC[23,18] abort file copy operations
TC[23,19] suspend all file copy operations involving a repository path
TC[23,20] resume all file copy operations involving a repository path
TC[23,21] abort all file copy operations involving a repository path
TC[23,22] enable the periodic reporting of the file copy status
TC[23,24] disable the periodic reporting of the file copy status
Appendix B. PUS Tables 105

Table B.6: Comparing the TC/TM mapping results of ST[23] File Management
for different document types

pdf, pptx, json pdf pptx json


TC TM TC TM TC TM TC TM
TC[23,1] TC[23,1] TM[23,4] TC[23,1] TC[23,1]
TC[23,2] TC[23,2] TC[23,2] TC[23,2]
TC[23,3] TM[23,4] TC[23,3] TM[23,4] TC[23,3] TM[23,4] TC[23,3] TM[23,4]
TC[23,5] TC[23,5] TC[23,5] TC[23,5]
TC[23,6] TC[23,6] TC[23,6] TC[23,6]
TC[23,7] TM[23,8] TC[23,7] TM[23,8] TC[23,7] TM[23,8] TC[23,7] TM[23,8]
TC[23,9] TC[23,9] TC[23,9] TC[23,9]
TC[23,10] TC[23,10] TC[23,10] TC[23,10]
TC[23,11] TC[23,11] TC[23,11] TC[23,11]
TC[23,12] TM[23,13] TC[23,12] TM[23,13] TC[23,12] TM[23,13] TC[23,12] TM[23,13]
TC[23,14] TC[23,14] TC[23,14]
TC[23,15] TC[23,15] TC[23,15] TC[23,15]
TC[23,16] TC[23,16] TC[23,16] TC[23,16]
TC[23,17] TC[23,17] TC[23,17] TC[23,17]
TC[23,18] TC[23,18] TC[23,18] TC[23,18]
TC[23,19] TC[23,19] TC[23,19] TC[23,19]
TC[23,20] TC[23,20] TC[23,20] TC[23,20]
TC[23,21] TC[23,21] TC[23,21] TC[23,21]
TC[23,22] TM[23,23] TC[23,22] TM[23,23] TC[23,22] TM[23,23] TC[23,22]
TC[23,24] TC[23,24] TC[23,24] TC[23,24]
0 1 0 2 Mistakes
0.0% 2.5% 0.0% 5.0% Mistakes [%]
Appendix B. PUS Tables

Table B.7: Results of MR generation for ST[03] Housekeeping when all three documents are attached
MR# MR Title Original Test Follow-up Test Expected Relation
MR1 Create and Report Housekeeping Parameter Report Send TC[3,1] to create a housekeeping parameter Send TC[3,9] to report housekeeping parameter re- TM[3,10] should report the created structure
Structure report structure port structures
MR2 Create and Report Diagnostic Parameter Report Send TC[3,2] to create a diagnostic parameter report Send TC[3,11] to report diagnostic parameter report TM[3,12] should report the created structure
Structure structure structures
MR3 Enable and Disable Periodic Generation of House- Send TC[3,5] to enable periodic generation of house- Send TC[3,6] to disable periodic generation of house- Periodic generation should stop
keeping Reports keeping parameter reports keeping parameter reports
MR4 Enable and Disable Periodic Generation of Diagnostic Send TC[3,7] to enable periodic generation of diag- Send TC[3,8] to disable periodic generation of diag- Periodic generation should stop
Reports nostic parameter reports nostic parameter reports
MR5 Generate One Shot Housekeeping Report Send TC[3,1] to create a housekeeping parameter Send TC[3,9] to report housekeeping parameter re- TM[3,10] should report the one shot report
report structure and TC[3,27] to generate a one shot port structures
report
MR6 Generate One Shot Diagnostic Report Send TC[3,2] to create a diagnostic parameter report Send TC[3,11] to report diagnostic parameter report TM[3,12] should report the one shot report
structure and TC[3,28] to generate a one shot report structures
MR7 Append Parameters to Housekeeping Report Struc- Send TC[3,1] to create a housekeeping parameter Send TC[3,9] to report housekeeping parameter re- TM[3,10] should report the appended parameters
ture report structure and TC[3,29] to append parameters port structures
MR8 Append Parameters to Diagnostic Report Structure Send TC[3,2] to create a diagnostic parameter report Send TC[3,11] to report diagnostic parameter report TM[3,12] should report the appended parameters
structure and TC[3,30] to append parameters structures
MR9 Modify Collection Interval of Housekeeping Report Send TC[3,1] to create a housekeeping parameter Send TC[3,9] to report housekeeping parameter re- TM[3,10] should report the modified collection inter-
Structure report structure and TC[3,31] to modify the collection port structures val
interval
MR10 Modify Collection Interval of Diagnostic Report Struc- Send TC[3,2] to create a diagnostic parameter report Send TC[3,11] to report diagnostic parameter report TM[3,12] should report the modified collection inter-
ture structure and TC[3,32] to modify the collection inter- structures val
val
MR11 Report Periodic Generation Properties of Housekeep- Send TC[3,1] to create a housekeeping parameter Send TC[3,9] to report housekeeping parameter re- TM[3,35] should report the periodic generation prop-
ing Report Structure report structure and TC[3,33] to report periodic gen- port structures erties
eration properties
MR12 Report Periodic Generation Properties of Diagnostic Send TC[3,2] to create a diagnostic parameter report Send TC[3,11] to report diagnostic parameter report TM[3,36] should report the periodic generation prop-
Report Structure structure and TC[3,34] to report periodic generation structures erties
properties
MR13 Apply Parameter Functional Reporting Configuration Send TC[3,37] to apply parameter functional report- Send TC[3,40] to report parameter functional report- TM[3,41] should report the applied configurations
ing configurations ing definitions
MR14 Create and Delete Parameter Functional Reporting Send TC[3,38] to create a parameter functional re- Send TC[3,39] to delete parameter functional report- The definition should be deleted
Definition porting definition ing definitions
MR15 Add and Remove Parameter Report Definitions Send TC[3,38] to create a parameter functional re- Send TC[3,43] to remove parameter report definitions The definitions should be removed
porting definition and TC[3,42] to add parameter
report definitions
MR16 Modify Periodic Generation Properties of Parameter Send TC[3,38] to create a parameter functional re- Send TC[3,40] to report parameter functional report- TM[3,41] should report the modified properties
Report Definitions porting definition and TC[3,44] to modify periodic ing definitions
generation properties
106
Table B.8: Results of MR generation for ST[01] Request Verification when all three documents are attached and when only the PDF
document is attached

T0 (source: pdf, pptx, json)


MR# MR Title Original Test Follow-up Test Expected Relation
MR1 Successful Acceptance Verification Send a valid TC Receive TM[1,1] TM[1,1] should confirm successful acceptance
Appendix B. PUS Tables

MR2 Failed Acceptance Verification Send an invalid TC Receive TM[1,2] TM[1,2] should confirm failed acceptance
MR3 Successful Start of Execution Verification Send a valid TC that starts execution Receive TM[1,3] TM[1,3] should confirm successful start of execution
MR4 Failed Start of Execution Verification Send a TC that fails to start execution Receive TM[1,4] TM[1,4] should confirm failed start of execution
MR5 Successful Progress of Execution Verification Send a TC that progresses execution Receive TM[1,5] TM[1,5] should confirm successful progress of execution
MR6 Failed Progress of Execution Verification Send a TC that fails during execution Receive TM[1,6] TM[1,6] should confirm failed progress of execution
MR7 Successful Completion of Execution Verification Send a TC that completes execution Receive TM[1,7] TM[1,7] should confirm successful completion of execution
MR8 Failed Completion of Execution Verification Send a TC that fails to complete execution Receive TM[1,8] TM[1,8] should confirm failed completion of execution
MR9 Failed Routing Verification Send a TC that fails routing Receive TM[1,10] TM[1,10] should confirm failed routing

T1 (source: pdf)
MR# MR Title Original Test Follow-up Test Expected Relation
MR1 Successful Acceptance Verification Send a TC and expect TM[1,1] Send the same TC with valid parameters TM[1,1] should be received
MR2 Failed Acceptance Verification Send a TC with invalid parameters Send the same TC with invalid parameters TM[1,2] should be received
MR3 Successful Start of Execution Verification Send a TC and expect TM[1,3] Send the same TC with valid parameters TM[1,3] should be received
MR4 Failed Start of Execution Verification Send a TC with invalid parameters Send the same TC with invalid parameters TM[1,4] should be received
MR5 Successful Progress of Execution Verification Send a TC and expect TM[1,5] Send the same TC with valid parameters TM[1,5] should be received
MR6 Failed Progress of Execution Verification Send a TC with invalid parameters Send the same TC with invalid parameters TM[1,6] should be received
MR7 Successful Completion of Execution Verification Send a TC and expect TM[1,7] Send the same TC with valid parameters TM[1,7] should be received
MR8 Failed Completion of Execution Verification Send a TC with invalid parameters Send the same TC with invalid parameters TM[1,8] should be received
MR9 Failed Routing Verification Send a TC to a non-existent destination Send the same TC to a non-existent destination TM[1,10] should be received
107
Appendix B. PUS Tables

Table B.9: Results of MR generation for ST[01] Request Verification when the PPTX and the JSON documents are attached
T2 (source: pptx)
MR# MR Title Original Test Follow-up Test Expected Relation
MR1 Routing Verification Send TC[8,1] with valid routing Send TC[8,1] with invalid routing TM[1,10] should be generated for invalid routing
MR2 Acceptance Verification Success Send TC[8,1] with valid acceptance Send TC[8,1] with invalid acceptance TM[1,1] should be generated for valid acceptance, TM[1,2] for invalid acceptance
MR3 Start of Execution Verification Send TC[200,1] with valid start conditions Send TC[200,1] with invalid start conditions TM[1,3] should be generated for valid start, TM[1,4] for invalid start
MR4 Progress of Execution Verification Send TC[200,1] with valid progress conditions Send TC[200,1] with invalid progress conditions TM[1,5] should be generated for valid progress, TM[1,6] for invalid progress
MR5 Completion of Execution Verification Send TC[200,1] with valid completion conditions Send TC[200,1] with invalid completion conditions TM[1,7] should be generated for valid completion, TM[1,8] for invalid completion
MR6 Are-You-Alive Test Send TC[17,1] for are-you-alive test Send TC[17,1] with altered parameters TM[17,2] should be generated for both, but with different details
MR7 On-Board Connection Test Send TC[17,3] for on-board connection test Send TC[17,3] with altered parameters TM[17,4] should be generated for both, but with different details
T3 (source: json)
MR# MR Title Original Test Follow-up Test Expected Relation
MR1 Check Raw Memory Data Send TC[6,9] to check raw memory data Verify TM[6,10] for checked raw memory data report TM[6,10] should report the checked raw memory data as requested by TC[6,9]
MR2 Abort All Request Sequences Send TC[21,13] to abort all request sequences Verify TM[21,14] for aborted request sequence report TM[21,14] should confirm the abortion of all request sequences as requested by TC[21,13]
MR3 Report Execution Status Send TC[21,6] to report the execution status of each request sequence Verify TM[21,7] for request sequence execution status report TM[21,7] should report the execution status of each request sequence as requested by TC[21,6]
MR4 Checksum Request Sequence Send TC[21,9] to checksum a request sequence Verify TM[21,10] for request sequence checksum report TM[21,10] should report the checksum of the request sequence as requested by TC[21,9]
MR5 Report Request Sequence Content Send TC[21,11] to report the content of a request sequence Verify TM[21,12] for request sequence content report TM[21,12] should report the content of the request sequence as requested by TC[21,11]
MR6 Report Parameter Statistics Definitions Send TC[4,8] to report the parameter statistics definitions Verify TM[4,9] for parameter statistics definition report TM[4,9] should report the parameter statistics definitions as requested by TC[4,8]
MR7 Report Parameter Definitions Send TC[20,6] to report parameter definitions Verify TM[20,7] for parameter definition report TM[20,7] should report the parameter definitions as requested by TC[20,6]
MR8 Perform On-Board Connection Test Send TC[17,3] to perform an on-board connection test Verify TM[17,4] for on-board connection test report TM[17,4] should report the results of the on-board connection test as requested by TC[17,3]
108
Table B.10: Results of TC/TM mapping ST[01] Request Verification for 5xPDF vs 1xPPTX experiments

T0 - pdf T3 - pdf
TC TC Description TM TM Description TC TC Description TM TM Description
TM[1,1] successful acceptance verification report TM[1,1] successful acceptance verification report
TM[1,2] failed acceptance verification report TM[1,2] failed acceptance verification report
TM[1,3] successful start of execution verification report TM[1,3] successful start of execution verification report
TM[1,4] failed start of execution verification report TM[1,4] failed start of execution verification report
TM[1,5] successful progress of execution verification report TM[1,5] successful progress of execution verification report
TM[1,6] failed progress of execution verification report TM[1,6] failed progress of execution verification report
Appendix B. PUS Tables

TM[1,7] successful completion of execution verification report TM[1,7] successful completion of execution verification report
TM[1,8] failed completion of execution verification report TM[1,8] failed completion of execution verification report
TM[1,10] failed routing verification report TM[1,10] failed routing verification report

T1 - pdf T4 - pdf
TC TC Description TM TM Description TC TC Description TM TM Description
TM[1,1] successful acceptance verification report TM[1,1] successful acceptance verification report
TM[1,2] failed acceptance verification report TM[1,2] failed acceptance verification report
TM[1,3] successful start of execution verification report TM[1,3] successful start of execution verification report
TM[1,4] failed start of execution verification report TM[1,4] failed start of execution verification report
TM[1,5] successful progress of execution verification report TM[1,5] successful progress of execution verification report
TM[1,6] failed progress of execution verification report TM[1,6] failed progress of execution verification report
TM[1,7] successful completion of execution verification report TM[1,7] successful completion of execution verification report
TM[1,8] failed completion of execution verification report TM[1,8] failed completion of execution verification report
TM[1,10] failed routing verification report TM[1,10] failed routing verification report

T2 - pdf T5 - pptx
TC TC Description TM TM Description TC TC Description TM TM Description
TC[1,1] successful acceptance verification report TM[1,1] successful acceptance verification report TC[8,1] Routing TM[1,10] Failed routing verification report
TC[1,2] failed acceptance verification report TM[1,2] failed acceptance verification report TC[8,1] Acceptance TM[1,1] Successful acceptance verification report
TC[1,3] successful start of execution verification report TM[1,3] successful start of execution verification report TC[8,1] Acceptance TM[1,2] Failed acceptance verification report
TC[1,4] failed start of execution verification report TM[1,4] failed start of execution verification report TC[200,1] Start of Execution TM[1,3] Successful start of execution verification report
TC[1,5] successful progress of execution verification report TM[1,5] successful progress of execution verification report TC[200,1] Start of Execution TM[1,4] Failed start of execution verification report
TC[1,6] failed progress of execution verification report TM[1,6] failed progress of execution verification report TC[200,1] Progress & Completion TM[1,5] Successful progress of execution verification report
TC[1,7] successful completion of execution verification report TM[1,7] successful completion of execution verification report TC[200,1] Progress & Completion TM[1,6] Failed progress of execution verification report
TC[1,8] failed completion of execution verification report TM[1,8] failed completion of execution verification report TC[200,1] Progress & Completion TM[1,7] Successful completion of execution verification report
TM[1,10] failed routing verification report TC[200,1] Progress & Completion TM[1,8] Failed completion of execution verification report
109
110 Appendix B. PUS Tables

Table B.11: Comparing results of TC/TM mapping ST[01] Request Verification


for 5xPDF vs 1xPPTX experiments

T0 - pdf T1 - pdf T2 - pdf T3 - pdf T4 - pdf T5 - pptx


TC TM TC TM TC TM TC TM TC TM TC TM
TM[1,1] TM[1,1] TC[1,1] TM[1,1] TM[1,1] TM[1,1] TC[8,1] TM[1,1]
TM[1,2] TM[1,2] TC[1,2] TM[1,2] TM[1,2] TM[1,2] TC[8,1] TM[1,2]
TM[1,3] TM[1,3] TC[1,3] TM[1,3] TM[1,3] TM[1,3] TC[200,1] TM[1,3]
TM[1,4] TM[1,4] TC[1,4] TM[1,4] TM[1,4] TM[1,4] TC[200,1] TM[1,4]
TM[1,5] TM[1,5] TC[1,5] TM[1,5] TM[1,5] TM[1,5] TC[200,1] TM[1,5]
TM[1,6] TM[1,6] TC[1,6] TM[1,6] TM[1,6] TM[1,6] TC[200,1] TM[1,6]
TM[1,7] TM[1,7] TC[1,7] TM[1,7] TM[1,7] TM[1,7] TC[200,1] TM[1,7]
TM[1,8] TM[1,8] TC[1,8] TM[1,8] TM[1,8] TM[1,8] TC[200,1] TM[1,8]
TM[1,10] TM[1,10] TM[1,10] TM[1,10] TM[1,10] TC[8,1] TM[1,10]
0 0 8 0 0 0 Mistakes
0.0% 0.0% 44.4% 0.0% 0.0% 0.0% Mistakes [%]
8.9% 0.0% PDF Avg [%]
111

Bibliography

Ahn, Janice et al. (Apr. 5, 2024). Large Language Models for Mathematical Reasoning: Progresses
and Challenges. doi: 10.48550/arXiv.2402.00157. arXiv: 2402.00157 [cs]. url: http:
//arxiv.org/abs/2402.00157 (visited on 06/27/2024). Pre-published.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (Mar. 22, 2015). Neural Machine
Translation by Jointly Learning to Align and Translate. Version 5. doi: 10.48550/arXiv.1409.
0473. arXiv: 1409.0473 [cs, stat]. url: http://arxiv.org/abs/1409.0473 (visited on
06/27/2024). Pre-published.
Bengio, Yoshua et al. (Mar. 2003). “A Neural Probabilistic Language Model”. In: Journal of
Machine Learning Research.
Brown, Tom B. et al. (July 22, 2020). Language Models Are Few-Shot Learners. doi: 10.48550/
arXiv.2005.14165. arXiv: 2005.14165 [cs]. url: http://arxiv.org/abs/2005.14165
(visited on 06/27/2024). Pre-published.
Chen, T. Y., S. C. Cheung, and S. M. Yiu (Feb. 27, 1998). Metamorphic Testing: A New Approach
for Generating Next Test Cases. doi: 10.48550/arXiv.2002.12543. arXiv: 2002.12543
[cs]. url: http://arxiv.org/abs/2002.12543 (visited on 06/24/2024). Pre-published.
Chen, Tsong Yueh, Pak-Lok Poon, and Xiaoyuan Xie (June 2016). “METRIC: METamorphic
Relation Identification Based on the Category-choice Framework”. In: Journal of Systems
and Software 116, pp. 177–190. issn: 01641212. doi: 10.1016/j.jss.2015.07.037. url:
https : / / linkinghub . elsevier . com / retrieve / pii / S0164121215001624 (visited on
06/21/2024).
Chen, Tsong Yueh et al. (June 2016). “Metamorphic Testing for Cybersecurity”. In: Computer
49, pp. 48–55. issn: 0018-9162. doi: 10.1109/MC.2016.176. url: http://www.scopus.com/
inward/record.url?scp=84976515867&partnerID=8YFLogxK (visited on 06/24/2024).
Chen, Tsong Yueh et al. (Jan. 1, 2018). “Metamorphic Testing: A Review of Challenges and
Opportunities”. In: Faculty of Engineering and Information Sciences - Papers: Part B. doi:
10.1145/3143561. url: https://ro.uow.edu.au/eispapers1/975.
Chiang, Wei-Lin et al. (Mar. 6, 2024). Chatbot Arena: An Open Platform for Evaluating LLMs
by Human Preference. doi: 10.48550/arXiv.2403.04132. arXiv: 2403.04132 [cs]. url:
http://arxiv.org/abs/2403.04132 (visited on 06/28/2024). Pre-published.
Danilevsky, Marina et al. (Oct. 1, 2020). A Survey of the State of Explainable AI for Natural
Language Processing. doi: 10.48550/arXiv.2010.00711. arXiv: 2010.00711 [cs]. url:
http://arxiv.org/abs/2010.00711 (visited on 06/27/2024). Pre-published.
Dwarakanath, Anurag et al. (July 12, 2018). “Identifying Implementation Bugs in Machine
Learning Based Image Classifiers Using Metamorphic Testing”. In: Proceedings of the 27th
112 Bibliography

ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 118–128.
doi: 10.1145/3213846.3213858. arXiv: 1808.05353 [cs]. url: http://arxiv.org/abs/
1808.05353 (visited on 08/15/2024).
Gotlieb, Arnaud and Bernard Botella (Dec. 3, 2003). “Automated Metamorphic Testing”. In:
Proceedings - IEEE Computer Society’s International Computer Software and Applications
Conference, pp. 34–40. isbn: 978-0-7695-2020-9. doi: 10.1109/CMPSAC.2003.1245319.
Li, Rui et al. (June 10, 2024). Metamorphic Relation Generation: State of the Art and Visions
for Future Research. Version 2. doi: 10.48550/arXiv.2406.05397. arXiv: 2406.05397 [cs].
url: http://arxiv.org/abs/2406.05397 (visited on 08/15/2024). Pre-published.
Lu, Chris et al. (Aug. 15, 2024). The AI Scientist: Towards Fully Automated Open-Ended
Scientific Discovery. doi: 10.48550/arXiv.2408.06292. arXiv: 2408.06292 [cs]. url:
http://arxiv.org/abs/2408.06292 (visited on 08/20/2024). Pre-published.
Luu, Quang-Hung, Huai Liu, and Tsong Yueh Chen (Nov. 23, 2023). Can ChatGPT Advance
Software Testing Intelligence? An Experience Report on Metamorphic Testing. doi: 10.48550/
arXiv.2310.19204. arXiv: 2310.19204 [cs]. url: http://arxiv.org/abs/2310.19204
(visited on 08/15/2024). Pre-published.
Maynez, Joshua et al. (May 1, 2020). On Faithfulness and Factuality in Abstractive Sum-
marization. doi: 10 . 48550 / arXiv . 2005 . 00661. arXiv: 2005 . 00661 [cs]. url: http :
//arxiv.org/abs/2005.00661 (visited on 06/27/2024). Pre-published.
Ouyang, Long et al. (Mar. 4, 2022). Training Language Models to Follow Instructions with
Human Feedback. doi: 10.48550/arXiv.2203.02155. arXiv: 2203.02155 [cs]. url: http:
//arxiv.org/abs/2203.02155 (visited on 08/15/2024). Pre-published.
Ouyang, Shuyin et al. (Aug. 5, 2023). LLM Is Like a Box of Chocolates: The Non-determinism
of ChatGPT in Code Generation. doi: 10.48550/arXiv.2308.02828. arXiv: 2308.02828
[cs]. url: http://arxiv.org/abs/2308.02828 (visited on 06/27/2024). Pre-published.
Radford, Alec et al. (Feb. 15, 2019). “Language Models Are Unsupervised Multitask Learners”.
In: Proceedings of the International Conference on Machine Learning 2019.
Segura, Sergio et al. (Sept. 2016). “A Survey on Metamorphic Testing”. In: IEEE Transactions on
Software Engineering 42.9, pp. 805–824. issn: 1939-3520. doi: 10.1109/TSE.2016.2532875.
url: https://ieeexplore.ieee.org/document/7422146 (visited on 06/24/2024).
Sennrich, Rico, Barry Haddow, and Alexandra Birch (June 10, 2016). Neural Machine Translation
of Rare Words with Subword Units. doi: 10.48550/arXiv.1508.07909. arXiv: 1508.07909
[cs]. url: http://arxiv.org/abs/1508.07909 (visited on 06/30/2024). Pre-published.
Shin, Seung Yeob et al. (June 7, 2024). Towards Generating Executable Metamorphic Relations
Using Large Language Models. arXiv: 2401.17019 [cs]. url: http://arxiv.org/abs/2401.
17019 (visited on 06/25/2024). Pre-published.
Sun, Chang-Ai et al. (2019). “METRIC+: A Metamorphic Relation Identification Technique
Based on Input plus Output Domains”. In: IEEE Transactions on Software Engineering,
pp. 1–1. issn: 0098-5589, 1939-3520, 2326-3881. doi: 10.1109/TSE.2019.2934848. url:
https://ieeexplore.ieee.org/document/8807231/ (visited on 06/21/2024).
Bibliography 113

Uniluxembourg/SnT/SVV/NEMET-Attila/Thesis-Work · GitLab (Aug. 19, 2024). GitLab. url:


https://gitlab.com/uniluxembourg/snt/svv/msc/nemet-attila-istvan-msc/thesis_
work (visited on 08/19/2024).
Vaswani, Ashish et al. (June 22, 2017). Attention Is All You Need. doi: 10.48550/arXiv.1706.
03762. arXiv: 1706.03762 [cs]. url: http://arxiv.org/abs/1706.03762 (visited on
06/27/2024). Pre-published.
Wang, Junjie et al. (Mar. 4, 2024). Software Testing with Large Language Models: Survey,
Landscape, and Vision. doi: 10.48550/arXiv.2307.07221. arXiv: 2307.07221 [cs]. url:
http://arxiv.org/abs/2307.07221 (visited on 08/15/2024). Pre-published.
White, Colin et al. (June 27, 2024). LiveBench: A Challenging, Contamination-Free LLM
Benchmark. doi: 10 . 48550 / arXiv . 2406 . 19314. arXiv: 2406 . 19314 [cs]. url: http :
//arxiv.org/abs/2406.19314 (visited on 08/13/2024). Pre-published.
Zelikman, Eric et al. (Mar. 18, 2024). Quiet-STaR: Language Models Can Teach Themselves to
Think Before Speaking. doi: 10.48550/arXiv.2403.09629. arXiv: 2403.09629 [cs]. url:
http://arxiv.org/abs/2403.09629 (visited on 08/15/2024). Pre-published.

You might also like