Nai - Research Paper
Nai - Research Paper
Nutanix Bible
It's Time For AI: LLM-Based UnitNutanix
TestsDev
fo… Nutanix.com Nutanix Community
ASSET TYPE
Blog Post
It's Time For AI: LLM-Based Unit
TECHNOLOGIES
Artificial Intelli…
Tests for OpenSource
Repositories
By Ria Mundhara, Intern; Marina Kalikanin, Member of Technical Staff; Harshil Dadlani, Member of
Technical Staff; Rajat Ghosh, Staff Data Scientist; Debojyoti Dutta, VP of Engineering
PAGE CONTENT
Executive Summary
In this article, we present how a coding agent is built based on a large language model (LLM)
and how we measure its performance on the Nutanix Cloud Platform solution, showcasing its
robust capability in managing intricate workflows.
Introduction
When we discuss coding language models (LLMs) and natural language (NL) language models
comparatively, such as Llama3 vs. CodeLlama, we could readily identify some distinctions. In fact,
coding LLMs are significantly more challenging to develop and work with compared to NL LLMs
for the following reasons.
1. Precision and Syntax Sensitivity: Code is a formal language with strict syntax rules and
structures. A minor error, such as a misplaced bracket or a missing semicolon, can lead to
errors that prevent the code from functioning. This requires the LLM to have a high degree
of precision and an understanding of syntactic correctness, which is generally more
stringent than the flexibility seen in natural language.
2. Execution Semantics: Code not only needs to be syntactically correct, but it also has to be
semantically valid—that is, it needs to perform the function it is supposed to do. Unlike
natural language, where the meaning can be implicitly interpreted and still understood
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 1/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
even if somewhat imprecisely expressed, code execution needs to yield very specific
It's Time For AI:
outcomes. If aLLM-Based Unit
code LLM gets the Testswrong,
semantics fo… the program might not work at all or
might perform unintended operations.
3. Context and Dependency Management: Code often involves multiple files or modules that
interact with each other, and changes in one part can affect others. Understanding and
managing these dependencies and contexts is crucial for a coding LLM, which adds a layer
of complexity compared to handling standalone text in natural language.
4. Variety of Programming Languages: There are many programming languages, each with its
own syntax, idioms, and usage contexts. A coding LLM needs to potentially handle multiple
languages, understand their unique characteristics, and switch contexts appropriately. This
is analogous to a multilingual NL LLM but often with less tolerance for error.
5. Data Availability and Diversity: While there is a vast amount of natural language data
available from books, websites, and other sources, high-quality, annotated programming
data can be more limited. Code also lacks the redundancy and variability of natural
languages, which can make training more difficult.
6. Understanding the Underlying Logic: Writing effective code involves understanding
algorithms and logic. This requires not only language understanding but also
computational thinking, which adds an additional layer of complexity for LLMs designed to
generate or interpret code.
7. Integration and Testing Requirements: For a coding LLM, the generated code often needs to
be tested to ensure it works as intended. This involves integrating with software
development environments and tools, which is more complex than the generally self-
contained process of generating text in natural language.
Each of these aspects makes the development and effective operation of coding LLMs a
challenging task, often requiring more specialized knowledge and sophisticated techniques
compared to natural language LLMs.
The deployment and life-cycle management of a LLM-serving API is challenging because of the
autoregressive nature of the transformer-based generation algorithm. For code LLM, the problem
is more acute for the following reasons:
1. Real-Time Performance: In many applications, coding LLMs are expected to provide real-
time assistance to developers, such as for code completion, debugging, or even generating
code snippets on the fly. Meeting these performance expectations requires highly efficient
models and infrastructure to minimize latency, which can be technically challenging and
resource-intensive.
2. Scalability and Resource Management: Code generation tasks can be computationally
expensive, especially when handling complex codebases or generating lengthy code
outputs. Efficiently scaling the service to handle multiple concurrent users without
degrading performance demands sophisticated resource management and possibly
significant computational resources. Also, the attention computation in the inference time
takes quadratic time complexity with respect to the input. Often, the input sequence length
for the code models are significantly higher than the NL models.
3. Context Management: Effective code generation often requires understanding not just the
immediate code snippet but also broader project contexts, such as libraries used, the overall
software architecture, and even the specific project's coding standards. Maintaining and
accessing this contextual information in a way that is both accurate and efficient adds
complexity to the serving infrastructure.
4. Security Concerns: Serving a coding LLM involves potential security risks, not only in terms
of the security of the model itself (e.g., preventing unauthorized access) but also ensuring
that the code it generates does not introduce security vulnerabilities into user projects.
Ensuring both model and output security requires rigorous security measures and constant
vigilance.
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 2/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
In summary, code LLMs are much harder to train and deploy for inference than NL LLMs. In this
It'sarticle,
TimeweFor AI:
cover anLLM-Based Unit
API benchmarking for aTests fo…
code generation developed entirely on Nutanix
infrastructure.
Figure 1 shows an LLM-assisted code generation workflow. It combines a context with a prompt
with a prompt template to generate the input sequence to a large language model (LLM). Then,
the LLM generates the output which is passed to the evaluation system. If the output is
satisfactory, the user can revise the prompt, prompt template, and LLM used. Figure 1 shows the
taxonomy for the LLM-assisted code generation workflow.
Table 1
Instruction
Prompt Write unit test to the following function
to an LLM
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 3/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
<PROMPT>
Template to
Prompt combine Context:
Template prompt and <CONTEXT>
context
Response:
A
combination
of prompt
Input and context Prompt template with specific prompt and context
through
prompt
template
Large
LLM Language CodeLlame, Starcoder
Model
import unittest
class TestTwoSum(unittest.TestCase):
def test_two_sum_normal(self):
def test_two_sum_no_solution(self):
self.assertIsNone(two_sum([1, 2, 3, 4], 10))
Output
Output generated
def test_two_sum_negative_numbers(self):
by the LLM
self.assertEqual(two_sum([-3, 4, 3, 90], 0), [0, 2
def test_two_sum_same_element_twice(self):
self.assertIsNone(two_sum([3, 3], 6))
def test_two_sum_one_element(self):
self.assertIsNone(two_sum([3], 3))
def test_two_sum_empty_list(self):
self.assertIsNone(two_sum([], 3))
Accuracy
assessment Provide feedback on the quality of the generated output and
Evaluation by a subject experiment with prompt, prompt template, and/or LLM for a given
matter context.
expert
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 4/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
As shown in Figure 2, the App layer runs on the top of the infrastructure layer of the Nutanix GPT-
in-a-Box 2.0 system used in the testing described below. The infrastructure layer can be deployed
in two steps, starting with Prism Element console login followed by VM resource configuration.
Figure 3 shows the UI for the Prism Element controller.
Figure 3: The UI showing the setup for a Prism Element console on which the transformer model
for this article was trained. It shows the AHV hypervisor summary, storage summary, VM
summary, hardware summary, monitoring for cluster-wide controller IOPS, monitoring for
cluster-wide controller I/O bandwidth, monitoring for cluster-wide controller latency, cluster CPU
usage, cluster memory usage, granular health indicators, and data resiliency status.
After logging into Prism Element, we create a virtual machine (VM) hosted on our Nutanix AHV
cluster. As shown in Figure 4, the VM has following resource configuration settings: 22.04 Ubuntu
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 5/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
operating system, 16 single core vCPUs, 64 GB of RAM, and a NVIDIA A100 tensor core
It'spassthrough
Time ForGPUAI:with
LLM-Based Unit
40 GB memory. The Tests fo… with the NVIDIA RTX 15.0 driver for
GPU is installed
Ubuntu OS (NVIDIA-Linux-x86_64-525.60.13-grid.run). The large deep learning models with
transformer architecture require GPU or other compute accelerators with high memory
bandwidth, large registers and L1 memory.
Figure 4: The VM resource configuration UI pane on Nutanix Prism Element. As shown, it helps a
user configure the number of vCPU(s), the number of cores per vCPUs, memory size (GiB), and
GPU choice. We used an NVIDIA A100 80G for this article.
The NVIDIA A100 Tensor Core GPU is designed to power the world’s highest-performing elastic
datacenters for AI, data analytics, and HPC. Powered by the NVIDIA Ampere™ architecture, A100
is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance
over the prior generation and can be partitioned into seven GPU instances to dynamically adjust
to shifting demands.
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 6/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
To peek into the detailed features of A100 GPU, we run `nvidia-smi` command which is a
It'scommand
Time For
lineAI: LLM-Based
utility, based on top ofUnit TestsManagement
the NVIDIA fo… Library (NVML), and intended to
aid in the management and monitoring of NVIDIA GPU devices. The output of the `nvidia-smi`
command is shown in Figure 6. It shows the Driver Version to be 515.86.01 and CUDA version to be
11.7. Figure 5 shows several critical features of the A100 GPU we used. The details of these features
are described in Table 1.
Table 2
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 7/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
Benchmarking Hypothesis
We aim to study the impact of input and output token size on latency, as well as identify any
memory or time bottlenecks in the workflow. It is instructive to choose the right code datasets for
this benchmarking, and we chose to use code from the GitHub repositories for three popular
Python packages: NumPy, PyTorch, and Seaborn. These packages were chosen because their
repositories include distinct complexities that could affect the unit test generation.
NumPy is a package for highly optimized array operations. Its codebase includes a wide
range of mathematical functions which are relatively straightforward to write unit tests for.
PyTorch is a popular optimized Deep Learning tensor library. Its complexity in model
architectures introduces unique challenges in test generation.
Seaborn is a Python data visualization library. Unlike NumPy and PyTorch, Seaborn’s focus
on rendering visualizations adds a layer of complexity in terms of testing image outputs.
For the code LLM API, we have used Meta-Llama-3-8B-Instruct. The API server was implemented
using FastAPI.
Results
Latency
First, we measured the latency for each of the requests and compared it with the corresponding
input/output token counts. Specifically, we measured the following metrics:
Latency: The time elapsed from the moment the API endpoint is called to when the output
is received and written to a test file.
Input Token Count: The number of tokens in the API call query.
Output Token Count: The number of tokens in the API call response.
As expected, the latency for all three packages closely fit an exponential distribution (p-value <
0.001). Figure 1 shows the fitted distribution, with the P99 latencies in red.
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 8/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
Figure 6: Latency distribution for all 3 packages. The black line shows the fitted exponential
distribution, and the red line denotes P99 latency.
The P99 latency for the NumPy repo appears higher than for the Seaborn and PyTorch repos. This
could be explained by the fact that the NumPy input files were on average larger, and had more
functions per file, than the PyTorch and Seaborn input files.
Figure 6 shows the correlation matrix among latency, input token count and output token count
for each individual package. There is an almost perfect linear correlation between latency and
output token count in all cases.
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 9/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
Figure 7 shows the jointplot between latency and output token count for all 3 repositories. It
clearly shows that latency increases with output token count. This proportionality can be
explained by the fact that the LMM generates one token at a time.
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 10/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
Figure 8: There is an almost perfect linear correlation between latency and output token count for
all 3 packages. There is no statistically significant difference between the regression lines for the 3
packages.
Interestingly, while there is a relatively high correlation between input token counts and latency
for the PyTorch and NumPy repos, this is not the case for the Seaborn repo. Given the heavy
emphasis on visualization within the Seaborn repository, input token count may not be a good
measure of input complexity for the Seaborn repository. Rather, the complexity in unit test
generation for Seaborn comes from validating image, rather than textual, output. This complexity
remains regardless of input length.
For all three packages, we notice outliers in the latency against input token count graph. Where
latency is high for a low input token count, the input file tends to have a large number of utility
functions with no doc strings or comments explaining their use (for example, husl.py from the
Seaborn repo). Where latency is low for a high input token count, the input file tends to be mostly
comments, or lists of configurations and constants that do not need to be unit tested.
Memory Usage
Next, we look at memory usage per line of code during the test generation workflow, in order to
find memory bottlenecks in the programme. The memory profiler module was used to log
memory usage per line of code for all Python scripts in the PyTorch repository. During the unit
test generation workflow, four main functions are called:
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 11/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
generate_test_file
It's Time For AI: LLM-Based Unit Tests fo…
parse_code
run_main_agent
run_combiner_agent
Figure 8 shows the memory usage per line of code for each of these four functions.
Figure 9: Memory usage against line number for test generation. Each line represents a file.
From these graphs, we notice some key bottlenecks. First, line 81 of run_main_agent:
for f in self.extracted_functions:
agent.generate_direct_vllm(
context=f, file_name=self.file_name, **kwargs
)
The memory used here scales linearly with the number of functions extracted from the input file.
As a result, files with many function definitions cause the spikes in memory usage observed.
Similar behavior is seen in run_combiner_agent.Memory usage scales linearly with the number of
classes, methods and import statements extracted from the file.
Time Complexity
To identify any timing bottlenecks, cProfile was used to profile the timing behavior of unit test
generation on the PyTorch repository. The flame graph in Figure 5 describes the relative time
spent in different parts of the workflow. As expected, the most time is spent waiting for the vLLM
response.
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 12/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
Figure 10: Flame graph showing time spent in different functions. The width of each frame
corresponds to the time spent in that function, and the call stack can be recreated by tracing
frames upwards.
Combining this with insights from the latency benchmarking, we know that larger test files
require more time to be generated.
Insights
The response time varies proportionally with the output token count, and memory usage
varies proportionally with the number of classes, methods and import statements in the
input file.
On average, the response times for both use cases vary between 0 and 20s.
Conclusion
This article demonstrates how we can benchmark an LLM-based unit test writing API for different
open-source repositories. The benchmarking process not only highlights the efficiency and
coverage of the generated tests but also provides insights into the strengths and limitations of
the LLM in diverse codebases. By systematically evaluating performance metrics such as
accuracy, execution time, and test coverage across multiple repositories, we can better
understand the contexts in which LLMs excel and where improvements are needed. Future work
could focus on refining the model's understanding of complex logic patterns and enhancing its
adaptability to various coding styles, ultimately leading to more robust and reliable unit test
generation tools.
© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature
and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 13/15
2/22/25, 5:48 PM It's Time For AI: LLM-Based Unit Tests for OpenSource Repositories | Nutanix / tech center
the United States and other countries. Other brand names mentioned herein are for
It's Time
identification purposes For
only and mayAI:
beLLM-Based Unit
the trademarks of Tests fo…
their respective holder(s). The third-
party products in this article are referenced for demonstration purposes only. Nutanix is not
affiliated with, endorsed by, or sponsored by these third-party companies. The use of these third
party products is solely for illustrative purposes to demonstrate the features and capabilities of
Nutanix's products. This post may contain links to external websites that are not part of
Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content
or accuracy of any external site. Our decision to link to an external site should not be considered
an endorsement of any content on such a site. Certain information contained in this post may
relate to or be based on studies, publications, surveys and other data obtained from third-party
sources and our own internal estimates and research. While we believe these third-party studies,
publications, surveys and other data are reliable as of the date of this post, they have not
independently verified, and we make no representation as to the adequacy, fairness, accuracy, or
completeness of any information obtained from third-party sources.
This post may contain express and implied forward-looking statements, which are not historical
facts and are instead based on our current expectations, estimates and beliefs. The accuracy of
such statements involves risks and uncertainties and depends upon future events, including
those that may be beyond our control, and actual results may differ materially and adversely from
those anticipated or implied by such statements. Any forward-looking statements included
herein speak only as of the date hereof and, except as required by law, we assume no obligation
to update or otherwise revise any of such forward-looking statements to reflect subsequent
events or circumstances.
Related Resources
This document details the This best practice guide is This best practice guide is This best practice guide is
design decisions that for individuals designing for individuals designing for individuals designing
support the deployment of and maintaining Nutanix and maintaining Nutanix and maintaining Nutanix
a scalable, resilient, and solutions for single- solutions for single- solutions for single-
secure private cloud… instance Oracle Database… instance Oracle Database… instance Oracle Database…
https://www.nutanix.com/tech-center/blog/llm-based-unit-tests-for-opensource-repositories 14/15