Gitub Copilot
Gitub Copilot
Arghavan Moradi Dakhel∗, Vahid Majdinasab∗, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais
Polytechnique Montreal, Montreal, Canada
Abstract
Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL)
arXiv:2206.15331v2 [cs.SE] 14 Apr 2023
based solution, called Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some
studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to
understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two
different programming tasks: (i) generating (and reproducing) correct and efficient solutions for fundamental algorithmic
problems, and (ii) comparing Copilot’s proposed solutions with those of human programmers on a set of programming
tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems
in computer science, like sorting and implementing data structures. In the latter, a dataset of programming problems
with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all
fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has
some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show
that the correct ratio of humans’ solutions is greater than Copilot’s suggestions, while the buggy solutions generated
by Copilot require less effort to be repaired. Based on our findings, if Copilot is used by expert developers in software
projects, it can become an asset since its suggestions could be comparable to humans’ contributions in terms of quality.
However, Copilot can become a liability if it is used by novice developers who may fail to filter its buggy or non-optimal
solutions due to a lack of expertise.
Keywords: Code Completion, Language Model, GitHub Copilot, Testing.
Optimize Solution
Algorithmic
Found
Problems
Recommened
Solution def search(x, seq):
for i in range(len(seq)): Reproduced Correct
if x <= seq[i]: solution
return i
GitHub return len(seq)
Copilot
Ratio of Correct
Solutions
Repairing Cost of
Assignments Evaluation Buggy Solutions
Repairing
Tool
Diversity of Correct
Python Solutions
Programming
Course Quality of Solutions
Students'
Submissions
Students Syntactic Mastery
Figure 1: The Workflow of proposed methods. The study includes two different methods to test Copilot in recommending codes to solve
programming problems. The first pipeline focuses on algorithmic problems collected from a well-known algorithm design book [11]. The second
pipeline focuses on the assignments of a Python programming course [28]. It compares Copilot with students in solving programming problems
in different aspects.
the book that we are using to collect the problems [11] created the input descriptions as explained above. Af-
is a comprehensive educational book, each problem is de- ter this, these descriptions were cross-checked with
scribed in detail and by building upon concepts that were the first author to make sure that they were correct,
explained in the previous chapters. As a result, some prob- understandable, and contained enough information
lem descriptions span multiple paragraphs and sometimes, about the problem being described. The first two
pages. authors both have taken the course “Introduction to
However a summary description of our selected prob- Algorithms” during their education and have more
lems can be found in different resources, but the authors than 5 years of experience in coding and program
summarized the description of each problem in their own design. To assess the agreement, we have calculated
words to reduce the chance of memorization [7] issue on Cohen’s Kappa score [10]. While the score was 0.95
Copilot. Therefore, our prompt engineering was done in implying an almost perfect agreement, for cases
two steps: where there were conflicts about the descriptions, the
two authors met and discussed the conflicts to resolve
1. Describing the problem: We needed to summa- them. In the end, the descriptions were also cross-
rize each problem’s description to feed them to Copi- checked with the third author who has more than
lot while staying as faithful as possible to the books. 10 years of experience in teaching algorithm design.
To make sure that our descriptions were understand- Therefore, the final input descriptions were what all
able and did contain enough information about the three authors agreed on.
algorithm being described, we cross-checked each of
them with those on popular coding websites such as Excluding sorting algorithms, other problems require code
W3SCHOOLS [52] and GEEKSFORGEEKS [24] as to be generated using previous code as it is common practice
well. For cross-checking, the second author summa- in both functional and object oriented programming For
rized [11]’s algorithm descriptions while keeping in these problems, we followed exactly the book’s example
mind Copilot’s limits on the length of the prompt. If by asking Copilot to generate code for the underlying
there were differences in the descriptions (i.e., the de- subproblems and then for the succeeding problems, we
scription was missing some key elements in explaining asked it to implement the solution using the code it had
the problem), the descriptions were revised. generated before. We have recorded all of our descriptions
and Copilot’s responses in our replication package [18].
2. Cross validation of problem descriptions: Cross-
validation of problem descriptions: The second author
5
3.1.3. Solving Fundamental Algorithmic Problems (1) Response Received
with Copilot Our observation shows that if Copilot is unable to pro-
To generate solutions with Copilot, we feed the descrip- vide solutions to the problem with the provided prompt,
tion of each algorithmic problem, call it prompt, to Copilot. it will return irrelevant responses such as repeating user’s
At each attempt on Copilot with a selected prompt, it only prompts, code that only contains import statements or
returns up to the top 10 solutions. Thus, we do not have natural language responses. Thus, this metric helps us to
access to the rest of the potential suggestions. To inquire evaluate if Copilot generates code for the summarized de-
about Copilot’s consistency in generating correct solutions, scription of the program instead of the mentioned irrelevant
we repeat the process 6 times and each time collect its top responses.
10 suggestions. We used the description of each problem in the form
To assess whether Copilot’s suggestions are consistent of comments and collected up to the top 10 suggestions of
over time, we performed 2 trials within a 30 days time Copilot in 6 different attempts and two separate trials, as
window. Each trial consists of 3 attempts for each prompt it is described in Subsection 3.1.3. To calculate this metric,
and each attempt contains up to 10 suggestions provided if at least one of the suggested solutions in an attempt
by Copilot. The collection of 3 first attempts is called within a trial is code content, we consider it as a successful
“First Trial” and the collection of 3 last attempts which code generation attempt or “Response Received”. Since
were conducted 30 days later is named “Second Trial”. we conduct 3 separate attempts in each trial, we report the
Given that Copilot may try to consider the script’s value of this metric with a number ∈ [0, 3] ∈ N per trial.
filename as a part of its query, to make sure that solutions
were only generated from our descriptions, we gave the (2) Correct Ratio
scripts unrelated names. We report the correct ratio as a fraction of solutions
suggested by Copilot per problem that are functional and
3.1.4. Evaluation Criteria address the objective of the problem. To calculate this
Below, we briefly explain the 4 different metrics which metric, we first need to evaluate the correctness of Copilot’s
we have used to evaluate Copilot and explain them in detail suggestions for a problem. A suggested code is correct if it
in the rest of this section. The metrics are calculated per passed a set of unit tests.
each fundamental algorithmic problem. However, in algorithmic problems, passing a set of unit
tests to check the correctness of solutions is not enough.
1. Response Received ∈ [0, 3] ∈ N. The number In this category, not only we need to verify a suggestion
of attempts in each trial that Copilot was able to on passing a set of unit tests, but also we need to verify its
understand the problem and generate code content chosen algorithm.
as its response. For example, in the “Sorting” problems, all problems
have the same functionality: sorting a list of numbers. But
2. Correct Ratio (%). The percentage of correct
the importance is the choice of the algorithm to address
solutions suggested by Copilot in each trial.
the problem and to check if Copilot is able to understand
3. Code Optimality ∈ [Y es, N o]. Whether at least the structure of the solution from the given description. If
one of the correct solutions suggested by Copilot in Copilot implements the “Bubble sort” instead of the “Selec-
each trial has optimal time complexity [Yes or No]. tion sort” algorithm or uses the Python built-in functions
“sort” or “sorted”, the code is still functionally correct and
4. Code Reproducibility ∈ [Y es, N o]. is able to sort the inputs. But the code is not addressing
the algorithm described in the problem. That is the same
• Within a Trial: Whether at least one of the
situation for implementing the data structure of a BST or
correct solutions suggested by Copilot in one
a graph.
attempt was repeated in two other attempts,
We tackle this challenge of calculating the correct ratio
within a trial [Yes or No].
by following three steps:
• Across Trials: Whether at least one of the
correct solutions suggested by Copilot in the 1. We check the functional correctness of Copilot’s sug-
first trial was repeated in the second trial [Yes gestions on a set of unit tests.
or No].
2. We check if the selected algorithm in the solution
5. Code Similarity ∈ [0, 1] ∈ R. follows the description that we gave to Copilot for
that problem. To conduct this step, same as in Sub-
• Within Trial: The similarity degree between section 3.1.2, the two first authors separately checked
all correct solutions within a trial. the solutions suggested by Copilot for the problems.
• Across Trials: The similarity of correct solu- They compared the algorithm of the solutions (that
tions between two trials. is employed by Copilot to solve the problem) to the
reference algorithms (ground truth). We collect the
6
ground truth for each problem from the reference text in each solution as they are not part of the code
book [11] and from popular coding websites such itself.
as W3SCHOOLS [52] and GEEKSFORGEEKS [24]. AST similarity is bounded between 0 and 1 with 1
We calculate Cohen’s Kappa score to measure the denoting structurally equivalent programs (regard-
agreement between the two authors. less of their semantic similarity) and 0 denoting no
3. The solutions per problem within a trial that passed equivalence between programs. It also returns 1 for
the two above steps are labeled as correct. Then, we “structurally equivalent recorded programs” where the
calculate the correct ratio based on the fraction of programs are functionally identical but their instruc-
the correct solutions within a trial. tions are executed in a different order, and “struc-
turally equivalent renamed identical programs” where
(3) Code Optimality the programs are structurally the same with different
We report this metric because the problems in our variable names.
dataset can be implemented with different algorithms. This Therefore, this similarity measure will not be affected
choice of the algorithm may impact their computation by different statement orders or different variable
complexity for example using a nested loop, queue, or names. However, this similarity will be different
recursive functions to solve a problem. With this metric, for semantically similar programs where the same
we want to check if Copilot is able to suggest the optimal concept is implemented in different ways. In Subsec-
algorithm of a problem among its correct suggestions. tion 3.2.4, we explain in more detail why we need to
We cannot write a code to automatically check if the apply this method to detect similar codes when we
computation size of another code is optimal due to Turing’s discuss Copilot’s duplication solutions.
halting problem [5]. Thus, same as in Subsection 3.1.2 and
Correct Ratio in this section, the two first authors check • To apply this comparison to correct solutions “Within
if there is a solution with an optimal algorithm between a Trial”, we compare the pairs of correct solutions
the correct solutions suggested by Copilot for a problem in across 3 different attempts within that trial. If at
a trial. They separately compared correct solutions with least one of the correct solutions in one attempt is
a reference optimal code for a problem (ground truth). If reproduced in two other attempts (similarity equals
at least one of the correct solutions suggested by Copilot 1), or in other words if at least one of the correct
within a trial is optimal, they consider that Copilot is solutions within a trial occurs in all its 3 attempts,
able to find an optimal solution for that problem [Yes] we consider that Copilot is able to reproduce the
and otherwise [No]. We calculate Cohen’s Kappa score to correct solution for that problem “Within a Trial”
report the agreement of two authors on code optimality. [Yes], otherwise, we consider that [No]. To apply
it “Across Trials”, we compare the pairs of correct
(4) Code Reproducibility and Similarity solutions among two trials. If at least one of the
While Copilot is closed-source and we have no informa- correct solutions in the first trial is reproduced in the
tion about its characteristics that may impact its behavior second trial (similarity equals 1), we consider that
on our prompts, we want to study if this tool is able to Copilot is able to reproduce the correct solution for
reproduce a correct solution for a problem in different at- that problem “Across Trials” [Yes], otherwise, we
tempts and over time. We introduce “Code Reproducibility” consider that [No].
as a metric for this measurement. For more clarification, Our observation shows that in some cases however Copi-
we split our approach for measuring this metric into three lot’s suggestions are not exactly the same but they are very
subsets: similar. R2-25: For example, Figure 2 shows two code
• We consider two different types for reproducing a samples for “Insertion Sort”. The differences between the
code: the one that checks if a correct solution is two code samples are only syntactically in a few lines. Code
reproduced across different attempts within a trial Sample #1 calculates the length of the list within the range
and calls it “Within a Trial”, and the one that checks in for loop instructor. However, Code Sample #2 assigns
if a correct solution of a problem is reproduced over the length of the list into a variable and then uses it to
a time window among two trials and call it “Across control the loop. Also, the comparison operator in the
Trials”. while loop condition is different in the two code samples.
However, only the variables of the operator are switched
• To identify the correct solutions that are reproduced and both are applying the same comparison.
and measure their similarity, we have used the Ab- Therefore, in addition to “Code Reproducibility”, we
stract Syntax Trees (AST) similarity method de- report the “Code Similarity” as the average similarity be-
scribed in [44]. AST similarity is calculated by first tween pairs of correct solutions for different fundamental
building the AST of a code and then pruning the algorithmic problems. To calculate the similarity, we follow
leaves that are related to variable or function names. the same AST similarity measure as explained above. For
Also, we ignore comments or any natural language “Within a Trial”, we compare all pairs of correct solutions
7
There are studies on Copilot that used easy but more
practical programming tasks than course assignments, such
as tasks in [51] (i.e., editing a CSV file), but these types of
tasks need less problem-solving effort compared to the as-
signments of a programming course in our selected dataset.
As in this study, our investigations go beyond the code
correctness, there are other advantages to using this dataset.
First, this dataset includes students’ submissions that sup-
port our research question on comparing Copilot with
humans. Second, the task description in this dataset is
human-written, reducing the chance of memorization issues
[7]. They are new tasks for testing Copilot and different
from the tasks in the Codex test set, i.e., the HumanEval
dataset [8]. In addition, this dataset includes different
test cases for each task alongside a tool for automatically
checking the functional correctness of solutions and the
repairing cost of buggy solutions.
This dataset includes 2442 “Correct” and 1783 “Buggy”
Figure 2: Two different solutions suggested by Copilot for student submissions for 5 Python programming assignments
Insertion sort. There are a few lines in these two code sam- in a Python course. Another study also used this dataset for
ples that are syntactically different but both are addressing characterizing the benefit of adaptive feedback for errors
the same functionality. Code Sample #1 calculates the length
of the list within the range in for loop instructor. Code Sample #2 generated by novice developers [1]. Table 1 shows the
assigns the length of the list to a variable and then uses it to control description of each programming task. Each task includes a
the loop. The comparison operator in the while loop condition is description of the problem, one or more reference solutions,
different in the two code samples. However, only the variables of the a different number of submissions by students that includes
operator are switched and both are applying the same comparison.
“Correct” and “Buggy” solutions, and different unit tests
for each task, with an average of 9 tests per problem, to
in different attempts within a trial, and for “Across Trials”, evaluate the functional correctness of solutions.
we compare all pairs of correct solutions between two trials. This dataset also contains a tool named “Refactory”
Finally, the average of these comparisons is reported for to automatically repair the buggy submissions of students
each problem. if applicable [28]. In our study, we use this tool to re-
pair buggy solutions generated by Copilot and students
3.2. RQ2: Copilot vs. Human to evaluate the complexity of fixing bugs in codes gener-
In this subsection, we aim to describe our research ated by Copilot compared to those of junior programmers.
method for RQ2, on how to compare Copilot codes with This tool matches each buggy program with the closest
human written codes in different quantitative metrics. First, correct solution based on its AST structure. Then, it mod-
we illustrate the dataset of programming tasks that we ifies different blocks of the incorrect program to repair its
used in our experiments and explain why we select this bug(s) and convert it to a correct solution if possible. This
dataset. Then, we explain how we employ Copilot to tool shows better performance than other state-of-the-art
generate solutions for each task in this dataset. After that, methods in repairing buggy programs such as Clara [26].
we present how we selected students’ solutions for this Despite others that need a large and diverse range of cor-
comparison. Finally, we discuss the criteria to compare rect solutions, this tool can repair buggy codes even with
Copilot with students in solving Python programming tasks one or two references (i.e., correct solutions).
from different aspects.
3.2.2. Solving Programming Problems with Copilot
3.2.1. Dataset: Python Programming Tasks To generate solutions with Copilot, akin to Subsec-
To address RQ2, as we already discussed in Subsec- tion 3.1.3, we feed the description of each programming
tion 3.1, we require a dataset of programming problems task in Table 1, called prompt, to Copilot. At each attempt,
that Copilot can solve so that we can conduct further inves- Copilot only returns the Top-10 solutions for a prompt.
tigations on Copilot’s suggestions. Considering the choice Thus, we do not have access to the rest of the potential
of programming tasks and in order to have a fair compari- suggestions. To inquire about the Copilot’s consistency in
son, we compare Copilot with junior developers. Therefore, generating solutions, similar to the previous experiments,
we choose a dataset of a Python programming course that we repeat the process. In this setup, we repeat the process
includes students’ submissions for 5 programming assign- 5 times and each time collect its top 10 suggested solutions.
ments2 . Expressly, we ask Copilot to solve each programming prob-
lem in 5 different attempts and collect the top 10 suggested
2 https://github.com/githubhuyang/refactory
8
Table 1: A summary of the dataset used to compare Copilot with the human in solving simple programming tasks. The
Dataset includes the assignments and submissions of a Python programming course. It includes students’ submissions for 5 Python programming
tasks [28]. The last two columns represent the number of students’ submissions in two categories “Correct” and “Buggy”.
solutions in each one. Thus in total, we collect 50 solutions solutions on the following markers. In the rest of this
by Copilot for each problem. section, we explain each metric in more detail.
As we already explained in Subsection 3.2.1, there are
different test cases per task. To evaluate the functional 1. Correct Ratio (pass@Topk)
correctness of Copilot’s solutions, a solution is considered 2. Repairing Costs
“’Correct” if it passes all the unit tests related to its problem.
Otherwise, it is considered as “Buggy”. 3. Diversity
10
(a) Sample code 1 (b) Sample code 2 (c) Sample code 3
Figure 3: Three different solutions were generated by Copilot for the q3: Duplicate Elimination Task in one attempt. There
is no difference between the approach of these 3 solutions in solving the task. The only difference between (a) and (b) is in variable names, “i”
and “item”. The difference between (c) and (b) is the additional comment in (c). The differences between (c) and (a) are in variable names
and comments.
keywords and built-ins to solve the same problem. How- types in solving the same problem can reflect the devel-
ever, even though flexibility in completing a programming opers’ mastery as novice developers may not be familiar
task is desired, it can impact the efficiency, readability, with all possible programming keywords and features in a
and even maintainability of codes in some cases [34, 14]. programming language. While diversity in syntax patterns
These differences can also reflect developers’ mastery of of a solution to address a specific task shows familiarity
the programming language. For example, Figure 4 shows with more programming keywords and built-ins, these di-
two different solutions to a simple programming task, q4: verse solutions may not necessarily be the best practice
Sorting Tuples, from Table 1. Code Sample #1 has more to solve a problem. One of the goals of pair programming
diverse programming syntax keywords and built-in func- in industrial projects is to transfer such experiences from
tions, but Code Sample #2 is easier to understand and experts to novice developers [41, 32, 23]. So, as another
more readable. evaluation criterion, we compare the diversity of program-
Cyclomatic Complexity (McCabe’s Cyclomatic Com- ming keywords and Python’s built-in functions of Copilot’s
plexity C.C.) is another code quality metric that evaluates code to those of humans.
the understandability of a code snippet. C.C. shows the For example, the codes in Figure 4 are different solu-
number of independent paths in a code component, specifi- tions to solve the same programming task. Code sample
cally, the number of decisions that can be made in a source #1 has more diverse programming syntax keywords such
code [17, 45]. Measuring the understandability of code snip- as {‘FunctionDef’, ‘List[None]’, ‘for’, ‘if’, ‘BoolOp’, ‘else’,
pets allows us to estimate the required effort for adding ‘break’, ‘elif’, ‘return’} and more diverse built-ins such as
new features to the code or modifying it [46]. {‘append’, ‘range’, ‘insert’}. Code sample #2 includes
There are studies that apply C.C. to measure the read- programming syntax keywords such as {‘FunctionDef’,
ability and understandability of small code snippets [19, ‘Lambda’, ‘NameConstant’, ‘Subscript[Num]’} and built-in
13, 38]. When comparing solutions for a problem, a lower method, ’sort’, which are less diverse than the first sample
C.C. indicates a more readable and understandable code. but more advanced and a less complex solution (in terms
For example, in Figure 4, the C.C. of code samples #1 and of cyclomatic complexity).
#2 are 4.13 and 1, respectively. While code sample #1 We follow the instructions suggested by [36] to collect
represents two nested for-loops to sort the list, code sample programming syntax patterns. We convert each solution to
#2 simply calls sort and uses a lambda to loop over the its AST and then walk through the syntax tree to collect
list. Such an approach is more Pythonic and also more nodes as programming keywords.
understandable. To collect built-in functions within a code, first, we need
To evaluate if Copilot’s suggestions are as understand- to distinguish the built-in function from other function calls
able as humans’, we calculate the C.C. of Copilot’s solutions since all types of calls in Python, from built-in to local
and compare them to the C.C. of humans’ solutions for the or public library, are a subset of a node named “Call” in
same problems. Thus, we can assess whether Copilot can AST. To do so, we extract a list of Python built-ins 4 .
provide understandable code that is easy to change and Then, we collect the node’s name of the node “Call” if
maintain (lower C.C.) or not if used as a pair programmer its “class name” was in the list of Python built-ins. We
in a software project. We use a Python package, RADON 3 , compare the diversity of the keywords and Python built-ins
to calculate it. C.C. close or above 10 is interpreted as not in Copilot’s and humans’ codes to study their capabilities
a best practice code. in using Python’s keywords and built-ins.
4. Empirical results
(5) Syntactic Mastery
As we already discussed in Subsection 3.2.4/(4), differ- In this section, we present the results we obtained to
ent syntax patterns and built-in functions, methods, and answer our RQs, one by one.
3 https://radon.readthedocs.io/en/latest 4 https://docs.python.org/3/library/functions.html
11
(2) Correct Ratio
Copilot shows various behavior in generating correct
solutions for sorting algorithms. The difficulty of problems
impacts its ability to generate a correct solution and to use
the correct algorithm for the implementation. However,
Copilot shows different behavior in two different trials.
For example in the first trial for bubble and bucket sort
which are two easy sorting algorithms, 100%, and 85.71%
of Copilot’s suggestions were correct respectively. However,
in the second trial, it generates no correct solutions for
these two sorting problems.
Since implementing heap sort requires implementing
a max heap, and then writing a sorting function, this
algorithm is harder to implement. In the first trial, Copilot
generates no correct solution for this problem. However,
during our second trial, 9.09% of its suggestions for this
problem are correct. In the second trial for Radix sort,
Figure 4: Two different solutions to solve q4: Sorting Tuples. Copilot showed improvement in solving the problem as it
Code Sample #1 has more diverse syntax patterns and Python built- generated codes in all three attempts (Response Received)
in functions compared to Code Sample #2. But #2 is more readable but none of the generated codes were correct.
and less complex (in terms of C.C.) in understanding because of using
Copilot shows some particular behavior for some of the
more advanced programming syntax and built-in methods. The C.C.
of Code Sample #1 (written by a human) is 4.13 while it is 1 for sorting algorithms. For example, during the second trial
Code Sample #2 (suggested by Copilot). where we asked it to generate codes for Bucket sort, some
of the generated codes were calling the Quick sort function
for sorting the buckets even though Quick sort had not
4.1. RQ1: Copilot on Algorithm Design been implemented in the code.
In this section, we assess the capability of Copilot to For validating the algorithm choice in solutions that
solve algorithmic problems. To highlight the difference passed all unit tests, two authors disagreed on the result
between our two trials which have been conducted 30 days for selection sort. The input prompt was summarized
apart from each other, for each marker, we have indicated from the descriptions collected from the algorithm design
the results of the solutions “Within a Trial” separately book [11]. The given prompt for this algorithm was “Create
from each other as “First Trial” and “Second Trial”. For a function that accepts a list as input. The function should
this part of our study, we discuss the different evaluation create two lists named sorted and unsorted. The function
criteria per each category of problems since our finding sorts an array by repeatedly finding the minimum element
shows there is a correlation between the difficulty of the (considering ascending order) from an unsorted list and
categories and the results. putting it at the beginning of the sorted list. Finally, it
should return the sorted array”. Given this description,
4.1.1. Sorting Algorithms the second author only accepted solutions that followed
In this section, we discuss our findings on Sorting Al- this exact description, mainly those which created the two
gorithms. For those evaluation metrics where the manual empty sorted and unsorted lists. Upon review, however,
inspection of authors was required (Response Received, al- the first and third authors mentioned that some solutions
gorithm validation on Correct Ratio, and code Optimality), followed the selection sort algorithm, without following the
the authors achieved 89% of the Kappa agreement. We exact steps mentioned in the description. After discussions,
discuss the details in the following of this section. these solutions were considered as correct as well.
Response Received [0,3] Correct Ratio [%] Optimal [Yes/No] Reproduced [Yes/No]
Algorithm
First Trial Second Trial First Trial Second Trial First Trial Second Trial First Trial Second Trial Across Trials
Sorting Algorithms
Bubble Sort 3 3 100 0 Yes - Yes - -
Bucket Sort 3 3 85.71 0 Yes - Yes - -
Heap Sort 1 3 0 9.09 - Yes - No -
Insertion Sort 3 3 100 100 Yes Yes Yes Yes Yes
Merge Sort 3 2 33.34 0 Yes - No - -
Quick Sort 3 3 16.67 16.67 Yes Yes No No No
Radix Sort 1 3 10 0 Yes - No - -
Selection Sort 3 3 14.28 13.34 Yes Yes No No No
Binary Search Trees
Data Structure 3 1 61.9 35.71 Yes Yes No No Yes
Min and Max Values in Tree 3 3 71.42 66.67 Yes Yes No Yes No
In-order Tree Walk 3 3 94.12 16.67 Yes Yes Yes No Yes
Finding The Successor Node 3 3 100 100 No Yes No Yes Yes
14
Table 3: Similarity ratios of the AST of Copilot’s correct versions (with iterative and recursive programming tech-
suggestions on fundamental algorithmic problems. To calcu-
late the similarity, we removed the duplicate correct solutions in each niques) for “Finding maximum and minimum values in
attempt (three attempts within a trial). The results show however the tree”, “In-order tree walk”, and “Finding successor
some of the correct solutions are not exactly reproduced in different nodes” problems. For the “In-order Tree Walk” problem,
attempts within a trial or between two trials, but they are very similar. Copilot generated functions inside the main function re-
The similarity is blank, “-”, if it cannot be calculated (i.e. no correct
solution or only one correct solution). sponsible for executing the walk. These functions were
duplicate functions of those generated for finding minimum
Algorithm First Trial Second Trial Across Trials and maximum values in the tree. This is bad programming
Sorting Algorithms practice as it over-complicates the code. However, since
Bubble Sort 0.93 - -
Bucket Sort 1 - -
these functions were not used by the original function at
Heap Sort - - - all, the generated code was still optimal. Copilot tends
Insertion Sort 0.99 1 0.99
Merge Sort - - - to generate recursive functions when the solution can be
Quick Sort - - 0.99
Radix Sort - - - solved using such an approach. For example, for the “In-
Selection Sort 0.61 - 0.63 order Tree Walk” and “Finding maximum and minimum
Binary Search Trees
values in the tree” problems, the generated codes are all
Data Structure 0.51 0.46 0.53
Min and Max Values in Tree - 1 0.83 recursive functions.
In-order Tree Walk 1 - 1
Finding The Successor Node 0.33 0.99 0.55
Thus, for all the BST problems in both trials, except
Elementary Graph Algorithms for “Finding successor nodes” in the first trial, at least one
Simple Data Structure 0.25 - - of the correct solutions suggested by Copilot has optimal
Breadth First Search 0.54 0.72 0.45
Depth First Search 0.73 - - time complexity.
Directed Acyclic Data Structure 0.63 - -
Finding Reachable Vertices 0.79 1 0.076
Greedy Algorithms
(4) Code Reproducibility and Similarity
Activity Class - - - As it is shown in Table 2, in the first trial, Copilot
Comparing Activities 1 - -
Adding Activities to a Set 0.09 0.11 0.17 exactly reproduces at least one of its correct solutions in 3
Generate All in one Prompt - - - different attempts only for the “In-order tree walk” problem.
Based on Table 3, the similarity between the pairs of its
correct solutions is not greater than 0.51 for those correct
the base BST data structure. In the next steps, we asked solutions that are not exactly reproduced. For example,
Copilot to generate functions for finding the maximum and the similarity of correct solutions in different attempts of
minimum value in a tree, performing an in-order tree walk, the first trial for “Data Structure” and “Finding successor
and finding the successor node of a child node. We discuss nodes” are 0.51 and 0.33 respectively.
the details of our results in the following section. In the second trial, based on Table 2, the exact correct
solutions are reproduced for “Finding maximum and min-
(1) Response Received imum values in the Tree” and “Finding successor nodes”
Our results show that Copilot is capable of understand- problems. The similarity for correct solutions of ‘Data
ing the BST problems in both trials. Only in the second Structure” which is not exactly reproduced in this trial is
trial, Copilot struggles in suggesting code in 2 out of 3 0.46.
attempts for generating the data structure of a BST. Unlike sorting algorithms, reproducibility across two
trials was not an issue on BST problems as Copilot repro-
(2) Correct Ratio duces at least one of the correct solutions from the first trial
Our results in Table 2 show that Copilot has inconsistent in the second trials for all BST problems except “Finding
behavior in generating correct solutions for some BST maximum and minimum values in the Tree”. However,
problems in two trials. For example, considering the “In- Table 3 shows that the similarity of the correct solution for
order Tree Walk”, 94.12% of Copilot’s suggestions are this problem across two trials is 0.83.
correct in the first trial, but in the second trial, it reduces
to 16.67%. However, for the two problems, “Min and Max Summary of Results. In summary, Copilot is capable
Values in Tree” and “Finding The Successor Node”, the of understanding the description of BST problems in both
correct ratio on both trials are very close to each other. trials, except for the “Data Structure” problem on the
For example, for ‘Finding The Successor Node”, 100% of second trial.
Copilot’s suggestions are correct in both trials. Copilot has inconsistent behavior in generating correct
solutions in two trials as 81.86% of its solutions are correct
(3) Code Optimality in the first trial but the correct ratio equals 54.76% in the
It should be noted that, in a majority of the cases, second trial. Copilot was able to generate optimal code for
Copilot was able to generate code consistent with optimal all the BST problems in both trials except for “Finding
time complexities as required for an efficient BST problem. successor nodes” in the first trial.
In addition, Copilot was able to generate multiple different Copilot struggled in exactly reproducing its correct solu-
tions within each trial and the similarity of those solutions
15
that are not exactly reproduced is not above 0.51. How-
ever, Copilot reproduces at least one of its correct solutions
from the first trial in the second trial (Across Trials) for all
BST problems except “Finding maximum and minimum
values in the Tree”. Although the correct solutions for this
problem are not exactly reproduced across two trials, the
Figure 6: R2-10 and R2-25: Code sample of operator over-
similarity of its correct solutions is 0.83. loading. “operator overloading” is an advanced Python feature in
which a Python built-in function is re-written by the programmer
4.1.3. Elementary Graph Algorithms to behave differently depending on the arguments that it receives as
input. “contains” and “str” are two Python native functions that
In this section, we discuss our findings on Elementary Copilot re-wrote in graph problems.
Graph Algorithms. The Kappa agreement between the
two authors on metrics that needed manual inspection is
83%. As our algorithms are becoming more complex, it is trial, out of 2 problems that Copilot addressed correctly,
required for Copilot to generate code that uses the previous only one of them, BFS, includes the optimal solution within
codes that it has generated. We discuss the details of our its correct solutions. Checking if a graph is cyclic, requires
results in the following section. using a BFS or DFS approach. If Copilot does not use
the codes that it has generated for BFS and DFS during
(1) Response Received checking if a graph is cyclic, we will be left with code pieces
Our results in Table 2 show that like BSTs, Copilot is that repeat the same operation over and over which is a bad
adept at generating code for elementary graph algorithms. practice in programming. We consider those suggestions
In the first trial, Copilot generates code in all 3 attempts as non-optimal.
for all graph problems except “Simple Data Structure” and We examined the solutions suggested by Copilot for
“Directed Acyclic Data Structure” and in the second trial, constructing the graph data structure and observed that
it struggles only in one of the 3 attempts on “Simple Data its solutions contain both list comprehensions and explicit
Structure”. “for” loops. In one of the correct solutions, the generated
code constructs the nodes from the input using explicit
(2) Correct Ratio “for” loops and in another solution, it does so using list
As we can find in Table 2, same as BST problems, comprehensions. We accept the code that uses list compre-
Copilot shows inconsistent behavior in generating correct hensions as optimal since if the input is large, there is a real
solutions for some graph problems in two trials. For ex- running time difference between these two approaches. We
ample, for “Simple Data Structure”, “Depth First Search” also observed that some of the generated codes are using
and “Directed Acyclic Data Structure” in the first trial, an advanced Python feature called “operator overloading”
50%, 75% and 86.37% of Copilot’s Suggestions are correct in which a native Python function is re-written by the pro-
respectively. However, in the second trial, Copilot is not grammer to behave differently depending on the arguments
able to generate correct solutions for these problems. For that it receives as input. Figure 6 shows an example of
the “BFS” problem, 100% of Copilot solutions are correct operator overloading generated by Copilot.
in both trials.
Our observation shows that during different attempts (4) Code Reproducibility and Similarity
on Copilot to generate code for BFS and DFS, Copilot As we can find in Table 2, in the first trial, Copilot is
generated code for both algorithms regardless of us asking able to reproduce at least one of its correct solutions for only
it to do so only for one of them. two graph problems, “Breath First Search” and “Finding
Even though Copilot was able to recognize and generate Reachable Vertices”. However, for other problems such as
code for our description, some of the generated codes had “Depth First Search” and ’Directed Acyclic Data Structure”,
one flaw and since successor methods use the previous the correct solution is not exactly reproduced by Copilot
methods, this bug was present in every piece of generated but their similarity equals 0.73 and 0.63 respectively. In
code. This snow-balling effect has affected our Kappa score the second trial, Copilot is able to reproduce the correct
as well. This bug was a result of Copilot considering the solutions for those two problems that it addressed correctly.
nodes being named by integer numbers. As a result, if a For across trials, Copilot is able to exactly reproduce the
node is created with a name that is not an integer (e.g. correct solutions only for the BFS problem. The similarity
”A” or ”Node1” instead of ”1” or ”2”), the code will fail between correct solutions of “Finding Reachable Vertices”
to iterate through the list of nodes and generate a syntax is very low across two trials, 0.076.
error. However, since the code functioned correctly given
the normal usage, we labeled them as correct. Summary of Results. Our results show Copilot is adept
at generating code for elementary graph algorithms. How-
(3) Code Optimality ever, same as BST, Copilot shows inconsistent behavior
In the first trial, Copilot generated one optimal solution in generating correct solutions for some graph problems
for each of the graph problems. However, in the second in two trials. In the first trial, Copilot is able to generate
16
correct solutions for all graph problems with an average activity starts. if the inputs have overlapping start-times,
correct ratio of 74.27%. However, in the second trial, it return False”. Here, Copilot implemented the description
is able to generate correct solutions for only two prob- correctly. However, since this method is dependent on
lems and 100% of its correct solutions are correct. Copilot its inputs being instances of the activity class, this code
was able to generate optimal code for all problems that will fail if the input is anything else. Type checking is
it addressed correctly in both trials except for “Finding important and a basic operation to do which Copilot fails
Reachable Vertices” in the second trial. In the manner to do here. Finally, for adding activities to a set of activities,
of reproducibility, it struggled to reproduce its correct so- Copilot was asked to create a method which accepts a set of
lutions for all graph problems. However, the similarity activities alongside a start time and end time of an activity.
between correct solutions for some problems is more than The method should first create a new activity instance
0.6. with the given start and end time and then check if this
new activity does not overlap with the activities in the set.
4.1.4. Greedy Algorithms Copilot was unable to generate the necessary code for this
In this section, we discuss our findings on the “activity no matter how detailed the description was.
selection” problem as a Greedy Algorithm. The Kappa
agreement between the two authors on metrics that needed (3) Code Optimality
manual inspection is 100%. The “activity selection” prob- As Copilot was not able to generate correct solutions to
lem requires the programmer to define a class for “activ- most of the problems, we could only analyze the optimality
ities”. Each activity has a start and end time. The goal of the solutions generated for “Comparing activities” and
of this problem is: given a set of activities where each “Adding Activities to a Set”. Here, the generated codes were
activity has its own start and ending time, return a set simple (As was the underlying problem) and the solutions
that contains the maximum number of activities that can required only checking the boundaries of class attributes
be performed as long as they do not overlap. Overlapping or whether the output of a function was true or not.
is defined as:
(4) Code Reproducibility and Similarity
• An activity’s start time must be after a previous As Table 2 and 3 show, Copilot was only capable
activity’s end time. of reproducing solutions to a problem for the “Adding
• An activity should not happen during another activ- activities to a set” problem across trails and these solutions
ity. were different from each other. As Table 3 shows, for
the “Comparing Activities” problem, Copilot generated
For this problem, we asked Copilot to generate codes for solutions which were exactly the same in the same trial.
implementing the activity class, comparing activities, and However, in the second trial it was not capable of even
finally checking for overlaps between activities to investigate producing a correct solution.
if the generated solutions are “greedy”.
Summary of Results. The activity selection problem
(1) Response Received was used as a proxy to see whether Copilot would be able
to generate code for solving this problem with a greedy
Our results in Table 2 show that Copilot is capable
solution. However, Copilot was not able to generate so-
of understanding what the underlying problem is and can
lutions that satisfied the criteria of a correct solution. In
generate code for it. Our observations show that Copilot
particular, Copilot showed difficulties in understanding
can even generate code when we give it the entire problem
type checking and variable boundary checking even though
definition (activity class, comparing activities, and adding
such behaviors were explicitly required in the prompt.
activities to a set) in one go.
Findings: Copilot is able to recognize fundamen-
(2) Correct Ratio tal algorithms by their names and generate correct,
Even though Copilot is capable of understanding what optimal code for them as long as the descriptions
we ask from it, the codes that it generates for solving are short and concise. In some cases, the developers
the problem are either buggy or incorrect. For example, may need to invoke Copilot multiple times in order
given the prompt “implement a class called activity. Each to receive solutions that are correct and tailored to
instance of this class has two attributes: start-time and end- their descriptions.
time. Both should be integer numbers between 0 and 24”, Challenges: Copilot is unable to generate code
the generated code has no functionalities for checking the for type-checking variables. It also generates need-
input type or their boundaries. In another problem, when lessly complicated code for some simple descriptions.
we asked Copilot to implement a method for comparing Hence, Copilot still needs to be improved to truly
activities, we gave it the following prompt: “implement a be considered as a pair programmer.
function for comparing two activities. the function should
return True if the first activity ends before the second
17
4.2. RQ2: Copilot vs. Human in Solving Program- Table 4: The Correct Ratio (CR) of Copilot’s solutions while collecting
ming Problems Top1, Top5, and Top10 solutions in all 5 attempts compared to the
Correct Ratio (CR) of students’ submissions
In this section, we discuss our findings to answer RQ2. Copilot Students
We discuss the results for each criterion of our evaluation Task CR@Top1 CR@Top5 CR@Top10 CR
separately. q1 Sequential Search 0.6 0.44 0.36 0.57
q2 Unique Dates Months 0.00 0.00 0.00 0.40
q3 Duplicate Elimination 1 0.72 0.56 0.64
4.2.1. Correct ratio of Copilot’s suggestions and q4 Sorting Tuples 0.00 0.08 0.14 0.54
q5 Top-k Elements 1 0.92 0.76 0.79
students’ submissions Total 0.52 0.43 0.35 0.59
As explained in Subsection 3.2.4, we calculate the
pass@Topk for solutions generated by Copilot for each
programming task. The pass@Topk shows the fraction of comparison, we calculate three different CRs for Copilot.
correct solutions among the Topk solutions, collected from The first, CR@Top1, reports the number of correct solu-
5 different attempts. We normalized the values to report tions out of all Top1 solutions in 5 different attempts for
this metric for the programming tasks. each programming task. CR@Top5 calculates the fraction
Figure 7a shows the normalized values for pass@Topk of correct solutions out of all Top5 solutions suggested
of each programming task for Copilot. TopK solutions by Copilot in 5 different attempts. Finally, CR@Top10
range between Top1 to Top10 because each attempt on represents the number of correct solutions generated by
Copilot includes only the Top10 suggestions. Based on Copilot out of all its 50 solutions for a programming task.
this result, Copilot cannot find correct solutions for “q2: Collecting more solutions decreases the CR of Copilot since
Unique Dates Months”. This task asks for “...solve the it increases the fraction of wrong solutions. For some of the
problem by implementing 3 different functions...”. Copilot questions, CR@Top1 and CR@Top5 of Copilot are greater
could not understand this point within the task description than students’ CR. For all questions, the CR of students’
and tried to solve the problem in one function. Thus, all of submissions is greater than CR@Top10 for Copilot’s sug-
Copilot’s solutions for this task failed the test cases because gestions. On average for all the programming tasks, the
the test units of this task are based on implementing 3 Correct Ratio (CR) of students’ submissions is greater than
different functions. the CR of Copilot’s suggestions.
There are no correct solutions in Copilot’s Top3 sug-
gestions for “q4: Sorting Tuples” in 5 different attempts. 4.2.2. Repairing costs of Buggy solutions generated
It increases to 0.02 in the set of Top4 solutions. For “q1”, by Copilot and students
“q3”, and “q5”, the pass@Top1 is equal to 0.08, 0.13, and In this part, we compare the repair cost of buggy solu-
0.13, respectively. For some questions, the pass@Topk, tions for Copilot with students. As we already discussed,
at different values of k, shows greater values than the our observation shows there are buggy solutions that are
other questions. For example, “q5” has the greatest val- generated by Copilot and are very similar to correct so-
ues for pass@Top4 and above. Also, “q4” has the lowest lutions. A small change can convert them into a correct
pass@Topk, for different values of k, after “q2”. solution. Therefore, we attempt to quantify our observation
In general, pass@Topk increases by increasing the k. It by calculating the intersection between Copilot’s correct
means collecting a larger number of solutions suggested by and buggy solutions for each problem using the BLEU
Copilot increases the number of correct solutions and this score [39]. The comparison has been done in a pairwise
growth can be different for different programming tasks. manner between each correct and each buggy solution. For
In addition, Figure 7b shows the Correct Ratio (CR) example, if out of 50 solutions, 40 are correct and 10 are
of solutions in each attempt independently. However, the buggy, we end up with 400 pairwise comparisons.
distribution of CRs in different attempts is varied, but BLEU is used in evaluating program synthesis approaches
adding new attempts can increase the average CR of solu- such as text-to-code, code summarization, and code predic-
tions. For example, the average CR in the first attempt tion. BLEU score uses the n-gram overlap between tokens
(atp1) is equal to 0.32 while it increases to 0.44 in the last of two contents and penalizes length difference. It returns
attempts (atp5). It shows if we ask Copilot to solve the a value between 0 and 1 [50]. BLEU measures how well
same problem multiple times (here 5 attempts), there is two texts match or are similar to each other. Ren et al. [43]
a chance to increase the CR among new Top10 suggested introduces a new metric, called CodeBLEU, that measures
solutions on average. However, this is not correct for all the BLEU score on syntax and semantics of codes. As a
questions. For example for “q1”, the CR in “atp4” is 0.7 part of this new metric, they measure CodeBLEU between
but it decreases to 0.4 in “atp5”. But, for “q5”, the CR in AST of codes.
the first attempt is equal to 0.7 and it increases to 0.9 in To measure the overlap between correct and buggy
the last attempt. solutions, we measure the BLEU score between the AST of
Since we cannot calculate pass@Topk for students, in the buggy and correct. We omit those buggy codes which
Table 4, we compare the CR of solutions generated by have syntax errors and cannot be converted into AST. For
Copilot with the CR of students’ submissions. For this example, the BLEU score of more than 0.7 between the
18
(a) Normalized pass@Topk of 5 different attempts (b) CR of solutions in 5 attempts
Figure 7: Evaluation of correct solutions generated by Copilot. Plot (a) shows the normalized values for pass@Topk metrics against
different values of k. It shows the fraction of correct solutions between Topk solutions of 5 different attempts. Plot (b) shows the distribution,
average and standard deviation of the Correct Ratio (CR) in each attempt for different programming tasks.
AST of several correct and buggy pairs of solutions implies dent buggy submissions. The average repairing time for
a high similarity between these two solutions. It can give Copilot’s buggy solutions is 4.94 seconds while it is equal
us an estimation of the number of changes that we need to to 6.48 seconds for the students. The reason is that on
apply to a buggy solution to repair it. average, the Relative Patch Size (RPS) of Copilot’s buggy
Figure 8 shows the density distribution for the BLEU solutions that need to be repaired is smaller than students’.
score among pairs of the buggy and correct solutions gen- As we can find in Table 5, the average RPS for Copilot and
erated by Copilot for different programming tasks. As we students are 0.33 and 0.35, respectively.
can see in this figure, there are pairs of correct and buggy We can conclude that however on average, the CR of
solutions with BLEU scores of 0.75 or greater. It shows that students’ submissions is greater than Copilot’s solutions,
sometimes a small change in a buggy solution generated but the repairing costs of buggy solutions of Copilot are
by Copilot can easily convert it into a correct solution, for less than students. With a repairing tool, we can repair
example, changing “>” (greater) to “≥” (greater equal). the majority of buggy solutions generated by Copilot and
Now that some of the buggy solutions generated by increase its CR.
Copilot are very similar to the correct solutions, we are in- Thus, if Copilot, as a pair programmer in a software
terested in comparing the repairing cost of Copilot’s buggy project, suggests buggy solutions, it is less expensive to fix
solutions with students’ buggy submissions. As we have its bugs compared to bugs that may be produced by junior
explained in Subsection 3.2.3, for this comparison, we need developers when solving the same programming task.
to downsample students’ submissions to the same size as
Copilot’s suggestions. Figure 9 shows the distribution of 4.2.3. Diversity of Copilot’s suggestions and stu-
repairing time for repairing students’ buggy submissions. dents’ submissions
There are a high number of submissions with low repairing The diversity of solutions shows the novelty of Copilot
time and few with high repairing time. Thus, to keep the and students in solving different problems. Also, it shows
distribution of repairing costs in the sample set close to that while increasing the number of sample codes increases
the entire population, we repeat the downsampling pro- the fraction of correct solutions, this increment is due
cess 5 times and report all repairing metrics for students’ to the diversity of correct solutions or duplication. As we
submissions based on the average of all 5 sampleset. discussed in Subsection 3.2.4, we observe duplicate solutions
As we can find in Table5, the average repair rate for in a single attempt and across multiple attempts on Copilot
Copilot’s buggy solutions is greater than students’, which to solve a problem. On the other hand, we observe duplicate
are 0.95 and 0.89 respectively. This means that on average, solutions among students’ submissions as well. For example,
95% of buggy solutions generated by Copilot have been for “q1: Sequential Search”, after comparing the ASTs of
fixed after the repair process. For example, for “q4: Sorting students’ correct submissions, 54.32% of their submissions
Tuples” and “q5: Top-k Elements”, all buggy solutions of are identified to be duplicated.
Copilot (100%) have been fixed while the repairing rate of To compare the diversity among students’ submissions
students’ submissions for these two tasks is equal to 85%. and Copilot’s solutions, we randomly downsample 10 stu-
In addition, the average repair time for Copilot’s buggy dent submissions in 5 different sample sets and consider
solutions is less than the students’. This means that not them as 5 different attempts. Then, in each attempt on
only the repairing tool can fix the majority of Copilot’s Copilot and for each sample set of students’ submissions,
buggy solutions but also it can fix them faster than stu- we eliminate duplicate correct and buggy solutions. There
19
Figure 8: Distribution of BLEU score among the pair of correct and buggy solutions generated by Copilot. This chart shows
a histogram of the BLEU Score on pairs of correct and buggy solutions generated by Copilot. The BLEU score of 0.75 and above represents a
great similarity between the AST of a correct and buggy pair. The BLEU score between several pairs of the buggy and correct solutions is
greater than 0.7, in different programming tasks. This supports our observation that several buggy solutions can be corrected with small
changes.
Figure 9: The distribution of repairing time for students’ buggy submissions. This chart shows a histogram of students’ buggy
submissions based on their repairing time. It shows that there are more buggy submissions with low repairing time than buggy submissions
with high repairing time. We repeat the downsampling process on students’ submissions 5 times to observe the same distribution in samplesets.
20
Table 5: Comparing the Repairing Cost of Copilot’s suggestions with students’s submissions
Copilot Students
Rep Avg Rep Avg Rep Avg Rep Avg
Task
Rate Time(sec) rps Rate Time RPS
q1 sequential search 0.94 9.61 0.48 0.98 2.58 0.40
q2 unique dates months 0.92 3.26 0.28 0.82 3.81 0.44
q3 duplicate elimination 0.91 0.64 0.26 0.96 4.35 0.30
q4 sorting tuples 1.00 0.78 0.15 0.85 8.82 0.29
q5 top-k elements 1.00 10.40 0.50 0.85 12.84 0.30
Total 0.95 4.94 0.33 0.89 6.48 0.35
are a few buggy solutions for Copilot and for student so- In general, the diversity of correct and buggy submis-
lutions involving syntax errors that cannot be converted sions for students is more than Copilot. While there is no
into AST (3 solutions). We consider them as non-duplicate guarantee that all non-duplicate solutions are optimized,
buggy solutions. students solved these 5 tasks with more diverse and novel
Figure 10 shows the cumulative distribution of Correct solutions.
(C) solutions, None Duplicate Correct (NDC) solutions,
Buggy (B) solutions, and None Duplicate Buggy (NDB) 4.2.4. The Cyclomatic Complexity of Codes
solutions by Copilot and students across different tasks. In this section, we calculate the Cyclomatic Complexity
In this figure, for example, in “q3: atp3”, the number (C.C.) of codes generated by Copilot and students. Table 6
of Correct (C) solutions suggested by Copilot is 17 but shows the average and the standard deviation of C.C. for
the number of Non-duplicate Correct (NDC) solutions is the correct solutions generated by Copilot and students.
only 2. This means that after generating more solutions It is worth mentioning that we use the sampling method
and running more attempts, Copilot repeats these 2 cor- explained in Subsection 3.2.3 to collect students’ correct
rect solutions several times. However, out of 14 Correct solutions.
(C) solutions generated by students in the third attempt On average, the correct solutions suggested by Copilot
(atp3), 13 solutions are non-duplicate. That is the same are found to be more optimized than students’ solutions.
observation for buggy solutions. Increasing the number However, we should consider that for example, for “q2”,
of attempts on Copilot leads to a jump in the number of Copilot has no correct solutions, or the CR of Copilot
correct solutions for “’q1” and “q5” from 2 to 18 and 7 to for “q4” is only 8%. In general, Copilot recommends less
38 respectively. However, for “q3” and “q4”, this growth complex solutions than students for the same questions
is smaller. The number of None Duplicate Correct (NDC) except for “q1”. But, for “q1”, the C.C. of Copilot’s correct
solutions of Copilot is less than or equal to the number of solutions have a lower standard deviation. It means that
Correct (C) solutions in each attempt for each task. This its C.C. is less spread around the average. Also, for “q5”,
is the same story for Buggy solutions. However, it shows Copilot used Python built-in functions “Sort” and “Sorted”,
that despite Copilot’s claims that it removes the duplicate however, it was asked in the description to not use them.
solutions, there are still duplicates in the Top 10 solutions
of each attempt. Table 6: The Cyclomatic Complexity (C.C.) of Copilot’s solutions
compare to students’ submissions
The difference between C and NDC in student sub-
missions is less than Copilot. For example, in “q3”, the Question C.C. Copilot C.C. Students
cumulative number of C solutions generated by Copilot in q1 Sequential Search 5.8 ± 1.94 4.63 ± 2.1
different attempts is greater than students’ submissions in q2 unique dates Months - 4.18 ± 1.03
different samplesets. However, it is the opposite for NDC q3 Duplicate Elimination 3 ± 0.01 3.12 ± 0.5
solutions. In “atp5” the cumulative number of C solutions q4 Sorting Tuples 1±0 4.13 ± 1.03
generated by Copilot equals 28 and it equals 22 after the q5 Top k Elements 1.44 ± 0.69 3.3 ± 1.46
5 sampleset on students’ submissions. However, the cu- Total 2.81 3.87
mulative NDC solutions at these attempts equal 2 (out of
28) for Copilot and it equals 21 (out of 22) for students. As already observed, we can conclude, the suggestions
It shows more diversity between correct and even buggy of Copilot can compete with students’ solutions in C.C.
submissions of students compare to Copilot’s solutions.
As another example for Copilot, there is no more NDC 4.2.5. Syntactic Mastery
solution after “atp3” for “q3” and “q5”. This means that by As discussed in Subsection 3.2.4/(5), different develop-
increasing the number of solutions generated by Copilot for ers can solve a programming task with different solutions.
these two questions, the CR increases due to the duplication Consequently, this can impact the readability and main-
of correct solutions not generating new ones. tainability of the code if it is not an efficient solution.
21
Figure 10: The cumulative distribution of solutions by Copilot and students. It shows the cumulative distribution of Correct (C),
Non-duplicate Correct (NDC), Buggy (B), and Non-duplicate Buggy (NDB) solutions for Copilot and students. Attempts (atp) for students
equal to a random sampleset of their submission. Each value on the stack represents the number of solutions in each of the 4 categories. The
growth of NDC solutions for Copilot’s solutions decreases or stops for some programming tasks while the number of its Correct (C) solutions
increases. Students’ submissions are more diverse than Copilot’s solutions.
In this section, we compare the diversity of syntax Subsection 4.2.3, or it could be the result of restriction in
keywords and the usage of built-in functions between the some assignments’ descriptions, for example in q5: Top k
solutions generated by Copilot and those written by humans elements that not using the built-in functions sort and
for different programming tasks. Figure 11 shows the sorted is requested which, unlike the students, Copilot was
diversity of syntax keywords and built-ins that we observed not able to understand this restriction.
in both Copilot’s and students’ solutions with normalized
values. Students used more diverse keywords and built-ins Findings: In general, Copilot suggests solutions
in comparison to Copilot. that compete with students’ submissions in different
For example, for q3: Duplicate Elimination, the only aspects. The correct ratio and diversity of students’
Python built-in function in Copilot’s solutions is ”append”. submissions are greater than Copilot’s. However,
However, students included more diverse built-ins such the cost of repairing buggy solutions generated by
as {’count’, ’remove’, ’index’, ’copy’, ’append’, ’reverse’, Copilot is less than students’. In addition, the
’pop’} in their solutions. As another example, in q5: Top- complexity of Copilot’s generated codes is less than
k elements, Copilot used {’sort’, ’append’, ’remove’} as students’.
built-in functions in all of its solutions but students used Challenges: Copilot has difficulty understanding
{’copy’, ’pop’, ’remove’, ’append’, ’sort’, ’extend’, ’reverse’, some requirements in the description of tasks. This
’clear’}. The using of programming keywords by Copilot affects the correct ratio of its solutions. However,
and students is similar to built-ins. For example, for q4: students understand those details and consider them
Sorting Tuples, there are solutions provided by students in their submissions.
that iterate over the list of tuples to sort them causing
diverse syntax patterns in their solutions such as {’Tuple’,
’Lt’, ’Add’, ’Expr’, ’Continue’, ’Eq’, ’Break’, ’Gt’, ’BoolOp’, 5. Discussion and Limitation
’And’, ’UnaryOp’, ’USub’, ’LtE’}. We cannot find these
programming patterns in Copilot’s solutions as it only used In this section, we discuss the boundaries of Copilot
the built-in function ”sort” in the majority of its solutions. and how to make it more beneficial in real programming
Students used more diverse syntax patterns and built- tasks despite its limitations.
ins to solve the same problem compared to Copilot. This
may be the result of students not being familiar with ad- 5.1. Description of Problems (Prompts)
vanced Python features as opposed to Copilot which uses Our results show that Copilot cannot understand some
such features frequently. However, this diversity could stem details in the description of problems that are understand-
from the diversity of student submissions as discussed in able by humans. For example, in q5, “Top-k Elements”, it
22
Figure 11: R2-2: Diversity of programming Syntax Patterns in Solutions generated by Copilot and Students. Plot (a) shows
the normalized value for the distinct number of Python built-in functions in Copilot’s solutions compared to students’ for different questions.
Plot (b) shows the normalized value for the distinct number of Python Syntax keywords in Copilot’s solutions compared to students.
is asked in the description to “... not use Python’s built-in we observed that Copilot might misunderstand the prob-
functions sort and sorted ...”. Copilot cannot understand lem entirely if the description contains multiple sentences
this detail and uses these two built-in functions in all of (whether short or long).
the correct solutions. However, the majority of students
avoided using these built-in functions. Instead, they wrote 5.2. Experimental Suggestions
a sorting algorithm and then called it for sorting tasks Furthermore, for more exploration on how to change
or used other built-in functions such as “append”, “re- prompt to meet the target solution, we performed some
move” and “max” in their solutions. As our results in experiments by applying different scenarios and discussing
Subsection 4.1 show, Copilot suggests correct solutions for their impacts on the results.
different sorting algorithms (meaning that Copilot is famil- Scenario#1: In this scenario, we changed “...older
iar with different sorting algorithms such as “Bubble Sort” people are at the front...” to “...descending order...” in the
or “Merge Sort”), but it did not use them in q5 because it description of q4 and repeated the process with Copilot
could not figure out the requirements of the problem. But to generate solutions. This small change improves the CR
students apply their knowledge about sorting algorithms to from 14% to 79%. This improvement shows there are some
solve this problem. Thus, since in the prompt, we cannot details/keywords in the description of problems that seem
limit Copilot to NOT using certain functions, instead, it is obvious to humans, but Copilot cannot understand those
better to clarify our task by defining functions that it is details in natural language. If we change those details into
allowed to use. programming specific/technical keywords such as “descend-
In q4, “Sorting Tuple”, it is asked to return the list ing”, it can help Copilot recommend relevant solutions.
of tuples in an order that “... older people are at the
front ...”. Copilot cannot understand this part. In 92% Scenario#2: We have a similar observation for q2,
of suggestions, it returned the sorted tuples in the default “Unique Birthday”, where the Copilot cannot understand
order: ascending. However, students considered this point the requirements mentioned in the description, however, all
in their submission. We even checked some of the buggy students considered it. In this question, it is asked for “...im-
submissions by students. Our observations show that even plement 3 different functions unique day, unique month
in buggy submission, students considered the correct order and contains unique day...”, to address the problem. Copi-
of sorting. It means that they fully understood what the lot could not understand this condition. Unit tests for q2
point of sorting tuples is in a way that “...older people are are testing all 3 functions. Thus, the CR of Copilot for q2
at the front...”. equals zero because all 50 solutions in different attempts
Copilot shows similar limitations on algorithmic prob- have failed on some of the test units.
lems. For example, when asking Copilot to implement the So, in this scenario, we gave 3 separate descriptions to
“activity” class in Subsection 4.1.4, Copilot cannot under- Copilot for unique day, unique month, and contains unique
stand putting limits on variables even though it was asked day functions in the same source file. Here is the revised
to do so explicitly. Another limitation is its difficulties in description that we used:
understanding long descriptions which are also observed
by [31]. Throughout our testing in Subsections 4.1 and 4.2, • unique day: Given a day and a list of possible
23
birthday dates return True if there is only one possible shows that none of them assumed any wrong structure for
birthday with that day, and False otherwise. the input data, while the structure of input is not clear
in the description of the question. Thus, we assume that
• unique month: Given a month and a list of possi- there is some extra clarification between students and the
ble birthday dates, return True if there is only one lecturer about the structure of the input.
possible birthday within that month, and False oth-
erwise.
6. Threats to Validity
• contains unique day: Given a month and a list of
possible birthday dates, return True if there is only We now discuss the threats to the validity of our study
one possible birthday with that month and day, and following the guidelines provided by Wohlin [54] for exper-
False otherwise. imentation in software engineering.
We start with the description of unique day at the first 6.1. Internal Validity
line of the source file. Then, we accepted the first solution The threat to internal validity comes from the fact that
suggested by Copilot. We continued with the description Copilot is closed-source. We cannot analyze our results
of unique month in the next line and accepted the first based on the characteristics (and expected behavior) of
suggested solution and followed the same instruction for Copilot’s trained model. This is also the case for Copilot’s
contains unique day. We repeat the process 50 times to training data, hence we are not able to indicate whether it
generate 50 solutions that contain 3 separate functions. memorized the solutions to these inquiries from its training
Copilot even calls the unique day function in some of its set or whether it generates a unique solution. Similar to
suggestions for the contains unique day function. You can other researchers [38, 29, 51, 8], we can only investigate
find sample solutions in the replication package. Since there Copilot’s functionality in suggesting code for the provided
are separate unit tests to test each function separately, we prompt.
run related tests against each function. In this scenario, the Also, as our experiments have shown, Copilot’s sug-
CR of unique day, unique month, and contains unique day gestions change over time and are not always consistent.
are 88%, 0%, and 40% respectively. This may come from the inconsistency stemming from the
While the original description was clear to students, nature of LLMs and also the continuous improvement of
Copilot could not understand it. Instead of asking Copilot Copilot’s engine as an ML product, perhaps by feeding
to solve the problem with different functions, we divide a new code samples or learning from new queries submitted
problem into 3 different problems. It increases the CR for to Copilot. As a result, we cannot guarantee that other
unique day and contains unique day. However, the CR of researchers will receive the same suggestions and results
unique month is still zero. In the following, we investigate that we obtained by performing the same experiments.
this case with a different scenario.
6.2. External Validity
Scenario#3: Since Copilot could not find any cor-
rect solutions for unique month, we manually checked its The lack of a dataset that comes from an industrial
suggested solutions. We found that in all buggy solutions, context and contains programming task statements along
Copilot refers to the second item of the “birthday” tuple with their corresponding codes drives us to follow the path
of other research in software engineering using classical
in the list of birthday dates as the month. However, unit
programming tasks to study Copilot’s competence [51, 47,
tests consider month as the first item of tuples to test the
functionality of the method. For example, consider below 38, 16, 49]. There are different advantages to these types
unit test: of programming tasks that we discussed in Subsections 3.1
and 3.2.1. To highlight two advantages, first, Copilot is
• unique month (Month = “January”, Birthdays = able to generate answers corresponding to these task de-
[( “January”,“1” ), ( “January”, “2” )]). scriptions. Thus, we could apply our assessments beyond
the correctness of the suggested solutions. Also, the task
In each tuple in the list of birthdays, for example, (“Jan- descriptions in our datasets are human-written and it de-
uary”,“1”), Copilot referred to the second item as a month, creases the possibility of the memorization issue in LLMs.
however, the first item in the tuple is the birthday month. But these programming tasks are not representative of the
In the description of “unique month”, we added the whole programming tasks in real software projects.
above unit test as a sample input, at the end of the descrip- Considering the choice of programming tasks and to
tion. It improves the CR of “unique month” from 0% to have a fair comparison, we compared Copilot with students
91%. It shows that adding sample input or sample unit test in a Python programming course. While we have no in-
in the description of problems can help Copilot to generate formation about the background and characteristics of the
more correct solutions. participants, we assume that they are good representatives
In addition, we randomly checked 20% of students’ of junior developers in real software projects, but they may
submissions (both correct and buggy). Our observation not be representatives of the whole population.
24
6.3. Conclusion Validity in improving their programming skills. Therefore, as future
To mitigate the threats to the validity of our conclusions, work, a tool or a layer on top of Copilot that can filter out
we choose different quantitative metrics, based on other buggy and non-optimal suggestions will reduce the liabil-
studies in software engineering, to compare Copilot’s codes ity of using this tool in software projects. Future works
with humans’ [19, 38, 30]. Even though these quantitative can also use our study design and explore more diverse
metrics reduce the chance of having biased conclusions, they programming tasks with heterogeneous participants in a
do not enable us to conduct any qualitative assessment human-centered study, to more comprehensively compare
such as how humans interact with the tool. Copilot with humans as an AI pair programmer.
25
[19] S. Fakhoury, D. Roy, A. Hassan, and V. Arnaoudova. Improv- potential of artificial intelligence as a method of software devel-
ing source code readability: Theory and practice. In 2019 oper’s productivity improvement. In 2022 Conference of Russian
IEEE/ACM 27th International Conference on Program Com- Young Researchers in Electrical and Electronic Engineering (El-
prehension (ICPC), pages 2–12. IEEE, 2019. ConRus), pages 386–390. IEEE, 2022.
[20] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, [38] N. Nguyen and S. Nadi. An empirical evaluation of GitHub Copi-
B. Qin, T. Liu, D. Jiang, et al. Codebert: A pre-trained lot’s code suggestions. In Accepted for publication Proceedings
model for programming and natural languages. arXiv preprint of the 19th ACM International Conference on Mining Software
arXiv:2002.08155, 2020. Repositories (MSR), pages 1–5, 2022.
[21] J. Finnie-Ansley, P. Denny, B. A. Becker, A. Luxton-Reilly, and [39] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
J. Prather. The robots are coming: Exploring the implications method for automatic evaluation of machine translation. In
of openai codex on introductory programming. In Australasian Proceedings of the 40th annual meeting of the Association for
Computing Education Conference, pages 10–19, 2022. Computational Linguistics, pages 311–318, 2002.
[22] N. Forsgren, M.-A. Storey, C. Maddila, T. Zimmermann, [40] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri.
B. Houck, and J. Butler. The space of developer productiv- Asleep at the keyboard? assessing the security of github copilot’s
ity: There’s more to it than you think. Queue, 19(1):20–48, code contributions. In 2022 2022 IEEE Symposium on Security
2021. and Privacy (SP) (SP), pages 980–994, Los Alamitos, CA, USA,
[23] I. Fronza, A. Sillitti, and G. Succi. An interpretation of the may 2022. IEEE Computer Society. doi: 10.1109/SP46214.2022.
results of the analysis of pair programming during novices in- 00057. URL https://doi.ieeecomputersociety.org/10.1109/
tegration in a team. In 2009 3rd International Symposium SP46214.2022.00057.
on Empirical Software Engineering and Measurement, pages [41] L. Plonka, H. Sharp, J. Van der Linden, and Y. Dittrich. Knowl-
225–235. IEEE, 2009. edge transfer in pair programming: An in-depth analysis. Inter-
[24] Geeksforgeeks Team. Geeksforgeeks. https://www. national journal of human-computer studies, 73:66–78, 2015.
geeksforgeeks.org, 2022. [42] K. Rahit, R. H. Nabil, and M. H. Huq. Machine translation
[25] S. Gulwani. Dimensions in program synthesis. In Proceedings of from natural language to code using long-short term memory. In
the 12th International ACM SIGPLAN Symposium on Princi- Proceedings of the Future Technologies Conference, pages 56–63.
ples and Practice of Declarative Programming, PPDP ’10, page Springer, 2019.
13–24, New York, NY, USA, 2010. Association for Computing [43] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundare-
Machinery. ISBN 9781450301329. doi: 10.1145/1836089.1836091. san, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method
URL https://doi.org/10.1145/1836089.1836091. for automatic evaluation of code synthesis. arXiv preprint
[26] S. Gulwani, I. Radiček, and F. Zuleger. Automated clustering arXiv:2009.10297, 2020.
and program repair for introductory programming assignments. [44] P. Salazar Paredes et al. Comparing Python programs using
ACM SIGPLAN Notices, 53(4):465–480, 2018. abstract syntax trees. Technical report, Uniandes, 2020.
[27] C. B. Harris and I. G. Harris. Glast: Learning formal grammars [45] M. M. S. Sarwar, S. Shahzad, and I. Ahmad. Cyclomatic complex-
to translate natural language specifications into hardware asser- ity: The nesting problem. In Eighth International Conference on
tions. In 2016 Design, Automation & Test in Europe Conference Digital Information Management (ICDIM 2013), pages 274–279.
& Exhibition (DATE), pages 966–971. IEEE, 2016. IEEE, 2013.
[28] Y. Hu, U. Z. Ahmed, S. Mechtaev, B. Leong, and A. Roychoud- [46] S. Scalabrino, G. Bavota, C. Vendome, M. Linares-Vasquez,
hury. Re-factoring based program repair applied to programming D. Poshyvanyk, and R. Oliveto. Automatically assessing code
assignments. In 2019 34th IEEE/ACM International Confer- understandability. IEEE Transactions on Software Engineering,
ence on Automated Software Engineering (ASE), pages 388–398. 47(3):595–613, 2019.
IEEE, 2019. [47] D. Sobania, M. Briesch, and F. Rothlauf. Choose your pro-
[29] S. Imai. Is github copilot a substitute for human pair- gramming copilot: A comparison of the program synthesis per-
programming? an empirical study. In Proceedings of the formance of github copilot and genetic programming. arXiv
ACM/IEEE 44th International Conference on Software En- preprint arXiv:2111.07875, 2021.
gineering: Companion Proceedings, pages 319–321, 2022. [48] D. Sobania, D. Schweim, and F. Rothlauf. Recent develop-
[30] S. Kim and E. J. Whitehead Jr. How long did it take to fix ments in program synthesis with evolutionary algorithms. arXiv
bugs? In Proceedings of the 2006 international workshop on preprint arXiv:2108.12227, 2021.
Mining software repositories, pages 173–174, 2006. [49] L. Tang, E. Ke, N. Singh, N. Verma, and I. Drori. Solving
[31] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, probability and statistics problems by program synthesis. arXiv
R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, preprint arXiv:2111.08267, 2021.
et al. Competition-level code generation with alphacode. arXiv [50] N. Tran, H. Tran, S. Nguyen, H. Nguyen, and T. Nguyen. Does
preprint arXiv:2203.07814, 2022. bleu score work for code migration? In 2019 IEEE/ACM 27th
[32] K. M. Lui and K. C. Chan. Pair programming productiv- International Conference on Program Comprehension (ICPC),
ity: Novice–novice vs. expert–expert. International Journal pages 165–176. IEEE, 2019.
of Human-computer studies, 64(9):915–925, 2006. [51] P. Vaithilingam, T. Zhang, and E. L. Glassman. Expectation
[33] Z. Manna and R. Waldinger. A deductive approach to program vs. experience: Evaluating the usability of code generation tools
synthesis. ACM Transactions on Programming Languages and powered by large language models. In CHI Conference on Human
Systems (TOPLAS), 2(1):90–121, 1980. Factors in Computing Systems Extended Abstracts, pages 1–7,
[34] L. M. Maruping, X. Zhang, and V. Venkatesh. Role of collec- 2022.
tive ownership and coding standards in coordinating expertise [52] W3schools Team. W3schools. https://www.w3schools.com,
in software project teams. European Journal of Information 2022.
Systems, 18(4):355–371, 2009. [53] N. Wirth. Algorithms & data structures. Prentice-Hall, Inc.,
[35] R. Mihalcea, H. Liu, and H. Lieberman. Nlp (natural lan- 1985.
guage processing) for nlp (natural language programming). In [54] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and
International Conference on intelligent text processing and com- A. Wesslén. Experimentation in software engineering. Springer
putational linguistics, pages 319–330. Springer, 2006. Science & Business Media, 2012.
[36] A. Moradi Dakhel, M. C. Desmarais, and F. Khomh. Assessing [55] F. Zhang, F. Khomh, Y. Zou, and A. E. Hassan. An empirical
developer expertise from the statistical distribution of program- study on factors impacting bug fixing time. In 2012 19th Working
ming syntax patterns. In Evaluation and Assessment in Software conference on reverse engineering, pages 225–234. IEEE, 2012.
Engineering, pages 90–99, 2021. [56] A. Ziegler, E. Kalliamvakou, S. Simister, G. Sittampalam, A. Li,
[37] E. A. Moroz, V. O. Grizkevich, and I. M. Novozhilov. The A. Rice, D. Rifkin, and E. Aftandilian. Productivity assessment
26
of neural code completion. arXiv preprint arXiv:2205.06537,
2022.
27