Benchmarks for AI in Software Engineering

Benchmarks drive many areas of research forward, and this is indeed the case for two areas of research that I engage with: software engineering and machine learning. With increasing emphasis on AI (especially LLMs) for coding, it is no surprise that benchmarks have played an important role at the intersection of the two areas. Researchers working on AI applied to code have recently been trying to improve the performance of large language models on SWE-bench, a dataset of GitHub issues; and, just a couple of years ago, HumanEval, a dataset of programming problems meant to challenge the code generation capabilities of large language models.

Benchmarks are important for those of us who build software development products that incorporate AI. While the actual product experience in the hands of real users is the ultimate measure of success, such a signal comes with a time lag and with some inconvenience to some of the users. Therefore, a reasonable offline proxy evaluation of the product is important, and most companies invest in such evals, which we will treat as a synonym of “benchmarks.” The offline evaluation judges whether our products that incorporate AI are getting better. Sometimes there also is a choice of models or agent frameworks to be made, and evals are needed for these.

Despite this importance, something is not quite right with the present state of benchmarks for AI applied to software engineering, and indeed this is the opinion I wish to express in this note.

When assessing a benchmark suite, we should ask the following:

Does the suite represent the software engineering work, including its complexity and diversity, that we want AI to help with in the first place?
Is there headroom for improvement (else the current AI “aces it”)?
Is the benchmark contaminated, in that the model training already has seen the answer and might have memorized it?
Is it simple and inexpensive to run the benchmark suite repeatedly?
Do we have robust scoring techniques to judge whether the answers produced are good?

Unfortunately, most eval sets fail one or more of these criteria.

HumanEval, which was popular until recently, was used for a variety of AI-for-code evaluations, but they are primarily “coding puzzle”-like Python questions, and do not represent the reality of any software engineer’s day-to-day work, except perhaps those preparing for coding interviews. The benchmark is now saturated, perhaps both due to contamination and because models have become a lot more competent in the past three years.

SWE Bench is a newer, exciting benchmark about solving issues on Github using AI. This benchmark has surged in popularity and has essentially spurred the move towards agentic techniques, where an LLM is not invoked just once, but repeatedly, incrementally gaining information in its context using external tools (such as Devin, OpenHands, or Jules). SWE Bench, however is also not perfect:

It represents a narrow view of the coding world: drawn from just 12 Python repos on Github
Sometimes the solution sketch is present in the issue description itself
Sometimes the provided tests do not actually check for the actual issue being solved thoroughly (this is a common issue faced by the software engineering research community in evaluating program repair systems!)

Based on the recently launched SWE-bench-Live, which is a more recent re-curation of the benchmark after training cutoff of the latest frontier LLMs, the performance of LLMs on the new dataset is far behind the original one, pointing to the possibility of contamination. Given all these observations, whether SWE-bench properly assesses coding ability of LLMs more broadly is not as clear.

Moreover, work by Rondon et al. showed that various complexity characteristics of bugs drawn from a bug database at Google were notably different from those of SWE-bench.

HumanEval and SWE-bench have taken hold in the ML community, and yet, as indicated above, neither is necessarily reflective of LLMs’ competence in everyday software engineering tasks. I conjecture one of the reasons is the differences in points of view of the two communities! The ML community prefers large-scale, automatically scored benchmarks, as long as there is a “hill climbing” signal to improve LLMs. The business imperative for LLM makers to compete on popular leaderboards can relegate the broader user experience to a secondary concern.

On the other hand, the software engineering community needs benchmarks that capture specific product experiences closely. Because curation is expensive, the scale of these benchmarks is sufficient only to get a reasonable offline signal for the decision at hand (A/B testing is always carried out before a launch). Such benchmarks may also require a complex setup to run, and sometimes are not automated in scoring; but these shortcomings can be acceptable considering a smaller scale. For exactly these reasons, these are not useful to the ML community.

Much is lost due to these different points of view. It is an interesting question as to how these communities could collaborate to bridge the gap between scale and meaningfulness and create evals that work well for both communities.

What’s the path forward? I offer the following challenges:

Find a way—perhaps by collaborating with industry partners—to calibrate benchmarks to “real world” representativeness, diversity, and complexity in software engineering tasks.
Set up processes to curate and refresh eval suites from the immense amount of software engineering data (not just static code, but code revisions, testing, and the “outer loop” stages as well) available from Github and other providers. SWE-bench-Live and LiveCodeBench5 are efforts in this direction.
Keeping in mind the dual purpose of evals, can curation be aware of the need to be able to run these evals as training time evals? This may require containerization and other techniques for proper execution harnesses.
Develop automated scoring that well-approximates how a human judges a model’s output. In most software engineering applications, the human is in the loop, and perfect outputs from AI are not strictly necessary for the latter to be useful. All current scoring methods—whether functional tests, textual matching, LLM-based judging etc.—have limitations. We believe that ultimately there has to be a (per example) reification of what the human is looking for in the answer, and score a model’s response based on that.
There are no commonly adopted benchmarks for significant portions of the software development lifecycle: code transformation, code reviewing, debugging or reasoning about code even though these are important to software engineers. (To be sure, some have been proposed but none have been adopted widely.) This is a missed opportunity both for developers of software tools and for ML engineers.

All of the above have costs associated with them, which is why a community effort might be more approachable than one driven by a single entity.