Tackling The Abstraction and Reasoning Corpus (ARC) With Object-Centric Models and The MDL Principle
Tackling The Abstraction and Reasoning Corpus (ARC) With Object-Centric Models and The MDL Principle
Sébastien Ferré
1 Introduction
Artificial Intelligence (AI) has made impressive progress in the past decade at specific
tasks, sometimes achieving super-human performance: e.g., image recognition [15],
board games [21], natural language processing [6]. However, AI still misses the gen-
erality and flexibility of human intelligence to adapt to novel tasks with little training.
To foster AI research beyond narrow generalization [11], F. Chollet [4,5] introduced a
measure of intelligence that values skill-acquisition efficiency over skill performance,
i.e. the amount of prior knowledge and experience that an agent needs to reach a rea-
sonably good level at a range of tasks (e.g., board games) matters more than its absolute
performance at any specific task (e.g., chess). Chollet also introduced the Abstraction
and Reasoning Corpus (ARC) benchmark in the form of a psychometric test to mea-
sure and compare the intelligence of humans and machines alike. ARC is a collection
of tasks that consist in learning how to transform an input colored grid into an output
colored grid, given only a few examples. It is a very challenging benchmark. While
humans can solve more than 80% of the tasks [14], the winner of a Kaggle contest1
could only solve 20% of the tasks (with a lot of hard-coded primitives and brute-force
search), and the winner of the more recent ARCathon’22 contest2 could only solve 6%
of the tasks.
1
https://www.kaggle.com/c/abstraction-and-reasoning-challenge
2
https://lab42.global/past-challenges/arcathon-2022/
The existing published approaches [10,3,23,2], and also the Kaggle winner, tackle
the ARC challenge as a program synthesis problem, where a program is a composi-
tion of primitive transformations, and learning is done by searching the large program
space. In contrast, psychological studies have shown that the natural programs pro-
duced by humans to solve ARC tasks are object-centric, and more declarative than
procedural [14,1]. When asked to verbalize instructions on how to solve a task, partici-
pants typically first describe what to expect in the input grid, and then how to generate
the output grid based on the elements found in the input grid.
We make two contributions w.r.t. existing work:
1. object-centric models that enable to both parse and generate grids in terms of object
patterns and computations on those objects;
2. an efficient search of object-centric models based on the Minimum Description
Length (MDL) principle [20].
A model for an ARC task combines two grid models, one for the input grid, and another
for the output grid. This closely matches the structure of natural programs. Compared
to the transformation-based programs that can only predict an output grid from an input
grid, our models can also provide a joint description for a pair of grids. They can also
create new pairs of grids, although this is not evaluated in this paper. They could also in
principle be adapted to tasks based on a single grid or on sequences of grids. All of this
is possible because grid models can be used both for parsing a grid and for generating
a grid.
The MDL principle comes from information theory, and says that “the model that
best describes the data is the model that compress them the more” [20,12]. It has for
instance been applied to pattern mining [22,9]. The MDL principle is used at two levels:
(a) to choose the best parses of a grid according to a grid model, and (b) to efficiently
search the large model space by incrementally building more and more accurate models.
MDL at level (a) is essential because the segmentation of a grid into objects is task-
dependent and has to be learned along with the definition of the output objects as a
function of the input objects. The two contributions support each other because existing
search strategies could not handle the large number of elementary components of our
grid models, and because the transformation-based programs are not suitable to the
incremental evaluation required by MDL-based search.
We report promising results based on grid models that are still far from covering
all knowledge priors assumed by ARC. Correct models are found for 96/400 varied
training tasks with a 60s time budget. Many of those are similar to the natural programs
produced by humans. Moreover, we demonstrate the generality of our approach by ap-
plying it to the automatic filling of spreadsheet columns [13], where inputs and outputs
are rows of strings instead of grids.
The paper is organized as follows. Section 2 presents the ARC benchmark, and a
running example task. Section 3 discusses related work. Section 4 defines our object-
centric models, and Section 5 explains how to learn them with the MDL principle.
Section 6 reports on experimental results, comparing with existing approaches.
Fig. 1: Training tasks b94a9452 (top) and 23581191 (bottom), with 2-3 demonstration exam-
ples (left) and the input grid of a test case (right).
ARC is a collection of tasks3 , where each task is made of training examples (3.3 on
average) and test examples (1 in general). Each example is made of an input grid and
an output grid. Each grid is a 2D array (with size up to 30x30) filled with integers
coding for colors (10 distinct colors). For a given task, the size of grids can vary from
one example to another, and between the input and the output. Each task is a machine
learning problem, whose goal is to learn a model that can generate the output grid from
the input grid, and so from a few training examples only. Prediction is successful only
if the predicted output grid is strictly equal to the expected grid for all test examples,
there is no partial success. However, three trials are allowed for each test example to
compensate for potential ambiguities in the training examples. Figure 1 shows two ARC
tasks (with the expected test output grid missing). The first is used as a running example
in this paper.
We now more formally define grids, examples, and tasks.
Definition 1 (grid). A grid g ∈ C h×w is a matrix with values taken from a set of
colors C, with h > 0 rows (height), and w > 0 columns (width). A grid cell is iden-
tified by coordinates (i, j), where i selects a row, and j selects a column. The color
at coordinates (i, j) is denoted by either gij or g[i, j]. Coordinates range from (0, 0)
to (h − 1, w − 1).
As illustrated by Figure 1, the output grid needs not have the same size as the input
grid, it can be smaller or bigger.
Definition 3 (task). A task is a pair T = (E, F ), where E is the set of training exam-
ples, and F is the set of test examples.
3
Data and testing interface at https://github.com/fchollet/ARC
ARC tasks have 3.3 training examples on average, and 1 or 2 test examples (most
often 1). As illustrated by Figure 1, the different input grids of a task need not have the
same size, nor use the same colors. The same applies to test grids.
ARC is composed of 1000 tasks in total: 400 “training tasks”4 , 400 evaluation tasks,
and 200 secret tasks for independent evaluation. Figure 1 shows two of the 400 training
tasks. Developers should only look at the training tasks, not at the evaluation tasks.
The latter should only be used to evaluate the broad generalization capability of the
developed systems.
3 Related Work
The ARC benchmark is recent and not many approaches have been published so far. All
those we know define a DSL (Domain-Specific Language) of programs that transform
an input grid into an output grid, and search for a program that is correct on the training
examples [10,3,23,2]. The differences mostly lie in the primitive transformations (prior
knowledge) and in the search strategy. It is tempting to define more and more primitives
like the Kaggle winner did, hence more prior knowledge, but this means a less intelli-
gent system according to Chollet’s measure. To guide the search in the huge program
space, those approaches use either grammatical evolution [10], neural networks [3],
search tree pruning with hashing and Tabu list [23], or stochastic search trained on
solved tasks [2]. A difficulty is that the output grids are generally only used to score
a candidate program so that the search is kind of blind. Alford [3] improves this with
a neural-guided bi-directional search that grows the program in both directions, from
input and output. Xu [23] compares the in-progress generated grid to the expected grid
but this limits the approach to the tasks whose output grids have the same size and
same objects as the input grids. DSL-based approaches have a scaling issue because the
search space increases exponentially with the number of primitives. Ainooson [2] al-
leviates this difficulty by defining high-level primitives that embody specialized search
strategies. We compare and discuss their performance in the evaluation section.
Johnson et al. [14] report on a psychological study of ARC. It reveals that humans
use object-centric mental representations to solve ARC tasks. This is in contrast with
existing solutions that are based on grid transformations. Interestingly, the tasks that
are found the most difficult by humans are those based on logics (e.g., an exclusive-or
between grids) and symmetries (e.g., rotation), precisely those most easily solved by
DSL-based approaches. The study exhibits two challenges: (1) the need for a large set
of primitives, especially about geometry; (2) the difficulty to identify objects, which can
be only visible in part due to overlap or occlusion. A valuable resource is LARC, for
Language-annotated ARC [1], collected by crowd-sourcing. It provides for most train-
ing tasks one or several natural programs that confirm the object-centric and declarative
nature of human representations. A natural program is short textual descriptions pro-
duced by a participant that could be used by another participant to generate test output
grids (without access to the training examples).
4
The term “training tasks” may be misleading as their purpose is to train AI developers, not AI
systems. Humans solve ARC tasks without training.
Beyond the ARC benchmark, a number of work has been done in the domain of
program synthesis, which is also known as program induction or programming by ex-
amples (PbE) [17]. An early approach is Inductive Logic Programming (ILP) [19],
where target predicates are learned from symbolic representations. PbE is used in the
FlashFill feature of Microsoft Excel 2013 to learn complex string processing formulas
from a few examples [18]. Dreamcoder [8] alternates a wake phase that uses a neurally
guided search to solve tasks, and a sleep phase that extends a library of abstractions
to compress programs found during wake. Bayesian program learning was shown to
outperform deep learning at parsing and generating handwritten world’s alphabets [16].
The purpose of a grid model is to distinguish between invariant and variant elements
across the grids of a task. In task b94a9452 (Figure 1 top), all input grids con-
tain a square but the size, color, and position vary. This can be expressed by a pat-
tern Square(size:?, color:?, pos:?), where Square is called a constructor (here with
three arguments), and the question marks are called unknowns (similar to Prolog vari-
ables). There is also a constructor for positions as 2D vectors Vec(i:?, j:?), and prim-
itive values for sizes (e.g., 3) and colors (e.g., blue). Patterns can be nested, like in
Square(3,?,Vec(?,2)), which means “a square whose size is 3, and whose top left cor-
ner is on column 2”, in order to have models as specific as necessary. Fully grounded
patterns (without unknowns) are called descriptions: e.g., Square(3,blue,Vec(2,4)).
However, with patterns only, there is no way to make the output grid depend on the
input grid, which is key to solving ARC tasks. We therefore add two ingredients to grid
models (typically to output models): references to the components of a grid description
(typically the input one), and function applications to allow some output components
to be the result of a computation. For example, in task b94a9452, the model for the
small square in the output grids could be Square(!small.size, !large.color, !small.pos -
!large.pos), where for instance !small.size is a reference to the size of the small square
in the input grid, and ’-’ is the substraction function. This model says that “the small
output square has the same size as the small input square, the same color as the large
input square, and its position is the difference between the positions of the two input
squares.”
Tables 1 and 2 respectively list the pattern constructors and the functions of
the grid models that we have used in our experiments. Each constructor/function
Table 1: Pattern constructors by type
type constructors
Grid Layers(size: Vector, color: Color, layers: Layer[])
Tiling(grid: Grid, size: Vector)
Layer Layer(pos: Vector, object: Object)
Object Colored(shape: Shape, color: Color)
Shape Point
Rectangle(size: Vector, mask: Mask)
Mask Bitmap(bitmap: Bitmap)
Full, Border, EvenCheckboard, OddCheckboard, ...
Vector Vec(i: Int, j: Int)
has a result type, and typed arguments. The argument types constrain which val-
ues/constructors/functions can be used in arguments. The names of constructor argu-
ments are used to reference the components of a grid model or grid description. Grid,
object and shape constructors have an implicit argument grid for their representation
as a raw grid. Point shapes have an implicit argument size, equal to Vec(1,1). Our grid
models describe a grid as either a stack of layers on top of a background having some
size and color, or as the tiling of a grid up to covering a grid of given size. A layer is an
object at some position. An object is so far limited to a one-color shape, where a shape
is either a point or some mask-specified shape fitting into a rectangle of some size. A
mask is either specified by a bitmap or by one of a few common shapes such as a full
rectangle or a rectangular border. Positions and sizes are 2D integer vectors. Four prim-
itive types are used: integers, colors, bitmaps (i.e., Boolean matrices), and grids (i.e.,
color matrices). The available functions essentially cover arithmetic operations on inte-
gers and on vectors, where vectors represent positions, sizes, and moves; and geometric
notions such as measures (e.g., area), translations, symmetries, scaling, and periodic
patterns (e.g., tiling). Unknowns are here limited to primitive types and vectors. Refer-
ences and functions are so far only used in output grid models. They could be used in
M i = Layers(?, black, [
Layer(?, Colored(Rectangle(?, Full), ?)),
Layer(?, Colored(Rectangle(?, Full), ?)) ])
M o = Layers(!lay[1].object.shape.size, !lay[0].object.color, [
Layer(!lay[0].pos - !lay[1].pos,
coloring(!lay[0].object, !lay[1].object.color)) ])
the input models to express constraints, e.g. to state that different objects have the same
color.
A task model M = (M i , M o ) is made of an input grid model M i and an output grid
model M o . Figure 2 shows a correct model for task b94a9452, which in words says:
“There are two stacked full rectangles on a black background in the input grid. The size
of the output grid is the same as the bottom object (lay[1].object), and its background
color is the color of the top object (lay[0].object). The output grid has a copy of the top
object, recolored in the color of the bottom object, and whose position is the difference
between the top object position and the bottom object position.”
a grid description, which can then be converted into a concrete grid. For example, the
output model M o of Figure 2 applied with, as environment ε, the above description π i
of the first input grid generates the following description π o = Layers(Vec(4,4), yellow,
[Layer(Vec(1,1), Colored(Rectangle(Vec(2,2), Full), red)]). This description conforms to
the expected output grid.
An important point is that these two operations are multi-valued, i.e. may return
multiple descriptions. Indeed, there are often several ways of parsing a grid according
to a model, for example if the the model mentions a single object while the grid contains
several ones. There are also several grids that can be generated by a model when it
contains unknowns.
describe(M, g i , g o ) =
{(ρi , ρo , π i , π o ) | ρi , π i ∈ parse(M i , nil , g i ),
ρo , π o ∈ parse(M o , π i , g o )}
The create mode makes it possible to create a new example of the task. It con-
sists of the successive generation of an input grid and an output grid, the latter being
conditioned by the former. This mode is not used in the ARC challenge but it could
contribute to the measurement of the intelligence of a system. Indeed, if an agent has
really understood a task, it should be able to produce new examples5 .
In all modes, the nil environment is used with the input model because the input
grid comes first, without any prior information. Note also that all modes inherit the
multi-valued property of parsing and generation. These three modes highlight an es-
sential difference between our object-centric models and the DSL-based programs of
existing approaches. The latter are designed for prediction (computation of the output
as a function of the input), they do not provide a description of the grids, nor a way
to create new input grids. A new example could be created by randomly generating an
input grid and applying the program, but in general, it would not respect most of the
task invariants: e.g., a random bitmap would be generated rather than a solid square.
MDL-based learning works by searching for the model that compresses the data the
more. The data to be compressed is here the set of training examples. We have to define
two things: (1) the description lengths of models and examples, and (2) the search space
of models and the learning strategy.
A common approach in MDL is to define the overall description length (DL) as the sum
of two parts (two-parts MDL): the model M , and the data D encoded according to the
model [12].
L(M, D) = L(M ) + L(D | M )
5
At school, teachers often ask pupils to produce their own examples of some concept to check
their understanding.
In our case, the model is a task model composed of two grid models, and the data
is the set of training examples (pairs of grids). To compensate for the small number
of examples, and to allow for sufficiently complex models, we use a rehearsal factor
α ≥ 1, like if each example were seen α times.
The DL of an example is based on the most compressive joint description of the pair
of grids.
The term L(ρ) := LN (ρ) − LN (1) encodes the extra-cost of not choosing the first
parsed description, penalizing higher ranks. LN (n) is a classical universal encoding for
integers [7]. The term L(π | M, ε) measures the amount of information that must be
added to the model and the environment to encode the description, typically the values
of the unknowns. The term L(g | π) measures the differences between the original grid
and the grid produced by the description. A correct model is obtained when ρi = 1
and L(ρo , π o , g o | M o , π i ) = 0 for all examples, i.e. when using the first description
for each input grid, there is nothing left to code for the output grids, and therefore the
output grids can be perfectly predicted from the input grids.
Three elementary model-dependent DLs have to be defined:
– L(M ): DL of a grid model;
– L(π | M, ε): DL of a grid description, according to the model and environment used
for parsing it;
– L(g | π): DL of a grid, relative to a grid description, i.e. the errors commited by the
description w.r.t. the grid.
We sketch those definitions for the grid models defined in Section 4. We recall that
decription lengths are generally derived from probability distributions with the equation
L(x) = − log P (x), corresponding to an optimal coding [12].
Defining L(M ) amounts to encode a syntax tree with constructors, values, un-
knowns, references, and functions as nodes. Because of types, only a subset of those
are actually possible at each node: e.g. type Layer has only one constructor. We use
uniform distributions across possible nodes, and universal encoding for non-bounded
ints. A reference is encoded according to a uniform distribution across all components
Table 3: Decomposition of L(M, D) for the model in Figure 2
input output pair
L(M ) 71.4 97.3 168.7
L(D | M ) 2355.2 0.0 2355.2
L(M, D) 2426.6 97.3 2523.9
of the environment that have a compatible type. We give unknowns a lower probability
than constructors, and references/functions a higher probability, in order to encourage
models that are more specific, and that make the output depend on the input.
Defining L(π | M, ε) amounts to encode the description components that are un-
knowns in the model. As those description components are actually grounded model
components, the above definitions for L(M ) can be reused, only adjusting the proba-
bility distributions to exclude unknowns, references and functions.
Defining L(g | π) amounts to encode which cells in grid g are wrongly specified
by description π. For comparability with grid models and descriptions, we represent
each differing cell as a point object – Layer(Vec(i,j),Colored(Point,c)) – and encode it
like descriptions. We also have to encode the number of differing cells.
Table 3 shows the decomposition of the description length L(M, D), for the model
in Figure 2 on task b94a9452, between the input and the output, and between the model
and the data encoded with the model. Remind that L(D | M ) is for α = 10 copies of
each example. It shows that the DL of the output grids is zero bits, which means that
they are entirely determined by the grid models and the input grids. The proposed model
is therefore a solution to the task. The average DL of an input grid is 2355.2/10/3 = 78.5
bits, on a par with the DL of the input grid model.
L(M i , Di ) L(M o , Do )
L̂(M, D) = i , Di )
+ o , D o ) ∈ [0, 2]
L(Minit
L(Minit
Our initial model uses the unknown grid ? for both input and output: Minit = (?, ?).
The available refinements are the following:
Table 4: Learning trace for task b94a9452 (in.lay[1] is inserted before in.lay[0] because we
use the final layer indices for clarity). in/out denotes the input/output model, L̂ is the normalized
DL.
step refinement L̂
0 (initial model) 2.000
1 in ← Layers(?, ?, []) 1.117
2 out ← Layers(?, ?, []) 0.272
3 in.lay[1] ← Layer(?, Col.(Rect.(?, ?), ?)) 0.179
4 out.lay[0] ← Layer(?, Col.(Rect.(?, ?), ?)) 0.101
5 out.size ← !lay[1].object.shape.size 0.079
6 in.lay[0] ← Layer(?, Col.(Rect.(?, ?), ?)) 0.070
7 out.lay[0].object ←
coloring(!lay[0].object, !lay[1].object.color) 0.045
8 out.color ← !lay[0].object.color 0.032
9 out.lay[0].pos ← !lay[0].pos−!lay[1].pos 0.020
10 in.lay[0].object.shape.mask ← Full 0.019
11 in.lay[1].object.shape.mask ← Full 0.019
12 in.color ← black 0.019
6 Evaluation
In this section, we first evaluate our approach on ARC, comparing it to existing ap-
proaches in terms of success rates, efficiency, model complexity, and model natural-
ness. We then evaluate the generality of our approach beyond ARC by applying it to a
different domain, spreadsheets, where inputs and outputs are rows of strings. Our ex-
periments were run with single-thread implementations on Fedora 32, Intel Core i7x12
with 16GB memory. We used one run per task set as there is no randomness involved.
We evaluated our approach on the 800 public ARC tasks, and we also took part in
the ARCathon 2022 challenge as team MADIL. The few parameters were set based
on the training tasks. To ensure a good balance of the computational time between
parsing and learning, we set some limits that remained stable across our experiments.
The number of descriptions produced by the parsing of a grid is limited to 64 and only
the 3 most compressive are retained for the computation of refinements. At each step, at
most 100,000 expressions are considered and only the 20 most promising refinements,
Table 5: Number of solved tasks (and percentage) and average learning time for solved tasks, for
different methods on different task sets
task set method solved tasks runtime
ARC train. Fischer et al, 2020 31 7.68%
(400 tasks) Alford et al, 2021 22 5.50%
Xu et al, 2022 57 14.25%
Ainooson et al, 2023 104 26.00% 178.7s
OURS 96 24.00% 4.6s
ARC eval. Ainooson et al, 2023 26 6.50%
(400 tasks) OURS 23 5.75% 11.4s
Kaggle’20 Icecuber (winner) 20.6%
(100 tasks) Fischer et al 3.0%
ARCathon’22 pablo (winner) 6 6%
(100 tasks) Ainooson et al 2 2%
OURS (4th ex-aequo) 2 2%
according to a DL estimate, are evaluated. The rehearsal rate α is set to 10. The tasks
are processed independently of each other, without learning from one to the other. The
results are given for a learning time per task limited to 60s plus 10s for the pruning
phase.
The learning and prediction logs and the screenshots of the solved training tasks are
available as supplementary materials.
Task sets and baselines. We consider four task sets for which results have been
reported: the 400 training and 400 evaluation public tasks, the 100 secret tasks of Kag-
gle’20, and the 100 secret tasks of ARCathon’22. We presume that those secret tasks
are taken from the 200 secret ARC tasks. As baselines, we consider published methods
that report results on the considered task sets [10,3,23,2]. We also include the winners
of the two challenges for reference. Unfortunately, the reported results are scarce, and
the papers do not provide their code. The code of our method is available as open source
on GitHub6 . Version 2.7 was used for the experiments reported here.
Success rates. On the training tasks, for which we have the more results to compare
with, our method solves 96 (24%) training tasks, almost on par with the best method, by
Ainooson et al (26%). Both methods also solved a similar number of evaluation tasks
(23 vs 26 tasks), and both solved 2/100 tasks in ARCathon’22, and ranked 4th ex-aequo.
Comparing the different task sets, it appears that the evaluation tasks are significantly
more difficult than the training tasks, and the secret tasks of ARCathon seem even more
difficult as the winner could only solve 6 tasks. Icecuber managed to correctly predict an
amazing 20.6% of the test output grids in Kaggle’20, but at the cost of the hand-coding
of 142 primitives, 10k lines of code, and brute-force search (millions of computed grids
per task).
The ARC evaluation protocol allows for three predictions per test example. How-
ever, the first prediction of our method is actually correct in 90 of the 96 solved train-
ing tasks. This shows that our learned models are accurate in their understanding of
6
https://github.com/sebferre/ARC-MDL
the tasks. To better evaluate the generalization capability of learned models, we also
measured the generalization rate as the proportion of models that are correct on train-
ing examples that are also correct on test examples: 92% (94/102) on training tasks,
and 72% (23/32) on evaluation tasks. This again suggests that the evaluation tasks fea-
ture a higher generalization difficulty. Without the pruning phase, this rate decreases
to 89% (91/102) on training tasks. This shows that the pruning phase is useful, al-
though description-oriented model learning is already good at generalization. Reasons
for failures to generalize are: e.g., the test example has several objects while all training
examples have a single object; the training examples have a misleading invariant.
Efficiency and model complexity. Intelligence is the efficiency at acquiring new
skills, according to Chollet. Although ARC enforces data efficiency by having only a
few training examples per task, and unique tasks, it does not enforce efficiency in the
amount of priors, nor in the computation resources. It is therefore useful to assess the
latter. We already mentioned Icecuber’s method that relies on a large number of primi-
tives, and intense computations. The method of Ainooson et al, which has comparable
performance to ours, uses 52 primitives and about 700s on average per solved task.
In comparison, our method uses 30 primitives and 4.6s per solved training task (21.7s
over all training tasks). Moreover, doubling the learning timeout at 120s does not lead
to solving more tasks, so 60s just seems to be enough to find a solution if there is one.
Note also that our method does not stop learning when a solution is found but when no
more compression can be achieved.
Another way to evaluate efficiency is to look at the complexity of learned mod-
els, typically the number of primitives composing the model in program synthesis ap-
proaches. A good proxy for this complexity is the depth of search that was reached
in the allocated time. In our case, it is equal to the number of refinements applied to
the initial empty model. Few methods provide this information: Icecuber limits depth
to 4, and Ainooson’s best results are achieved with a brute-force search with maximum
depth 3. Methods based on DreamCoder [3] have similar limits but can learn more
complex programs by discovering and defining new operations as common composi-
tions of primitives, and reusing them from one task to another. Our method can dive
much deeper in less computation time, thanks to its greedy strategy. The number of
refinement steps achieved in a timeout of 60s on the training tasks ranges from 4 to 57,
with an average of 19 steps. This demonstrates the effectiveness of the MDL criteria
to guide the search towards correct models. This claim is reinforced by the fact that a
beam search (width=3) did not lead to solving more tasks.
Learned models. The learned models for solved tasks are very diverse despite the
simplicity of our models. They express various transformations: e.g., moving an ob-
ject, extending lines, putting one object behind another, order objects from largest to
smallest, remove noise, etc. Note that none of these transformations is a primitive in
our models, they are learned in terms of objects, basic arithmetics, simple geometry,
and the MDL principle.
We compared our learned models to the natural programs of LARC [1]. Remark-
ably, many of our models involve the same objects and similar operations than the
natural programs. For example, the natural program for task b94a9452 is: “[The input
has] a square shape with a small square centered inside the large square on a black
Table 6: Pattern constructors by type for strings
type constructors
Row Cell[]
Cell Nil
Factor(left:Cell, token:Token, right:Cell)
Token Const(s:string)
Regex(re:Regex)
Regex Ident, Letters, Decimal, Digits, Spaces
T Alt(if:Cond, then:T , else:T )
background. The two squares are of different colors. Make an output grid that is the
same size as the large square. The size and position of the small inner square should be
the same as in the input grid. The colors of the two squares are exchanged.” For other
tasks, our models miss some notions used by natural programs but manage to compen-
sate them: e.g., topological relations such as ”next to” or ”on top” are compensated by
the three attempts; the majority color is compensated by the MDL principle selecting
the largest object. However, in most cases, the same objects are identified.
These observations demonstrate that our object-centric models align well with the
natural programs produced by humans, unlike approaches based on the composition of
grid transformations. An example of a program learned by [10] on the task 23b5c85d
is strip black; split colors; sort Area; top; crop, which is a sequence
of grid-to-grid transformations, without explicit mention of objects.
For comparison, the DSL of FlashFill also uses predefined regular expressions but
uses them to locate positions in the string, rather than tokens. Their programs are condi-
tional expressions (switch), where each branch is a concatenation of substrings specified
by position, and constant strings. In contrast, our models allows for free nestings of con-
ditionals (Alt) and concatenation (Factor). However, their DSL has loops that have so
far no counterpart in our models.
Task set. For a preliminary evaluation, we used as a task set the 14 examples in [13].
Each task has one or two strings as inputs and one string as output, and 2-6 training ex-
amples (avg. 3.4). We complement them with 3-6 evaluation examples, some of which
feature some generalization difficulty. Those 14 tasks are available in the supplementary
materials in the same JSON format as ARC tasks.
Efficiency and success rates. Learning takes 1s or less, except for Task 1 and
Task 13 that have longer input strings and take respectively 9.9s and 5.2s. The depth
of search ranges from 11 to 76, and averages at 35 steps. For 11/14 tasks (all except
Tasks 4, 5, 9), the learned model correctly describes and predicts the training examples.
However, only 5 of those learned models generalize to all test examples: 3 models fail
on a single test example (e.g., in Task 3 the file extension .mp4 contains a digit unlike
other file extensions); in Task 8, the training examples are ambiguous because the input
string is a date in different formats, and the output string could either be the day or last
two digits of the year; in Task 11, there is a typo in the training examples (on purpose),
which makes the task under-specified. The (partial) failure for other tasks is explained
by missing features in our models, notably the counterpart of loops, or by a wrong se-
quence of refinements. For instance, Task 9 can be solved by delaying the insertion of
alternatives.
Learned models. Input models can be expressed as regular expressions with groups
on tokens and alternatives, and output models can be expressed as string interpolations
with group identifiers as variables. For Task 10, we obtain the following model.
M i : \(.*\([0-9]+\).*\)?\([0-9]+\).*\([0-9]+\)
M o : {if \1 then \2 else "425"}-\3-\4
The input is made of three integers, the first one being optional. The output is the
concatenation of those three integers, separated by dashes, and the first integer is 425
when missing in the input. In FlashFill, in general, a large number of programs is gen-
erated as an exhaustive search is performed. For Task 10, the program given as solution
in [13] is the following (A refers to the input column):
Switch((b1, e1), (b2, e2)), where
b1 ≡ M atch(A, N umT ok, 3),
b2 ≡ ¬M atch(A, N umT ok, 3),
e1 ≡ Concatenate(SubStr2(A, N umT ok, 1),
Const(”-”), SubStr2(A, N umT ok, 2),
Const(”-”), SubStr2(A, N umT ok, 3))
e2 ≡ Concatenate(Const(”425-”), SubStr2(A, N umT ok, 1),
Const(”-”), SubStr2(A, N umT ok, 2))
This is illustrative of the different programming style between our pattern-based models
and the computation-based DSL programs.
References
1. Acquaviva, S., Pu, Y., Kryven, M., Sechopoulos, T., Wong, C., Ecanow, G., Nye, M., Tessler,
M., Tenenbaum, J.: Communicating natural programs to humans and machines. Advances in
Neural Information Processing Systems 35, 3731–3743 (2022)
2. Ainooson, J., Sanyal, D., Michelson, J.P., Yang, Y., Kunda, M.: An approach for solving
tasks on the abstract reasoning corpus. arXiv preprint arXiv:2302.09425 (2023)
3. Alford, S., Gandhi, A., Rangamani, A., Banburski, A., Wang, T., Dandekar, S., Chin, J.,
Poggio, T.A., Chin, S.P.: Neural-guided, bidirectional program search for abstraction and
reasoning. CoRR abs/2110.11536 (2021), https://arxiv.org/abs/2110.11536
4. Chollet, F.: On the measure of intelligence. arXiv preprint arXiv:1911.01547 (2019)
5. Chollet, F.: A definition of intelligence for the real world. Journal of Artificial General Intel-
ligence 11(2), 27–30 (2020)
6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional trans-
formers for language understanding. In: Conf. North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, NAACL-HLT. pp. 4171–
4186. Assoc. Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
7. Elias, P.: Universal codeword sets and representations of the integers. IEEE Trans. Informa-
tion Theory 21(2), 194–203 (1975)
8. Ellis, K., et al.: Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep
library learning. In: ACM Int. Conf. Programming Language Design and Implementation.
pp. 835–850 (2021)
9. Faas, M., Leeuwen, M.v.: Vouw: geometric pattern mining using the MDL principle. In: Int.
Symp. Intelligent Data Analysis. pp. 158–170. Springer (2020)
10. Fischer, R., Jakobs, M., Mücke, S., Morik, K.: Solving Abstract Reasoning Tasks with Gram-
matical Evolution. In: LWDA. pp. 6–10. CEUR-WS 2738 (2020)
11. Goertzel, B.: Artificial general intelligence: concept, state of the art, and future prospects.
Journal of Artificial General Intelligence 5(1), 1 (2014)
12. Grünwald, P., Roos, T.: Minimum description length revisited. International journal of math-
ematics for industry 11(01) (2019)
13. Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In:
Symp. Principles of Programming Languages. pp. 317–330. ACM (2011)
14. Johnson, A., Vong, W.K., Lake, B., Gureckis, T.: Fast and flexible: Human program induction
in abstract reasoning tasks. arXiv preprint arXiv:2103.05823 (2021)
15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. Advances in neural information processing systems 25, 1097–1105 (2012)
16. Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through
probabilistic program induction. Science 350(6266), 1332–1338 (2015)
17. Lieberman, H.: Your Wish is My Command. The Morgan Kaufmann series in interactive
technologies, Morgan Kaufmann / Elsevier (2001). https://doi.org/10.1016/b978-1-55860-
688-3.x5000-3
18. Menon, A., Tamuz, O., Gulwani, S., Lampson, B., Kalai, A.: A machine learning framework
for programming by example. In: Int. Conf. Machine Learning. pp. 187–195. PMLR (2013)
19. Muggleton, S., Raedt, L.D.: Inductive logic programming: Theory and methods. Journal of
Logic Programming 19,20, 629–679 (1994)
20. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
21. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrit-
twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of
Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
22. Vreeken, J., Van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Data Min-
ing and Knowledge Discovery 23(1), 169–214 (2011)
23. Xu, Y., Khalil, E.B., Sanner, S.: Graphs, constraints, and search for the abstraction and rea-
soning corpus. arXiv preprint arXiv:2210.09880 (2022)
Supplementary Materials
We here describe the contents of the supplementary file accompanying the paper. A
public repository of the source code is also available at https://github.com/sebferre/
ARC-MDL for ARC and at https://github.com/sebferre/ARC-MDL-strings for Flash-
Fill. There are two main directories: one for ARC tasks and another for FlashFill tasks.
Task Sets
The public task sets of ARC are available online at https://github.com/fchollet/ARC.
There are two task sets: training tasks and evaluation tasks, each containing 400 tasks.
The task set of FlashFill is made of the 14 examples in [13]. We provide them as
JSON files in FlashFill/taskset/, using the same format as ARC tasks, except
that strings and arrays of strings are used instead of colored grids. For convenience, we
also provide the file FlashFill/taskset/all examples.json to allow for
browsing all examples in one file.
Results
We provide the learning and prediction logs for each task set:
– ARC/training tasks.log
– ARC/evaluation tasks.log
– FlashFill/tasks.log
Each log file starts with the hyperparameter values, and ends with global statistical
measures. For each task, it gives:
– the detailed DL (description length) of the initial model;
– the learning trace (including the pruning phase) as a sequence of refinements, and
showing the decrease of the normalized DL;
– the learned models before and after pruning and their detailed DL;
– the best joint description for each training example, except for ARC evaluation
tasks so as not to leak their contents to the AI developer (a recommendation made
by F. Chollet);
– the prediction for each training and test example;
– and finally a few measures for the task.
The measures given for each task and at the end are the following:
– runtime-learning: learning time in seconds (including the pruning phase);
– bits-train-error: the remaining error commited on output training grids, in
bits;
– acc-train-micro: the proportion of training output grids that are correctly
predicted;
– acc-train-macro: 1 if all training output grids are correctly predicted, 0 oth-
erwise;
– acc-train-mrr: Mean Reciprocal Rank (MRR) of correct predictions for train-
ing output grids, 1 if all first predictions are correct;
– acc-test-micro: the proportion of test output grids that are correctly pre-
dicted;
– acc-test-macro: 1 if all test output grids are correctly predicted, 0 otherwise;
– acc-test-mrr: Mean Reciprocal Rank (MRR) of correct predictions for test
output grids, 1 if all first predictions are correct.
The reference measure in ARC is acc-test-macro. The micro measures provide a
more fine-grained and more optimistic measure of success.
For convenience, we also provide in ARC/solved tasks a picture for each of
the 96 training ARC tasks that are solved by our approach. We kindly invite the reader
to browse them to get a quick idea of the diversity of the tasks that our approach can
solve. The pictures are screenshots from the UI provided along with ARC tasks.