0% found this document useful (0 votes)
29 views20 pages

Tackling The Abstraction and Reasoning Corpus (ARC) With Object-Centric Models and The MDL Principle

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views20 pages

Tackling The Abstraction and Reasoning Corpus (ARC) With Object-Centric Models and The MDL Principle

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Tackling the Abstraction and Reasoning Corpus (ARC)

with Object-centric Models and the MDL Principle

Sébastien Ferré

Univ Rennes, CNRS, Inria, IRISA


Campus de Beaulieu, 35042 Rennes, France
Email: [email protected]

Abstract. The Abstraction and Reasoning Corpus (ARC) is a challenging bench-


arXiv:2311.00545v1 [cs.AI] 1 Nov 2023

mark, introduced to foster AI research towards human-level intelligence. It is a


collection of unique tasks about generating colored grids, specified by a few ex-
amples only. In contrast to the transformation-based programs of existing work,
we introduce object-centric models that are in line with the natural programs
produced by humans. Our models can not only perform predictions, but also pro-
vide joint descriptions for input/output pairs. The Minimum Description Length
(MDL) principle is used to efficiently search the large model space. A diverse
range of tasks are solved, and the learned models are similar to the natural pro-
grams. We demonstrate the generality of our approach by applying it to a different
domain.

1 Introduction

Artificial Intelligence (AI) has made impressive progress in the past decade at specific
tasks, sometimes achieving super-human performance: e.g., image recognition [15],
board games [21], natural language processing [6]. However, AI still misses the gen-
erality and flexibility of human intelligence to adapt to novel tasks with little training.
To foster AI research beyond narrow generalization [11], F. Chollet [4,5] introduced a
measure of intelligence that values skill-acquisition efficiency over skill performance,
i.e. the amount of prior knowledge and experience that an agent needs to reach a rea-
sonably good level at a range of tasks (e.g., board games) matters more than its absolute
performance at any specific task (e.g., chess). Chollet also introduced the Abstraction
and Reasoning Corpus (ARC) benchmark in the form of a psychometric test to mea-
sure and compare the intelligence of humans and machines alike. ARC is a collection
of tasks that consist in learning how to transform an input colored grid into an output
colored grid, given only a few examples. It is a very challenging benchmark. While
humans can solve more than 80% of the tasks [14], the winner of a Kaggle contest1
could only solve 20% of the tasks (with a lot of hard-coded primitives and brute-force
search), and the winner of the more recent ARCathon’22 contest2 could only solve 6%
of the tasks.
1
https://www.kaggle.com/c/abstraction-and-reasoning-challenge
2
https://lab42.global/past-challenges/arcathon-2022/
The existing published approaches [10,3,23,2], and also the Kaggle winner, tackle
the ARC challenge as a program synthesis problem, where a program is a composi-
tion of primitive transformations, and learning is done by searching the large program
space. In contrast, psychological studies have shown that the natural programs pro-
duced by humans to solve ARC tasks are object-centric, and more declarative than
procedural [14,1]. When asked to verbalize instructions on how to solve a task, partici-
pants typically first describe what to expect in the input grid, and then how to generate
the output grid based on the elements found in the input grid.
We make two contributions w.r.t. existing work:

1. object-centric models that enable to both parse and generate grids in terms of object
patterns and computations on those objects;
2. an efficient search of object-centric models based on the Minimum Description
Length (MDL) principle [20].

A model for an ARC task combines two grid models, one for the input grid, and another
for the output grid. This closely matches the structure of natural programs. Compared
to the transformation-based programs that can only predict an output grid from an input
grid, our models can also provide a joint description for a pair of grids. They can also
create new pairs of grids, although this is not evaluated in this paper. They could also in
principle be adapted to tasks based on a single grid or on sequences of grids. All of this
is possible because grid models can be used both for parsing a grid and for generating
a grid.
The MDL principle comes from information theory, and says that “the model that
best describes the data is the model that compress them the more” [20,12]. It has for
instance been applied to pattern mining [22,9]. The MDL principle is used at two levels:
(a) to choose the best parses of a grid according to a grid model, and (b) to efficiently
search the large model space by incrementally building more and more accurate models.
MDL at level (a) is essential because the segmentation of a grid into objects is task-
dependent and has to be learned along with the definition of the output objects as a
function of the input objects. The two contributions support each other because existing
search strategies could not handle the large number of elementary components of our
grid models, and because the transformation-based programs are not suitable to the
incremental evaluation required by MDL-based search.
We report promising results based on grid models that are still far from covering
all knowledge priors assumed by ARC. Correct models are found for 96/400 varied
training tasks with a 60s time budget. Many of those are similar to the natural programs
produced by humans. Moreover, we demonstrate the generality of our approach by ap-
plying it to the automatic filling of spreadsheet columns [13], where inputs and outputs
are rows of strings instead of grids.
The paper is organized as follows. Section 2 presents the ARC benchmark, and a
running example task. Section 3 discusses related work. Section 4 defines our object-
centric models, and Section 5 explains how to learn them with the MDL principle.
Section 6 reports on experimental results, comparing with existing approaches.
Fig. 1: Training tasks b94a9452 (top) and 23581191 (bottom), with 2-3 demonstration exam-
ples (left) and the input grid of a test case (right).

2 Abstraction and Reasoning Corpus (ARC)

ARC is a collection of tasks3 , where each task is made of training examples (3.3 on
average) and test examples (1 in general). Each example is made of an input grid and
an output grid. Each grid is a 2D array (with size up to 30x30) filled with integers
coding for colors (10 distinct colors). For a given task, the size of grids can vary from
one example to another, and between the input and the output. Each task is a machine
learning problem, whose goal is to learn a model that can generate the output grid from
the input grid, and so from a few training examples only. Prediction is successful only
if the predicted output grid is strictly equal to the expected grid for all test examples,
there is no partial success. However, three trials are allowed for each test example to
compensate for potential ambiguities in the training examples. Figure 1 shows two ARC
tasks (with the expected test output grid missing). The first is used as a running example
in this paper.
We now more formally define grids, examples, and tasks.

Definition 1 (grid). A grid g ∈ C h×w is a matrix with values taken from a set of
colors C, with h > 0 rows (height), and w > 0 columns (width). A grid cell is iden-
tified by coordinates (i, j), where i selects a row, and j selects a column. The color
at coordinates (i, j) is denoted by either gij or g[i, j]. Coordinates range from (0, 0)
to (h − 1, w − 1).

ARC grids use 10 colors, and have height/width up to 30.

Definition 2 (example). An example is a pair of grids e = (g i , g o ), where g i is called


the input grid, and g o is called the output grid.

As illustrated by Figure 1, the output grid needs not have the same size as the input
grid, it can be smaller or bigger.

Definition 3 (task). A task is a pair T = (E, F ), where E is the set of training exam-
ples, and F is the set of test examples.
3
Data and testing interface at https://github.com/fchollet/ARC
ARC tasks have 3.3 training examples on average, and 1 or 2 test examples (most
often 1). As illustrated by Figure 1, the different input grids of a task need not have the
same size, nor use the same colors. The same applies to test grids.
ARC is composed of 1000 tasks in total: 400 “training tasks”4 , 400 evaluation tasks,
and 200 secret tasks for independent evaluation. Figure 1 shows two of the 400 training
tasks. Developers should only look at the training tasks, not at the evaluation tasks.
The latter should only be used to evaluate the broad generalization capability of the
developed systems.

3 Related Work

The ARC benchmark is recent and not many approaches have been published so far. All
those we know define a DSL (Domain-Specific Language) of programs that transform
an input grid into an output grid, and search for a program that is correct on the training
examples [10,3,23,2]. The differences mostly lie in the primitive transformations (prior
knowledge) and in the search strategy. It is tempting to define more and more primitives
like the Kaggle winner did, hence more prior knowledge, but this means a less intelli-
gent system according to Chollet’s measure. To guide the search in the huge program
space, those approaches use either grammatical evolution [10], neural networks [3],
search tree pruning with hashing and Tabu list [23], or stochastic search trained on
solved tasks [2]. A difficulty is that the output grids are generally only used to score
a candidate program so that the search is kind of blind. Alford [3] improves this with
a neural-guided bi-directional search that grows the program in both directions, from
input and output. Xu [23] compares the in-progress generated grid to the expected grid
but this limits the approach to the tasks whose output grids have the same size and
same objects as the input grids. DSL-based approaches have a scaling issue because the
search space increases exponentially with the number of primitives. Ainooson [2] al-
leviates this difficulty by defining high-level primitives that embody specialized search
strategies. We compare and discuss their performance in the evaluation section.
Johnson et al. [14] report on a psychological study of ARC. It reveals that humans
use object-centric mental representations to solve ARC tasks. This is in contrast with
existing solutions that are based on grid transformations. Interestingly, the tasks that
are found the most difficult by humans are those based on logics (e.g., an exclusive-or
between grids) and symmetries (e.g., rotation), precisely those most easily solved by
DSL-based approaches. The study exhibits two challenges: (1) the need for a large set
of primitives, especially about geometry; (2) the difficulty to identify objects, which can
be only visible in part due to overlap or occlusion. A valuable resource is LARC, for
Language-annotated ARC [1], collected by crowd-sourcing. It provides for most train-
ing tasks one or several natural programs that confirm the object-centric and declarative
nature of human representations. A natural program is short textual descriptions pro-
duced by a participant that could be used by another participant to generate test output
grids (without access to the training examples).
4
The term “training tasks” may be misleading as their purpose is to train AI developers, not AI
systems. Humans solve ARC tasks without training.
Beyond the ARC benchmark, a number of work has been done in the domain of
program synthesis, which is also known as program induction or programming by ex-
amples (PbE) [17]. An early approach is Inductive Logic Programming (ILP) [19],
where target predicates are learned from symbolic representations. PbE is used in the
FlashFill feature of Microsoft Excel 2013 to learn complex string processing formulas
from a few examples [18]. Dreamcoder [8] alternates a wake phase that uses a neurally
guided search to solve tasks, and a sleep phase that extends a library of abstractions
to compress programs found during wake. Bayesian program learning was shown to
outperform deep learning at parsing and generating handwritten world’s alphabets [16].

4 Object-centric Models for ARC Grids

We introduce object-centric models as a mix of patterns and functions, in contrast to


DSL-based programs that are only made of functions. We examplify them with grid
models that describe ARC grids in term of objects having different shapes, colors, sizes,
and positions. Such grid models are used to parse a grid, i.e. to understand its contents
according to the model, and also to generate a grid, using the model as a template. A
task model comprises two grid models that enable to predict an output grid, to describe
a pair of grids, or to create a new pair of grids for the given task.

4.1 Mixing Patterns and Functions

The purpose of a grid model is to distinguish between invariant and variant elements
across the grids of a task. In task b94a9452 (Figure 1 top), all input grids con-
tain a square but the size, color, and position vary. This can be expressed by a pat-
tern Square(size:?, color:?, pos:?), where Square is called a constructor (here with
three arguments), and the question marks are called unknowns (similar to Prolog vari-
ables). There is also a constructor for positions as 2D vectors Vec(i:?, j:?), and prim-
itive values for sizes (e.g., 3) and colors (e.g., blue). Patterns can be nested, like in
Square(3,?,Vec(?,2)), which means “a square whose size is 3, and whose top left cor-
ner is on column 2”, in order to have models as specific as necessary. Fully grounded
patterns (without unknowns) are called descriptions: e.g., Square(3,blue,Vec(2,4)).
However, with patterns only, there is no way to make the output grid depend on the
input grid, which is key to solving ARC tasks. We therefore add two ingredients to grid
models (typically to output models): references to the components of a grid description
(typically the input one), and function applications to allow some output components
to be the result of a computation. For example, in task b94a9452, the model for the
small square in the output grids could be Square(!small.size, !large.color, !small.pos -
!large.pos), where for instance !small.size is a reference to the size of the small square
in the input grid, and ’-’ is the substraction function. This model says that “the small
output square has the same size as the small input square, the same color as the large
input square, and its position is the difference between the positions of the two input
squares.”
Tables 1 and 2 respectively list the pattern constructors and the functions of
the grid models that we have used in our experiments. Each constructor/function
Table 1: Pattern constructors by type
type constructors
Grid Layers(size: Vector, color: Color, layers: Layer[])
Tiling(grid: Grid, size: Vector)
Layer Layer(pos: Vector, object: Object)
Object Colored(shape: Shape, color: Color)
Shape Point
Rectangle(size: Vector, mask: Mask)
Mask Bitmap(bitmap: Bitmap)
Full, Border, EvenCheckboard, OddCheckboard, ...
Vector Vec(i: Int, j: Int)

Table 2: Functions by domain


Arithmetics: addition and substraction; product and division by a small constant (2..3);
minimum, maximum and average of two integers; span between two positions (|x−y|+
1); vectorized versions of the previous functions (e.g., (i1 , j1 )+(i2 , j2 ) = (i1 +i2 , j1 +
j2 )); projection of a vector on an axis.
Geometry: size and area of an object/shape; extremal and median positions of an object
along each axis (e.g., top and bottom, middle); stripping a grid from some background
color; cropping a grid at some frame; translation vector of an object against another;
scaling an object by a constant factor or relative to a size vector; extension of an ob-
ject/shape to some size, in agreement to a periodic pattern (e.g., checkerboard); tiling
an object/shape along the two axes; applying symmetries to objects/shapes (combining
rotations and reflections).
Other functions: recoloring an object; swapping two colors; color counts and majority
color; bitwise operations on masks.

has a result type, and typed arguments. The argument types constrain which val-
ues/constructors/functions can be used in arguments. The names of constructor argu-
ments are used to reference the components of a grid model or grid description. Grid,
object and shape constructors have an implicit argument grid for their representation
as a raw grid. Point shapes have an implicit argument size, equal to Vec(1,1). Our grid
models describe a grid as either a stack of layers on top of a background having some
size and color, or as the tiling of a grid up to covering a grid of given size. A layer is an
object at some position. An object is so far limited to a one-color shape, where a shape
is either a point or some mask-specified shape fitting into a rectangle of some size. A
mask is either specified by a bitmap or by one of a few common shapes such as a full
rectangle or a rectangular border. Positions and sizes are 2D integer vectors. Four prim-
itive types are used: integers, colors, bitmaps (i.e., Boolean matrices), and grids (i.e.,
color matrices). The available functions essentially cover arithmetic operations on inte-
gers and on vectors, where vectors represent positions, sizes, and moves; and geometric
notions such as measures (e.g., area), translations, symmetries, scaling, and periodic
patterns (e.g., tiling). Unknowns are here limited to primitive types and vectors. Refer-
ences and functions are so far only used in output grid models. They could be used in
M i = Layers(?, black, [
Layer(?, Colored(Rectangle(?, Full), ?)),
Layer(?, Colored(Rectangle(?, Full), ?)) ])
M o = Layers(!lay[1].object.shape.size, !lay[0].object.color, [
Layer(!lay[0].pos - !lay[1].pos,
coloring(!lay[0].object, !lay[1].object.color)) ])

Fig. 2: A correct model for task b94a9452.

the input models to express constraints, e.g. to state that different objects have the same
color.
A task model M = (M i , M o ) is made of an input grid model M i and an output grid
model M o . Figure 2 shows a correct model for task b94a9452, which in words says:
“There are two stacked full rectangles on a black background in the input grid. The size
of the output grid is the same as the bottom object (lay[1].object), and its background
color is the color of the top object (lay[0].object). The output grid has a copy of the top
object, recolored in the color of the bottom object, and whose position is the difference
between the top object position and the bottom object position.”

4.2 Parsing and Generating Grids with a Grid Model


We introduce two operations that must be defined for any grid model M : the parsing
of a grid g into a description π and the generation of a grid description π, and thus of a
grid g. These operations are analogous to the parsing and generation of sentences from
a grammar, where syntactic trees correspond to our descriptions π.
In both operations, the references present in the model M are first resolved using a
description as the evaluation context, called environment and written ε. Concretely, each
reference is a path in ε and is replaced by the sub-description at the end of this path.
The functions applying to these references are then evaluated. The result is a reduced
model M ′ consisting only of patterns and values.
Parsing. The parsing of a grid g consists in replacing the unknowns of the reduced
model M ′ by descriptions corresponding to the content of the grid. It is not necessary
that the whole content of the grid be described, which allows for partial models. A grid
is analyzed from the top layer to the bottom layer to take into account overlapping ob-
jects. The analysis of an object is contextual, it depends on what remains to be covered
in the grid after the analysis of the upper layers. For efficiency reasons, each grid is
pre-processed to extract a collection of single-colored parts and the objects are parsed
as unions of these parts (see Figure 3). As the analysis of the grids can become com-
binatorial, we bound the number of descriptions produced by the parsing and we order
them according to the description length measures defined in Section 5. As an example,
the parsing of the first input grid of the running task with the model M i of Figure 2
returns the following description: π i = Layers(Vec(12,13), black, [Layer(Vec(2,4), Col-
ored(Rectangle(Vec(2, 2), Full), yellow), Layer(Vec(1,3), Colored(Rectangle(Vec(4,4), Full),
red) ]).
Generation. The generation of a grid consists in replacing the remaining unknowns
in the reduced model M ′ by random descriptions of the right type, in order to obtain
Fig. 3: Parts, points and rectangles found in the first output grid

a grid description, which can then be converted into a concrete grid. For example, the
output model M o of Figure 2 applied with, as environment ε, the above description π i
of the first input grid generates the following description π o = Layers(Vec(4,4), yellow,
[Layer(Vec(1,1), Colored(Rectangle(Vec(2,2), Full), red)]). This description conforms to
the expected output grid.
An important point is that these two operations are multi-valued, i.e. may return
multiple descriptions. Indeed, there are often several ways of parsing a grid according
to a model, for example if the the model mentions a single object while the grid contains
several ones. There are also several grids that can be generated by a model when it
contains unknowns.

4.3 Predict, Describe, and Create with Task Models


We demonstrate the versatility of task models by showing that they can be used in three
different modes: to predict the output grid from the input grid, to describe a pair of
grids jointly, or to create a new pair of grids for the task. We use below the notation
ρ, π ∈ parse(M, ε, g) to say that π is the ρ-th parsing of the grid g according to the
model M and with the environment ε; and the notation ρ, π, g ∈ generate(M, ε) to
say that π is the ρ-th description generated by the model M with the environment ε,
and that g is the concrete grid described by π. The rank ρ is motivated by the fact that
parsing and generation are multi-valued.
The predict mode is used after a model has been learned, in the evaluation phase
with test cases, by predicting an output grid for the given input grid. It consists in first
parsing the input grid with the input model and the nil environment in order to get an
input description π i , and then to generate the output grid by using the ouput model and
the input grid description as the environment.

predict(M, g i ) = {(ρi , ρo , g o ) | ρi , π i ∈ parse(M i , nil , g i ),


ρo , π o , g o ∈ generate(M o , π i )}
The describe mode is used in the learning phase of the model (see Section 5). It
allows to obtain a joint description of a pair of grids. It consists in the parsing of the
input grid and the output grid. Let us note that the parsing of the output grid depends
on the result of the parsing of the input grid, hence the term ”joint description”.

describe(M, g i , g o ) =
{(ρi , ρo , π i , π o ) | ρi , π i ∈ parse(M i , nil , g i ),
ρo , π o ∈ parse(M o , π i , g o )}

The create mode makes it possible to create a new example of the task. It con-
sists of the successive generation of an input grid and an output grid, the latter being
conditioned by the former. This mode is not used in the ARC challenge but it could
contribute to the measurement of the intelligence of a system. Indeed, if an agent has
really understood a task, it should be able to produce new examples5 .

create(M ) = {(ρi , ρo , g i , g o ) | ρi , π i , g i ∈ generate(M i , nil ),


ρo , π o , g o ∈ generate(M o , π i )}

In all modes, the nil environment is used with the input model because the input
grid comes first, without any prior information. Note also that all modes inherit the
multi-valued property of parsing and generation. These three modes highlight an es-
sential difference between our object-centric models and the DSL-based programs of
existing approaches. The latter are designed for prediction (computation of the output
as a function of the input), they do not provide a description of the grids, nor a way
to create new input grids. A new example could be created by randomly generating an
input grid and applying the program, but in general, it would not respect most of the
task invariants: e.g., a random bitmap would be generated rather than a solid square.

5 MDL-based Model Learning

MDL-based learning works by searching for the model that compresses the data the
more. The data to be compressed is here the set of training examples. We have to define
two things: (1) the description lengths of models and examples, and (2) the search space
of models and the learning strategy.

5.1 Description Lengths

A common approach in MDL is to define the overall description length (DL) as the sum
of two parts (two-parts MDL): the model M , and the data D encoded according to the
model [12].
L(M, D) = L(M ) + L(D | M )
5
At school, teachers often ask pupils to produce their own examples of some concept to check
their understanding.
In our case, the model is a task model composed of two grid models, and the data
is the set of training examples (pairs of grids). To compensate for the small number
of examples, and to allow for sufficiently complex models, we use a rehearsal factor
α ≥ 1, like if each example were seen α times.

L(M ) = L(M i ) + L(M o )


X
L(D | M ) = α L(g i , g o | M )
(g i ,g o )

The DL of an example is based on the most compressive joint description of the pair
of grids.

L(g i , g o | M ) = min ρi ,ρo ,πi ,πo ∈describe(M,gi ,go )


[ L(ρi , π i , g i | M i , nil ) + L(ρo , π o , g o | M o , π i ) ]

Terms of the form L(ρ, π, g | M, ε) denote the DL of a grid g encoded according to


a grid model M and an environment ε, via the ρ-th description π resulting from the
parsing, which serves as an intermediate representation. We can decompose these terms
by using π as an intermediate representation of the grid.

L(ρ, π, g | M, ε) = L(ρ) + L(π | M, ε) + L(g | π)

The term L(ρ) := LN (ρ) − LN (1) encodes the extra-cost of not choosing the first
parsed description, penalizing higher ranks. LN (n) is a classical universal encoding for
integers [7]. The term L(π | M, ε) measures the amount of information that must be
added to the model and the environment to encode the description, typically the values
of the unknowns. The term L(g | π) measures the differences between the original grid
and the grid produced by the description. A correct model is obtained when ρi = 1
and L(ρo , π o , g o | M o , π i ) = 0 for all examples, i.e. when using the first description
for each input grid, there is nothing left to code for the output grids, and therefore the
output grids can be perfectly predicted from the input grids.
Three elementary model-dependent DLs have to be defined:
– L(M ): DL of a grid model;
– L(π | M, ε): DL of a grid description, according to the model and environment used
for parsing it;
– L(g | π): DL of a grid, relative to a grid description, i.e. the errors commited by the
description w.r.t. the grid.
We sketch those definitions for the grid models defined in Section 4. We recall that
decription lengths are generally derived from probability distributions with the equation
L(x) = − log P (x), corresponding to an optimal coding [12].
Defining L(M ) amounts to encode a syntax tree with constructors, values, un-
knowns, references, and functions as nodes. Because of types, only a subset of those
are actually possible at each node: e.g. type Layer has only one constructor. We use
uniform distributions across possible nodes, and universal encoding for non-bounded
ints. A reference is encoded according to a uniform distribution across all components
Table 3: Decomposition of L(M, D) for the model in Figure 2
input output pair
L(M ) 71.4 97.3 168.7
L(D | M ) 2355.2 0.0 2355.2
L(M, D) 2426.6 97.3 2523.9

of the environment that have a compatible type. We give unknowns a lower probability
than constructors, and references/functions a higher probability, in order to encourage
models that are more specific, and that make the output depend on the input.
Defining L(π | M, ε) amounts to encode the description components that are un-
knowns in the model. As those description components are actually grounded model
components, the above definitions for L(M ) can be reused, only adjusting the proba-
bility distributions to exclude unknowns, references and functions.
Defining L(g | π) amounts to encode which cells in grid g are wrongly specified
by description π. For comparability with grid models and descriptions, we represent
each differing cell as a point object – Layer(Vec(i,j),Colored(Point,c)) – and encode it
like descriptions. We also have to encode the number of differing cells.
Table 3 shows the decomposition of the description length L(M, D), for the model
in Figure 2 on task b94a9452, between the input and the output, and between the model
and the data encoded with the model. Remind that L(D | M ) is for α = 10 copies of
each example. It shows that the DL of the output grids is zero bits, which means that
they are entirely determined by the grid models and the input grids. The proposed model
is therefore a solution to the task. The average DL of an input grid is 2355.2/10/3 = 78.5
bits, on a par with the DL of the input grid model.

5.2 Search Space and Strategy


The search space for models is characterized by: (1) an initial model, and (2) a refine-
ment operator that returns a list of model refinements M1 . . . Mn given a model M . A
refinement can insert a new component, replace an unknown by a pattern (introducing
new unknowns for the constructor arguments), or replace a model component by an ex-
pression (a composition of references, values, and functions). The refinement operator
has access to the joint descriptions, so it can be guided by them. Similarly to previous
MDL-based approaches [22], we adopt a greedy search strategy based on the descrip-
tion length of models. At each step, starting with the initial model, the refinement that
reduces the more L(M, D) is selected. The search stops when no model refinement
reduces it. To compensate for the fact that the input and output grids may have very
different sizes, we actually use a normalized description length L̂ that gives the same
weight to the input and output components of the global DL, relative to the initial model.

L(M i , Di ) L(M o , Do )
L̂(M, D) = i , Di )
+ o , D o ) ∈ [0, 2]
L(Minit
L(Minit

Our initial model uses the unknown grid ? for both input and output: Minit = (?, ?).
The available refinements are the following:
Table 4: Learning trace for task b94a9452 (in.lay[1] is inserted before in.lay[0] because we
use the final layer indices for clarity). in/out denotes the input/output model, L̂ is the normalized
DL.
step refinement L̂
0 (initial model) 2.000
1 in ← Layers(?, ?, []) 1.117
2 out ← Layers(?, ?, []) 0.272
3 in.lay[1] ← Layer(?, Col.(Rect.(?, ?), ?)) 0.179
4 out.lay[0] ← Layer(?, Col.(Rect.(?, ?), ?)) 0.101
5 out.size ← !lay[1].object.shape.size 0.079
6 in.lay[0] ← Layer(?, Col.(Rect.(?, ?), ?)) 0.070
7 out.lay[0].object ←
coloring(!lay[0].object, !lay[1].object.color) 0.045
8 out.color ← !lay[0].object.color 0.032
9 out.lay[0].pos ← !lay[0].pos−!lay[1].pos 0.020
10 in.lay[0].object.shape.mask ← Full 0.019
11 in.lay[1].object.shape.mask ← Full 0.019
12 in.color ← black 0.019

– the insertion of a new layer in the list of layers – one of Layer(?,Col.(Point,?)),


Layer(?,Col.(Rectangle(?,?),?)), Layer(?,!object), and !layer – where !object (resp.
!layer) is a reference to an input object (resp. an input layer);
– the replacement of an unknown at path p by a pattern P when for each example,
there is a parsed description π s.t. π.p matches P ;
– the replacement of a model component at path p by an expression e when for each
example, there is a description π s.t. π.p = e.
Table 4 shows the learning trace for task b94a9452, showing at each step the re-
finement that was found the most compressive. It reveals how the system learns about
the task (steps are given in brackets): “The input and output grids are made of layers
of objects over a background (1-2). There is a rectangle lay[1] in the input (3) and
a rectangle lay[0] in the output (4). The output grid is the size of lay[1] in input (5).
There is another rectangle lay[0] in the input, above lay[1] (6). We can use its color
for the background of the output (8). The output rectangle is the same as the input rect-
angle lay[0] but with the color of lay[1] (7), and its position is equal to the difference
between the positions of the two input rectangles (9). All rectangles are full (10-11) and
the input background is black (12).”

5.3 Pruning Phase


The learned model sometimes lacks generality, and fails on test examples. This is be-
cause the goal of MDL-based learning as defined above is to find the most compressive
task model on pairs of grids. This is relevant for the description mode, as well as for
the creation mode. However, in the prediction mode, the input grid model is used as a
pattern to match the input grid, and it should be as general as possible provided that
it captures the correct information for generating the output grid. For example, if all
input grids in training examples have height 10, then the model will fail on a test ex-
ample where the input grid has height 12, even if that height does not matter at all for
generating the output.
We therefore add a pruning phase as a post-processing of the learned model. The
principle is to start from this learned model, and to repeatdly apply inverse refinements
while this does not break correct predictions. Inverse refinements can remove a layer or
replace a constructor/value by an unknown. In order to have a uniform learning strat-
egy, we also use here an MDL-based strategy, only adapting the description length to
the prediction mode. In prediction mode, the input grid is given, and we therefore re-
place L(g i , g o | M ) by L(g o | M, g i ). Hence, the DL L(ρi , π i , g i | M i , nil) becomes
L(ρi , π i , g i | M i , nil, g i ), which is equal to L(ρi ) because π i is fully determined by
the input grid, the input grid model, and the parsing rank. This new definition makes
it possible to simplify the input model while decreasing the DL. Indeed, such simpli-
fications typically reduce L(M i ) but increase L(π i |M i , nil) and L(g i |π i ). The two
latter terms are no more counted in the prediction-oriented measure. The cost related
to ρi is important because choosing the wrong description of the input grid almost in-
variably lead to a wrong predicted output grid. The DL L(ρo , π o , g o | M o , π i ) becomes
L(ρo , π o , g o | M o , π i , g i ), which is equal to the former as π i is an abstract representa-
tion of g i . Indeed, unlike for input grids, it is important to keep output grids as com-
pressed as possible.
On task b94a9452, starting from the model in Figure 2, the pruning phase performs
three generalization steps, replacing by unknowns: the Full mask of the two input rect-
angles, and the black color of the input background. Those generalizations happen not
to be necessary on the test examples of the task in ARC, but they make the model work
on input grids that would break invariants of training examples, e.g. a cross above a
rectangle above a blue background.

6 Evaluation

In this section, we first evaluate our approach on ARC, comparing it to existing ap-
proaches in terms of success rates, efficiency, model complexity, and model natural-
ness. We then evaluate the generality of our approach beyond ARC by applying it to a
different domain, spreadsheets, where inputs and outputs are rows of strings. Our ex-
periments were run with single-thread implementations on Fedora 32, Intel Core i7x12
with 16GB memory. We used one run per task set as there is no randomness involved.

6.1 Abstraction and Reasoning Corpus (ARC)

We evaluated our approach on the 800 public ARC tasks, and we also took part in
the ARCathon 2022 challenge as team MADIL. The few parameters were set based
on the training tasks. To ensure a good balance of the computational time between
parsing and learning, we set some limits that remained stable across our experiments.
The number of descriptions produced by the parsing of a grid is limited to 64 and only
the 3 most compressive are retained for the computation of refinements. At each step, at
most 100,000 expressions are considered and only the 20 most promising refinements,
Table 5: Number of solved tasks (and percentage) and average learning time for solved tasks, for
different methods on different task sets
task set method solved tasks runtime
ARC train. Fischer et al, 2020 31 7.68%
(400 tasks) Alford et al, 2021 22 5.50%
Xu et al, 2022 57 14.25%
Ainooson et al, 2023 104 26.00% 178.7s
OURS 96 24.00% 4.6s
ARC eval. Ainooson et al, 2023 26 6.50%
(400 tasks) OURS 23 5.75% 11.4s
Kaggle’20 Icecuber (winner) 20.6%
(100 tasks) Fischer et al 3.0%
ARCathon’22 pablo (winner) 6 6%
(100 tasks) Ainooson et al 2 2%
OURS (4th ex-aequo) 2 2%

according to a DL estimate, are evaluated. The rehearsal rate α is set to 10. The tasks
are processed independently of each other, without learning from one to the other. The
results are given for a learning time per task limited to 60s plus 10s for the pruning
phase.
The learning and prediction logs and the screenshots of the solved training tasks are
available as supplementary materials.
Task sets and baselines. We consider four task sets for which results have been
reported: the 400 training and 400 evaluation public tasks, the 100 secret tasks of Kag-
gle’20, and the 100 secret tasks of ARCathon’22. We presume that those secret tasks
are taken from the 200 secret ARC tasks. As baselines, we consider published methods
that report results on the considered task sets [10,3,23,2]. We also include the winners
of the two challenges for reference. Unfortunately, the reported results are scarce, and
the papers do not provide their code. The code of our method is available as open source
on GitHub6 . Version 2.7 was used for the experiments reported here.
Success rates. On the training tasks, for which we have the more results to compare
with, our method solves 96 (24%) training tasks, almost on par with the best method, by
Ainooson et al (26%). Both methods also solved a similar number of evaluation tasks
(23 vs 26 tasks), and both solved 2/100 tasks in ARCathon’22, and ranked 4th ex-aequo.
Comparing the different task sets, it appears that the evaluation tasks are significantly
more difficult than the training tasks, and the secret tasks of ARCathon seem even more
difficult as the winner could only solve 6 tasks. Icecuber managed to correctly predict an
amazing 20.6% of the test output grids in Kaggle’20, but at the cost of the hand-coding
of 142 primitives, 10k lines of code, and brute-force search (millions of computed grids
per task).
The ARC evaluation protocol allows for three predictions per test example. How-
ever, the first prediction of our method is actually correct in 90 of the 96 solved train-
ing tasks. This shows that our learned models are accurate in their understanding of
6
https://github.com/sebferre/ARC-MDL
the tasks. To better evaluate the generalization capability of learned models, we also
measured the generalization rate as the proportion of models that are correct on train-
ing examples that are also correct on test examples: 92% (94/102) on training tasks,
and 72% (23/32) on evaluation tasks. This again suggests that the evaluation tasks fea-
ture a higher generalization difficulty. Without the pruning phase, this rate decreases
to 89% (91/102) on training tasks. This shows that the pruning phase is useful, al-
though description-oriented model learning is already good at generalization. Reasons
for failures to generalize are: e.g., the test example has several objects while all training
examples have a single object; the training examples have a misleading invariant.
Efficiency and model complexity. Intelligence is the efficiency at acquiring new
skills, according to Chollet. Although ARC enforces data efficiency by having only a
few training examples per task, and unique tasks, it does not enforce efficiency in the
amount of priors, nor in the computation resources. It is therefore useful to assess the
latter. We already mentioned Icecuber’s method that relies on a large number of primi-
tives, and intense computations. The method of Ainooson et al, which has comparable
performance to ours, uses 52 primitives and about 700s on average per solved task.
In comparison, our method uses 30 primitives and 4.6s per solved training task (21.7s
over all training tasks). Moreover, doubling the learning timeout at 120s does not lead
to solving more tasks, so 60s just seems to be enough to find a solution if there is one.
Note also that our method does not stop learning when a solution is found but when no
more compression can be achieved.
Another way to evaluate efficiency is to look at the complexity of learned mod-
els, typically the number of primitives composing the model in program synthesis ap-
proaches. A good proxy for this complexity is the depth of search that was reached
in the allocated time. In our case, it is equal to the number of refinements applied to
the initial empty model. Few methods provide this information: Icecuber limits depth
to 4, and Ainooson’s best results are achieved with a brute-force search with maximum
depth 3. Methods based on DreamCoder [3] have similar limits but can learn more
complex programs by discovering and defining new operations as common composi-
tions of primitives, and reusing them from one task to another. Our method can dive
much deeper in less computation time, thanks to its greedy strategy. The number of
refinement steps achieved in a timeout of 60s on the training tasks ranges from 4 to 57,
with an average of 19 steps. This demonstrates the effectiveness of the MDL criteria
to guide the search towards correct models. This claim is reinforced by the fact that a
beam search (width=3) did not lead to solving more tasks.
Learned models. The learned models for solved tasks are very diverse despite the
simplicity of our models. They express various transformations: e.g., moving an ob-
ject, extending lines, putting one object behind another, order objects from largest to
smallest, remove noise, etc. Note that none of these transformations is a primitive in
our models, they are learned in terms of objects, basic arithmetics, simple geometry,
and the MDL principle.
We compared our learned models to the natural programs of LARC [1]. Remark-
ably, many of our models involve the same objects and similar operations than the
natural programs. For example, the natural program for task b94a9452 is: “[The input
has] a square shape with a small square centered inside the large square on a black
Table 6: Pattern constructors by type for strings
type constructors
Row Cell[]
Cell Nil
Factor(left:Cell, token:Token, right:Cell)
Token Const(s:string)
Regex(re:Regex)
Regex Ident, Letters, Decimal, Digits, Spaces
T Alt(if:Cond, then:T , else:T )

background. The two squares are of different colors. Make an output grid that is the
same size as the large square. The size and position of the small inner square should be
the same as in the input grid. The colors of the two squares are exchanged.” For other
tasks, our models miss some notions used by natural programs but manage to compen-
sate them: e.g., topological relations such as ”next to” or ”on top” are compensated by
the three attempts; the majority color is compensated by the MDL principle selecting
the largest object. However, in most cases, the same objects are identified.
These observations demonstrate that our object-centric models align well with the
natural programs produced by humans, unlike approaches based on the composition of
grid transformations. An example of a program learned by [10] on the task 23b5c85d
is strip black; split colors; sort Area; top; crop, which is a sequence
of grid-to-grid transformations, without explicit mention of objects.

6.2 From Grids to Strings (FlashFill)


A similar yet different kind of tasks, compared to ARC, is the automatic filling of some
columns in a spreadsheet given already filled columns, from only a few input-output
examples. A notable work in program synthesis [13] has led to a new feature in Mi-
crosoft Excel 2013, called FlashFill. As a simple example, consider a spreadsheet where
column A contains lastnames (e.g., Smith), column B contains firstnames (e.g., Jones
Paul), and column C is expected to contain the initial of the first firstname followed by
the lastname (e.g., J. Smith).
Like in ARC, each task comes with a few input-output pairs, and the output should
be predicted from the input. The main difference lies in the type of inputs and outputs,
here rows of strings instead of colored grids. The research hypothesis here is that by
changing only the definition of patterns, functions, and the model-specific DLs, our
approach is able to learn models that solve the tasks given as examples in the work
cited above.
Models. Table 6 lists the patterns of our models for rows of strings. A row model
describes a row of strings, and is simply an array of cell models. A cell model decribes
the content of a spreadsheet cell, i.e. a string. It is either the empty string (Nil), or the
factorization of a string with a token in the middle and two substrings on each side. To-
kens here play the role of objects. A token model is either a constant string or a regular
expression, taken among a list of predefined ones: e.g., Digits = [0-9]+ matches con-
tiguous sequences of digits. Unknowns (?) can be used as cell models and as conditions
(Cond). Finally, the Alt constructor can be used in cell models and token models to
express alternatives (Alt(?,M1 ,M2 )), conditionals (Alt(expr,M1 ,M2 )), and optionals
(Alt(?,M ,Nil)). The available functions are so far limited: string length, filtering chars
(digits, letters, upper letters, lower letters), converting a string to uppercase or lower-
case, converting ints and bools to strings, equality to some constant value and logical
operators for conditions. Expressions and references are so far restricted to tokens.

For comparison, the DSL of FlashFill also uses predefined regular expressions but
uses them to locate positions in the string, rather than tokens. Their programs are condi-
tional expressions (switch), where each branch is a concatenation of substrings specified
by position, and constant strings. In contrast, our models allows for free nestings of con-
ditionals (Alt) and concatenation (Factor). However, their DSL has loops that have so
far no counterpart in our models.

Task set. For a preliminary evaluation, we used as a task set the 14 examples in [13].
Each task has one or two strings as inputs and one string as output, and 2-6 training ex-
amples (avg. 3.4). We complement them with 3-6 evaluation examples, some of which
feature some generalization difficulty. Those 14 tasks are available in the supplementary
materials in the same JSON format as ARC tasks.

Efficiency and success rates. Learning takes 1s or less, except for Task 1 and
Task 13 that have longer input strings and take respectively 9.9s and 5.2s. The depth
of search ranges from 11 to 76, and averages at 35 steps. For 11/14 tasks (all except
Tasks 4, 5, 9), the learned model correctly describes and predicts the training examples.
However, only 5 of those learned models generalize to all test examples: 3 models fail
on a single test example (e.g., in Task 3 the file extension .mp4 contains a digit unlike
other file extensions); in Task 8, the training examples are ambiguous because the input
string is a date in different formats, and the output string could either be the day or last
two digits of the year; in Task 11, there is a typo in the training examples (on purpose),
which makes the task under-specified. The (partial) failure for other tasks is explained
by missing features in our models, notably the counterpart of loops, or by a wrong se-
quence of refinements. For instance, Task 9 can be solved by delaying the insertion of
alternatives.

Learned models. Input models can be expressed as regular expressions with groups
on tokens and alternatives, and output models can be expressed as string interpolations
with group identifiers as variables. For Task 10, we obtain the following model.

M i : \(.*\([0-9]+\).*\)?\([0-9]+\).*\([0-9]+\)
M o : {if \1 then \2 else "425"}-\3-\4

The input is made of three integers, the first one being optional. The output is the
concatenation of those three integers, separated by dashes, and the first integer is 425
when missing in the input. In FlashFill, in general, a large number of programs is gen-
erated as an exhaustive search is performed. For Task 10, the program given as solution
in [13] is the following (A refers to the input column):
Switch((b1, e1), (b2, e2)), where
b1 ≡ M atch(A, N umT ok, 3),
b2 ≡ ¬M atch(A, N umT ok, 3),
e1 ≡ Concatenate(SubStr2(A, N umT ok, 1),
Const(”-”), SubStr2(A, N umT ok, 2),
Const(”-”), SubStr2(A, N umT ok, 3))
e2 ≡ Concatenate(Const(”425-”), SubStr2(A, N umT ok, 1),
Const(”-”), SubStr2(A, N umT ok, 2))
This is illustrative of the different programming style between our pattern-based models
and the computation-based DSL programs.

7 Conclusion and Perspectives


We have presented a novel and general approach to efficiently learn skills at tasks that
consist in generating structured outputs as a function of structured inputs. Following
Chollet’s measure of intelligence, efficiently learning here means limited knowledge
prior for the target scope of tasks, only a few examples per task, and low computational
resources. Our approach is based on descriptive task models that combine object-centric
patterns and computations, and on the MDL principle for guiding the search for mod-
els. We have detailed an application to ARC tasks on colored grids, and sketched an
application to FlashFill tasks on strings. We have shown promising results, especially
in terms of efficiency, model complexity, and model naturalness.
Going further on ARC will require a substantial design effort as our current models
cover so far a small subset of the knowledge priors that are required by ARC tasks
(e.g., goal-directedness). For FlashFill tasks, the addition of a counterpart for loops and
common functions is expected to suffice to match the state-of-the-art.

References
1. Acquaviva, S., Pu, Y., Kryven, M., Sechopoulos, T., Wong, C., Ecanow, G., Nye, M., Tessler,
M., Tenenbaum, J.: Communicating natural programs to humans and machines. Advances in
Neural Information Processing Systems 35, 3731–3743 (2022)
2. Ainooson, J., Sanyal, D., Michelson, J.P., Yang, Y., Kunda, M.: An approach for solving
tasks on the abstract reasoning corpus. arXiv preprint arXiv:2302.09425 (2023)
3. Alford, S., Gandhi, A., Rangamani, A., Banburski, A., Wang, T., Dandekar, S., Chin, J.,
Poggio, T.A., Chin, S.P.: Neural-guided, bidirectional program search for abstraction and
reasoning. CoRR abs/2110.11536 (2021), https://arxiv.org/abs/2110.11536
4. Chollet, F.: On the measure of intelligence. arXiv preprint arXiv:1911.01547 (2019)
5. Chollet, F.: A definition of intelligence for the real world. Journal of Artificial General Intel-
ligence 11(2), 27–30 (2020)
6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional trans-
formers for language understanding. In: Conf. North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, NAACL-HLT. pp. 4171–
4186. Assoc. Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
7. Elias, P.: Universal codeword sets and representations of the integers. IEEE Trans. Informa-
tion Theory 21(2), 194–203 (1975)
8. Ellis, K., et al.: Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep
library learning. In: ACM Int. Conf. Programming Language Design and Implementation.
pp. 835–850 (2021)
9. Faas, M., Leeuwen, M.v.: Vouw: geometric pattern mining using the MDL principle. In: Int.
Symp. Intelligent Data Analysis. pp. 158–170. Springer (2020)
10. Fischer, R., Jakobs, M., Mücke, S., Morik, K.: Solving Abstract Reasoning Tasks with Gram-
matical Evolution. In: LWDA. pp. 6–10. CEUR-WS 2738 (2020)
11. Goertzel, B.: Artificial general intelligence: concept, state of the art, and future prospects.
Journal of Artificial General Intelligence 5(1), 1 (2014)
12. Grünwald, P., Roos, T.: Minimum description length revisited. International journal of math-
ematics for industry 11(01) (2019)
13. Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In:
Symp. Principles of Programming Languages. pp. 317–330. ACM (2011)
14. Johnson, A., Vong, W.K., Lake, B., Gureckis, T.: Fast and flexible: Human program induction
in abstract reasoning tasks. arXiv preprint arXiv:2103.05823 (2021)
15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. Advances in neural information processing systems 25, 1097–1105 (2012)
16. Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through
probabilistic program induction. Science 350(6266), 1332–1338 (2015)
17. Lieberman, H.: Your Wish is My Command. The Morgan Kaufmann series in interactive
technologies, Morgan Kaufmann / Elsevier (2001). https://doi.org/10.1016/b978-1-55860-
688-3.x5000-3
18. Menon, A., Tamuz, O., Gulwani, S., Lampson, B., Kalai, A.: A machine learning framework
for programming by example. In: Int. Conf. Machine Learning. pp. 187–195. PMLR (2013)
19. Muggleton, S., Raedt, L.D.: Inductive logic programming: Theory and methods. Journal of
Logic Programming 19,20, 629–679 (1994)
20. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
21. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrit-
twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of
Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
22. Vreeken, J., Van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Data Min-
ing and Knowledge Discovery 23(1), 169–214 (2011)
23. Xu, Y., Khalil, E.B., Sanner, S.: Graphs, constraints, and search for the abstraction and rea-
soning corpus. arXiv preprint arXiv:2210.09880 (2022)

Supplementary Materials
We here describe the contents of the supplementary file accompanying the paper. A
public repository of the source code is also available at https://github.com/sebferre/
ARC-MDL for ARC and at https://github.com/sebferre/ARC-MDL-strings for Flash-
Fill. There are two main directories: one for ARC tasks and another for FlashFill tasks.

Task Sets
The public task sets of ARC are available online at https://github.com/fchollet/ARC.
There are two task sets: training tasks and evaluation tasks, each containing 400 tasks.
The task set of FlashFill is made of the 14 examples in [13]. We provide them as
JSON files in FlashFill/taskset/, using the same format as ARC tasks, except
that strings and arrays of strings are used instead of colored grids. For convenience, we
also provide the file FlashFill/taskset/all examples.json to allow for
browsing all examples in one file.

Results
We provide the learning and prediction logs for each task set:
– ARC/training tasks.log
– ARC/evaluation tasks.log
– FlashFill/tasks.log
Each log file starts with the hyperparameter values, and ends with global statistical
measures. For each task, it gives:
– the detailed DL (description length) of the initial model;
– the learning trace (including the pruning phase) as a sequence of refinements, and
showing the decrease of the normalized DL;
– the learned models before and after pruning and their detailed DL;
– the best joint description for each training example, except for ARC evaluation
tasks so as not to leak their contents to the AI developer (a recommendation made
by F. Chollet);
– the prediction for each training and test example;
– and finally a few measures for the task.
The measures given for each task and at the end are the following:
– runtime-learning: learning time in seconds (including the pruning phase);
– bits-train-error: the remaining error commited on output training grids, in
bits;
– acc-train-micro: the proportion of training output grids that are correctly
predicted;
– acc-train-macro: 1 if all training output grids are correctly predicted, 0 oth-
erwise;
– acc-train-mrr: Mean Reciprocal Rank (MRR) of correct predictions for train-
ing output grids, 1 if all first predictions are correct;
– acc-test-micro: the proportion of test output grids that are correctly pre-
dicted;
– acc-test-macro: 1 if all test output grids are correctly predicted, 0 otherwise;
– acc-test-mrr: Mean Reciprocal Rank (MRR) of correct predictions for test
output grids, 1 if all first predictions are correct.
The reference measure in ARC is acc-test-macro. The micro measures provide a
more fine-grained and more optimistic measure of success.
For convenience, we also provide in ARC/solved tasks a picture for each of
the 96 training ARC tasks that are solved by our approach. We kindly invite the reader
to browse them to get a quick idea of the diversity of the tasks that our approach can
solve. The pictures are screenshots from the UI provided along with ARC tasks.

You might also like