0% found this document useful (0 votes)
11 views20 pages

Using Contrastive Learning With Generative Similarity To Learn Spaces That Capture Human Inductive Biases

The document proposes using contrastive learning with a measure of generative similarity to learn representations that capture human inductive biases. Generative similarity is defined as the probability that two data points were generated from the same distribution compared to different distributions. This allows incorporating Bayesian models of human cognition into neural networks by defining 'same' and 'different' pairs for contrastive learning, even when exact computation is intractable. The approach is demonstrated on learning geometric shape biases and distinguishing probabilistic programs.

Uploaded by

gamininganela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

Using Contrastive Learning With Generative Similarity To Learn Spaces That Capture Human Inductive Biases

The document proposes using contrastive learning with a measure of generative similarity to learn representations that capture human inductive biases. Generative similarity is defined as the probability that two data points were generated from the same distribution compared to different distributions. This allows incorporating Bayesian models of human cognition into neural networks by defining 'same' and 'different' pairs for contrastive learning, even when exact computation is intractable. The approach is demonstrated on learning geometric shape biases and distinguishing probabilistic programs.

Uploaded by

gamininganela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Using Contrastive Learning with Generative Similarity

to Learn Spaces that Capture Human Inductive Biases

Raja Marjieh1* , Sreejan Kumar2* , Declan Campbell2 , Liyi Zhang3 , Gianluca Bencomo3 , Jake
Snell3 , and Thomas L. Griffiths1,3
arXiv:2405.19420v1 [cs.LG] 29 May 2024

1
Department of Psychology, Princeton University
2
Princeton Neuroscience Institute
3
Department of Computer Science, Princeton University
* equal contribution

Abstract

Humans rely on strong inductive biases to learn from few examples and abstract
useful information from sensory data. Instilling such biases in machine learning
models has been shown to improve their performance on various benchmarks
including few-shot learning, robustness, and alignment. However, finding effective
training procedures to achieve that goal can be challenging as psychologically-
rich training data such as human similarity judgments are expensive to scale, and
Bayesian models of human inductive biases are often intractable for complex,
realistic domains. Here, we address this challenge by introducing a Bayesian
notion of generative similarity whereby two datapoints are considered similar if
they are likely to have been sampled from the same distribution. This measure can
be applied to complex generative processes, including probabilistic programs. We
show that generative similarity can be used to define a contrastive learning objective
even when its exact form is intractable, enabling learning of spatial embeddings
that express specific inductive biases. We demonstrate the utility of our approach by
showing how it can be used to capture human inductive biases for geometric shapes,
and to better distinguish different abstract drawing styles that are parameterized by
probabilistic programs.

1 Introduction
Human intelligence is characterized by strong inductive biases that enable humans to form meaningful
generalizations [1, 2], learn from few examples [2], and abstract useful information from sensory
data [3]. Instilling such biases into machine learning models has been at the center of numerous
recent studies [4–13], and has been shown to improve accuracy, few-shot learning, interpretability,
and robustness [14]. Key to this effort is the ability to find effective training procedures to imbue
neural networks with these inductive biases. Two prominent approaches for achieving that goal are
i) leveraging the extensive literature on modeling human inductive biases with Bayesian models
[1, 15] to specify a computational model for the bias of interest and then distilling it into the model,
usually via meta-learning [4, 5, 16], and ii) incorporating psychologically-rich human judgments in
the training objectives of models such as soft labels [17], categorization uncertainty [6, 18], language
descriptions [4, 19], and similarity judgments [11, 13, 20].
While both approaches are promising, they are not without limitations. Though Bayesian models
provide an effective description of human inductive biases, they are often computationally intractable
due to expensive Bayesian posterior computations that require summing over large hypothesis
spaces. This problem is particularly pronounced when considering symbolic models which are key to

Preprint. Under review.


θ θ
θ
! Same Different θ
"
Same Pair
θ θ1 θθ
1
θ1

θ2
θ1 θ θ2 θθ11 θ2

X1
θ2 θ1 X1 θθ22 X1 Different Pair

X1 θ2 X2 X1 X2
X 1 X2

X2 X1 X2
X
2
Figure 1: Schematic representation of generative similarity. A. Graphical models for the same and
different data generation hypotheses. B. Example same and different quadrilateral shape pairs.
X2

modeling human behavior [21–25]. Likewise, incorporating human judgments in model objectives
may be intuitive, but it is often not scalable for the data needs of modern machine learning. For
example, while incorporating human similarity judgments (i.e., judgments of how similar pairs of
stimuli are) have been shown to improve model behavior [11, 13, 20], collecting large amounts of
these judgments at the scale of modern datasets is challenging as the number of required judgments
grows quadratically in the number of stimuli (though see [19] for proxies).
Here, we introduce a third approach based on the method of contrastive learning [26], a widely
used training procedure in machine learning. Contrastive learning uses the designation of datapoints
as being the “same” or “different” to learn a representation of those datapoints where the “same”
datapoints are encouraged to be closer together and “different” datapoints further apart. This approach
provides a way to go from a similarity measure to a representation. We define a principled notion of
similarity based on Bayesian inference (initially proposed in [27]) and show how it can be naturally
implemented in a contrastive learning framework even when its exact form is intractable. Specifically,
given a set of samples and a hierarchical generative model of the data from which data distributions
are first sampled and then individual samples are drawn (e.g., a Gaussian mixture), we define the
generative similarity between a pair of samples to be their probability of having been sampled
from the same distribution relative to that of them being sampled from two independently-drawn
distributions (Figure 1). By using Bayesian models to define similarity within a contrastive learning
framework, we provide a general procedure for instilling human inductive biases in machine models.
To demonstrate the utility of our approach, we apply it to three domains of increasing complexity.
First, we consider a Gaussian mixture example where similarity and embeddings are analytically
tractable, which we then further test with simulations. Then, we consider a generative model
for quadrilateral shapes where generative similarity can be computed in closed form and can be
incorporated explicitly in a contrastive objective. By training a model with this objective, we show
how it acquires human-like regularity biases in a geometric reasoning task. Finally, we consider
probabilistic programs, using two classes of probabilistic programs from DreamCoder [24, 28]. While
generative similarity is not tractable in this case, we show how it can be implicitly induced using a
Monte Carlo approximation applied to a triplet loss function, and how it leads to a representation
that better captures the structure of the programs compared to standard contrastive learning. Viewed
together, these results highlight a path towards alignment of human and machine intelligence by
instilling useful inductive biases from Bayesian models of cognition into machine models via a
scalable contrastive learning framework, and allowing neural networks to capture abstract domains
that previously were restricted to symbolic models.

2 Generative Similarity
1 1
and Contrastive Learning
1

We begin by laying out the formulation of generative similarity and its integration within a contrastive
1
learning Rframework. 11
Given a set of samples D and an associated generative model of the data
p(D) = p(D|θ)p(θ)dθ where p(θ) is some prior over distribution parameters (e.g., a beta prior) and
p(D|θ) is an associated likelihood function (e.g., a Bernoulli distribution), we define the generative
similarity between a1pair of samples sgen (x1 , x2 ), to be the Bayesian probability odds ratio for the

2
probability that they were sampled from the same distribution to that of them being sampled from
two independent (or “different”) distributions
R
p(same|x1 , x2 ) p(x1 |θ)p(x2 |θ)p(θ)dθ
sgen (x1 , x2 ) = =R (1)
p(different|x1 , x2 ) p(x1 |θ1 )p(x2 |θ2 )p(θ1 )p(θ2 )dθ1 dθ2
where we assume that a priori p(same) = p(different) (i.e., the prior over the two hypotheses is
uniform). The same and different data generation hypotheses are shown in Figure 1A along with
example same and different pairs in Figure 1B in the case of a generative process of quadrilateral
shapes, with the same pair corresponding to two squares, and the different pair corresponding to a
square and a trapezoid.
Given the definition of generative similarity, we next distinguish between two scenarios. If sgen is
tractable, then its incorporation in a contrastive loss function is straightforward: given a parametric
neural encoder ϕφ (x) and a prescription for deriving similarities from these embeddings, e.g.
semb = s0 e−d where d is a distance measure and s0 is a constant, we can then directly optimize
the embedding parameters such that the difference between the generative similarity and the
corresponding embedding similarity is minimized, e.g.,

φ∗ = arg min Ep (semb (ϕφ (X1 ), ϕφ (X2 )) − sgen (X1 , X2 ))2 . (2)
φ
If, on the other hand, sgen is not tractable, we can implicitly incorporate it in a neural network using
individual triplet loss functions (here we focus on triplets for convenience, but our formalism can be
easily adapted to larger sample tuples). Specifically, given a generative model of the data, we can
define a corresponding contrastive generative model on data triplets (X, X + , X − ) as follows
Z
pc (x, x , x ) = p(x|θ+ )p(x+ |θ+ )p(x− |θ− )p(θ+ )p(θ− )dθ+ dθ−
+ −
(3)

and then given a choice of a triplet contrast function ℓ, e.g., d(ϕ(x), ϕ(x+ )) − d(ϕ(x), ϕ(x− )) or
some monotonic function of it (alternatively, one could also use an embedding similarity measure
such as the dot product [29]), we define the optimal embedding to be

φ∗ = arg min Epc ℓ(ϕφ (X), ϕφ (X + ), ϕφ (X − )) (4)


φ
Crucially, this function can be easily estimated using a Monte Carlo approximation with triplets
sampled from the following process
 + −
θ , θ ∼ p(θ), sample distributions
x, x+ ∼ p(x|θ+ ), sample ‘same’ examples (5)
 −
x ∼ p(x|θ− ), sample ‘different’ examples
The functional in Equation (4) has some desirable properties. First, if ℓ is chosen to be convex
and strictly increasing in ∆ϕ (x, x+ , x− ) ≡ d(ϕ(x), ϕ(x+ )) − d(ϕ(x), ϕ(x− )) (e.g., softmax loss
ℓ(∆ϕ ) = log(1 + exp(∆ϕ )); [29]), then the optimal embedding that minimizes Equation (4) ensures
that the expected distance between same pairs is strictly smaller than that of different pairs as defined
by the processes in Figure 1A, i.e.,
Epsame d(ϕ∗ (X), ϕ∗ (X + )) < Epdiff d(ϕ∗ (X), ϕ∗ (X − )) (6)

(see Appendix A for a proof). Second, under suitable redefinitions, the special case of ℓ(∆) = ∆
can be related to the objective DKL [s||psame ] − DKL [s||pdiff ] where DKL is the Kullback-Leibler
divergence, which is akin to contrastive divergence learning [30] (see Appendices B-D).
In what follows, we apply the framework that we just laid out to three domains of increasing
complexity, namely, Gaussian mixtures, geometric shapes, and finally probabilistic programs.

3 Experiments
3.1 Motivating Example: Generative Similarity of a Gaussian Mixture

To get a sense of the contrastive training procedure with generative similarity, we begin with a
Gaussian mixture example. Gaussian mixtures are an ideal starting point because i) they are analyti-

3
A B C

Figure 2: Encoding the generative similarity of a Gaussian mixture. A. Optimal linear projection
vector φ for a symmetric Gaussian mixture with means ±µ. B. Learned 1D embedding values from a
two-layer perceptron for points sampled from a 2D Gaussian mixture. C. Mean generative similarity
as a function of distance in the embedding space shown in B (discretized into 500 quantile bins).
Shaded area indicates 95% CIs bootsrapped over data points.

cally and numerically tractable, and ii) they play a key role in the cognitive literature on models of
categorization [31, 32].

3.1.1 Analytical Case: Linear Projections


Consider a data generative process that is given by a mixture of two Gaussians with means µ1,2 , equal
variances σ 2 , and a uniform prior p1,2 = 1/2. Without loss of generality, we can choose a coordinate
system in which µ1 = −µ2 ≡ µ. Consider further a subfamily of embeddings that are specified by
linear projections of the form ϕφ (x) = φ · x where φ is a unit vector of choice φ · φ = 1. A natural
measure of contrastive loss in this case would be
Epc ℓ = Epc ϕφ (X) · ϕφ (X − ) − ϕφ (X) · ϕφ (X + )
 
(7)
i.e., using the ‘dot’ product as a measure of embedding similarity [29] (for one-dimensional em-
beddings this is just a regular product). By plugging in the definition of the Gaussian mixture and
simplifying using Gaussian moments (see Appendix E), Equation (7) boils down to
Epc ℓ ∝ −4||µ||22 cos2 θφµ (8)
where θφµ is the angle between φ and µ. This loss is minimized for cos θφµ = ±1 or equivalently
φ∗ = ±µ̂ where µ̂ is the normalized version of µ. This means that the optimal linear mapping is
simply one that projects different points onto the axis connecting the centers of the two Gaussians
µ1 − µ2 = 2µ, which is equivalent to a linear decision boundary that is orthogonal to the line µ1 − µ2
and passing through the origin, and thus effectively recovering a linear classifier (Figure 2A).

3.1.2 Numerical Case: Two-Layer Perceptron


Next, to test the Monte Carlo approximation (5) and to see how well it tracks the theoretical generative
similarity, we considered an embedding family that is parametrized by two-layer perceptrons. For
the generative family, we chose as before a mixture of two Gaussians, this time with mean values
of µ1 = (5, 5) and µ2 = (1, 1) and unit variance σ 2 = 1. As for the loss function, here we used a
quadratic (Euclidean) loss of the form
1 X 
(ϕ(x) − ϕ(x+ ))2 − (ϕ(x) − ϕ(x− ))2

L= (9)
Ntriplets + −
{x,x ,x }

where {x, x+ , x− } are triplets sampled from Equation (5). We trained the perceptron model using
10,000 triplets (learning rate = 10−5 , hidden layer size = 32, batch-size = 256, and 300 epochs).
The resulting model successfully learned to distinguish the two Gaussians as seen visually from the
embedding values in Figure 2B, and also from the test accuracy of 99.7%. Finally, we wanted to see

4
how well the embedding distance tracked the theoretical generative similarity, which in this case can
be derived in closed form by simply plugging in the Gaussian distributions in Equation (1)
(x1 −µ1 )2 (x2 −µ1 )2 (x1 −µ2 )2 (x2 −µ2 )2
1 −
2e
2σ 2 × e− 2σ 2 + 12 e− 2σ 2 × e− 2σ 2
s(x1 , x2 ) = (x1 −µ1 )2 (x1 −µ2 )2 (x2 −µ1 )2 (x2 −µ2 )2
. (10)
( 12 e− 2σ 2 + 21 e− 2σ 2 ) × ( 21 e− 2σ 2 + 12 e− 2σ 2 )
The mean generative similarity as a function of embedding distance between pairs (grouped into 500
quantile bins) is shown in Figure 2C. We see that it is indeed a monotonically decreasing function of
distance in embedding space (Spearman’s ρ(498) = −0.99, p < 10−3 ). Moreover, we found that the
average distance 95% (1.96-sigma) confidence interval (CI) for same pairs was [1.62, 1.66] whereas
for different pairs it was [4.77, 4.91], consistent with the prediction of Equation (6). Overall, these
results give us confidence in our formalism and our goal next is to apply it to domains that are more
challenging and realistic: geometric quadrilateral shapes, and abstract drawing styles.

3.2 Instilling Human Geometric Shape Regularity Biases

A Oddball Task B Mean Error Rates Across Quadrilateral Type


0.8
Percent Incorrect Trials

0.6

0.4
Species
Humans (ρ=0.86,p=0.0006)
0.2 Baboons (ρ=0.34,p=0.303)
Pretrained CorNet (ρ=0.35,p=0.285)

C Error Rate Correlation 0.0

0.8
square rectangle losange parallelogram rightKite kite isoTrapezoid hinge rustedHinge trapezoid random
0.7
0.8
Mean Error Rates Across Quadrilateral Type
0.6
Percent Incorrect Trials

Finetuned CorNet Model


Pearson’s R

Contrastive GenSim (ours) 0.6


0.5
Supervised
0.4
0.4

0.3 Finetuned CorNet Model


0.2
0.2 Supervised (ρ=0.472, p=0.142)
Contrastive GenSim (ours) (ρ=0.88,p=0.0003)
0.1 0.0

0.0
Baboons Humans
Species square rectangle losange parallelogram rightKite kite isoTrapezoid hinge rustedHinge trapezoid random

Figure 3: Generative similarity can instill human geometric regularity biases. A. The oddball
task of Sablé-Meyer et al. [23] used six quadrilateral stimulus images, in which five images were of
the same reference shape (differing in scale and rotation) and one was an oddball (highlighted in red)
that diverged from the reference shape’s geometric properties. In this example, the reference shape is
a rectangle; note that the oddball does not have four right angles like the rectangles. B. Sablé-Meyer
et al. [23] examined error rates for humans, monkeys, and pre-trained Convolutional Neural Networks
(CNNs) [33] (top) across quadrilaterals of decreasing geometric regularity. We compare the same
CNN model Sablé-Meyer et al. [23] use with different finetuning objectives (bottom). We report the
Spearman rank correlation between model performance and number of geometric regularities across
quadrilateral type (see Table 1 in Appendix G for these values). Error bars denote confidence intervals
over different subjects (humans, monkeys, or model training seeds). C. Correlation between finetuned
models’ error rates with human or monkey error rates. Error bars denote confidence intervals over 10
training runs.

3.2.1 Background
There is much evidence in psychological research of the human species being uniquely sensitive
to abstract geometric regularity [34, 35]. Sablé-Meyer et al. [23] compared diverse human groups
(varying in education, cultural background, and age) to non-human primates on a simple oddball
discrimination task. Participants were shown a set of five reference shapes and one “oddball” shape
and were prompted to identify the oddball (Figure 3A). The reference shapes were generated using

5
basic geometric regularities: parallel lines, equal sides, equal angles, and right angles, which can be
specified by a binary vector corresponding to the presence or absence of these specific geometric
features. There were 11 types of quadrilateral reference shapes with varying geometric regularity,
from squares (most regular) to random quadrilaterals containing no parallel lines, right angles, or
equal angles/sides (least regular; Figure 3B). In each trial, five different versions of the same reference
shape (e.g., a square) were shown in different sizes and orientations. The oddball shape was a modified
version of the reference shape, in which the lower right vertex was moved such that it violated the
regularity of the original reference shape (e.g., moving the lower right vertex of a trapezoid such that
it no longer has parallel sides). Figure 3A shows an example trial.
Sablé-Meyer et al. [23] found that humans were naturally sensitive to these geometric regularities
(right angles, parallelism, symmetry, etc.) whereas non-human primates were not. Specifically,
they found that human performance was best on the oddball task for the most regular shapes,
and systematically decreased as shapes became more irregular. Conversely, non-human primates
performed well above chance, but they performed worse than humans overall and, critically, exhibited
no influence of geometric regularity (Figure 3B). Additionally, they tested a pretrained convolutional
neural network (CNN) model, CorNet [33], on the task. CorNet (Core Object Recognition Network)
is a convolutional neural network model with an architecture that explicitly models the primate ventral
visual stream. It is pretrained on a standard supervised object recognition objective on ImageNet
and is one of the top-scoring models of “brain-score”, a benchmark designed to test models of the
visual system using both behavioral and neural data [36]. Like the monkeys, CorNet exhibited no
systematic relationship with the level of geometric regularity (Figure 3B).

3.2.2 Generative Similarity Experiment


Intuitive geometry serves as an ideal case study for our framework because i) it admits a generative
similarity measure that can be computed in closed form, and ii) we can use it to test whether our
contrastive training framework can induce the human inductive bias observed by Sablé-Meyer et al.
[23] in a neural network.
Recall that the shape categories of Sablé-Meyer et al. [23] can be specified by binary feature vectors
corresponding to the presence or absence of abstract geometric features (equal angles, equal sides,
parallel lines, and right angles of the quadrilateral) from which individual examples (or exemplars)
can be sampled. Formally, we can define a natural generative process for such shapes as follows:
given a set of binary geometric feature variables {F1 , . . . , Fn } ∈ {0, 1}n , we define a hierarchical
distribution over shapes by first sampling Bernoulli parameters θi for each feature variable Fi from
a prior Beta(α, β), then sampling feature values f = (f1 , . . . , fn ) from the resulting Bernoulli
distributions Bern(θi ), and then uniformly sampling a shape σ(f ) from a (possibly large) list of
available exemplars S(f ) = {σ1 (f ), . . . , σM (f )} that are consistent with the sampled feature vector
f (the set could also be empty if the geometric features are not realizable due to geometric constraints).
In other words, the discrete generative process is defined as

θi ∼ Beta(α, β), sample Bernoulli parameters
fi ∼ Bern(θi ), sample discrete features (11)
σ(f ) ∼ Uniform(S(f )), sample shape exemplar

This process covers both soft and definite categories, and our current setting corresponds to the special
limit α = β → 0, in which case the Beta prior over Bernoulli parameters becomes concentrated
around 0 and 1 so that the process becomes that of choosing a category specified by a set of geometric
attributes and then sampling a corresponding exemplar. In Appendix F, we use the conjugacy relations
between the Beta and Bernoulli distributions to derive the generative similarity associated with the
process (11) in closed form, and we further show that in our limit of interest (α = β → 0) the
(1) (2)
corresponding generative similarity between shapes σ1 , σ2 with feature vectors fi , fi is given by
X (1) (2)
log s(σ1 (f (1) ), σ2 (f (2) )) ∝ − (fi − fi )2 (12)
i

We used this generative similarity measure to finetune CorNet, the same model Sablé-Meyer et al.
[23] used in their experiments, to see whether our measure would induce the human geometric
regularity bias (Figure 3). Specifically, given a random pair of quadrilateral stimuli from [23], we
computed the above quantity (i.e. the Euclidean distance) between their respective binary geometric

6
feature vectors (presence and absence of equal sides, equal angles, and right angles) and finetuned
the pretrained CorNet model on a contrastive learning objective using these distances. This pushed
quadrilaterals with similar geometric features together and pulled those with different geometric
features apart in the model’s representation (additional details regarding training are provided in
Appendix G). Like Sablé-Meyer et al. [23], to test the model on the oddball task, we extract the
embeddings for all 6 choice images and choose the oddball as the one that is furthest (i.e. Euclidean
distance) from the mean embedding.

3.2.3 Results

The geometric regularity effect observed for humans in [23] was an inverse relationship between
geometric regularity and error rate (see green line in top plot of Figure 3B). For example, humans
performed best on the most regular shapes, such as squares and rectangles. This regularity effect was
absent in the monkey and pretrained CorNet error rates (Figure 3B; top panel). In Figure 3B (bottom
panel), we show error rates as a function of geometric regularity on a CorNet model finetuned on
generative similarity (GenSim CorNet; blue line bottom plot). We also show the performance of a
CorNet model finetuned on a supervised classification objective on the quadrilateral stimuli (grey line
bottom plot), where the model must classify which of the 11 categories a quadrilateral belongs to.
Note that Sablé-Meyer et al. [23] also finetuned CorNet on the same supervised classification objective
in their supplementary results, and we replicate their results here as a baseline for our proposed
method. Both models were trained for 13 epochs (the same number of epochs used by Sablé-Meyer
et al. [23]). Like humans, the model error rates for the GenSim CorNet model significantly increase
as the shapes become more irregular (Spearman’s ρ(9) = 0.88, p = 0.0003). This is not the case for
the model finetuned with supervised classification (Spearman’s ρ(9) = 0.472, p = 0.142). We also
correlated the error rates of the finetuned models with those of humans and monkeys (Figure 3C)
and see a double dissociation between the two models. Specifically, the generative similarity-trained
CorNet model’s error rates match human error rates significantly more than monkey error rates,
t(18) = 12.45, p < 0.0001, whereas those of the baseline supervised CorNet model match monkey
error rates significantly more than human error rates (Figure 3C), t(18) = 17.43, p < 0.0001. We
replicated the geometric regularity effect when using contrastive learning with generative similarity on
a different architecture (Supplementary Figure S1) and also saw that training on a standard contrastive
learning objective from SimCLR [26] does not yield the regularity effect (Supplementary Figure S2).

3.3 Learning Abstract Drawings using Generative Similarity over Probabilistic Programs

3.3.1 Background

Our final test case is based on a recent study on capturing human intuitions of psychological
complexity for abstract geometric drawings [24]. Sablé-Meyer et al. [24] framed geometric concept
learning as probabilistic program induction within the DreamCoder framework [28]. A base set of
primitives were defined such that motor programs that draw geometric patterns are generated through
recursive combination of these primitives within a Domain Specific Language (DSL, Figure 4A).
The DSL contains motor primitives, such as tracing a particular curve and changing direction, as
well as control primitives to recursively combine subprograms, such as Concat (concatenate two
subprograms together) and Repeat (repeat a subprogram n times). DreamCoder can be trained on
drawings to learn a grammar from which probabilistic programs can be sampled. These probabilistic
programs can then be rendered into images such as the ones seen in Figure 4. Sablé-Meyer et al. [24]
used a working memory task with these stimuli to show people’s intuitions about the psychological
complexity of the image can be modeled through the complexity of the underlying program [24].
The study showed that DreamCoder can produce grammars of different abstract drawing styles
depending on its training data. For example, when trained on “Greek-style” drawings with highly
rectilinear structure, DreamCoder learns a grammar that synthesizes programs which capture this
drawing style (Figure 4B). Likewise, when trained on “Celtic-style” drawings that feature lots of
circles and curves, DreamCoder learns a different grammar that captures the Celtic drawing style
(Figure 4B). Both grammars use the same set of base primitives, but weight the primitives differently
and thus produce images that differ in their abstract drawing style.

7
3.3.2 Generative Similarity Experiment
We wanted to see whether generative similarity over probabilistic programs can allow neural networks
to capture the abstract structure that such programs represent. We tested this by training a neural
network using generative similarity over probabilistic programs from the different grammars discussed
in the previous section (Greek or Celtic, see Figure 4B). In this case, the generative similarity is
intractable, but we can apply a Monte Carlo approximation to Equation (4) by sampling from the
program grammar.
We employ the following technique to generate Monte Carlo triplet samples for the triplet contrastive
loss function. The anchor is randomly sampled from either the Celtic or Greek grammars that are
learned through DreamCoder (with equal probabilities). The positive example is sampled from the
same grammar as the anchor and the negative example is another random sample from either the
Celtic or Greek grammar with equal probability, consistent with Equation (5).
We used 20k examples from both the Celtic and Greek grammars (40k images in total) for training and
800 examples from each grammar for testing. Because of the similarity of the stimuli in Figure 4B to
handwritten characters, we used the same CNN architecture that [37] used on the Omniglot dataset
[21] with six convolutional blocks consisting of 64-filter 3 × 3 convolution, a batch normalization
layer, a ReLU nonlinearity, and a 2 × 2 max-pooling layer. As a baseline, we trained another model,
with the same architecture and training data, on a standard contrastive learning objective used in the
SimCLR paper [26]. The SimCLR objective produces augmented versions of an image (e.g. random
cropping, rotations, gaussian blurring, etc.) and trains representations of an image and its augmented
version to be as similar as possible, as well as representations of an image and different images to be
as dissimilar as possible. See Appendix H for more details on training.

A B Celtic Grammar Greek Grammar


Geometry LoT Primitives
Program :=
| Program; Program Concatenate: run one program and then another
| Repeat([Int=2]) { Program } Repeat a program a certain number of times
| Subprogram { Program } Execute a program, then restore the original state
------------------------------ C Classification Accuracy
(Greek or Celtic)
D Regression Performance
(Number of Program Primitives)
| Trace([t=Int=1], Trace a curve by moving according to the parameters
0.8 0.5
[speed=Num=+1], R2 regression score
[acceleration=Num=+0] 0.7
0.4
0.6
Accuracy

[turningSpeed=Num=+0])
| Move([t=Num=+1]) Move a certain distance without tracing anything 0.5 0.3
| Turn(angle-Num) Rotate the current heading 0.4
0.3 0.2
Sampled Program Corresponding Image
sample

Repeat(2) { render 0.2 0.1


Trace(accel=((1 + 1) + 1), rotSpeed=1) 0.1
} 0.0 0.0
Triplet GenSim Baseline (SimCLR) Triplet GenSim Baseline (SimCLR)

Figure 4: Generative similarity helps contrastive learning models better represent probabilistic
programs. A. Primitives of the generative Language of Thought (LoT) DreamCoder model im-
plemented in [24]. Primitives are recursively composed to produce symbolic programs that can be
rendered into abstract geometric pattern stimuli. B. Sablé-Meyer et al. [24] trained DreamCoder on
two sets of drawings, Celtic and Greek, to produce two different grammars that produced qualitatively
different drawings. C. Performance (with CI’s over 10 training runs) of embeddings on classifying
images as from the Celtic or Greek grammars. D. Performance of embeddings on predicting the
number of primitives of the program used to generate the image stimulus (see A).

3.3.3 Results
To compare the ability of the model’s embeddings to separate Celtic or Greek images after training
on either contrastive objective, we took the embeddings of all test images and trained a logistic
regression model to classify Celtic or Greek images. The logistic regression model was trained
and evaluated using five-fold cross-validation, where the regularization parameter was tuned using
nested folds within the training set. The mean test set classification accuracy from the embeddings
trained with contrastive generative similarity was 84% (95% CI [81.9, 84.9]), which was signifi-
cantly higher than that of the SimCLR contrastive learning baseline, 72.8% (95% CI [71.8, 73.8]),
t(18) = 12.81, p < .0001 (see Figure 4C). We replicated these results on a different architecture
(Supplementary Figure S3). We also found that the first two principal components of the model’s
embeddings can separate the two drawing styles (Figure 5), supporting the notion that generative
pretraining facilitates factorization of task-relevant dimensions explored in prior work [38].

8
We also compared the ability of the learned embeddings to encode properties of the underlying
program (Figure 4D). For each test image, we counted the number of motor and control primitives
(Figure 4A) within the programs used to generate each image. Then, we trained a ridge regression
model to predict this number from the respective image embedding, regressing out average grey-level
of the image prior to training as a potential confound. The ridge regression model was trained
and evaluated using five-fold cross-validation, where the regularization parameter was tuned using
nested folds within the training set. The average test set score using the generative similarity model,
R2 = 0.50 (95% CI [0.46, 0.52]), was significantly higher than that of the SimCLR baseline model,
R2 = 0.23 (95% CI [0.21, 0.24]), t(18) = 17.48, p < 0.0001. This suggests that the embedding
space of the generative similarity model better encodes properties of the original programs.
PC1
PC2

Figure 5: PCA space of embeddings from model trained with generative similarity triplet loss.
The model clearly seperates “Greek” (left) from “Celtic” (right) styles, with mixed styles in between.

4 Discussion
We have introduced a new framework for learning representations that capture human inductive biases
by combining a Bayesian notion of generative similarity with contrastive learning. Our framework is
very general and can be applied to any hierarchical generative process, even when the exact form of
inferences in generative similarity is intractable, allowing neural networks to capture domains that
were previously restricted to symbolic models. To demonstrate the utility of our approach we applied
it to three domains of increasing complexity. First, we investigated an analytically tractable case
that involved a mixture of two Gaussians and showed that mean generative similarity monotonically
decreases with respect to distance in the contrastively-learned embedding space (Figure 2). Second,
we examined a visual perception task with quadrilaterals used in cognitive science [23] and showed
that our procedure is able to produce the human geometric regularity bias in models that were
previously unable to (Figure 3). Third, we considered the domain of probabilistic programs that
synthesize abstract drawings of different styles and showed improved representations of the programs
compared to standard contrastive paradigms like SimCLR [26] (Figure 4).
There are some limitations to our work that point towards future research directions. First, there may
be multiple candidates reflecting different hypotheses concerning the generative model. Characteriz-
ing the kind of generative models underlying human judgments in various domains of interest (see
e.g., [39] for a recent study on skeletal models of shapes) is key to instilling the right inductive biases
in machine models. Second, while our experiments test an array of domains that vary in complexity
to highlight the flexibility of our approach, there is still room for testing richer generative processes.
In the domains we examined in this work, there is generally one main level of hierarchy (e.g., in
our Gaussian mixtures domain, which Gaussian one is sampling from, or in the drawing domain,
which grammar one is sampling from). However, many Bayesian models in cognitive science can
employ multiple layers of hierarchy [25] and it would be exciting to apply our framework for such
models in the future. Third, in our work we mainly focus on domains related to vision, but our
framework is general enough to be applicable for other modalities or applications. For example, Large
Language Models can often produce unpredictable failures in logical [40] and causal [41] reasoning.
Cognitive scientists have written models for human logical reasoning [42] or causal learning [43]
based on probabilistic Bayesian inference. Potential future work may involve contrastive learning

9
with generative similarity over Bayesian models of reasoning to imbue language models with logical
and causal reasoning abilities. Note that, although contrastive learning is most commonly used
in vision, there is precedence for its use in the language domain [44, 45]. Finally, not all human
inductive biases are desirable and some may even have adverse societal effects (e.g., in the context of
social judgments and categories [46]). Researchers should devote utmost care in deciding what they
choose to instill into their models, and what social implications that may entail.
Strong inductive biases are a hallmark of human intelligence [22]. Finding ways to imbue such
biases in machine models is key for developing more generally intelligent AI as well as for achieving
human-AI alignment. Our work makes an exciting step towards that goal, and we hope that it will
inspire others to pursue similar ideas further.
Acknowledgements. This work was supported by grant N00014-23-1-2510 from the Office of Naval
Research. S.K. is supported by a Google PhD Fellowship. We thank Mathias Sablé-Meyer for
assisting us with accessing the data in his work and general advice.

Appendix

A Separation in Expectation
Our goal is to show that for any triplet loss function ℓ(∆ϕ ) that is convex and strictly increasing
in ∆ϕ (x, x+ , x− ) = dϕ (x, x+ ) − dϕ (x, x− ) where dϕ (x, y) = d(ϕ(x), ϕ(y)) is a given embedding
distance measure (e.g., softmax loss or quadratic loss), then the optimal embedding that minimizes
Equation (4) ensures that the expected distance between same pairs is strictly smaller than that of
different pairs as defined by the generative process in Figure 1A. To see that, let ϕ∗ denote the optimal
embedding and let ℓ∗ denote its achieved loss. By definition, for any suboptimal embedding ϕsub
which achieves ℓsub we have ℓ∗ < ℓsub . One such suboptimal embedding (assuming non-degenerate
distributions) is the constant embedding which collapses all samples into a point ϕsub = ϕ0 . In that
case, we have ∆ = 0 and hence ℓ∗ < ℓ(0). Now, using Jensen inequality we have
ℓ(0) > ℓ∗ = Epc ℓ(∆ϕ∗ (X, X + , X − )) ≥ ℓ(Epc ∆ϕ∗ (X, X + , X − )). (A1)
Observe next that since ℓ is strictly increasing (and hence its inverse is well-defined and strictly
increasing) it follows that Epc ∆ϕ∗ (X, X + , X − ) < 0. Finally, by noting that
Z
Epc dϕ∗ (X, X + ) = p(x|θ+ )p(x+ |θ+ )p(x− |θ− )p(θ+ )p(θ− )dϕ∗ (x, x+ )dxdx+ dx− dθ+ dθ−
Z
= p(x|θ+ )p(x+ |θ+ )p(θ+ )dϕ∗ (x, x+ )dxdx+ dθ+

= Epsame dϕ∗ (X, X + )


and likewise,
Z

Epc dϕ∗ (X, X ) = p(x|θ+ )p(x+ |θ+ )p(x− |θ− )p(θ+ )p(θ− )dϕ∗ (x, x− )dxdx+ dx− dθ+ dθ−
Z
= p(x|θ+ )p(x− |θ− )p(θ+ )p(θ− )dϕ∗ (x, x− )dxdx− dθ+ dθ−

= Epdiff dϕ∗ (X, X − )


we arrive at the desired result
Epsame dϕ∗ (X, X + ) < Epdiff dϕ∗ (X, X − ). (A2)

B Connections to Other Loss Functions


It is possible to show under suitable regularization (to ensure that sgen can be treated as a distribution;
see Appendix C) that the generative similarity measure defined in Equation (1) can be derived as the
minimizer of a functional that is reminiscent of contrastive divergence learning [30]
Z
DKL [s||psame ] − DKL [s||pdiff ] = [s(x, x′ ) log pdiff (x, x′ ) − s(x, x′ ) log psame (x, x′ )]dxdx′ (B1)

10
R
where DKL [p||q] = p(x) log[p(x)/q(x)]dx is the Kullback-Leibler (KL) divergence. While the
left-hand-side in Equation (B1) may seem rather different from Equation (4), the cancellation in the
KL divergences yields a special case of Equation (4) with ℓ(∆) = ∆ upon minimal redefinitions,
namely, recasting distance measures as similarities d(x, y) → s0 − s(x, y) and substituting prob-
abilities with their logarithm p → log p (i.e., applying a monotonic transformation; see Appendix
D). Indeed, varying the functional in Equation (4) with respect to s along with a simple quadratic
regularizer (see Appendix D) yields s∗ (x1 , x2 ) ∝ psame (x1 , x2 ) − pdiff (x1 , x2 ) which is equivalent
to the generative similarity measure (1) up to a monotonic transformation of probabilities p → log p.

C Generative Similarity as an Optimal Solution


In what follows we will show that generative similarity (1) can be derived as the minimizer of the
following functional
Z 
−1 ′ ′
L[s] = DKL [s||psame ] − DKL [s||pdiff ] − β H(s) + λ s(x, x )dxdx − 1 (C1)

where DKL is the Kullback-Leibler divergence, H is entropy, and the linear integral is a Lagrangian
constraint that ensures that s is normalized so that the other terms are well defined. Note that
while λ > 0 is a Lagrange multiplier, β −1 > 0 is a free parameter of our choice that controls the
contribution of the entropy term and we may set it to one or a small number if desired. In other words,
the minimizer s∗ of L is the maximum-entropy (or entropy-regularized) solution that maximizes the
contrast between psame and pdiff in the DKL sense (i.e. it seeks to assign high weight to pairs with
high psame but low pdiff , and low values for pairs with low psame but high pdiff ). To derive s∗ , observe
that from the definition of the KL divergence we have
s(x, x′ ) s(x, x′ )
Z  
DKL [s||psame ] − DKL [s||pdiff ] = s(x, x′ ) log − s(x, x′
) log dxdx′
psame (x, x′ ) pdiff (x, x′ )
Z
= [s(x, x′ ) log pdiff (x, x′ ) − s(x, x′ ) log psame (x, x′ )] dxdx′

Next, varying the functional with respect to s we have


δL
= log pdiff (x1 , x2 ) − log psame (x1 , x2 ) + β −1 [log s(x1 , x2 ) + 1] + λ = 0 (C2)
δs
which yields
 β
1 psame (x1 , x2 )
s∗ (x1 , x2 ) = (C3)
Z(β, λ) pdiff (x1 , x2 )
where we defined Z(β, λ) ≡ eβλ+1 . Next, from the Lagrange multiplier equation δλ L = 0 we have
Z  β
psame (x1 , x2 )
Z(β, λ) = dx1 dx2 (C4)
pdiff (x1 , x2 )
which fixes λ as a function of β assuming that the right-hand integral converges. Two possible sources
of divergences are i) pdiff (x1 , x2 ) approaches zero while psame (x1 , x2 ) remains finite, and ii) the
integral is carried over an unbounded region without the ratio decaying fast enough. The latter issue
can be resolved by simply assuming that the space is large but bounded and that the main probability
mass of the generative model is far from the boundaries R(which is plausible for practical applications).
As for the former, observe that when pdiff (x1 , x2 ) ≡ p(x1 |θ1 )p(x2 |θ2 )p(θ1 )p(θ2 )dθ1 dθ2 = 0 it
implies (from non-negativity) thatRp(x1 |θ1 )p(x2 |θ2 ) = 0 for all θ1,2 in the support of p(θ) which in
turn implies that psame (x1 , x2 ) ≡ p(x1 |θ)p(x2 |θ)p(θ)dθ = 0. In other words, if pdifferent vanishes
then so does psame (but not vice versa, e.g. if p(x1 |θ) and p(x2 |θ) have non-overlapping support as a
function of θ). Likewise, the rate at which these approach zero is also controlled by the same factor
p(x1 |θ1 )p(x2 |θ2 ) → 0 and so we expect the ratio to be generically well-behaved.
Finally, setting β = 1 we arrive at the desired Bayes odds relation
psame (x1 , x2 ) p(same|x1 , x2 )
s∗ (x1 , x2 ) ∝ = (C5)
pdiff (x1 , x2 ) p(different|x1 , x2 )

11
where the second equality follows from the fact that we assumed that a priori p(same) = p(different).
As a sanity check of the convergence assumptions, consider the case of a mixture of two one-
dimensional Gaussians with means µ1 = −µ2 = µ and uniform prior, and as a test let us set σ = 1
and µ ≫ 1 so that the Gaussians do not overlap and are far from the origin. In this case, we assume
that the space is finite x ∈ [−Λ, Λ] such that Λ ≫ µ ≫ 1 so that the Gaussians are unaffected by the
boundary. Then, for points that are far from the Gaussian centers, e.g. at the origin x1 = x2 = 0 for
which the likelihoods are exponentially small we have
(+µ)2 (+µ)2 (−µ)2 (−µ)2
1 − 2
p(x1 , x2 |same) 2e × e− 2 + 12 e− 2 × e− 2
= (+µ)2 (−µ)2 (+µ)2 (−µ)2
=1 (C6)
p(x1 , x2 |different) ( 12 e− 2 + 12 e− 2 ) × ( 21 e− 2 + 21 e− 2 )
which is indeed finite.

D The Special Case of ℓ(∆) = ∆


We consider the triplet loss objective under the special case of ℓ(∆) = ∆ where ∆(x, x+ , x− ) =
d(x, x+ ) − d(x, x− ). Recasting the distance measures as similarities d(x, y) → s0 − s(x, y) and
unpacking Equation (4) we have
L[s] = Epc ∆(X, X + , X − )
= Epc s(X, X − ) − Epc s(X, X + )
= Epdiff s(X, X − ) − Epsame s(X, X + )
Z
= [s(x, x′ )pdiff (x, x′ ) − s(x, x′ )psame (x, x′ )]dxdx′

where the third equality follows from an identical derivation to the one found in Appendix A above
Equation (A2).
Our goal next is to find the similarity function which minimizes L[s] by varying it with respect to
s, i.e., δs L = 0. As before, since L is linear in s we need to add a suitable regularizer to derive a
solution (otherwise δs L = 0 has no solutions). Here we are no longer committed to a probabilistic
interpretation of s and so a natural choice would be a quadratic regularizer
Z 
2 ′ ′
Lreg [s] = L[s] + λ s (x, x )dxdx − Λ (D1)

for some constants Λ, λ > 0. Varying the Lagrangian with respect to the similarity measure we have
δLreg
= pdiff (x1 , x2 ) − psame (x1 , x2 ) + 2λs(x1 , x2 ) = 0 (D2)
δs
This in turn implies that the optimal similarity measure is given by
1
s∗ (x1 , x2 ) = [psame (x1 , x2 ) − pdiff (x1 , x2 )] (D3)

Likewise, for the Lagrange multiplier we have
δLreg
Z
= s2 (x, x′ )dxdx′ − Λ = 0 (D4)
δλ
Plugging in the optimal solution we have
Z
1 2
[psame (x, x′ ) − pdiff (x, x′ ))] dxdx′ − Λ = 0 (D5)
4λ2
The integral is positive since it is the squared difference between two normalized probability distribu-
tions, and so denoting its value as Cp > 0 we can solve for λ
r
1 Cp
λ= (D6)
2 Λ
Thus, putting everything together we have
s
Λ
s∗ (x1 , x2 ) = [psame (x1 , x2 ) − pdiff (x1 , x2 ))] (D7)
Cp

12
E Gaussian Mixtures and Linear Projections
To derive Equation (8), we start by plugging in the definition of the Gaussian mixture generative
process (i.e., uniformly sampling a Gaussian and then sampling points from it) into Equation (4)
X X Z
Epc ℓ(X, X + , X − ) = ℓ(ϕ(x), ϕ(x+ ), ϕ(x− ))×
i=1,2 j=1,2

(x − µi )2 + (x+ − µi )2 + (x− − µj )2
 
1 1
× exp − dxdx+ dx−
4 ((2π)d σ 2d )3/2 2σ 2
Now, recall that
ℓ(x, x+ , x− ) = ϕφ (x) · ϕφ (x− ) − ϕφ (x) · ϕφ (x+ )
= (φ · x)(φ · x− ) − (φ · x)(φ · x+ )
Substituting into the loss formula and integrating we have
(x − µi )2 + (x− − µj )2
XZ  

Epc ℓ ∝ (φ · x)(φ · x ) exp − 2
dxdx−
i,j

(x − µi )2 + (x+ − µi )2
XZ  
+
−2 (φ · x)(φ · x ) exp − 2
dxdx+
i

Note next that since x+ and x− are dummy integration variables, we can further rewrite
(x − µi )2 + (x− − µj )2
XZ  

Epc ℓ ∝ (φ · x)(φ · x ) exp − dxdx−
2σ 2
i̸=j

(x − µi )2 + (x+ − µi )2
XZ  
− (φ · x)(φ · x+ ) exp − 2
dxdx+
i

Thus, using the fact that the distributions are separable and standard Gaussian moment formulae we
arrive at X X
Epc ℓ ∝ (φ · µi )(φ · µj ) − (φ · µi )(φ · µi ) (E1)
i̸=j i

Finally, plugging in µ1 = −µ2 = µ and using the fact that ||φ||22 = φ · φ = 1 we have
Epc ∝ −4(φ · µ)2 = −4||µ||22 cos2 θφµ . (E2)

F Generative Similarity of Geometric Shape Distributions


Our goal is to derive the generative similarity measure associated with the process in Equation (11)
psame (σ1 , σ2 )
s(σ1 (f (1) ), σ2 (f (2) )) = (F1)
pdiff (σ1 , σ2 )
Plugging in the different Beta, Bernoulli, and uniform distributions into the nominator of Equation
(1) we have
X Z δ(fˆ(1) − f (1) ) δ(fˆ(2) − f (2) )
psame (σ1 , σ2 ) =
(1,2) |S(fˆ(1) )| |S(fˆ(2) )|
fˆ1···n
Y  f (1) +f (2) +α−1 ¯(1) ¯(2)

× θi i i
(1 − θi )fi +fi +β−1 /B(α, β) dθ1···n
i

where we defined f¯i = 1 − fi and used the definition of the Bernoulli distribution Bern(fi ; θi ) =
¯
θifi (1 − θi )fi , and the Beta distribution Beta(θi ; α, β) = θiα−1 (1 − θi )β−1 /B(α, β) where B is the
R1
Beta function and is given by B(z1 , z2 ) = 0 dttz1 −1 (1−t)z2 −1 which is well-defined for all positive
numbers z1 , z2 > 0. Note that the delta function δ(fˆ − f ) simply enforces the fact that by definition

13
each stimulus is consistent with only one set of feature values (otherwise there would be at least
one feature of the stimulus that is both True and False which is a contradiction). Likewise, |S(fˆ)| is
the cardinality of the exemplar set associated with the feature vector fˆ which accounts for uniform
sampling. Likewise, for the denominator of Equation (1) we have
X Z δ(fˆ(1) − f (1) ) Y  f (1) +α−1 (1)

f¯i +β−1 (1)
pdiff (σ1 , σ2 ) = θ(1)i
i
(1 − θ(1)i ) /B(α, β) dθ1···n
ˆ(1)
|S(f )|
ˆ(1)
f1···n i

X Z δ(fˆ(2) − f (2) ) Y  f (2) +α−1 ¯(2)



(2)
× j
θ(2)j (1 − θ(2)j )fj +β−1 /B(α, β) dθ1···n
(2) |S(fˆ(2) )| j
fˆ1···n

The above integrals might seem quite complicated at first but the conjugacy relation between the Beta
and Bernoulli distributions as well as the delta functions simplify things drastically. Indeed, the delta
functions cancel the summation over features, and the cardinality factors cancel out in the ratio so
that we are left with a collection of Beta function factors (see definition of Beta function above)
(1) (2) (1) (2)
B(f + fi + α, f¯i + f¯i + β)B(α, β)
Q
(1) (2)
s(σ1 (f ), σ2 (f )) = Q i (1)i (1) (2) (2)
(F2)
+ α, f¯i + β) j B(fj + α, f¯j + β)
Q
i B(fi

Taking the logarithm and rearranging the terms we have


(1) (2) (1) (2)
X B(fi + α, f¯i + f¯i + β)B(α, β)
+ fi
log s(σ1 (f (1) ), σ2 (f (2) )) = log (1) (1) (2) (2)
(F3)
i B(f + α, f¯ + β)B(f + α, f¯ + β)
i i i i

Now, recall the following Beta function identities1


x y
B(x + 1, y) = B(x, y); B(x, y + 1) = B(x, y) (F4)
x+y x+y
Using these identities we can group and simplify the different ratios contributing to the sum depending
(1) (2)
on the values of the features. If fi = fi = 1 then we have
B(2 + α, 0 + β)B(α, β) α+1 α+β
log = log (F5)
B(1 + α, 0 + β)B(1 + α, 0 + β) α+β+1 α
(1) (2) (1) (2)
If on the other hand fi = 1 and fi = 0 or fi = 0 and fi = 1 then we have
B(1 + α, 1 + β)B(α, β) β α+β
log = log (F6)
B(1 + α, 0 + β)B(0 + α, 1 + β) α+β+1 β
(1) (2)
and finally for fi = fi = 0 we have
B(0 + α, 2 + β)B(α, β) β+1 α+β
log = log (F7)
B(0 + α, 1 + β)B(0 + α, 1 + β) α+β+1 β
Next, defining Σ1 and Σ2 to be the sets of features that hold true for stimuli σ1 and σ2 , we can write
α+1 α+β
log s(σ1 (f (1) ), σ2 (f (2) )) = |Σ1 ∩ Σ2 | log
α+β+1 α
β α+β
+ (|Σ1 − Σ2 | + |Σ2 − Σ1 |) log
α+β+1 β
β+1 α+β
+ |Σ̄1 ∩ Σ̄2 | log
α+β+1 β
where |Σ1 ∩ Σ2 | is the number features that hold true for both stimuli, |Σi − Σj | is the number of
features that hold true for σi but not for σj , and finally |Σ̄1 ∩ Σ̄2 | is the number of features that hold
1
These follow from the fact that B(z1 , z2 ) = Γ(z1 )Γ(z2 )/Γ(z1 + z2 ) where Γ is the Gamma function which
satisfies Γ(z + 1) = zΓ(z) for any z > 0 [47].

14
neither for σ1 nor for σ2 . Observe next that by definition |Σ̄1 ∩ Σ̄2 | = n − |Σ1 ∩ Σ2 | − |Σ1 − Σ2 | −
|Σ2 − Σ1 | where n is the overall number of features. From here it follows that
α+1 β
log s(σ1 (f (1) ), σ2 (f (2) )) = |Σ1 ∩ Σ2 | log
α β+1
β
+ (|Σ1 − Σ2 | + |Σ2 − Σ1 |) log
β+1
β+1 α+β
+ n log
α+β+1 β
In the limit of α = β → 0 we have
β+1
log s(σ1 (f (1) ), σ2 (f (2) ) = n log 2 − log (|Σ1 − Σ2 | + |Σ2 − Σ1 |) (F8)
β
where the first term is simply a constant. Finally, observe that
X (1) (2) (2) (1)
|Σ1 − Σ2 | + |Σ2 − Σ1 | = fi (1 − fi ) + fi (1 − fi ) (F9)
i
(1) (1) (2) (2)
X
= fi − 2fi fi + fi
i
(1) (2)
X
= (fi − fi )2
i

where the third equality follows from the fact that f 2 = f for binary features. In other words,
the generative similarity reduces to a monotonically decreasing function of the Euclidean distance
between the geometric features of shapes
β + 1 X (1) (2)
log s(σ1 (f (1) ), σ2 (f (2) ) = n log 2 − log (fi − fi )2 (F10)
β i

which is the desired result.

G Details on Quadrilateral Experiment


For our main experiments, we use the CorNet model which was used in the original work (variant ‘S’)
that introduces the Oddball task [23]. CorNet contains four “areas” corresponding to the areas of the
visual stream: V1, V2, V4, and IT. Each area contains convolutional and max pooling layers. There
are also biologically plausible recurrent connections between areas (e.g., V4 to V1). After IT, the
penultimate area in the visual stream, a linear layer is used to readout object categories. The model is
pretrained on ImageNet on a standard supervised object recognition objective. The pretrained CorNet
model’s performance on the Oddball task is reported in Figure 3B.
For finetuning the model on the supervised classification objective, we followed the protocol used
in the supplementary results of Sablé-Meyer et al. [23]. Specifically, 11 new object categories are
added to the model’s last layer, and the model is trained to classify a quadrilateral as one of the 11
categories shown in Figure 3B. For training data, we used quadrilaterals from all 11 categories with
different scales and rotations (though the specific quadrilateral images used in the test trials were held
out). We used a learning rate of 5e-6 using the Adam optimizer with a cross entropy loss. Training
was conducted on an NVIDIA Quadro P6000 GPU with 25GB of memory.
For finetuning the model on the generative similarity contrastive objective, we first calculated the
Euclidean distance of the model’s final layer embedding between different quadrilateral images, then
calculated the Euclidean distance between the quadrilaterals’ respective geometry feature vectors,
and finally used the mean squared error between the embeddings’ distance and the feature vectors’
distance as the loss. The geometric feature vectors were a set of 22 binary features encoding the
following properties: 6 features per pair of edges encoding whether their lengths are equal or not, 6
features per pair of angles coding whether their angles are equal or not, 6 features per pair of edges
encoding whether they were parallel or not, and 4 features per angle encoding whether they were
right angles or not. See Table 1 below for a list of these values. Like the supervised model, we used

15
Table 1: List of geometric regularities for each quadrilateral type (sorted from most regular to least)
shape rightAngles parallels symmetry equalSides equalAngles
square 4 2 4 4 4
rectangle 4 2 2 2 4
losange 0 0 2 4 2
parallelogram 0 2 1 2 2
rightKite 2 0 1 2 2
kite 0 0 1 2 2
isoTrapezoid 0 1 1 1 2
hinge 1 0 0 1 0
rustedHinge 0 0 0 1 0
trapezoid 0 1 0 0 0
random 0 0 0 0 0

training data from each category of quadrilaterals with different scales and rotations (though specific
images used in the test trials were held out). We used the Adam optimizer with a learning rate of
5e-4. We used the exact same training data, learning rate, and optimizer when running the control
experiments for finetuning CorNet on the SimCLR objective (Figure S2). Training was conducted on
an NVIDIA Quadro P6000 GPU with 25GB of memory.

H Details on Drawing Styles Experiment


We used the DreamCoder grammars Sablé-Meyer et al. [24] trained on Greek and Celtic drawings
respectively to obtain training data. Both models used the same DSL (see Figure 4A) but, because they
are trained on different images, they weigh those primitives differently and thus combine primitives
differently when sampling from the grammar. We obtained 20k images from both grammars (40k
images in total) and used 800 additional examples from each grammar for testing. Each image as a
128 × 128 gray-scale image. Images were normalized to have pixel values between 0-1 by dividing
by 255.
Because of the similarity of the stimuli in Figure 4B to handwritten characters, we used the same CNN
architecture that [37] used on the Omniglot dataset [21] with six convolutional blocks consisting of
64-filter 3 × 3 convolution, a batch normalization layer, a ReLU nonlinearity, and a 2 × 2 max-pooling
layer. This network outputs a 256-dimensional embedding. Our experiment was replicated with
another CNN architecture (CorNet), which yielded similar results (Figure S3). We used two different
training objectives: a standard contrastive learning objective from SimCLR [26] and one based on a
Monte-Carlo estimate of generative similarity. For both objectives, we used the same learning rate
1e-3 and the same Adam optimizer with a batch size of 128.
For the SimCLR baseline objective, images in the batch were randomly augmented. The augmenta-
tions were: random resize crop, random horizontal flips, and random Gaussian blurs. The original
SimCLR paper also had augmentations corresponding to color distortions which we did not use
because our data were already grayscale images. Like SimCLR, we used the InfoNCE loss function
[48]. Let vi be the embedding of image i and vi′ be the embedding of image i’s augmented counterpart.
PN f (v ,v ′ )
The loss is N1 i=1 log 1 P expi fi(v ,v′ ) where f is a similarity function between embeddings (Sim-
N j i j
CLR used cosine similarity). This effectively pushes representations of images and their augmented
counterparts to be more similar while also pushing representations of images and other images’
augmented counterparts to be more dissimilar. Training was conducted with one NVIDIA Tesla P100
GPU with 16GB of memory.
For the generative similarity objective, let image a be the representation of the anchor image that
is sampled from a random grammar c ∈ {Celtic, Greek}. Let p be the representation of a positive
image that is another image sampled from c and n be the negative image that is randomly sampled
from either grammar (and is therefore considered an independent sample). Using these samples, we
compute the positive Euclidean distance d(p, a) and the negative Euclidean distance d(n, a) in order
to compute the loss function d(p, a) − d(n, a) (see Equation 5). Training was conducted with one
NVIDIA Tesla P100 GPU with 16GB of memory.

16
I Reproduction of Geometric Regularity Effect with a Different Architecture

0.5 Model
GenSim ResNet (ρ=0.957,p<0.0001)
Percent Incorrect Trials
0.4

0.3

0.2

0.1

0.0

square rectangle losange parallelogram rightKite kite isoTrapezoid hinge rustedHinge trapezoid random

Figure S1: To show the regularity effect can generalize to a different architecture, we finetuned
ResNet-101 [49] on the GenSim objective and show that it recovers the human geometric regularity
effect. Note that in their supplement, Sablé-Meyer et al. [23] also reported that a pretrained ResNet-
101 fails to produce the geometric regularity effect.

J No Geometric Regularity Effect for Standard Contrastive Learning

Model
SimCLR CorNet(ρ=0.424,p=0.194)
Percent Incorrect Trials

0.06

0.04

0.02

0.00

square rectangle losange parallelogram rightKite kite isoTrapezoid hinge rustedHinge trapezoid random

Figure S2: We finetune CorNet using the standard contrastive objective in SimCLR [26]. Specifically,
simple augmentations (cropping and resizing, rotations, etc) were applied to individual quadrilateral
images, and then the CNN was trained to push its representations of those images together, to be more
similar (i.e., less distant) to their augmented counterparts, and pull its representations of different
quadrilateral images apart, to be more dissimilar (i.e., more distant) from each other. Finetuning on
this objective does not result in the human geometric regularity effect. Note that the performance of
this network is much higher than that of other models and both humans and baboons. To understand
why, recall that the reference image choices in the Oddball task are different scales/rotations of the
reference image. Because the SimCLR objective applies similar image augmentations to minimize the
distance of an image’s embedding with its augmented counterpart, the overlap between the training
paradigm and the Oddball task allows the network to excel at the Oddball task at a superhuman level.
Crucially, however, this network is not human-like because it lacks the regularity effect and therefore
does not have the human inductive bias we strive to instill in this work.

17
K Reproducing Drawing Style Experiments with a Different Architecture

Classification Accuracy (Greek or Celtic Classification) Decoding Accuracy (Number of Program Primitives)
0.6

0.8
0.5

0.6 0.4
Accuracy

R2 score
0.3
0.4

0.2

0.2
0.1

0.0 0.0
Triplet GenSim Baseline (SimCLR) Triplet GenSim Baseline (SimCLR)
model model

Figure S3: To show the generality of the results in Figure 4, we reproduced the results with a different
CNN architecture (CorNet) than the one reported in Figure 4. With a different architecture, training
on the GenSim contrastive objective still can decode Greek or Celtic drawing style better (left) and
predict the number of motor and control primitives used (right). Errorbars are 95% confidence
intervals over different model training runs.

References
[1] Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind:
Statistics, structure, and abstraction. Science, 331(6022):1279–1285, 2011.
[2] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through
probabilistic program induction. Science, 350(6266):1332–1338, 2015.
[3] Samuel J Gershman. On the blessing of abstraction. Quarterly Journal of Experimental Psychology, 70(3):
361–365, 2017.
[4] Sreejan Kumar, Carlos G Correa, Ishita Dasgupta, Raja Marjieh, Michael Y Hu, Robert Hawkins,
Jonathan D Cohen, Karthik Narasimhan, Thomas L Griffiths, et al. Using natural language and program
abstractions to instill human inductive biases in machines. Advances in Neural Information Processing
Systems, 35:167–180, 2022.
[5] R Thomas McCoy, Erin Grant, Paul Smolensky, Thomas L Griffiths, and Tal Linzen. Universal linguistic
inductive biases via meta-learning. arXiv preprint arXiv:2006.16324, 2020.
[6] Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty
makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 9617–9626, 2019.
[7] Martin N Hebart, Charles Y Zheng, Francisco Pereira, and Chris I Baker. Revealing the multidimen-
sional mental representations of natural objects underlying human similarity judgements. Nature Human
Behaviour, 4(11):1173–1185, 2020.
[8] Ilia Sucholutsky, Ruairidh M Battleday, Katherine M Collins, Raja Marjieh, Joshua Peterson, Pulkit Singh,
Umang Bhatt, Nori Jacoby, Adrian Weller, and Thomas L Griffiths. On the informativeness of supervision
signals. In Uncertainty in Artificial Intelligence, pages 2036–2046. PMLR, 2023.
[9] Ilia Sucholutsky and Thomas L Griffiths. Alignment with human representations supports robust few-shot
learning. Advances in Neural Information Processing Systems, 36, 2024.
[10] R Thomas McCoy and Thomas L Griffiths. Modeling rapid language learning by distilling bayesian priors
into artificial neural networks. arXiv preprint arXiv:2305.14701, 2023.

18
[11] Lukas Muttenthaler, Lorenz Linhardt, Jonas Dippel, Robert A Vandermeulen, Katherine Hermann, Andrew
Lampinen, and Simon Kornblith. Improving neural network representations using human similarity
judgments. Advances in Neural Information Processing Systems, 36, 2024.
[12] Jake C Snell, Gianluca Bencomo, and Thomas L Griffiths. A metalearned neural circuit for nonparametric
bayesian inference. arXiv preprint arXiv:2311.14601, 2023.
[13] Aditi Jha, Joshua C Peterson, and Thomas L Griffiths. Extracting low-dimensional psychological represen-
tations from convolutional neural networks. Cognitive Science, 47(1):e13226, 2023.
[14] Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C
Love, Erin Grant, Jascha Achterberg, Joshua B Tenenbaum, et al. Getting aligned on representational
alignment. arXiv preprint arXiv:2310.13018, 2023.
[15] Thomas L Griffiths, Nick Chater, Charles Kemp, Amy Perfors, and Joshua B Tenenbaum. Probabilistic
models of cognition: Exploring representations and inductive biases. Trends in Cognitive Sciences, 14(8):
357–364, 2010.
[16] Marcel Binz, Ishita Dasgupta, Akshay K Jagadish, Matthew Botvinick, Jane X Wang, and Eric Schulz.
Meta-learned models of cognition. Behavioral and Brain Sciences, pages 1–38, 2023.
[17] Ilia Sucholutsky and Matthias Schonlau. Soft-label dataset distillation and text dataset distillation. In 2021
International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
[18] Katherine Maeve Collins, Matthew Barker, Mateo Espinosa Zarlenga, Naveen Raman, Umang Bhatt,
Mateja Jamnik, Ilia Sucholutsky, Adrian Weller, and Krishnamurthy Dvijotham. Human uncertainty in
concept-based ai systems. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society,
pages 869–889, 2023.
[19] Raja Marjieh, Pol Van Rijn, Ilia Sucholutsky, Theodore Sumers, Harin Lee, Thomas L Griffiths, and Nori
Jacoby. Words are all you need? Language as an approximation for human similarity judgments. In The
Eleventh International Conference on Learning Representations, 2023.
[20] Philippe Esling, Axel Chemla-Romeu-Santos, and Adrien Bitton. Generative timbre spaces with variational
audio synthesis. In Proceedings of the International Conference on Digital Audio Effects (DAFx), pages
175–181, 2018.
[21] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. The Omniglot challenge: a 3-year
progress report. Current Opinion in Behavioral Sciences, 29:97–104, 2019.
[22] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines
that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.
[23] Mathias Sablé-Meyer, Joël Fagot, Serge Caparos, Timo van Kerkoerle, Marie Amalric, and Stanislas
Dehaene. Sensitivity to geometric shape regularity in humans and baboons: A putative signature of human
singularity. Proceedings of the National Academy of Sciences, 118(16):e2023123118, 2021.
[24] Mathias Sablé-Meyer, Kevin Ellis, Josh Tenenbaum, and Stanislas Dehaene. A language of thought for the
mental representation of geometric shapes. Cognitive Psychology, 139:101527, 2022.
[25] Jake Quilty-Dunn, Nicolas Porot, and Eric Mandelbaum. The best game in town: The reemergence of the
language-of-thought hypothesis across the cognitive sciences. Behavioral and Brain Sciences, 46:e261,
2023.
[26] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for
contrastive learning of visual representations. In International Conference on Machine Learning, pages
1597–1607. PMLR, 2020.
[27] Charles Kemp, Aaron Bernstein, and Joshua B Tenenbaum. A generative theory of similarity. In
Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Citeseer,
2005.
[28] Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc
Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive program
synthesis with wake-sleep library learning. In Proceedings of the 42nd ACM Sigplan International
Conference on Programming Language Design and Implementation, pages 835–850, 2021.
[29] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in Neural
Information Processing Systems, 29, 2016.
[30] Miguel A Carreira-Perpinan and Geoffrey Hinton. On contrastive divergence learning. In International
Workshop on Artificial Intelligence and Statistics, pages 33–40. PMLR, 2005.
[31] Yves Rosseel. Mixture models of categorization. Journal of Mathematical Psychology, 46(2):178–210,
2002.
[32] Adam N Sanborn, Thomas L Griffiths, and Daniel J Navarro. Rational approximations to rational models:
alternative algorithms for category learning. Psychological Review, 117(4):1144, 2010.

19
[33] Jonas Kubilius, Martin Schrimpf, Kohitij Kar, Rishi Rajalingham, Ha Hong, Najib Majaj, Elias Issa, Pouya
Bashivan, Jonathan Prescott-Roy, Kailyn Schmidt, et al. Brain-like object recognition with high-performing
shallow recurrent ANNs. Advances in Neural Information Processing Systems, 32, 2019.
[34] Christopher S Henshilwood, Francesco d’Errico, Karen L Van Niekerk, Yvan Coquinot, Zenobia Jacobs,
Stein-Erik Lauritzen, Michel Menu, and Renata García-Moreno. A 100,000-year-old ochre-processing
workshop at Blombos Cave, South Africa. Science, 334(6053):219–222, 2011.
[35] Aya Saito, Misato Hayashi, Hideko Takeshita, and Tetsuro Matsuzawa. The origin of representational
drawing: a comparison of human children and chimpanzees. Child Development, 85(6):2232–2246, 2014.
[36] Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar,
Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al. Brain-score: Which artificial neural
network for object recognition is most brain-like? BioRxiv, page 407007, 2018.
[37] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in
Neural Information Processing Systems, 30, 2017.
[38] Declan Campbell and Jonathan D Cohen. A relational inductive bias for dimensional abstraction in neural
networks. arXiv preprint arXiv:2402.18426, 2024.
[39] Nathan Destler, Manish Singh, and Jacob Feldman. Skeleton-based shape similarity. Psychological Review,
2023.
[40] Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, and
Michael R Lyu. A & b== b & a: Triggering logical reasoning failures in large language models. arXiv
preprint arXiv:2401.00757, 2024.
[41] Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models:
Opening a new frontier for causality. arXiv preprint arXiv:2305.00050, 2023.
[42] Steven T Piantadosi, Joshua B Tenenbaum, and Noah D Goodman. The logical primitives of thought:
Empirical foundations for compositional cognitive models. Psychological Review, 123(4):392, 2016.
[43] Noah D Goodman, Tomer D Ullman, and Joshua B Tenenbaum. Learning a theory of causality. Psycholog-
ical Review, 118(1):110, 2011.
[44] Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embed-
dings. arXiv preprint arXiv:2104.08821, 2021.
[45] Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. Moelora: Contrastive
learning guided mixture of experts on parameter-efficient fine-tuning for large language models. arXiv
preprint arXiv:2402.12851, 2024.
[46] Susan T Fiske. Stereotype content: Warmth and competence endure. Current Directions in Psychological
Science, 27(2):67–73, 2018.
[47] Emil Artin. The gamma function. Courier Dover Publications, 2015.
[48] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive
coding. arXiv preprint arXiv:1807.03748, 2018.
[49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

20

You might also like