0% found this document useful (0 votes)
23 views24 pages

(Yet) Another Theoretical Model of Thinking

This paper introduces a theoretical model of thinking that emphasizes internal consistency, allowing for complex thought sequences, indefinite information retention, and online learning. The model is designed to generate new ideas from existing inputs while ensuring that outputs are always aligned with the inputs. It draws parallels with Turing machines and proposes a modular approach to processing information, which enhances generalization and creativity in thought generation.

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views24 pages

(Yet) Another Theoretical Model of Thinking

This paper introduces a theoretical model of thinking that emphasizes internal consistency, allowing for complex thought sequences, indefinite information retention, and online learning. The model is designed to generate new ideas from existing inputs while ensuring that outputs are always aligned with the inputs. It draws parallels with Turing machines and proposes a modular approach to processing information, which enhances generalization and creativity in thought generation.

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

(Yet) Another Theoretical Model of Thinking

Patrick Virie
[email protected]
arXiv:1511.02455v1 [cs.AI] 8 Nov 2015

Abstract
This paper presents a theoretical, idealized model of the thinking process with the
following characteristics: 1) the model can produce complex thought sequences
and can be generalized to new inputs, 2) it can receive and maintain input infor-
mation indefinitely for generation of thoughts and later use, and 3) it supports
learning while executing. The crux of the model lies within the concept of inter-
nal consistency, or the generated thoughts should always be consistent with the
inputs from which they are created. Its merit, apart from the capability to generate
new creative thoughts from internal mechanism, depends on the potential to help
training to generalize better. This is consequently enabled by separating input in-
formation into several parts to be handled by different processing components with
a focus mechanism to fetch information for each. This modularized view with fo-
cus binds the model with the computationally capable Turing machines. And as
a final remark, this paper constructively shows that the computational complexity
of the model is at least, if not surpass, that of a universal Turing machine.

1 Introduction
This paper presents a theoretical, idealized model of thinking. Thinking, as a mental process, is one
of the most sophisticate products of intelligence. It allows us to perform procedural simulations in
order to predict future outcomes of given present states. Having the model of thinking would enable
the explanation of our mind and would also facilitate the building of replicas of it.
Due to its deep inherent association with our minds, thinking is one of the fundamental concepts in
philosophy [1]. Perhaps most of the modern attempts to model the process of thinking are arguably
inspired by Alan Turing’s work [2], which addresses many philosophical and technical questions
about the machines that can think. From that point onward, the term ”thinking” has been associated
with various meanings even within the context of computation. Our interpretation of the word
thinking would only limited to the continuous process of generating data from selected inputs.
As the reader goes through the content, a question might come up in mind, ”this does not seem to
be how the brain works.” The purpose of this paper however, is not to postulate the actual process
of thinking that happens in one’s brain; we are only interested a theoretical fraction that captures the
ideal essence of thinking such as how new ideas are composed from experience and generalization
of thinking sequences. From this motive, the model that we develop in this paper only has to main-
tain the following characteristics: a) it must permit complex transformation within sequence and
generalize to unseen inputs, b) it can receive and maintain information indefinitely and selectively
use them to generate future sequence, and c) it should support learning while executing. From which
necessities are these characteristics derived? And how do we address them?
If we manage to represent the information of any moment of thought using a structure within a math-
ematically defined space, thinking process can be defined as a sequence of transformations between
structures within that space. To be able to capture the real world sequences, the transformations
should be sufficiently expressive. This problem can recently be addressed by the re-popularized
concept of deep learning [3–5]. Deep learning allows stacking of simple transformations, which are
individually easy to analyze, to express a complex non-linear transformation.

1
2+3 0+5 1+4

5
Figure 1: Alternative representations of “5”

Even though thoughts are very fluid and alternating from moment to moment, yet everything happens
in mind. Every information that we generate or receive at some points in time shall be used at some
other times to generate a new data. Therefore to address the second characteristic, the model must be
able to memorize information. The success of recurrent models for time-related tasks shows that we
might use them as a prototype of our model [6, 7]. This will be discussed in Section 3. Furthermore
to show that the complexity of the model could rival that of our mind, we could try to relate the
model with the behavior of a universal Turing machine [8]. Section 4 will contribute to this aim.
Finally, we aim to develop a model that allows learning process to happen homogeneously along
the execution path and simultaneously at the execution time. This requirement is not crucial in the
process of thinking, but it is useful for any actual system that implements our model to have some
kind of online-learning capability. We will show that learning while executing is possible within our
model in Section 3.
Before going to the model, we will first discuss a constraint that constitutes the essence of our
model, called internal consistency. We will relate this constraint with the notion of generalization
within the scope of thinking. What can we guarantee given the world can change? How would we
define generalization in the context of thought? And how can we implement the constraint in a real
machine? This is where we start.

2 Internal consistency
For a generative system that tries to model the world, it usually consists of two sides of data repre-
sentation: one that represents the visible states of the world, and the other that represents the hidden
states [9]. The hidden states are alternative representations of the visible states, and as the name
implies, each represents another way of how to represent a data. For an example, we say 2 + 3 is an
alternative representation of 5, so do 1 + 4 and any other summation of two numbers that equal 5.
Why do we have to be bothered with alternative representation? Alternative representations usually
have some neat inter-basis characteristics. For example, the bases that contain information in the
hidden states can be made less correlated, less redundant, or sometimes be completely independent.
Independent bases are desirable for generative models since we can produce the distribution of
Q world efficiently by the product of distributions of individual bases [10]: P (X0 , X1 , . . .) =
the
i P (Xi ) where Xi s are random variables. It is up to application to define the best set of bases for
the hidden states. This is in fact conformed to what deep learning tries to achieve.
To further illustrate why alternative representations are necessary as the fundamental our thinking
model. What if whenever we try to imagine the Mona Liza painting but what comes out instead turns
out to be the Scream from our mind. Is it not frustrating? Though these two historical paintings are
of artistic merits, recalling the wrong one while we originally intend for the other can hardly be
recognized as a trait of intelligence without proper reasons. In control theory, engineers attempt to
design systems with closed-loop sensory feedbacks to rectify undesirable outputs. The systems can
immediately identify when their previous outputs deviate from the expectation in order to compute
control with proper compensation scheme. But would it not be better if we can build systems that
can inherently prevent all of these action-intention inconsistencies? We can say that the systems are
perfectly adaptable without the need for compensatory feedbacks. In fact, this is the empirical trait
that any intelligent systems should have, something that we humans have, at least to some degrees.
Let us suppose for now that inputs of a generative system that tries to generate the visible states of
the world are never-been-seen-before hidden states. How can we guarantee that the system would
produce the correct visible states as the outputs? The generalization of alternative representations

2
A B
Figure 2: The non-sharing property. The group on the left has the non-sharing property. The group
on the right does not. The cross-mark indicates which mapping is not valid.

would be achieved by making sure that the generated visible representations are always conformed
to the hidden states that cause them. For example, suppose that we have a system that tries to
produce the addition result of any two numbers, say A and B as the visible state, we say the system
is generalized when its generative function is A + B. We call the phenomenon where the hidden
states of a system are always alternative representations of its visible states internal consistency.
This is the best we can do for generalization.
What does this heuristic mean computationally?

2.1 The non-sharing property and preservation of variance

Let v represents a visible state from the set V , and h represents a hidden state from the set H. A
system is said to be internally consistent when the mapping from a visible state to any hidden state
and reconstruction from one of the reachable hidden states always results in the visible state itself:
X
P (v|h)P (h|v) = 1 ∀v ∈ V (2.1)
h
Or the mapping preserves variance in V . Please note that this paper uses probabilistic short forms;
namely, P (h|v) is a short form of P (ĥ = h|v̂ = v) where ĥ, v̂ are random variables.
To better understand the connection between internal consistency and preservation of variance, we
need to consider the fact that an alternative representation of any data may not be unique. A visible
state can have many alternative hidden states of representation, but those states must only correspond
to none other than the visible state itself.
Definition 2.1. Let a set of alternative hidden states of a visible state is a set whereas every element
can be generatively mapped into the visible state. The non-sharing property is satisfied if and only
if the forward mapping from the visible state only results in an element from the set.
The non-sharing property implies that, when we transform a hidden state into its corresponding
visible state, there is no other visible state that better matches the hidden state available. Let us think
about the addition example again. Suppose that we want to produce the visible state of 2 + 3, which
is 5, how do we guarantee that the produced 5 is correct? It is by converting back 5 to a hidden state
which may result in 1 + 4. Then we can verify that 1 + 4 and 2 + 3 belong to the same set.
Lemma 2.1. Preservation of variance implies the non-sharing property.
P
Proof.
P h P (h|v)P (v|h) is a convex combination of P (v|h) ∀v, h. If there exists P (v|h) < 1 then
h P (h|v)P (v|h) < 1. Therefore, for any h with non-zero P (h|v), P (v|h) = 1.
Lemma 2.2. The non-sharing property is equivalent to variance preservation.
P
Proof. For any h with non-zero P (h|v), h P (v|h)P (h|v) = 1 when P (v|h) = 1, ∀v. Also, from
Lemma 2.1. This completes the proof.
From Lemma 2.2, for a system to be internally consistent, it must at least preserve the variance in
the visible states. Internal consistency may not be perfectly achieved in practice. We can however
approach it by gradually training the system to maximize reconstruction chance:
X
max P (v|h)P (h|v) ∀v (2.2)
h
Learning hidden representation while maximizing the reconstruction chance aligns with the goal of
autoencoder training [4, 11]. Preservation of variance is therefore another justification for learning
representation with autoencoders.

3
One nice property of the non-sharing property is that it can be stacked to create a deep expressive
architecture that permits multi-layer hidden data transformation. This way, it is possible to find a
good complex representation for any data domain by having each layer gradually removes correla-
tion in the data and promotes little-by-little independency for the data in the adjacent layer. Training
to build a deep internally consistent system is as simple as training to preserve the variance between
layers.
Lemma 2.3. Stacking of variance preservation satisfies the non-sharing property.
Proof. Let the primePnotations of one dimension vectors represent their variants. Given a variance
preservation layer,P h P (v|h)P P(h|v) = 1, stacking another layer P on top of it preserves the non-
sharing property: h P (v|h) ( h0 P (h|h0 )P (h0 |h)) P (h|v) = h P (v|h)1P (h|v) = 1. We can
apply this action multiple times to build a multi-layer architecture. This completes the induction.

Preservation of variance suggests a way to generate innovative yet relevant visible states from new
hidden states, and therefore allows us to guarantee internal consistency, i.e., alternative representa-
tion for unseen inputs. To show this in a system, we require the knowledge or the detail of system
implementation. We will see in the next section a way to implement an internally consistent system.

2.2 Implementation in a linear system

We discuss a linear neural network as an internally consistent system. Each layer of the network can
be mathematically expressed as a matrix multiplication: h = W v where W represents a forward
linear mapping weight matrix, v is a visible state vector, and h is a hidden state vector. To achieve
variance preservation, the generative mapping from a hidden state to a visible state W 0 must fulfills
this equation: v = W 0 W v. For the analytical purpose, it is even simpler to consider filling the entire
linear span of the visible state set, i.e., to make W 0 W = I. In this regard, the generative matrix
W 0 must be the left inverse of the weight matrix W , preserving variance of the visible states. The
following subsection shows that we can guarantee the non-sharing property for unseen visible states
when the inverse exists.

2.2.1 A linear neural layer that satisfies internal consistency


Let us define the concept of equilibrium in a linear neural layer.
Definition 2.2. An equilibrium is a setting where a visible state and and only one of its hidden states
correspond to each other.

In other words, for any h ∈ Hv or the set of alternative hidden representations of v, every P (v|h) =
1, and only an element ĥ ∈ Hv receives all the probabilistic mass, P (ĥ|v) = 1. For short notation,
ĥ ⇐⇒ v. This is partly due to the determinism of linear systems that allows no more than one h
to correspond to W v and only one v to reciprocally correspond to W 0 h.
Lemma 2.4. A linear system that allows hidden-visible transformation to reach the equilibrium in
one step from any hidden state has the non-sharing property.
Proof. To prove this statement, we show that, when the non-sharing property does not hold, there
always exists a path greater than one step. Let v 0 → h → v be a path. Such a part violates the
non-sharing property when v 0 6= v. Suppose there exists an h0 such that h0 → v. h0 can never be
h, because h → v and h → v 0 cannot be true at the same time; this contradicts the determinism of
linear systems. From here we can conclude that these is at least more than one step from h0 to reach
the closest equilibrium, h0 → v 0 → h ⇐⇒ v.
Proposition 2.5. A linear system where its forward mapping has the left inverse satisfies the non-
sharing property for unseen hidden states.
Proof. Consider a path h0 → v 0 → h00 → v 00 → . . . in its linear form: h0 → W 0 h0 → W W 0 h0 →
W 0 W W 0 h0 → . . . Since W 0 W = I, the path can be truncated: h0 → W 0 h0 → W W 0 h0 →
IW 0 h0 = W 0 h0 ⇐⇒ W W 0 h0 . From any h0 , the system reaches the equilibrium W 0 h0 ⇐⇒
W W 0 h0 in one step. Thus, it satisfies the non-sharing property according to Lemma 2.4.
Corollary 2.6. Stacking of linear internally consistent systems satisfies the non-sharing property
for unseen hidden states.

4
h r

W U
W U

v
Figure 3: An internally consistent model with a hidden residue state representation.

Proof. This is true by Proposition 2.3. It can also be seen that any upward pass in the stack always
maintains the equilibrium. Since the stack only takes one single downward pass from a hidden state
to generate its visible state, it satisfies the non-sharing property according to Lemma 2.4.

2.2.2 Missing variance


The discussion of alternative representation has led us to a kind of problems where in some applica-
tions we might want to construct visible states purely from given hidden states, i.e., constructing v
from h. To state a few, given an artistic style and an abstract shape, how to create a detailed painting
containing the shape with the style [12, 13] or reconstruction of visual images from brain reading
signal [14], these problems belong to such category. The difficulty of these problems, apart from
finding the generative mapping, lies mostly in the fact that it is almost impossible to perfectly pro-
vide all the variance for the generative construction. The construction requires that all the variance
in the hidden state have to be filled; otherwise the result visible state may suffer the lack of suffi-
cient details. This explains why for us it is hard sometimes to imagine the precise details of some
concepts. Before discussing how could we fill the missing variance, we need a way to represent it
first.
We introduce a hidden residue as a part of a hidden state that is not given as input. Let W repre-
sents an assumed-given visible-to-hidden mapping weight, and U represents a weight for visible-to-
residue mapping. To satisfy internal consistency, we must have the left inverse W 0 |U 0 of W |U such
that
(W 0 |U 0 )(W |U )v = (W 0 W + U 0 U )v = v (2.3)
A|B is a matrix as a result of concatenation between the matrix A and B of the same number of
columns.
We can constructing v from h via this relationship:
v = W 0h + U 0r (2.4)
where r is the have-to-be-inferred hidden residue state. The process of inference must be automat-
ically done by the system. In the next section, we explore the strategies to train a system with such
capability following the internal consistency constraint.

2.3 Training internal consistency in a linear system

The goal of the training is to search for a set of bases that provides good alternative representations
and preserves variance in the data. This is unsupervised learning. Given a set of data v ∈ V as
the visible states, we wish to find the forward mapping weight W , its generative weight W 0 , also
perhaps along with the residue mapping U , and its inverse U 0 , subject to Equation 2.3.
Despite that the exact solution for finding the weights can be found via any decomposition process,
it is more favorable in applications with large amount of data to use an iterative based algorithm.
Reconstruction ICA is a good candidate for reconstruction training [15]. Consider one form
 of its
objective (without the residue weight), minW v∈training data ||W | W v − v||22 + λ||W v||1 , Le et
P
al. shows that

5
Lemma 2.7 (Le et al., 2011). The reconstruction term in RICA’s objective is equivalent to
1
||(W | W − I)EΛ 2 ||22 or the orthonormality cost with weights in the space rotated by eigenvec-
tors and scaled by eigenvalues.

Λ is a diagonal eigenvalue matrix, and E is a matrix whose columns are eigenvectors of the covari-
ance matrix of the training data. λ in RICA’s objective is a sparsity coefficient that controls how
much learning effort contributes to finding independent bases. When the eigenvalues of the training
data are real positive, we can see that the solution to RICA involves ones where W | W approaches
the identity, which in turn makes the system that implements RICA satisfies internal consistency.
Although RICA is originally derived with the generative weight as the transpose of the forward
weight and without the residue weights, we can extend it to support a non-transpose system by
updating the gradients for both W and W 0 separately, and expand the weight into the form W |U ,
which includes the residue weight.

2.3.1 Linear transpose bases


In deep learning, it has been a common approach to use transpose as the generative weight of the
forward one. The transpose acts as good learning regulator originally presented in Oja rule’s [16].
For a linear neural layer with transpose, the weights that suffice the condition for internal consistency
follow
v = (W | W + U | U ) v

which also implies that

U | U v = (I − W | W )v
W | W v = (I − U | U )v

When v is any vector from the entire linear span of the visible state set, we can see that

U | U = (I − W | W )
W | W = (I − U | U )
Unless we allow complex values in the neural network, the last two equations suggest that only the
weights that make I − W | W and I − U | U have real positive eigenvalues can be decomposed into
U | U and W | W respectively. If the weights are valid and the square roots of their eigenvalues exist,
knowing one weight allows us to immediately extract the other weight via any eigen decomposition
process of the form:
U | U = I − W | W = EΛE |
1
U = Λ 2 E|
Λ is a diagonal eigenvalue matrix, and E is a matrix whose columns are eigenvectors of I − W | W ,
and vice versa for W .
Since each eigenvalue represents the variance along each principle axis in the visible states, this
means each of the weights, W and U , cannot extract more information than that presenting in the
visible states.

2.3.2 Addressing missing variance


When some parts of the hidden states are not given, we can fill them with priors from training. This
suggests a system with some form of internal memory that can remember the information provided
in the hidden states during the training. The memory allows the system to later infer the residue
states conditioned on the available part of given hidden states.
We choose to fulfill the role of the memory with a belief network stacked on top of the hid-
den units due to its simplicity [3] among other generative techniques. A belief network is a
generative model that can be trained to generate training data’s distribution P (v) where v ∈
the training set. It is a stack of restricted Boltzmann machines trained with the contrastive

6
divergence algorithm, a variant of gradient ascent  with the following update rule: ∇W ∝
| 0 0 0 0|
P P P
v P (v) h P (h|v)hv − h0 ,v 0 P (h , v )h v . At convergence, the values of hidden units
h conditional on visible units v should match the stationary distribution due to bipartite nature of
the machines. Inference in a belief network creates a Markov chain which its stationary distribution
conforms to P (v).
In our case, we use the top layer belief network to model the distribution of the hidden and the residue
state, i.e., v of the belief network is simply h|r of our system. Given h, running the inference in
the form of repetitive sampling should provide us the missing variance conditioned on it. Then the
hidden state and the residue state are fed down to construct the visible state following Equation 2.4.
We will later see that this variance filling mechanism can be used as a memory extension for our
thinking model. This augmentation allows us to increase to the model’s capacity to quickly access
memory and also facilitates the learning while executing procedure.

1 2

h r
3 3

Figure 4: A stack of internally consistent layers with its inference steps to recover the missing vari-
ance. The blacken figures indicate a belief network with r as the hidden residue state representation
of this model.

2.4 Extension to non-linear

In some applications, having hidden states that capture some non-linear traits in visible states has
never been without practical advantages. We can use the rectified linear activation function as a
means to achieve this [17]. The non-linear behavior of rectified linear units can convey non-linear
information from visible states to hidden states while satisfying internal consistency given the pres-
ence of mirror bases.
Lemma 2.8. A rectified linear layer with mirror bases satisfies internal consistency in the same
manner to those with linear bases.
Proof. Let W u be represented by ρ(W u) + ρ(−W u) when ρ is a rectified linear projection of every
row of W u. Let k, u be column vectors of the same length:
 |
| k u when k | u > 0
ρ(k u) =
0 otherwise
Let W represents the left inverse of W . It can be seen that for any v, W 0 ρ(W v) + −W 0 ρ(−W v) =
0

W 0 W v = v.
A pair of mirror bases contains mutually exclusive linear polars tied together to form a linear basis.
Although within a layer, each mirror pair behaves like a linear basis and follows internal consistency,
we can have non-linear transfer at the adjacent layer by assigning a different weight value to each of
the rectified linear base in the pair.

2.4.1 A more general way to choose non-linear bases


Here we suggest a method to modify an existing linear mapping into a non-linear one that, of course,
satisfies internal consistency.

7
h h
base

v = v

mirror

Figure 5: A rectified linear mirror pair is equivalent to a linear function.

Let us start from the basic linear internal consistency.


For any v,
v = W 0W v
= W 0 IW v
S0
" #
0
= W [I I . . .] S1 W v
...
The last one requires that
X
Si = I (2.5)
i
Note that each basis set Si is not only a constant matrix but can also be a collection of functions
where their resolved values depend on to what values they are being multiplied.
Example 2.1. Let σ represents a step function at 0:

1.x = x if x ≥ 0
σ.x =
0.x = 0 otherwise
We can see that
σ.x = ρ(x).
The step function we choose behave like a rectified linear function. The mirror pair of σ is in fact
1 − σ. Therefore we have the option to choose two bases as following:
σ 0 ...
" #
S0 = 0 σ ...
... ... ...
1−σ 0 ...
" #
S1 = 0 1 − σ ...
... ... ...
which satisfies Equation 2.5.
The aggregation of step functions is useful because we can represent any function with it [18].
For a layer, we can expressively have
S0
" #
h = S1 W v
...
v = W 0 [I I . . .] h.
The adjacent layer can assign any weight value to every element in h permitting non-linear trans-
fer across layers. It is also worth noting that the formed generative and forward mapping are not
symmetry. The generative mapping is just a summation before being multiplied with the left inverse
while the forward activation is non-linear following the choice of basis sets.
The application of this proposal is that we can perform the following steps when training internal
consistency in a neural layer:

8
h h h

v = v v

Figure 6: A rectified linear can be factorized into a step function and a linear function.

1. Train its linear weight matrices following v = W 0 W v


S0
" #
P
2. Choose non-linear basis sets S1 such that i Si = I
...

This technique can be applied many times in a stack to form alternating layers of linear and non-
linear transfer functions.
A variation of this technique can be used in a convolutional network [19] as well:
v = W0 ∗W ∗v
v=W 0
∗δ∗W ∗v
S0
( )
v=W 0
∗ S1 ∗W ∗v (2.6)
...
δ is the Dirac delta function: the identity of the convolution operation. For a short notation we define
S0
( )
the following quantity S1 an alternative representation of the identity I belonging to any tensor
...
operation ? that has the distributive property such that
S0
( )
A ? S1 ? B = A ? I ? B
...
X
Si = I
i

A and B belong to the domain of the ? operation.


From Equation 2.6 we can have
hi = Si ∗ W ∗ v (2.7)
v = W0 ∗
X
hi
i

In practice however, it is not convenient to convolve a tensor with a basis set that depends on the
value of its operand as in Equation 2.7. We can leverage this with the convolution theorem and inner
product:
v = F 0 (F(W 0 ) · F(F 0 (F(W ) · F(v))))
v = F 0 (F(W 0 ) · F(W ) · F(v)))
v = F 0 (F(W 0 ) · 1 · F(W ) · F(v)))
S0
( )
v = F 0 (F(W 0 ) · S1 · F(W ) · F(v)))
...
F and F 0 are the Fourier transform and its inverse respectively.

9
We can now multiply each element individually as in the linear example:

hi = F 0 (Si · F(W ) · F(v)) (2.8)

Again, we can first train the convolution kernels W and W 0 before choosing the non-linear transfer
functions.

2.5 Temporal internal consistency

A recurrent neural network belongs to a class of expressive architectures that we can use to build a
generative temporal system for modeling temporal sequences [20]. If we treat hidden states as the
theme of a temporal sequence, we can have visible states that can change through time by treating
past states as the condition for the present time step. For each past configuration, the current visible
state will satisfy the non-sharing property. But does the non-sharing property apply for unseen
conditions as well?

... t-2 t-1 t


Figure 7: One layer of a temporal model conditioned on two past steps.

Let vt represents the visible state at a time t. ~v\t denotes a collection of visible states over many
time steps into someP past excluding vP t . Consider the expectation of temporal variance preservation
conditioned on ~v\t : ~v\t P (~v\t |vt ) h P (vt |h, ~v\t )P (h|vt , ~v\t ) = 1. We can rearrange the term
to treat the past data as a part of the hidden state; and together with the original hidden state, we can
use them to produce the visible state of the present time step:
" #
X X XX
P (vt |~v\t , h)P (h|~v\t , vt ) P (~v\t |vt ) = P (vt |~v\t , h)P (h, ~v\t |vt ) = 1
~
v\t h ~
v\t h

Lemma 2.9. Following the proof of Lemma 2.1, satisfying the conditional non-sharing property is
equivalent to satisfying the non-sharing property when the condition is accounted as a part of the
hidden state.

Therefore in any conditional alternative representation system, we can consider a condition as a part
of a hidden state. As time progresses, each visible state will always be alternatively represented by
the combination of a hidden state and some past visible states, satisfying the non-sharing property
for any condition and thus the internal consistency constraint.
To build a temporal internally consistent system, the training algorithm has to preserve the variance
found in the present visible state conditioned on given past visible states. Due to the relation from
Lemma 2.9, we propose another proposition:
Proposition 2.10. If an algorithm that is used to train the conditional non-sharing property shares
the same routines with one that is used to train the non-sharing property when the condition is
accounted as a part of the hidden state, that algorithm will generalize the non-sharing property to
unseen conditions.

It remains depending on the system implementation to provide sufficient criteria that can guarantee
the generality of the algorithm to unseen conditions. Analogous to the non-conditional case, it is
straightforward to see that conditional RICA can be regarded as such an algorithm.

10
We can also extend the expressions to describe a multilayer system:
 
XX X X
P (vt0 |~v\t
0
, vt1 )  P (vt1 |~v\t
1
, vt2 ) (. . .) P (vt2 , ~v\t
1
|vt1 ) P (vt1 , ~v\t
0
|vt0 ) =
 
~ 0
v\t vt1 ~ 1
v\t vt2
   
X X X X
P (vt0 |~v\t
0
, vt1 )  P (vt1 |~v\t
1
, vt2 ) (. . .) P (vt2 , ~v\t
1
|vt1 ) P (vt1 |~v\t
0
, vt0 ) P (~v\t
0
|vt0 )
  

~ 0
v\t vt1 ~ 1
v\t vt2

The superscripts denote layers to which the associated states belong.


The temporal model allows visible states to change despite having been corresponded to a fixed
hidden state. Hidden states can also change following the temporal progression of visible states. We
could stop here and propose these two alternative transformation as the model of thinking unless we
wish to consider the concept of attention into the our thinking paradigm.

3 Yet another model of thinking

Our model of thinking is a temporal apparatus that continually generates data to form a sequence
of thoughts using the information from somewheres and some times in the sequence itself. In a
manner similar to other alternative representation models, we represent the generated thoughts with
visible states, and we use hidden states to relate information in the thought sequence. Hidden states
are divided and grouped forming a finite number of processing modules which we call components.
Each component acts as an information cache whose content is fetched depending on a spatio-
temporal cue, or simply focus. Components are used to provide the means to choose information
from various sources and combine them in a creative yet deducible way to form the sequence of
thinking.
While recurrent neural networks are useful, their recurrent connections have a disadvantage com-
pared to focuses. To generate a thought using recurrent connections that requires some information
up to some long past in the sequence, the recurrent connections have to cover throughout a lot of
past steps in the sequence. But this would come into a shortcoming when we consider our minds’
capability to switch between different thoughts. At one time, a song is just an ear-worm in our mind,
but at another time, we would have no problem to switch to another song. If it is the extensive tem-
poral recurrent connections we have in our mind, switching thoughts would not be so simple. The
cue that signals the switching would have to compete with many others. This idea suggests that the
model should permit only short temporal connections, and should rather rely on another mechanism
to fetch information, such as the focus. With the focus, the model can access past information at any
time while allowing the ability to abruptly change thoughts.
The impression of the model is best illustrated visually in Figure 8.

3.1 The model and its behavior

In our model, the degrees of freedom that change between consecutive thought steps derive only
from focuses. In particular, a focus controls from where and when information should be fetched
into a component. A focus is similar to a pointer in the context of programming, namely, an address
that points to a memory content in a program. Though the memory content can be dynamically
altered, the part of the program that executes and controls the pointer remained static. It is this
concept that we bring into our model with focus, and this allows us to generalize the model to some
degree. The focus also allows the model to choose which portion of information should and shall
be processed at a time yielding a biologically inspired capability to recognize structures in noisy
environments.
Let us define the focus of a component as a tuple of a selective focus s and a generative focus g,
f = (s, g). The selective focus defines from where in the previous visible states the information
content in the component is fetched, and the generative focus defines upon which portion of the new
visible state the information content will be placed.

11
Figure 8: An illustration of our model. Here we have three visible states depicted horizontally at the
bottom. From the middle state which is our current step, the model infers a selective and a generative
focus s and g. s points to a location in the past and the information from that location is fetched
and transformed to one of the components depicted vertically. The result hidden state along with
the generative is used to produce a new visible state. c is a context component. u is an augmented
external input as a part of a visible state representation.

At the beginning of each time step t, the model starts by activating the focus of each component
using the most recent thoughts.
vt , vt−1 , ft−1 , . . . , vt−τ , ft−τ → f = (s, g) → h (3.1)
The activation of the focus is unidirectional from a finite number τ of latest visible states in the
sequence, with optionally the same number of previous focuses, to the current focus of each compo-
nent. The lack of backward mapping means the non-sharing property between the visible states and
the focus is not defined. Yet when the activation is deterministic, the non-sharing property is hold
when we consider an implicit inverse mapping.
Once the focus of every component is generated, the model then fetches the content h for each com-
ponent, corresponding to each individual’s selective focus s, and uses it together with the contents
from other components to generate a new visible state, a new thought for the thought sequence,
following the internal consistency constraint.
X
∀v h|v, g −1 ) = 1
h, g )P (h
P (v|h (3.2)
h
We call this the combination rule, i.e., the generated thought must preserve all the information pro-
vided in the components. h represents a collection of all components’ contents. g is a collection of
generative focuses from all components that dictates how should h be combined. In order to com-
plete the equation, we require the inverses of the generative focuses g −1 . They act as the selective
focuses that choose information from the newly generated thought back into the components.
Depending on the generative focuses, the components’ contents when combined into the new
thought may sometimes not overlap one another. To make the combination rule always valid, the
content mapping parameter W of each individual component must satisfy preservation of variance
such that for each component i
X
∀vi P (vi |hi )P (hi |vi ) = 1 (3.3)
hi
This equation serves as a constraint that explains the behavior of the mapping parameter for each
individual component. It is not mandated to hold for every portion of the generated thought; there
can be times when this equation contradicts Equation 3.2, e.g., when the generative focuses place
the mutually contradictory contents of two or more components on the same portion of the thought.
Nevertheless, it must hold for the mapping parameter of each component.
All of these equations constitute the thinking process of our model.

3.2 Advantages

For the reader, it is best to pause here and discuss the advantages this model can accomplish.

12
First, the multi component model allows each component to store information from a different
source and be ready to combine with others’ to form a new thought. As we mention in the pre-
vious section that when the model has been trained to follow internal consistency, we could have a
generalization guarantee for the alternative representation to any new combination.
Sir Isaac Newton had postulated in his seminal work [21] that every surrounding change in the
environment comes from either motion or transformation. Some objects may alter their intrinsic
properties, but the rest only move. Using our model, it is possible to have a representation of any
environment state that separates motion from the background, and we let the focuses handle the
mechanic of the moving part. This way we aim for a better generalization when training the model.
The last reason is in fact a means to reduce hypothesis variance with a limited amount of training
data. To see roughly why separating focuses and contents can help reduce hypothesis variance, we
can count the amount of training examples required for two neural network implementations, i.e.,
with and without focuses. Let M be the number of our model’s components, T be the total number
of past states we kept for the network model without focuses, N be the number of content bits per
each state, K be the number of values per bit, Finally we let each individual bit of focus has two
values, i.e., focus or not focus. In the worst case scenario where the networks can only memorize the
input-output pairs and do not generalize them, the lower-bound of the required number of examples
to memorize input-output mappings is the size of domain times the size of co-domain. For the
network model without focus, the required number of examples is the size of total past states we
kept times the size of the new state, T (K N ) × K N . We will let the reader work out for the required
number of examples for our model, which is M K N/M × K N/M + (K N × 2N ). We can see that
for case M = 2, K ≥ 2, and N ≥ 2; the amount of examples for training the model without focuses
is greater than the number required for our model. Usually for a fair comparison we let T = M ,
i.e., the number of states we kept for the network model without focuses is equal to the number of
components of our model.
T (K N ) × K N = T K 2N
= 2K 2N
 
> M K N/M × K N/M + (K N × 2N )
= M K 2N/M + (2K)N
= (2N + 1)K N
This crude estimation only provides an intuition of why separating focuses can help reducing hy-
pothesis variance of a neural network however, as we do not take account of the ability to generalize
of the compared models nor the dependencies in the training data.

3.3 Existence

Here we show that in general, we can always build a multicomponent-multilayer internally consis-
tent system that allows non-linear representation of the visible state sequences following our model’s
behavior.
Proposition 3.1. There exist non-linear implementations of the model that satisfy Equation 3.2 and
Equation 3.3.
Giving some examples will take care of the proof.
Example 3.1. In the first example, we show that a linear, single-layer, internally consistent imple-
mentation of the model exists. Then by Lemma 2.3 and Lemma 2.8, we can put any desired number
of non-linear layers at the bottom of it to create a stack of non-linear internal consistency layers.
From Equation 3.2, we interpret it into a linear form:
v = GW 0 W G−1 v ∀v (3.4)
which can be expanded as below:
" 0 #  −1 
W0 0 0 W0 0 0 G0
#"
v = [G0 G1 G2 ] 0 W10 0 0 W1 0 G−11
v
0 0 W20 0 0 W2 G−12

13
Gi and G−1i are the generative focus matrix of a component i and its inverse respectively. Wi and
Wi0 are the forward content mapping matrix of a component i and the corresponding generative
mapping matrix. To satisfy Equation 3.3, we require that
x = Wi0 Wi x for any x = G−1
i v (3.5)
hence,
v = GG−1 v (3.6)

If we limit ourselves to allow each generative focus matrix Gi to only contain 0 or 1, and to only
be formed by a combinatorial basis shuffling of the identity such that GG−1 = I, it can be implied
that each individual Gi must not intersect one another. From here, we can see that there are at least
as many settings as the factorials of the dimension of v.
Example 3.2. As an alternative of Example 3.1, if we fix G to I, the number of valid settings
depends on the choice of individual Wi that satisfies W 0 W v = v. The task is left for us to choose
the bases of each individual Wi such that they are not correlated with those of other components.
Example 3.3. In a convolutional neural network with a pooling layer [22], the generative focus is
always the residue of the pooling layer that satisfies variance preservation.

W ∗ v0 → h1 −pool
−→ (h2 , g 2 )
Again, the superscripts denote layers to which the associated states belong. The generative focus
here indicates from where the pooling result has taken the input. If W has an inverse W 0 such that
W 0 ∗ W = δ, then we can have
v 0 = W 0 ∗(GG−1 (W ∗ v0 ))
where for each component i
h2i = Gi (W ∗ v0 )
For a multilayer implementation of the model, the mechanic of components and focuses should stay
at the top of the stack as an executive function that controls thought. We let the lower layers to act as
the non-linear transfer function between the bottommost visible states and the topmost hidden states
in the components.

3.4 Learning while executing

Thoughts are creative, and yet no one but ourselves can teach us to think with nothing but executing
input sequences as the examples. A good thinking model should allow learning while executing.
Like other machine learning paradigms, the model works in two phases: executing and training.
The difference is in our model these phases both use the same execution path with the only inputs
as example sequences of visible states. The system that implements the model should learn to
generate the sequences and also generalize them. We are allowed, however, to devise specials of
such sequences especially for the sake of training.
We impose that any machine can learn while executing if a) the learning happens along the path of
executing, and b) the learning happens in the direction of executing. These are the conditions for
learning while executing. This type of training prohibits more than one step of the back propagation
through time algorithm [23] that is used to train recurrent neural networks. Though our model
could definitely receive benefit from having some kind of long-term guide especially for training the
selective focus.
During the training phase of our model, we advice to train first the content parameters of the com-
ponents.

3.4.1 Content training


The content mapping parameter of each component can be trained unsupervisedly with data supplied
by an initial selective focus and a generative focus and specially designed visible state sequences to
leverage them.

14
Let us consider a linear executing step of the model.
v = GW 0 W Svv ∀v (3.7)
In the manner similar to Equation 3.4, S is a collection of the selective focuses of all components.
v represents a collection of past visible states.

Also for each component according to Equation 3.3,

Siv = Wi0 Wi Siv ∀i, v (3.8)

Merging executing equations yields

v = GSvv (3.9)
Equation 3.9 suggests how to design the training sequences, i.e., the next step visible state v in the
sample sequence should somehow represent the aggregation of all selected information from the
past Svv according to G.
Given a focus f = (s, g), each component will passively generate a hidden state with the current
parameter values and then produce a visible state. The process of learning can utilize this path:
tie tie
Svv −−→ W Svv = h −−→ W 0 h = G−1 v
tie
a −−→ b denotes a supervised learning that learns mapping from a to b, following its direction.
We can see that learning of the content parameters does not break the learning while executing
conditions.

3.4.2 Focus training


This section shows that there is a possibility to train our model’s focuses while satisfying the
learning-while-executing conditions.
The generative focus is straightforward to train since it can be obtained straightforwardly by content
optimization. Equation 3.9 implies that we have to choose the generative focus G of all components
such that their combined content best matches v, fulfilling the equation. Or in a convolutional
network, Example 3.3 shows that the generative focus g of a component can be extracted along with
the content while training the content parameters. Then we can immediately tie the recent thoughts
with it.
tie
vt , vt−1 , ft−1 , . . . , vt−τ , ft−τ −−→ g

Once the content parameter and the generative focus have been trained, the selective focus can be
trained by activate-and-tie mechanism with optionally reinforcement learning as a guide [24, 25].
If the randomly chosen selective focus can allow the model to predict the next step visible state in
the given sample sequence, according to Equation 3.9, the activation of that focus is subsequently
enhanced. We will show in the next section that with the presence of memory mechanism this can
be made easier.

3.5 Extension

3.5.1 Memory component


Memory is the actual implementation of a component to allow its selective focus to fetch a visible
state from the past. The selective focus can either be in the form of a) a temporal cue that contains
the location in time relative to the present, or b) a part of a previous hidden state that allows us to
fetch the residue information of that state. In the latter case, the memory acts as a most-recent-hash
that only allows the latest content associated with a cue the be retrieved. A variance filling system
such as a belief network, mentioned in Section 2.3.2, would serve this purpose.
When the selective focus is a part of some hidden state, we can use this fact to help training it.

15
Example 3.4. During a focus training phase, we can design a sample sequence so that the focus of
an immediate previous step can be trained one-by-one. Given an initial selective focus s0 that always
points to the current visible state, we can derive a hidden state that contains the information of the
supposed selective focus for the previous step together with its residue:
vt →
− s0t st−1 |r
We can then tie past states to the produced focus:
tie
vt−1 , vt−2 , ft−2 , . . . , vt−τ , ft−τ −−→ st−1

3.5.2 Context component


Section 2.5 discusses the possibility of having an alternative representation model to pass on tem-
poral conditions between steps while satisfying internal consistency. This suggests that sometimes
the model must be allowed to carry extra degrees of freedom between steps of a thought sequence
in the form of contexts. This is in order to represent the dynamic of some applications, to call a
few, generating a sequence of music with a fixed theme, modeling a world object that changes its
appearance while moving, or addressing the transformation part of the environment according to
Newton’s postulation.
Instead of allowing direct recursive links between steps, we can use components to pass on contexts.
We can consider a context component as a special component whose selective focus directly trans-
forms a visible state to the component’s content. For example, the model can cache the visible state
of the immediate previous step in a context component for current use.
Despite the flexibility provided by contexts, there can be a model without context which is, in terms
of complexity, equivalent to the context counterpart. When a context and its origin visible state
are alternative representation of each other, we can always replace a context component’s direct
transformation with a normal selective focus mechanism that fetches context-equivalent data from
predefined locations in a thought sequence that hold them. We shall use this fact to facilitate the
elaboration of the proof of our model’s complexity.

3.5.3 External inputs


In some applications, inputs of the model are not only those generated in the sequence but also
those received externally during the execution. Cabessa et al. presents a theoretical framework of
Super-Turing machines [26]. They are interactive Turing machines capable to handle external inputs
during program execution. We, on the other hand, do not treat external inputs as separated entities
but rather a part of visible states that is generated by an external mechanism. The augmented visible
state (v, u) therefore comprises of the part v that is generated by the model and the other part u that
is written onto the state by the environment. The augmented part should seamlessly work with the
focuses like the normal visible state as it always does.

4 The model as a universal simulator

We use computers to achieve much, ranging from calculating the total of a shopping cart to putting
men on the moon. They are also potent to be used as simulators, simulating the trajectory of a robots
or computing the motion of planets, for example, with limited versatility but the accuracy not less
than that of our brains. Because in theory we can regard computers as universal Turing machines
[8], perhaps to prove that our model can sustain the thought process it is to show that the model can
simulate any Turing machine as such.
There have been many efforts to relate neural networks to Turing machines. Take Siegelmann and
Sontag’s work [27], for example, as one among the originally cited. And during the time we compose
this work, two of such stand prominent among the hype in deep learning. Graves et al. presented
neural Turing machines, which were carefully designed using Long short-term memory (LSTM)
[6,28], and were tested to complete a few algorithmic tasks. Zaremba and Sutskever further extended
the work with various of techniques, notably reinforcement learning, to address the mechanism of
Turing machine heads [25].

16
Turing machine’ heads are comparable to our selective focuses in a sense, considering the ability
to fetch and utilize information from the past. Our model also bears a resemblance to LSTM in
this very aspect. While Graves et al. used LSTM to build Turing machines from neural networks,
the work is not a proof that LSTM by itself can simulate any Turing machine, but rather the use of
LSTM and other neural networks to implement Turing machine components that could. This paper
starts from the concept of alternative representation, develops an underlying theory that guarantees
generalization to unseen inputs, and integrates the concept of focus to allow the model to manipulate
information in a way that resembles a Turing machine’s behavior. If we show that our model can
simulate any Turing machine, we could say that this work bridges the gap between LSTM and
Turing machines, providing another evidence that perhaps a neural network can also be viewed as a
universal Turing machine.
Before we go to our proof, we consider another simple lemma that relates Turing machine’s symbols
with our state representations.
Lemma 4.1. There always exists a set of alternative symbols for a Turing machine’s transition
function,
(s, th ) → (s0 , t0h , h0 )
that satisfies the non-sharing property from any current state s and the tape content th associated
with any current head h to a new state s0 , a new head location h0 , or a new content t0h to be written
over the current head.

Since the transition function of a Turing machine is one-directional and deterministic, we can readily
implement it in a system with the non-sharing property. Proposition 3.1 suggests that we can poten-
tially use a multi-layer implementation to generatively map any current state and the tape content
associated with any current head to a new state, a new head location, and a new content to be written
over the current head. Here the lemma certifies that we can always find a set of alternative sym-
bols that satisfies Equation 3.2, and now we are ready to show example algorithms for simulating a
Turing machine on our model with, of course, their complexity analysis.

4.1 An algorithm

When running a program on a Turing machine, the dynamic of the program during the execution
time prevents us from training the focus to directly point to the Turing machine’s head location. We
can resort to search for recent head contents, generatively produced somewhere in the visible state
sequence. At the start of each Turing machine step, the algorithm makes the model looks back into
the sequence to resolve the current head content using only a fixed number of recent thoughts.
Consider a Turing machine’s transition function, (s, th ) → (s0 , t0h , h0 ). Since our original model
does not have memory, it is required to write down all the symbols onto the visible state sequences.
Let a square bracket represents a visible state of our model’s sequence. Here is a portion of the
sequence involved in one Turing machine step:
  h i
[. . . , m−1 ] , β, s, h, (th−1 , h−1 ), th−1 , ∗, ∗ , α, s, h, ∗, hf−1 , thf−1 , f−1 , 1 , . . . , [α, s, h, ∗, h, th , f−m , m] ,

The sequence comprises of a pivot step, tagged with β, and several search steps, tagged with α. At
the pivot step, s is the current state, h is the current head. The parenthesis in the pivot groups the new
content of the previous head location with the head itself. Next in the step is the yet-to-be-resolved
th−1 content for the current head. To resolve this value, the search steps will jump from one pivot
step to another looking for the head’s content in the parentheses. Each search step, starting with α,
carries s, h to be used to compute the symbols of the next Turing machine step at the end of the
search. The search progresses by updating these four parameters: the found head, the content of
the found head, the focus location of the found head, and the step counter of the search. The model
keeps track of the found head to conditionally decide when to end the search. The focus location
allows the model to track the current search location. The step counter allows the model to compute
how many steps to jump to find the next pivot step. And it is already obvious why we keep the found
head content. These four parameters allow the model to evaluate the selective focus using only the
two most recent visible states. ∗ represents a wild card symbol which we do not interest at the time
it is written. We encourage the reader to become familiar with this sequence before moving forward.

17
To generate a new visible state, the model utilizes components. Each individual component copies
information from the sequence following the focus, satisfying internal consistency, and combines
with that of the others to generate symbols (Lemma 4.1). For the sake of simplicity, we also use
context components on this algorithm’s illustration. It is not hard to see that the information keeps
in each context component satisfies the non-sharing property. Therefore, we can entirely replace
each context with some initial symbols at the very beginning of the sequence and a focus mechanism
to fetch and combine them, without affecting the algorithm’s performance (Section 3.5.2). At each
search step (with α), the model decides whether to stop or continue the search and put symbols on
the components. The following describes the templates of what information will be carry in the
components:
step tag (context)
Turing machine’ states (copy)
current search head (copy)
h i
α, s, h, th−1 , hf−m , thf−m , f−m , m → → a new visible state
current search content (copy)
current search focus (context)
search iteration (context)
i.e., when hf−m ∈ {h, φ} the templates collect these symbols:
β β
s, h s, h
hf−m h 0 0 0
thf−m = th → [β, s , h , (th , h), th , ∗, ∗] → . . .
∗ ∗
∗ ∗

otherwise,

α
s, h
hf−m−1 h i
thf−m−1 → α, s, h, ∗, hf−m−1 , thf−m−1 , f−m−1 , m + 1 → . . .
f−m−1 = f−m − m−m − 1
m+1
φ represents an empty or any unrecognized symbol at the start of the sequence. For the first search
step, after a pivot step:
α
s, h
  hf−1 h i
[. . . , m−1 ] , β, s, h, (th−1 , h−1 ), th−1 , ∗, ∗ → thf−1 → α, s, h, ∗, h , t
f−1 hf−1 , f−1 , 1 → ...
f−1 = −m−1 − 1
1
f−1 is fixed to the previous pivot step. The components we show here are merely the templates of
the real components. They do not correspond injectively to the symbols on a visible state, rather
each symbol on a visible state derives from only some of the information presents in them. For
example, the new state s0 would require the old state and the found head content to be generated.
More importantly, the reader can verify that the identities of the symbols in each component template
can be determined within at most two most recent visible states. Since the first visible state in the
sequence has to be given as the input, this completes the induction for each transformation step of a
Turing machine.

4.1.1 Complexity analysis


Let N be the current number of Turing machine steps counting from the beginning of program
execution. The worst case time complexity of the algorithm to execute the step, contributed by the
search procedure, are bounded by this number, assuming that each step is required to search to the
begin of the sequence, which is O(N ).

18
What is the average time complexity to execute a step?
For a vanilla Turing machine, the head only moves left or right. Under the assumption that there is
an equal probability for the head to move left or right considering all possible programs, we can plot
a random walk graph of the head’s locations up until it reaches the current location (see Figure 9).
Consider the head’s location of a Turing machine after N steps of execution, the total number of
paths to that point is equal to 2N = |P∞ | + |P0 | + |P1 | + . . . + |PN −2 | where |Pn | is the number of
paths where the previous visits of the current location were at n-th step, |P∞ | is the number of paths
that never visit the location.

Figure 9: A graph of all possible paths a Turing machine’s head can take to a location. The red
dash-dash line illustrates a path from the step n = 2 and never again reaches the location until the
current step.

The expected total of the search steps is given by

N −1 −1
" # " N
!#
1 X N N
X
E [search steps at N ] = N (N − n)|Pn | + N 2 − |Pn | (4.1)
2 n=0 2 n=0

The first term in Equation 4.1 is the expected number of steps from the last visit of the current loca-
tion. |Pn | can be computed by counting the number of paths, on the left and the right of the Pascal’s
semi-triangle, avoiding the current location (see Figure 9). It is given by the Catalan number:

|Pn | = 2C(N − n − 2, 0)2n


 
1 2m
C(2m, 0) =
m+1 m
C(2m + 1, 0) = 0

using Stirling approximation on the Catalan number, we can compute the asymptote:

2N
 
|Pn | = O
(N − n)3/2


The asymptote of the first term is therefore given by O( N ) :

N −1 N −2
1 X X 2n
N
(N − n)|P n | = 2(N − n)C(N − n − 2, 0) N
2 n=0 n=0
2
N −2  
X 1
= O
n=0
(N − n)1/2

19
1
Because (N −n)1/2
is monotonically increasing, we can find its bounding via integral test:
Z N −1 N −2   Z N −1
j X 1 k
dn < O < dn ∃j, k ∈ R+
0 (N − n + 1)1/2 n=0
(N − n)1/2 0 (N − n)1/2
N −1 N −2   N −1
X 1
−2j(N − n + 1)1/2 < O < −2k(N − n)1/2
0 n=0
(N − n)1/2 0
N −2
√ √
 
X 1
O( N ) < O < O( N )
n=0
(N − n)1/2

The second term in Equation 4.1 accounts for the paths where the current head location has never
been visited before. In this case, the model has to search the entire sequence yielding the asymptote
of O(N ) for the term. Fortunately, we can make a modification to the algorithm to get rid of this
term by adding two extra copy components that keeps track of the leftmost and rightmost bounds of
all the visited head locations. When the head location exceeds one of the bounds, the model updates
the bound, skips the search and immediately writes the empty symbol as the √ head’s content. This
augmentation improves the average time complexity of the algorithm to O( N ).
Can we do better?

4.2 A constant time algorithm

Figure 10: An illustration of a Turing machine’s head location from start after moving left, right,
right. The dashed arrows point from the current to the previous left of the current head location, to
the previous right of it, or to itself.

The idea of this algorithm is to let every step maintains the step counts from the current to the
previous left of the current head location, to the previous right of it, and to itself. Let us consider
now three steps of a visible state sequence from the start of a program. From step n = 0, the head
moves left, left, then right, producing the following sequence:
[ Tag, s, h, th , t0h , dn−1→n , ln , cn , rn ]

[ β, s0 , h0 , th0 , ∗, ∗, 0, 0, 0 ],
[ γ, s1 , h1 , ∗, t0h0 , L, ∗, ∗, ∗ ],
[ α, ∗, ∗, ∗, ∗, ∗, ∗, −1, −1 ],
[ β, s1 , h1 , th1 , ∗, ∗, −1, −1, −1 ],
[ γ, s2 , h2 , ∗, t0h1 , L, ∗, ∗, ∗ ],
[ α, ∗, ∗, ∗, ∗, ∗, ∗, −2, −1 ],
[ β, s2 , h2 , th2 , ∗, ∗, −2, −2, −1 ],
[ γ, s3 , h3 , ∗, t0h2 , R, ∗, ∗, ∗ ],
[ α, ∗, ∗, ∗, ∗, ∗, −1, −2, ∗ ],
[ β, s3 , h3 , th3 , ∗, ∗, −1, −2, −3 ],
[ ... ... ... ... ... ... ... ... ... ],

20
dn−1→n represents a head direction from step n − 1 to n. The previous step counts for the previous
left, right, and the current position are ln rn cn respectively. A Turing machine’s step starts at β with
its states and the previous step counts. At γ step, the next Turing machine’s states are produced. At
α step, the previous step counts to the current head location and one of left or right are produced
depending on the head direction. At the next β, our model finishes the Turing’s step with the fetched
head content and the last step count. The generation of the step counts follows these rules,

l − 1 if dn−1→n = R
cn = n
rn − 1 otherwise

c + l(n+cn ) if dn−1→n = L
ln = n
−1 otherwise

c + r(n+cn ) if dn−1→n = R
rn = n
−1 otherwise
The component templates are as follow:
step tag (context)
Turing machine’ states (copy)
[Tag, s, h, th , t0h , dn−1→n , ln , cn , rn ] → previous head content (copy) → a new visible state
head direction (context)
step counts (context)
Again, the reader can verify that the identities of the symbols in each component template can
be determined within this time the most recent visible state. This completes the algorithm. And
because each Turing machine’s step only requires two extra steps to be simulated in our model. We
can conclude that the time complexity of this algorithm is O(1).
The existence of this algorithm by itself allows us to state the main theorem of this paper.
Theorem 4.2. The model with an arbitrary depth and a finite number of components can simulate
any vanilla Turing machine and only be slowed by within a constant factor compared to the machine
it simulates.

4.3 With memory components

We can augment the components with the most-recent-hash memory mechanism like one we intro-
duced in Section 3.5.1. This enhancement allows the tape content associated with any head location,
not just ones limited by the left-right-only movement of the vanilla Turing machines, to be read or
written in a single step. An execution step of a Turing machine simulated on our model with the
most-recent-hash memory is manifested by these expressions:
α β
read 0 0 0 0 write 0 0 0
[β, h, ∗, s] → h −
−−→ th → [α, h , th , s ] → th −−−→ h → [β, h , th , s ] → . . .
s s0
For each Turing machine’s step, our model with memory only needs one extra step to complete the
read-write cycle, with one memory component to read and write the tape contents. It infers the head
content th at the β → α step and memorizes the new head content t0h at the β → α step, both while
focusing on the head symbol at the β step. The efficiency is readily apparent.

5 Discussion
To state it one last time, this paper presents a theoretical thinking model. The model consists of
sub-components where all combine the information that each fetches from some part of the former
thoughts in order to creatively compose a new one. Each component has the ability to invariantly
extract information from any when and where with its selective focus, which in turn depends on and
is driven only by the most recent thoughts. The combination mechanism is governed by the concept
of internal consistency. Internal consistency is especially useful for a system that operates through
time, has the capacity to alternatingly and repeatedly recognize and generate data, and requires

21
that the newly generated data are relevant to the cause from which they are originated, such as the
thinking process.
The explicit use of focuses brings an advantage to our model. It happens that, in our world, physi-
cal interaction of between objects tend to correlate their spatio-temporal locations. Because of this,
physical transformation can be hypothetically simulated using a hierarchy of execution within a sys-
tem. Our model allows the focus mechanism to handle a higher order of execution, while having
each component handles the transformation within its responsible detail. This way we might be able
to achieved generalization with limited training data. It is also a means to reduce the hypothesis vari-
ance of the model’s implementation while, as much as possible, preserving the degrees of freedom
of thinking.
The problem this model tries to solve is in fact the inverse of filtering in signal processing. In
filtering we try to discover a recognition function, or simply an estimate in the context of signal
process, while taking the generative function, or the input control, and process noise into account.
In this work however, we attempt to find a generative function that is simply the perfect inverse of
the recognition function. Furthermore the approach of filtering has also been widely applied to find
distributed weights for combining a series of measurements that signify the same information pro-
portionally to each measurement’s confidential quantity. While in our work however, the distributed
weights are from hidden states to visible states, and are formed by training with the only limitation
to preservation of variance.
Throughout the length of this work, we have discussed several models. The first model that we
introduce in Section 2 is the hidden-visible bipartite model with its hidden state representations
and its visible state representations that are always alternative representations of each other. Ap-
plications that involve this basic model include factor analysis and some knowledge representation
where preservation of variance is required. Adding temporal support to the first model gives rise to
a temporal version of it (Section 2.5). The possibility to further enhance the temporal model with
indefinitely long memory retrieval capability has eventually led the discussion to our thinking model
in Section 3. We dedicate Section 4 to show that the model can simulate any Turing machine with
the computational complexity rivaling that of a universal Turing machine. Table 1 summarizes all
the models we have discussed in this work.
Model Hidden state Visible state complexity
Hidden-visible bipartite fixed fixed bidirectional mapping
Temporal bipartite changed by recurrent context changed by recurrent context a recurrent model
Thinking model changed by focuses contents follow the focuses’ a universal Turing machine

Table 1: Summary of the models discussed in this paper.

To make an intelligence machine with the generative capability, one essentially requires a decent
way to internally represent the world. Deep learning and techniques such as sparse representation
were particularly designed to address this requirement. In this paper, we present another important
factor that allows generalization guarantee of newly generated data to be “relevant”, at least to all the
data accumulated during the course of the machine’s execution. Internal consistency is not merely a
hypothetical trait of intelligent machines, like us, when they manipulate data; we believe that when
we can factorize knowledge into parts, identify similarity and distinction in its information, and
combine it with others to form a new idea is when we truly understand it. This conviction serves as
the very motive of this work.

References
[1] George Boole. An investigation of the laws of thought: on which are founded the mathematical
theories of logic and probabilities. Dover Publications, 1854.
[2] Alan M Turing. Computing machinery and intelligence. Mind, pages 433–460, 1950.
[3] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep
belief nets. Neural computation, 18(7):1527–1554, 2006.
[4] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends R in Machine
Learning, 2(1):1–127, 2009.

22
[5] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks,
61:85–117, 2015.
[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[7] Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and
Jürgen Schmidhuber. A novel connectionist system for unconstrained handwriting recogni-
tion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009.
[8] Fred C Hennie and Richard Edwin Stearns. Two-tape simulation of multitape turing machines.
Journal of the ACM (JACM), 13(4):533–546, 1966.
[9] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.
MIT press, 2009.
[10] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications.
Neural networks, 13(4):411–430, 2000.
[11] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man-
zagol. Stacked denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion. The Journal of Machine Learning Research, 11:3371–3408,
2010.
[12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.
arXiv preprint arXiv:1409.4842, 2014.
[13] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style.
arXiv preprint arXiv:1508.06576, 2015.
[14] Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gal-
lant. Reconstructing visual experiences from brain activity evoked by natural movies. Current
Biology, 21(19):1641–1646, 2011.
[15] Quoc V Le, Alexandre Karpenko, Jiquan Ngiam, and Andrew Y Ng. Ica with reconstruction
cost for efficient overcomplete feature learning. In Advances in Neural Information Processing
Systems, pages 1017–1025, 2011.
[16] Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathemat-
ical biology, 15(3):267–273, 1982.
[17] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann ma-
chines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10),
pages 807–814, 2010.
[18] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of
control, signals and systems, 2(4):303–314, 1989.
[19] Les E Atlas, Toshiteru Homma, and Robert J Marks II. An artificial neural network for spatio-
temporal bipolar patterns: Application to phoneme classification. In Proc. Neural Information
Processing Systems (NIPS), page 31, 1988.
[20] Graham W Taylor, Geoffrey E Hinton, and Sam T Roweis. Modeling human motion using
binary latent variables. In Advances in neural information processing systems, pages 1345–
1352, 2006.
[21] Isaac Newton. The principia: mathematical principles of natural philosophy. Univ of Califor-
nia Press, 1999.
[22] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[23] AJ Robinson and Frank Fallside. The utility driven dynamic error propagation network. Uni-
versity of Cambridge Department of Engineering, 1987.
[24] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.
MIT press Cambridge, 1998.
[25] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv
preprint arXiv:1505.00521, 2015.

23
[26] Jérémie Cabessa and Hava T Siegelmann. The computational power of interactive recurrent
neural networks. Neural Computation, 24(4):996–1019, 2012.
[27] Hava T Siegelmann and Eduardo D Sontag. Turing computability with neural nets. Applied
Mathematics Letters, 4(6):77–80, 1991.
[28] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual predic-
tion with lstm. Neural computation, 12(10):2451–2471, 2000.

24

You might also like