Supervised Neural Networks For The Classification of Structures
Supervised Neural Networks For The Classification of Structures
3, MAY 1997
I. INTRODUCTION
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 715
A. Preliminaries
Graphs: We consider finite directed vertex labeled graphs
without multiple edges. For a set of labels , a graph (over
) is specified by a finite set of vertices, a set of
ordered couples of (called the set of edges), and
a function from to (called the labeling function).
Note that graphs may have loops, and labels are not restricted
to being binary. Specifically, labels may also be real-valued
vectors. A graph (over ) is a subgraph of if
, and . For a finite set # denotes
its cardinality. Given a graph and any vertex , the
function returns the number of edges leaving
from , i.e.,
, while the function returns the number of
edges entering , i.e.,
. Given a total order on the edges leaving from
, the vertex in is the vertex pointed by
Fig. 2. Chemical structures represented as trees. the th pointer leaving from . The valence of a graph
is defined as . A labeled directed
mains, are given. Specifically, we propose a generalization of acyclic graph (labeled DAG) is a graph, as defined above,
a recurrent neuron—the generalized recursive neuron—which without loops. A vertex is called a supersource for
is able to build a map from a domain of structures to the set if every vertex in can be reached by a path starting from .
of reals. This newly defined neuron allows the formalization The root of a tree (which is a special case of a directed graph)
of several supervised models for structures which stem very is always the (unique) supersource of the tree.
naturally from well known models, such as backpropagation Structured Domain, Target Function, and Training Set: We
through time networks, real-time recurrent networks, simple define a structured domain (over ) as any (possibly
recurrent networks, recurrent cascade correlation networks, infinite) set of graphs (over ). The valence of a domain
and neural trees. is defined as the maximum among the out-degrees of the
This paper is organized as follows. In Section II we intro- graphs belonging to . Since we are dealing with learning,
duce structured domains and some preliminary concepts on we need to define the target function we want to learn. In
graphs and neural networks. The general problem of how approximation tasks, a target function over is defined
to encode labeled graphs for classification is discussed in as any function , where is the output dimension,
Section III. The generalized recursive neurons are defined in while in (binary) classification tasks we have
Section IV, where some related concepts are discussed. Sev- (or .) A training set on a domain is
eral supervised learning algorithms, derived by standard and defined as a set of couples , where and
well-known learning techniques, are presented in Section V. is a target function defined on .
Simulation results for a subset of the proposed algorithms are Standard and Recurrent Neurons: The output of a
reported in Section VI, and conclusions drawn in Section VII. standard neuron is given by
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
716 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
(a) (b)
Fig. 3. Bubble chamber events: (a) coded events and (b) corresponding tree representation.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 717
Fig. 5. Classification of graphs by neural networks. The input graph is encoded as a fixed-size vector which is given as input to a feedforward neural
network for classification.
valence one; in that case, the position of a vertex within where , and
the list corresponds to the time of processing; e.g., given , yields the following set of equations:
the list
cycle
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
718 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
Fig. 6. Neuron models for different input domains. The standard neuron is suited for the processing of unstructured patterns, the recurrent neuron for the
processing of sequences of patterns, and finally the proposed generalized recursive neuron can deal very naturally with structured patterns.
Fig. 7. The encoding network for an acyclic graph. The graph is represented by the output of the encoding network.
Supersource: graph must have a reference supersource. network is recurrent (see Fig. 8), and the neural representation
Notice that if graph does not have a supersource, then it is considered to be well formed only if converges to a
is still possible to define a convention for adding to graph stationary value.
an extra vertex (with a minimal number of outgoing edges), The encoding network fully describes how the represen-
tation for the structure is computed and it will be used in
such that is a supersource for the new graph (see Appendix
the following to derive the learning rules for the generalized
A for an algorithm).
recursive connections.
If the above conditions are satisfied, we can adopt the
When considering a structured domain, the number of
convention that graph is represented by , i.e., the recursive connections of must be equal to the valence of
output of computed for the supersource . Consequently, the domain. The extension to a set of generalized neurons
due to the recursive nature of (3), it follows that the neural is trivial: if the valence of the domain is , each generalized
representation for an acyclic graph is computed by a feed- neuron will have groups of recursive connections each.
forward network (encoding network) obtained by replicating
the same generalized recursive neuron and connecting these B. Optimized Training Set for Directed Acyclic Graph
copies according to the topology of the structure (see Fig. 7). When considering DAG’s, the training set can be organized
If the structure contains cycles, then the resulting encoding so to improve the computational efficiency of both the reduced
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 719
Fig. 8. The encoding network for a cyclicgraph. In this case, the encoding network is recurrent and the graph is represented by the output of the
encoding network at a fixed point of the network dynamics.
Fig. 9. Optimization of the training set: the set ofstructures (in this case, trees) is transformed into the corresponding minimal DAG, which is then used
to generate thesorted training set. The sorted training set is then transformed into a set of sorted vectors using the numeric codes for the labels and
targets, and used as a training set for the network.
representations and the learning rules. In fact, given a training reduction in space complexity. In some cases, such reduction
set of DAG’s, if there are graphs which share can even be exponential.
a common subgraph , then we need to explicitly represent
only once. The optimization of the training set can be C. Well-Formedness of Neural Representations
performed in two stages (see Fig. 9 for an example).
When considering cyclic graphs, to guarantee that each
1) All the DAG’s in the training set are merged into a single encoded graph gets a proper representation through the en-
minimal DAG, i.e., a DAG with a minimal number of coding network, we have to guarantee that for every initial
vertices. state the trajectory of the encoding network converges to
2) A topological sort on the vertices of the minimal DAG an equilibrium, otherwise it would be impossible to process
is performed to determine the updating order on the a nonstationary representation. This is particularly important
vertices for the network. when considering cyclic graphs. In fact, acyclic graphs are
Both stages can be done in linear time with respect to the size guaranteed to get a convergent representation because the
of all DAG’s and the size of the minimal DAG, respectively.2 resulting encoding network is feedforward, while cyclic graphs
Specifically, stage 1) can be done by removing all the duplicate are encoded by using a recurrent network. Consequently,
subgraphs through a special subgraph-indexing mechanism the well-formedness of representations can be obtained by
(which can be implemented in linear time). defining conditions that guarantee the convergence of the
The advantage of having a sorted training set is that all the encoding network. Regarding this issue, it is well known that
reduced representations (and also their derivatives with respect if the weight matrix is symmetric, an additive network with
to the weights, as we will see when considering learning) first-order connections possesses a Lyapunov function and is
can be computed by a single ordered scan of the training set. convergent ([5], [15]). Moreover, Almeida [1] proved a more
Moreover, the use of the minimal DAG leads to a considerable general symmetry condition than the symmetry of the weight
matrix, i.e., a system satisfying detailed balance
2 This analysis was done by C. Goller. (5)
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
720 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 721
Case I—DAG’s: This case has been treated by Goller and with
Küchler in [12]. Since the training set contains only DAG’s,
can be computed by backpropagating the error from the ..
. (16)
feedforward network through the encoding network of each
structure. As in backpropagation through time, the gradient
contributions of corresponding copies of the same weight are repetitions of
collected for each structure. The total amount is then used
to change all the copies of the same weight. If learning is and is defined according to the topology
performed by structure, then the weights are updated after of the graph.
the presentation of each individual structure, otherwise, the In this context, the input of the adjoint network is
Error , and the learning rules become
gradient contributions are collected through the whole training
set and the weights changed after all the structures in the
training set have been presented to the network. (17)
Case II—Cyclic Graphs: When considering cyclic graphs,
can only be computed by resorting to recurrent back- (18)
propagation. In fact, if the input structure contains cycles, then
the resulting encoding network is cyclic. To hold the constraint on the weights, all the changes
In the standard formulation, a recurrent backpropagation referring to the same weight are added and then all copies
network with units is defined as of the same weight are changed by the total amount. Note
(10) that each structure gives rise to a new adjoint network and
independently contributes to the variations of the weights.
where is the input vector for the network, and Moreover, the above formulation is more general than the
the weight matrix. The learning rule for one previously discussed and it can also be used for DAG’s,
a weight of the network is given by in which case the adjoint network becomes a feedforward
network representing the backpropagation of errors.
(11)
An important variation to the learning rules presented above
is constituted by teacher forcing, a technique proposed by
where all the quantities are taken at a fixed point of the Pineda [20] to force the creation of new fixed points in
recurrent network, is the error between the current output standard recurrent networks. While the effects of this technique
of unit and the desired one, ( is a is well understood in standard recurrent networks, it is not
Kronecker delta), and the quantity clear to us how it can influence the dynamics of the encoding
(12) networks. In fact, while a new encoding network for each
graph in the training set is generated, all the encoding networks
are interdependent since the weights on the connections are
can be computed by relaxing the adjoint network , i.e., a shared. A more accurate study on how this sharing of resources
network obtained from by reversing the direction of the affects the dynamics of the system when using teacher forcing
connections. The weight from neuron to neuron in is needed.
is replaced by from neuron to neuron in . The
activation functions of are linear and the output units in B. Extension of Real-Time Recurrent Learning
become input units in with as input.
Given a cyclic graph , let be the number The extension of real-time recurrent learning [34] to gener-
of generalized neurons and the output at time of these alized recursive neurons does not present particular problems
neurons for . Then we can define when considering graphs without cycles. Cyclic graphs, in-
stead, give rise to a training algorithm that can be considered
only loosely in real time.
(13) In order to be concise, we will only show how derivatives
..
. can be computed in real time, leaving to the reader the
development of the learning rules, according to the chosen
network architecture and error function.
as the global state vector for our recurrent network, where for
Case I—DAG’s: Let be the number of generalized
convention is the output of the neurons representing the
neurons, the supersource of , and
supersource of .
To account for the labels, we have to slightly modify (10)
(14)
where (19)
be the representation of according to the encoding network,
.. where
. (15)
(20)
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
722 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
and is the valence of the domain. The derivatives of with Note that both the strongly connected components of a
respect to and can be computed graph and its component graph can be computed by
from (19) using two depth-first searches in in time [6].
Moreover, a strongly connected component of corresponds
to a maximal subset of interconnected units in the encoding
network for , while the component graph describes the
functional dependences between these subsets of units. Thus,
the computation of the derivatives in real time can be done by
observing the following.
(21)
1) Equations (21) and (22) are still valid.
where is the Kronecker delta, 2) Each strongly connected component of the graph corre-
and , is the th row of sponds to a set of interdependent equations.
3) The component graph defines the dependences between
sets of equations.
For example, let consider a single generalized recursive neuron
and the graph
(22)
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 723
(a)
(a)
(b)
Fig. 13. LRAAM-based networks for the classification of structures.
(b)
The reason for allowing different degrees of interaction
G
Fig. 12. (a) A labeled directed graph . The strongly connected components
of G are shown as shaded regions. (b) The acyclic component graph SCC G between the classification and the representation tasks is due
obtained by shrinking each strongly connected component of G
to a single to the necessity of having different degrees of adaptation of
vertex. the compressed representations to the requirements of the
classification task. If no interaction at all is allowed, i.e., the
In this type of network (see Fig. 13), the first part of the LRAAM is trained first and then its weights frozen
network is constituted by an LRAAM (note the double arrows , the compressed representations will be such that similar
on the connections) whose task is to devise a compressed representations will correspond to similar structures, while if
representation for each structure. This compressed represen- full interaction is allowed, i.e., the LRAAM and the classifier
tation is obtained by using the standard learning algorithm for are trained simultaneously, the compressed representations will
LRAAM [32]. The classification task is then performed in the be such that structures in the same class will get very similar
second part of the network through a multilayer feedforward representations.5 On the other hand, when considering DAG’s,
network with one or more hidden layers (network ) or a by setting , where is the number
simple sigmoidal neuron (network ). of vertices traversed by the longest path in the structures,
Several options for training networks and are available. the classifier error will be backpropagated across the whole
These options can be characterized by the proportion of the structure, thus implementing the backpropagation through the
two different learning rates (for the classifier and the LRAAM) structure defined in Section V-A.
and by the different degrees , and in the presence of the It is interesting to note that the SRN by Elman can be
following two basic features. obtained as a special case of network . In fact, when
considering network (with and ) for the
• The training of the classifier is not started until per-
classification of lists (sequences) the same architecture is ob-
cent of the training set has been correctly encoded and
tained, with the difference that there are connections from the
subsequently decoded by the LRAAM.
hidden layer of the LRAAM back to the input layer6, i.e., the
• The error from the classifier is backpropagated across
decoding part of the LRAAM. Thus, when considering lists,
levels of the structures encoded by the LRAAM.3
the only difference between a SRN and network is in the
Note that, even if the training of the classifier is started unsupervised learning performed by the LRAAM. However,
only when all the structures in the training set are properly when forcing the learning parameters for the LRAAM to
encoded and decoded, the classifier’s error can still change the be null, we obtain the same learning algorithm as in SRN.
compressed representations which, however, are maintained Consequently, we can claim that SRN is a special case of
consistent4 by learning in the LRAAM. network . This can be better understood by looking at the
right-hand side of Fig. 14, which shows network in terms of
3 The backpropagation of the error across several levels of the structures can
be implemented by unfolding the encoder of the LRAAM (the set of weights 5 Moreover, in this case, there is no guarantee that the LRAAM will be able
from the input to the hidden layer) according to the topology of the structures. to encode and to decode consistently all the structures in the training set, since
4 A consistent compressed representation is a representation of a structure training is stopped when the classification task has been performed correctly.
which contains all the information sufficient for the reconstruction of the 6 The output layer of the LRAAM can be considered as being the same as
whole structure. the input layer.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
724 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
(a) (b)
Fig. 14. (a) The network B, with x = 0 and y = 1 can be considered as (b) a generalization of the simple recurrent network by Elman.
elements of an SRN. Of course, the copy function for network th hidden unit. The output of the output neuron is
is not as simple as the one used in an SRN, since the computed as
correct relationships among components of the structures to
be classified must be preserved.7 (25)
(28)
(24)
where
is the derivative of . The above
where is the weight of the th hidden unit associated equations are recurrent on the structures and can be computed
with the output of the th hidden unit computed on the by observing that for a leaf vertex (26) reduces to
th component pointed by , and is the weight of the , and all the remaining derivatives are null. Consequently,
connection from the th (frozen) hidden unit, , and the we only need to store the output values of the unit and its
7 The copy function needs a stack for the memorization of compressed derivatives for each component of a structure. Fig. 15 shows
representations. The control signals for the stack are defined by the encoding- the evolution of a network with two pointer fields.
decoding task.
8 Since the maximization of the correlation is obtained using a gradient
Note that, if the hidden units have self-recurrent connections
ascent technique on a surface with several maxima, a pool of hidden units is only, the matrix defined by the weights is not triangular,
trained and the best one selected. but diagonal.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 725
Fig. 15. The evolution of a network with two pointer fields. The units in the label are fully connected with the hidden units and the output unit.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
726 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
Fig. 17. The learning and classification algorithms for binary neural trees.
TABLE I
DESCRIPTION OF A SET OF CLASSIFICATION PROBLEMS INVOLVING LOGIC TERMS
that the total number of subterms for the training set will be of a specific unification pattern. Specifically, in the unifi-
equal to the number of vertices of the minimal DAG obtained cation patterns for the problems inst1 and inst1 long,
after the optimization of the training set. Finally, the maximal the variable occurs twice making these problems much
depth of the terms, both in the training and test sets, is six. more difficult than inst4 long, because any classifier for
For each problem about the same number of positive and these problems would have to compare arbitrary subterms
negative examples is given. Both positive and negative ex- corresponding to .
amples were generated randomly. Training and test sets are As a final remark, it must be pointed out that the number of
disjoint and were generated by the same algorithm. Some training examples is a very small fraction of the total number
examples of logic terms represented as trees are shown in of terms characterizing each problem. In fact, the total number
Figs. 18–20. of distinct terms belonging to a problem is defined both by
Note that the set of proposed problems ranges from the de- the number and arity of the symbols, and by the maximal
tection of a particular atom (label) in a term to the satisfaction depth allowed for the terms. Making some calculations, it
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 727
Fig. 18. Samples of trees representing terms for problem inst4 long. The positive terms are shown on the left, while negative terms are shown on
the right. The vertices and edges characterizing the positive terms are drawn in bold.
Fig. 19. Samples of trees representing terms for problem inst1 long. The positive term is shown on the left, while the negative term is shown on the right.
is not difficult to compute that the total number of terms and the learning parameters, thus it should be possible to
allowed for problem inst1 is equal to 21 465. This number, improve on the reported results. The first column of the table
however, grows exponentially with the maximal depth, e.g., if shows the name of the problem, the second the number of units
the maximal depth is set to four, the total number of distinct used to represent the labels, the third the number of hidden
terms is larger than . This gives an idea of how small is units, the fourth the learning parameters ( is the learning
the training set for problem inst1 long, where the maximal parameter for the LRAAM, the learning parameter for the
depth is six. classifier, the momentum for the LRAAM), the fifth the
percentage of terms in the training set which the LRAAM was
B. LRAAM-Based Networks able to properly encode and decode, the sixth the percentage
Table III reports the best results we obtained for the majority of terms in the training set correctly classified, the seventh the
of the problems described in Table I, over four different net- percentage of terms in the test set correctly classified, and the
work settings (both in number of hidden units for the LRAAM eighth the number of epochs the network employed to reach
and learning parameters) for the LRAAM-network with a the reported performances.
single unit as classifier. The simulations were stopped after The results highlight that some problems get a very satisfac-
30 000 epochs, apart from problem inst1 long for which tory solution even if the LRAAM performs poorly. Moreover,
we used a bound of 80 000 epochs, or when the classification this behavior does not seem to be related to the complexity
problem over the training set has been completely solved. We of the classification problem, since both problems involv-
made no extended effort to optimize the size of the network ing the simple detection of an atom (label) in the terms
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
728 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
Fig. 20. Samples of trees representing terms for problem termocc1 very-long. The positive term is shown on the left, while the negative term
is shown on the right.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 729
TABLE III
THE BEST RESULTS OBTAINED FOR SOME OF THE CLASSIFICATION PROBLEMS BY AN LRAAM-BASED NETWORK
distinct reduced representations since they must be decoded accordance with the better performance in encoding-decoding
into different terms. of the latter than the former. Of course, this does not constitute
The above considerations on the final representations for enough evidence to conclude that the relationship between the
the terms are valid only if the LRAAM reaches a good variance of the clusters and the performance of the LRAAM
encoding-decoding performance on the training set. However, is demonstrated. However, it does seem to be enough to call
as reported in Table III, some classification problems can for a more accurate study on this issue.
be solved even if the LRAAM performs poorly. In this
case, the reduced representations contain almost exclusively C. Cascade-Correlation for Structures
information about the classification task. Figs. 26 and 27 The results obtained by cascade-correlation for structures,
report the results of a principal components analysis on the shown in Table IV, are obtained for a subset of the problems
representations developed for the problems inst4 long using a pool of eight units. The networks used have both
and inst7, respectively. In the former, the first and sec- triangular and diagonal recursive connections matrices and no
ond principal components suffice for a correct solution of connections between hidden units. We decided to remove the
the classification problem. In the latter, the second principal connections between hidden units to reduce the probability of
component alone gives enough information for the solution overfitting.
of the problem. Moreover, notice how the representations We made no extended effort to optimize the learning pa-
developed for inst7 clustered with smaller variance than the rameters and the number of units in the pool, thus it should
representations developed for inst4 long, and how this is in be possible to significantly improve on the reported results.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
730 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
VII. CONCLUSION defined for standard neurons, can be adapted to deal with
structures. We believe that other learning procedures, besides
We have proposed a generalization of the standard neuron, those covered in this paper, can be adapted as well.
namely the generalized recursive neuron, for extending the The proposed approach to learning in structured domains
computational capability of neural networks to the processing can be adopted for automatic inference in syntactic and struc-
of structures. On the basis of the generalized recursive neu- tural pattern recognition [10], [13]. Specifically, in this paper,
ron, we have shown how several of the learning algorithms we have demonstrated the possibility to perform classification
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 731
Fig. 24. The first, second, and third principal components of the reduced representations, devised by a basic LRAAM on the training and test sets of the
inst1 problem, yield a nice three-dimensional view of the term’s representations.
Fig. 25. Results of the principal components analysis (first, second, and third principal components) of the reduced representations developed by the
proposed network (LRAAM + Classifier) for the inst1 problem.
tasks involving logic terms. Note that automatic inference can (QSPR and QSAR), where it can be used for the automatic
also be obtained by using inductive logic programming [17]. determination of topological indexes [14], [22], which are
The proposed approach, however, has its own specific pecu- usually designed through a very expensive trial and error
liarity, since it can approximate functions from a structured approach.
domain (which may have real-valued vectors as labels) to In conclusion, the proposed architectures extend the process-
the reals. Specifically, we believe that the proposed approach ing capabilities of neural networks, allowing the processing
can fruitfully be applied to molecular biology and chemistry of structured patterns which can be of variable size and
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
732 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
Fig. 26. Results of the principal components analysis (first and second principal components) of the reduced representations developed by the proposed
network (LRAAM + Classifier) for the inst4-long problem. The resulting representations are clearly linearly separable.
Fig. 27. Results of the principal components analysis (first, and second principal component) of the reduced representations developed by the proposed
network (LRAAM + Classifier) for the inst7 problem. The resulting representations can be separated using only the second principal component.
complexity. However, some of the proposed architectures the network. These limitations, in the context of standard
do nevertheless have computational limitations. For example, recurrent cascade-correlation (RCC), have been discussed in
cascade-correlation for structures has computational limita- [11], where it is demonstrated that certain finite state automata
tions due to the fact that frozen hidden units cannot receive cannot be implemented by networks built by RCC using
input from hidden units introduced after their insertion into monotone activation functions. Since our algorithm reduces
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 733
TABLE IV
RESULTS OBTAINED ON THE TEST SETS FOR SOME OF THE CLASSIFICATION PROBLEMS USING BOTH NETWORKS WITH TRIANGULAR AND
DIAGONAL RECURSIVE CONNECTION MATRICES. THE SAME NUMBER OF UNITS (8) IN THE POOL WAS USED FOR ALL NETWORKS
to standard RCC when considering sequences, it follows that in different ways according to the complexity of the structured
it has limitations as well. domain. For example, if the labels which appear in a graph
are all distinct, then we can define a total order on (the
set of labels) and note that the set
APPENDIX
ALGORITHM FOR THE ADDITION OF A SUPERSOURCE for each
In the following we define the algorithm for the addition contains a single element which can be used as representative
of a supersource to a graph. The algorithm uses the concepts (i.e., ). If the same label can have multiple occurrences, we
of strongly connected components of a graph and the related can use information on the out-degree of the vertices
definition of a component graph (see Section V-B.2 for formal
definitions).
for each
A. Algorithm Supersource
and further on the in-degree of the vertices
Input: A graph
Output: A graph with supersource
Begin for each
• Compute the component graph of ; If the set still contains more than one vertex, then it can be
• Define the set refined by resorting to more sophisticated criteria (see [31]).
; Theorem 1: The algorithm Supersource applied to a graph
• Define for each vertex the set containing adds to the minimum number of edges to
all the vertices (in ) within the strongly connected create a supersource.
component corresponding to ; Proof: If no edge is added by the algorithm
• For each define , Supersource. Otherwise both a new vertex , the supersource,
i.e., is a vertex representing the strongly and a new edge for each strongly connected component
connected component corresponding to ; of with are created. The vertex is
• if ( (i.e., )) a supersource, since, by definition, all the verices in can
then is the supersource, , and ; be reached by a path starting from . Moreover, if one edge
else leaving from is removed, say , the vertices belonging
• Create a new vertex (the supersource) and set to the strongly connected component cannot be reached by
; any other vertex since . This demonstrates
• Set , where that all the added edges are needed.
.
End. LABELING RAAM
In the algorithm above we left the procedure The labeling RAAM (LRAAM) [27], [28], [30], is an exten-
undefined since it can be defined sion of the RAAM model [21] which allows one to encode
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
734 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997
Fig. 29. The network for a general LRAAM. The first layer in the network
implements an encoder; the second layer, the corresponding decoder.
Fig. 28. The result of algorithm Supersource applied to the graph shown in
Fig. 12(a).
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.
SPERDUTI AND STARITA: SUPERVISED NEURAL NETWORKS 735
[2] A. Atiya, “Learning on a general network,” in Neural Information [27] A. Sperduti, “Encoding of labeled graphs by labeling RAAM,” in
Processing Systems, D. Z. Anderson, Ed. New York: AIP, 1988, pp. Advances in Neural Information Processing Systems, J. D. Cowan, G.
22–30. Tesauro, and J. Alspector, Eds. San Mateo, CA: Morgan Kaufmann,
[3] L. Atlas et al., “A performance comparison of trained multilayer vol. 6, pp. 1125–1132, 1994.
perceptrons and trained classification trees,” Proc. IEEE, vol. 78, pp. [28] , “Labeling RAAM,” Connection Sci., vol. 6, no. 4, pp. 429–459,
1614–1619, 1992. 1994.
[4] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and [29] , “Stability properties of labeling recursive autoassociative mem-
Regression Trees. Belmont, CA: Wadsworth, 1984. ory,” IEEE Trans. Neural Networks, vol. 6, pp. 1452–1460, 1995.
[5] M. A. Cohen and S. Grossberg, “Absolute stability of global pattern [30] A. Sperduti and A. Starita, “An example of neural code: Neural
formation and parallel memory storage by competitive neural networks,” trees implemented by LRAAM’s,” in Proc. Int. Conf. Neural Networks
IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 815–826, 1983. Genetic Algorithms, Innsbruck, Austria, 1993, pp. 33–39.
[6] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to [31] , “Encoding of graphs for neural network processing,” Diparti-
Algorithms. Cambridge, MA: MIT Press, 1990. mento di Informatica, Università di Pisa, Italy, Tech. Rep., 1996.
[7] J. L. Elman, “Finding structure in time,” Cognitive Sci., vol. 14, pp. [32] A. Sperduti, A. Starita, and C. Goller, “Fixed length representation of
179–211, 1990. terms in hybrid reasoning systems, report i: Classification of ground
[8] S. E. Fahlman, “The recurrent cascade-correlation architecture,” terms,” Dipartimento di Informatica, Università di Pisa, Italy, Tech.
Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CS-91-100, Rep. TR-19/94, 1994.
1991. [33] A. Sperduti, A. Starita, and C. Goller, “Learning distributed representa-
[9] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning archi- tions for the classification of terms,” in Proc. Int. Joint Conf. Artificial
tecture,” in Advances in Neural Information Processing Systems, D. S. Intell., 1995, pp. 509–515.
Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, vol. 2, 1990, pp. [34] R. J. Williams and D. Zipser, “A learning algorithm for continually
524–532. running fully recurrent neural networks,” Neural Computa., vol. 1, pp.
[10] K. S. Fu, Syntactical Pattern Recognition and Applications. Engle- 270–280, 1989.
wood Cliffs, NJ: Prentice-Hall, 1982.
[11] C. L. Giles, D. Chen, G. Z. Sun, H. H. Chen, Y. C. Lee, and M.
W. Goudreau, “Constructive learning of recurrent neural networks:
Limitations of recurrent casade correlation and a simple solution,” IEEE
Trans. Neural Networks, vol. 6, pp. 829–836, 1995. Alessandro Sperduti received the “laurea” and
[12] C. Goller and A. Küchler, Learning Task-Dependent Distributed Doctoral degrees in 1988 and 1993, respectively,
Structure-Representations by Backpropagation Through Structure, all in computer science from the University of Pisa,
Institut für Informatik, Technische Universität München, Germany, Italy.
AR Rep. AR-95-02, 1995. In 1993 he spent a period at the International
[13] R. C. Gonzalez and M. G. Thomason, Syntactical Pattern Recognition. Computer Science Institute, Berkeley, supported by
Reading, MA: Addison-Wesley, 1978. a postdoctoral fellowship. In 1994 he returned to
[14] L. H. Hall and L. B. Kier, “Reviews in computational chemistry,” the Computer Science Department, University of
The Molecular Connectivity Chi Indexes and Kappa Shape Indexes Pisa, where he is presently Assistant Professor. His
in Structure-Property Modeling. New York: VCH, 1991, ch. 9, pp. research interests include data sensory fusion, image
367–422. processing, neural networks, hybrid systems. In the
[15] J. J. Hopfield, “Neurons with graded response have collective computa- field of hybrid systems his work has focused on the integration of symbolic
tional properties like those of two-state neurons,” in Proc. Nat. Academy and connectionist systems.
Sci., 1984, pp. 3088–3092. Dr. Sperduti has contributed to the organization of several workshops on
[16] T. Li, L. Fang, and A. Jennings, “Structurally adaptive self-organizing
this subject and also served on the program committee of conferences on
neural trees,” in Proc. Int. Joint Conf. Neural Networks, 1992, pp.
Neural Networks.
329–334.
[17] S. Muggleton and L. De Raedt, “Inductive logic programming: Theory
and methods,” J. Logic Programming, vol. 19, no. 20, pp. 629–679,
1994.
[18] M. P. Perrone, “A soft-competitive splitting rule for adaptive tree-
Antonina Starita (A’91–M’96) received the
structured neural networks,” in Proc. Int. Joint Conf. Neural Networks,
“laurea” degree in physics from the University of
1992, pp. 689–693.
Naples, Italy, and the Doctoral degree in computer
[19] M. P. Perrone and N. Intrator, “Unsupervised splitting rules for neural
science from the University of Pisa, Italy.
tree classifiers,” in Proc. Int. Joint Conf. Neural Networks, 1992, pp.
She was Research Fellow of the Italian National
820–825.
[20] F. J. Pineda, “Dynamics and architecture for neural computation,” J. Council of Research at the Information Processing
Complexity, vol. 4, pp. 216–245, 1988. Institute of Pisa and then she became a Researcher
[21] J. B. Pollack, “Recursive distributed representations,” Artificial Intell., for the same institution. Since 1980, she has become
vol. 46, nos. 1–2, pp. 77–106, 1990. Associate Professor at the Computer Science
[22] D. H. Rouvray, Computational Chemical Graph Theory. New York: Department of the University of Pisa, where she is
Nova Sci., 1990, p. 9. in charge of the two regular courses, “Knowledge
[23] D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing: Acquisition and Expert Systems” and “Neural Networks,” at the University
Explorations in the Microstructure of Cognition. Cambridge, MA: MIT of Pisa. She has been active in the areas of biomedical signal processing,
Press, 1986. rehabilitation engineering, and biomedical applications of signal processing,
[24] A. Sankar and R. Mammone, “Neural tree networks,” Neural Networks: responsible for many research projects at the national and international level.
Theory and Applications. New York: Academic, 1991, pp. 281–302. Her current scientific interests are now shifted to AI methodologies, robotics,
[25] I. K. Sethi, “Entropy nets: From decision trees to neural networks,” neural networks, hybrid systems, and sensory integration in artificial systems.
Proc. IEEE, vol. 78, pp. 1605–1613, 1990. Dr. Starita is member of the Italian Medical and Biological Engineering
[26] J. A. Sirat and J.-P. Nadal, “Neural trees: A new tool for classification,” Society, the International Medical and Biological Engineering Society, and
Network, vol. 1, pp. 423–438, 1990. of the INNS (International Neural Network Society).
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on September 28,2023 at 06:02:45 UTC from IEEE Xplore. Restrictions apply.