0% found this document useful (0 votes)
52 views11 pages

Feature Transformers A Unified Representation Learning Framework For Lifelong Learning

Despite the recent advances in representation learning, lifelong learning continues to be one of the most challenging and unconquered problems. Catastrophic forgetting and data privacy constitute two of the important challenges for a successful lifelong learner. Further, existing techniques are designed to handle only specific manifestations of lifelong learning, whereas a practical lifelong learner is expected to switch and adapt seamlessly to different scenarios.

Uploaded by

Sai Hareesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views11 pages

Feature Transformers A Unified Representation Learning Framework For Lifelong Learning

Despite the recent advances in representation learning, lifelong learning continues to be one of the most challenging and unconquered problems. Catastrophic forgetting and data privacy constitute two of the important challenges for a successful lifelong learner. Further, existing techniques are designed to handle only specific manifestations of lifelong learning, whereas a practical lifelong learner is expected to switch and adapt seamlessly to different scenarios.

Uploaded by

Sai Hareesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Under review as a conference paper at ICLR 2019

F EATURE T RANSFORMERS : A U NIFIED R EPRESENTA -


TION L EARNING F RAMEWORK FOR L IFELONG L EARN -
ING

Anonymous authors
Paper under double-blind review

A BSTRACT

Despite the recent advances in representation learning, lifelong learning continues


to be one of the most challenging and unconquered problems. Catastrophic forget-
ting and data privacy constitute two of the important challenges for a successful
lifelong learner. Further, existing techniques are designed to handle only specific
manifestations of lifelong learning, whereas a practical lifelong learner is expected
to switch and adapt seamlessly to different scenarios. In this paper, we present a
single, unified mathematical framework for handling the myriad variants of life-
long learning, while alleviating these two challenges. We utilize an external mem-
ory to store only the features representing past data and learn richer and newer
representations incrementally through transformation neural networks - feature
transformers. We define, simulate and demonstrate exemplary performance on a
realistic lifelong experimental setting using the MNIST rotations dataset, paving
the way for practical lifelong learners. To illustrate the applicability of our method
in data sensitive domains like healthcare, we study the pneumothorax classifi-
cation problem from X-ray images, achieving near gold standard performance.
We also benchmark our approach with a number of state-of-the art methods on
MNIST rotations and iCIFAR100 datasets demonstrating superior performance.

1 I NTRODUCTION
Deep learning algorithms have achieved tremendous success on various challenging tasks like object
detection, language translation, medical image segmentation, etc. Lifelong learning - the ability to
adapt, benefit and sustain performance post deployment with more data and feedback, is an impor-
tant goal for artificial intelligence (AI) and extremely crucial for sustained utility of these algorithms
in domains like healthcare (learning from rare cases), self driving (learning new object detectors),
etc. While there is no single standard definition of lifelong learning, most of the research in this field
can be classified into one of the following sub-categories:
1. Incremental learning - encountering new data, with no change in distribution, for same task
2. Domain adaptation - data from modified target distributions but for the same task
3. New-task learning - data from tasks that were not presented before
Majority of the successful techniques study these variants in isolation. However, a realistic lifelong
learning scenario would involve a mixture of these manifestations over time. Another major impedi-
ment to a successful lifelong learner is data privacy. In domains like healthcare, it is near impossible
to have data access beyond scope (both time and geography) which leads to catastophic forgetting
(McCloskey & Cohen, 1989)- the inability of machine learning models to retain past knowledge
while learning from new data. In this paper, we provide a unified framework - feature transformers
- for practical lifelong learning to handle data privacy and catastrophic forgetting.
In summary, the major contributions of our work are as follows:
• Define a realistic and a challenging lifelong learning scenario and formulate a unique and
generic mathematical framework - feature transformer, to work in this setting successfully.
• Ensure data privacy by storing only features from previous episodes while successfully
combating catastrophic forgetting.

1
Under review as a conference paper at ICLR 2019

• Principled way of utilizing additional neural-compute to tackle complex incremental tasks.


• Demonstrate state-of-the-art results on common continuous learning benchmarks for new
task learning like MNIST rotations, iCIFAR dataset and a challenging healthcare problem.
• Demonstrate exemplary performance even under severe constraints on memory and com-
pute, thus operating under the assumptions of practical lifelong learning.
Following review of related work in section 2, we describe the mathematical formulation of feature
transformers and their practical realization in section 3. Section 4 contains experiments and results,
followed by discussion in section 5.
2 R ELATED WORK
A conventional deep learning classifier can be viewed as an automatic feature extractor followed by
a classifier, trained jointly. For lifelong learning, the methods in literature can be broadly classified
into: 1) learning with fixed feature representations and 2) incrementally evolving representations.
Fixed Represenations: In this class of approaches, a representation for the data is learnt from the
initial task and remains frozen for ensuing tasks, while only the classification layers are modified.
The simplest baseline approaches include fine-tuning (Oquab et al., 2014) and Mensink et al. (2012)
where the idea of using fixed representations was extended further by using the nearest class mean
classifier. In spite of their simplicity, these algorithms do not perform well in practice due to the
limiting constraint of a fixed representation throughout the incremental learning phase.
Incrementally evolving representations: The last few years have witnessed renewed interest in
developing methods that allow changing data representation with the addition of new tasks. Naive
methods of retraining with only new data suffer from catastrophic forgetting (McCloskey & Cohen,
1989). To overcome this problem, most methods have attempted different manifestations of re-
hearsal (Robins, 1995) - replaying data from the previous tasks. However, access to all the previous
data is usually not feasible due to size, compute and privacy concerns, researchers have attempted to
reproduce past through proxy information, known as pseudorehearsal. Pseudorehearsal techniques
can be further classified depending on how the previous information is stored /used:

• Knowledge distillation based approaches: An example method for lifelong learning us-
ing knowledge distillation (Hinton et al., 2015) is learning without forgetting (LwF) (Li &
Hoiem, 2017) - where distillation loss was added to match output of the updated network
to that of the old network on the old task output variables. LwF and its extensions does not
scale to a large number of new tasks, suffers from catastrophic forgetting and importantly
do not address incremental learning or domain adapatation settings.
• Rehearsal using exemplar sets: Rebuffi et al. (2017) propose iCaRL, an incremental
learning approach by storing an exemplar set of data from the previous tasks and aug-
menting it with the new data. In more recent works, Javed & Paracha (2018) and Castro
et al. (2018) have argued that a decoupled nearest mean classifier from Rebuffi et al. (2017)
is not essential and have proposed joint learning of feature extraction and classification.
Lopez-Paz et al. (2017) propose Gradient Episodic Memory, a technique which stores data
from previous classes and constrains the gradient update while learning new tasks.
• Rehearsal using generative models: Triki Rannen et al. (2017) propose to reproduce past
knowledge by using task-specific under-complete autoencoders. When a new task is pre-
sented, optimization is constrained by preventing the reconstructions of these autoencoders
from changing, thereby ensuring that the features necessary for the previous tasks are not
destroyed. Shin et al. (2017); He et al. (2018); Venkatesan et al. (2017) employ generative
adversarial networks for recreating the history from previous tasks.
• Network regularization strategies: Methods belonging to this class aim to identify and
selectively modify parts of the neural network which are critical to remember past knowl-
edge or by explicitly penalizing loss of performance on old tasks. Kirkpatrick et al. (2017);
Liu et al. (2018) use the Fischer information matrix to identify the most crucial weights
responsible for prediction of a given task and lower the learning rate for these tasks. Lee
et al. (2017) propose a dynamically expanding network to increase the capacity for new
tasks if the previous architecture is insufficient to represent the data.

2
Under review as a conference paper at ICLR 2019

Our framework lies at the intersection of pseudorehearsal methods and progressive neural networks,
with scope for judiciously utilizing extra capacity, while resisting catastrophic forgetting.
3 L IFE - LONG LEARNING VIA FEATURE TRANSFORMATIONS
Before we present the feature transformer method, we introduce the terminologies and notations.
We consider training a deep neural network which classifies an input to one of the classes c ∈ [C] ,
{1, 2, · · · , C}. We refer to the operation of classifying an input to a particular class as a task. To this
end, the classifier is trained with training dataset (X, Y ), drawn from a joint distribution (X , Y).
We view the deep neural network, defined by the parameters (θ, κ), as the composition of a feature
extractor Φθ : X → F , and a classifier Ψκ
Ψκ ◦ Φθ : X → [C], (1)
where X is the space of input data, and F is a space of low-dimensional feature vectors.
We concisely denote the training of the neural network by TRAIN(θ, κ; D), which minimizes a
loss function on training data D = (X, Y ) and produces the network parameters (θ, κ). Let us also
define the set of all computed features on input F = Φθ (X). 1
When the loss function only penalizes misclassification, the network is expected to learn only the
class separation boundaries in the feature space. However, as we demonstrate experimentally, good
separation of class specific features enables stable learning of representations, which directly has a
bearing on the performance of life-long learning. Therefore, in all our training procedures, we also
use a feature loss which promotes feature separation
model loss = classification loss(θ,κ) + λ · feature-loss(θ) . (2)

In the lifelong learning context, we denote the time varying aspect of the network, training data
and the classes, by using the time symbol t ∈ N , {0, 1, 2, · · · } in superscript on these objects.
Realistically, at any time t > 0, the classifier encounters any of the following canonical situations:

1. The number of classes remain the same for the classifier as at t − 1. The model en-
counters new training data for a subset of classes, without change in their distribution:
(t) (t) (t) (t) (t) (t)
∀t > 0, C (t) = C (t−1) , T1 ⊆ [C (t) ] and ∀τ ∈ T1 , (Xτ , Yτ ) ∼ (Xτ , Yτ ) =
(t−1) (t−1)
(Xτ , Yτ ).
2. The number of classes remain the same for the classifier as at t − 1. The model en-
counters new training data from modified input distribution(s), for a subset of classes:
(t) (t) (t) (t) (t) (t)
∀t > 0, C (t) = C (t−1) , T2 ⊆ [C (t) ] and ∀τ ∈ T2 , (Xτ , Yτ ) ∼ (Xτ , Yτ ) 6=
(t−1) (t−1)
(Xτ , Yτ ).
3. The model encounters new class(es) and corresponding new data: ∀t > 0, C (t) >
(t) (t) (t) (t) (t) (t)
C (t−1) , T3 = [C (t) ]\[C (t−1) ] and ∀τ ∈ T3 , (Xτ , Yτ ) ∼ (Xτ , Yτ ).

In the most generic scenario, a combination of all the three situations can occur at any time index
(t) (t) (t)
t, with training data available for the classes in the set T (t) , T1 ∪ T2 ∪ T3 . However, it is
important to note that at every index t, the classifier is trained to classify all the C (t) classes.
3.1 F EATURE TRANSFORMATION BY AUGMENTING NETWORK CAPACITY
At any time t − 1, the classifier is optimized to classify all the classes [C (t−1) ] and also the set
of features F (t−1) are well separated according to classes. At t, when new training data D(t) =
(t) (t)
∪τ ∈T (t) (Xτ , Yτ ) is encountered, the features extracted using the previous feature extractor
 
∂F (t) = ∪τ ∈T (t) Φθ(t−1) (Xτ(t) ) , (3)
are not guaranteed to be optimized for classifying the new data and new classes. In order to achieve
good performance on new data and classes, we propose to change the feature representation at time
t, just before the classification stage. We achieve this by defining a feature transformer
Φ∆θ(t) : F (t−1) → F (t) , (4)
1
Though Φθ is a mapping which acts on individual vectors, we abuse the notation here by using it with sets.

3
Under review as a conference paper at ICLR 2019

parameterized by ∆θ (t) , which maps any feature extracted by Φθ(t−1) to a new representation. The
new feature extractor is now given by Φθ(t) , Φ∆θ(t) ◦ Φθ(t−1) , where θ (t) , θ (t−1) ∪ ∆θ (t) .
Practically, this is realized by augmenting the capacity of the feature extractor at each time t by
using one or more fully connected layers2 . It is however possible that Φ∆θ(t) could be simply an
identity transform and feature transformers learnt in previous episodes could be adapted for new
data. This helps in controlling the growth of network capacity over time and this aspect of our work
is described in section 4.5.
The feature transformer is trained, along with a new classifier layer, using the composite loss func-
tion of the form in equation 2, by invoking TRAIN(∆θ (t) , κ(t) ; D(t) ), with D(t) = (∂F (t) , Y (t) )3 .
This ensures that the classifier performs well on the new data. However, strikingly, training a feature
transformer at t does not involve changing the feature extractor Φθ(t−1) at all, and this helps us in
alleviating catastrophic forgetting by efficiently making use of already computed features F (t−1)
through a memory module.
3.2 R EMEMBERING HISTORY VIA MEMORY
The set of all extracted features F (t−1) serves as a good abstraction of the model, for all the tasks
and data encountered till t−1. Therefore, if F (t−1) is available to the model when it encounters new
tasks and data, then the feature transformer at t can take advantage of this knowledge to retain the
classification performance on previous tasks and data as well. To this end, we assume the availability
of an un-ending memory module M, equipped with READ(), WRITE() and ERASE() procedures,
that can store F (t−1) and can retrieve the same at t. In situations where memory is scarce, only a
relevant subset of F (t−1) can be stored and retrieved.
We train the feature transformer at any t > 0 by invoking TRAIN(∆θ (t) , κ(t) ; D(t) ), with the
combined set of features
0
D(t) = (∂F (t) ∪ F (t−1) , ∪t0 ∈[1,2,··· ,t] Y (t ) ), ∀t > 0. (5)

We then obtain the new set of features


F (t) = Φ∆θ(t) (∂F (t) ) ∪ Φ∆θ(t) (F (t−1) ), (6)
and replace F (t−1) in memory by (a subset of) F (t) .
With the assumption of infinite memory and capacity augmentation at every episode, our feature
transformers framework is presented in algorithm 1.
3.3 F EATURE T RANSFORMERS IN ACTION
Before we present experimental results to demonstrate the ability of feature transformers for life-
long learning of new tasks, we first show how feature transformers operate when simply new
episodes of data are presented. As described in section 3, we use a composite loss comprising
of classification loss and feature loss to train the classifier. To promote feature clustering/separation,
we propose to use center-loss as described in Wen et al. (2016).
Dropping the time index t for the brevity, with the convention that the ground truth class for each x
is encoded using one-hot vector ȳ = [y1 , y2 , · · · , yC ]T , and (Ψκ ◦ Φθ (x))c is the cth component
of the classifier output, we use the following loss function for all the classifier training procedures:
X X X X
loss(θ, κ) = − yc · log((Ψκ ◦ Φθ (x))c ) + λ · kΦθ (x) − µc k2 , (7)
(x,ȳ)∈D c∈[C] (x,ȳ)∈D c∈[C]

where D = ∪τ ∈T (Xτ , Yτ ) is the given training data set, and µc is the centroid of all features
corresponding to input data labelled as c.
Figure 1 provides snapshot of feature transformation algorithm when a new episode of data is en-
countered. We consider X-Ray lung images (from Wang et al. (2017a)) consisting of two classes:
(i) normal and (ii) pneumothorax. At a time index (t − 1), the classifier model is trained on 6000
2
There is no specific restriction on the kind of layers to be used, but in our present work we use only fully
connected layers.
3 (t) (t)
Y = ∪τ ∈T (t) Yτ

4
Under review as a conference paper at ICLR 2019

Input Task set T (t) , and training data ∪τ ∈T (t) (Xτ(t) , Yτ(t) ), ∀t ≥ 0
Output (θ (t) , κ(t) ), ∀t
t ← 0, ERASE(M) /* Set initial time, erase memory */
D(0) ← ∪τ ∈T (0) (Xτ(0) , Yτ(0) ) /* Obtain initial tasks and training data */
TRAIN(θ (0) , κ(0) ; D(0) ) /* Train initial network */
F (0) ← ∪τ ∈T (0) (Φθ(0) (Xτ(0) )) /* Compute features */
WRITE(M, (F (0) , Y (0) )) /* Write features to memory */
while T RU E do
t ← t + 1, obtain T (t) , ∪τ ∈T (t) (Xτ(t) , Yτ(t) ) /* Obtain current tasks and training data */
(t)
Compute ∂F using equation 3 /* Compute old model features on new data */
(F (t−1) , Y (t−1) ) ← READ(M) /* Read previously computed features from memory */
Form D(t) using equation 5 /* Form composite training data */
TRAIN(∆θ (t) , κ(t) ; D(t) ) /* Train feature transformer */
Φθ(t) ← Φ∆θ(t) ◦ Φθ(t−1) /* Obtain new feature extractor */
Compute F (t) using equation 6 /* Compute new features */
(t) (t0 )
ERASE(M), WRITE(M, (F , ∪t ∈[1,2,··· ,t] Y ))
0 /* Erase & write new features*/
end
Algorithm 1: The life-long learning framework

D(t
<latexit sha1_base64="upQO8RixVY8PJM/p9s36IcqeMkg=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYBDiwbArgh6DevAYwTwkWcPsZDYZMrO7zPQKYclXePGgiFc/x5t/4+Rx0MSChqKqm+6uIJHCoOt+O0vLK6tr67mN/ObW9s5uYW+/buJUM15jsYx1M6CGSxHxGgqUvJloTlUgeSMYXI/9xhPXRsTRPQ4T7ivai0QoGEUrPdw8ZiU89U5GnULRLbsTkEXizUgRZqh2Cl/tbsxSxSNkkhrT8twE/YxqFEzyUb6dGp5QNqA93rI0ooobP5scPCLHVumSMNa2IiQT9fdERpUxQxXYTkWxb+a9sfif10oxvPQzESUp8ohNF4WpJBiT8fekKzRnKIeWUKaFvZWwPtWUoc0ob0Pw5l9eJPWzsueWvbvzYuVqFkcODuEISuDBBVTgFqpQAwYKnuEV3hztvDjvzse0dcmZzRzAHzifP4muj5E=</latexit>
1) [C (t
<latexit sha1_base64="2LZk8BK4jVTYAcOKTp2gRih0GHk=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQD5ZdEfRY7MVjBfsB27Vk02wbmk2WZFYoS3+GFw+KePXXePPfmLZ70NYHA4/3ZpiZFyaCG3Ddb6ewtr6xuVXcLu3s7u0flA+P2kalmrIWVULpbkgME1yyFnAQrJtoRuJQsE44bsz8zhPThiv5AJOEBTEZSh5xSsBKvt94zKpw4Z1Pg3654tbcOfAq8XJSQTma/fJXb6BoGjMJVBBjfM9NIMiIBk4Fm5Z6qWEJoWMyZL6lksTMBNn85Ck+s8oAR0rbkoDn6u+JjMTGTOLQdsYERmbZm4n/eX4K0U2QcZmkwCRdLIpSgUHh2f94wDWjICaWEKq5vRXTEdGEgk2pZEPwll9eJe3LmufWvPurSv02j6OITtApqiIPXaM6ukNN1EIUKfSMXtGbA86L8+58LFoLTj5zjP7A+fwB70WQXA==</latexit>
1)
] D(t)
<latexit sha1_base64="L5L+dKhb/nd2wnb2RaR3BGTCAYY=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GNRDx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3Uz91hPXRsTqAccJ9yM6UCIUjKKVWrePWQXPJr1S2a26M5Bl4uWkDDnqvdJXtx+zNOIKmaTGdDw3QT+jGgWTfFLspoYnlI3ogHcsVTTixs9m507IqVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtoQvMWXl0nzvOq5Ve/+oly7zuMowDGcQAU8uIQa3EEdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx+s6o8f</latexit>
[C (t) ]
<latexit sha1_base64="BXrnpxOhjASFTq/fieOBhu88mKg=">AAAB8HicbVBNSwMxEM3Wr1q/qh69BItQL2VXBD0We/FYwX7Idi3ZNNuGJtklmRXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmhYngBlz32ymsrW9sbhW3Szu7e/sH5cOjtolTTVmLxiLW3ZAYJrhiLeAgWDfRjMhQsE44bsz8zhPThsfqHiYJCyQZKh5xSsBKD37jMavC+TTolytuzZ0DrxIvJxWUo9kvf/UGMU0lU0AFMcb33ASCjGjgVLBpqZcalhA6JkPmW6qIZCbI5gdP8ZlVBjiKtS0FeK7+nsiINGYiQ9spCYzMsjcT//P8FKLrIOMqSYEpulgUpQJDjGff4wHXjIKYWEKo5vZWTEdEEwo2o5INwVt+eZW0L2qeW/PuLiv1mzyOIjpBp6iKPHSF6ugWNVELUSTRM3pFb452Xpx352PRWnDymWP0B87nDxF2j+o=</latexit>

Trainable
Trained and fixed

<latexit sha1_base64="9eFOUokv3sX/7QKJx5Jmj/Dm1Jw=">AAACF3icdVDLSgMxFM34tr6qLt0Ei1AXDpk6Wt2JblxWsFbojCWTpjY08yC5I5Rh/sKNv+LGhSJudeffmKkVVPRAyOGce7n3niCRQgMh79bE5NT0zOzcfGlhcWl5pby6dqHjVDHeZLGM1WVANZci4k0QIPllojgNA8lbweCk8Fs3XGkRR+cwTLgf0utI9ASjYKRO2faCWHb1MDRf5jX6Iu9k3yTsQZ8DvcqqsONs53mnXCG2S2qHdYKJTeq7u+6I1Fyyt48dm4xQQWM0OuU3rxuzNOQRMEm1bjskAT+jCgSTPC95qeYJZQN6zduGRjTk2s9Gd+V4yyhd3IuVeRHgkfq9I6OhLvY0lSGFvv7tFeJfXjuF3oGfiShJgUfsc1AvlRhiXISEu0JxBnJoCGVKmF0x61NFGZgoSyaEr0vx/+SiZjvEds7cytHxOI45tIE2URU5qI6O0ClqoCZi6Bbdo0f0ZN1ZD9az9fJZOmGNe9bRD1ivH2qfoKw=</latexit>
✓ (t 1) [
<latexit sha1_base64="cyY4ZrPM6zr/1y4CZqRy4tJeApk=">AAAB63icbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnBxEByhL3NJlmyu3fszgnhyF+wsVDE1j9k579xL7lCEx8MPN6bYWZelEhh0fe/vdLa+sbmVnm7srO7t39QPTxq2zg1jLdYLGPTiajlUmjeQoGSdxLDqYokf4wmt7n/+MSNFbF+wGnCQ0VHWgwFo5hLPZYm/WrNr/tzkFUSFKQGBZr96ldvELNUcY1MUmu7gZ9gmFGDgkk+q/RSyxPKJnTEu45qqrgNs/mtM3LmlAEZxsaVRjJXf09kVFk7VZHrVBTHdtnLxf+8borD6zATOkmRa7ZYNEwlwZjkj5OBMJyhnDpCmRHuVsLG1FCGLp6KCyFYfnmVtC/qgV8P7i9rjZsijjKcwCmcQwBX0IA7aEILGIzhGV7hzVPei/fufSxaS14xcwx/4H3+AB4XjkY=</latexit>
<latexit

<latexit sha1_base64="bAhvrEBKQfp8PxT2ABGHcsjzSKg=">AAACG3icbVDLSgNBEJyNrxhfUY9eBoMQL2E3CHoM6sFjBPOAbAyzk04yZPbBTK8Qlv0PL/6KFw+KeBI8+DfOJjloYsEwRVU33V1eJIVG2/62ciura+sb+c3C1vbO7l5x/6Cpw1hxaPBQhqrtMQ1SBNBAgRLakQLmexJa3vgq81sPoLQIgzucRND12TAQA8EZGqlXrLrXIJG5Xij7euKbL3HrI5H2kl8SdXEEyO6TMp6maa9Ysiv2FHSZOHNSInPUe8VPtx/y2IcAuWRadxw7wm7CFAouIS24sYaI8TEbQsfQgPmgu8n0tpSeGKVPB6EyL0A6VX93JMzX2Zam0mc40oteJv7ndWIcXHQTEUQxQsBngwaxpBjSLCjaFwo4yokhjCthdqV8xBTjaOIsmBCcxZOXSbNaceyKc3tWql3O48iTI3JMysQh56RGbkidNAgnj+SZvJI368l6sd6tj1lpzpr3HJI/sL5+ANnkooQ=</latexit>
✓ (t) <latexit sha1_base64="PqpZ0URy/9EQtKjFJ4bKwR/diS0=">AAACFXicbVDLSsNAFJ3UV62vqEs3g0WoICURQZdFNy4r2Ac0MUwmk3boZBJmJkIJ+Qk3/oobF4q4Fdz5N07aLGrrgWEO59zLvff4CaNSWdaPUVlZXVvfqG7WtrZ3dvfM/YOujFOBSQfHLBZ9H0nCKCcdRRUj/UQQFPmM9PzxTeH3HomQNOb3apIQN0JDTkOKkdKSZ545fswCOYn0lzltSXMvm5OgM0ZJgh6yhjrNc8+sW01rCrhM7JLUQYm2Z347QYzTiHCFGZJyYFuJcjMkFMWM5DUnlSRBeIyGZKApRxGRbja9KocnWglgGAv9uIJTdb4jQ5EsttSVEVIjuegV4n/eIFXhlZtRnqSKcDwbFKYMqhgWEcGACoIVm2iCsKB6V4hHSCCsdJA1HYK9ePIy6Z43batp313UW9dlHFVwBI5BA9jgErTALWiDDsDgCbyAN/BuPBuvxofxOSutGGXPIfgD4+sXId6gBA==</latexit>
(t)
✓ (t 1) (t 1)
(B)
<latexit sha1_base64="ZHgZ640QwqzlFBK5Z/PpBl+cxgU=">AAACF3icbVDLSsNAFJ3UV62vqEs3g0WoC0sigi6LblxWsA9oYphMJu3QySTMTIQS8hdu/BU3LhRxqzv/xkmbRW09MMzhnHu59x4/YVQqy/oxKiura+sb1c3a1vbO7p65f9CVcSow6eCYxaLvI0kY5aSjqGKknwiCIp+Rnj++KfzeIxGSxvxeTRLiRmjIaUgxUlryzKbjxyyQk0h/mdOWNPeyOQk6Y5Qk6CFrqDP7NM89s241rSngMrFLUgcl2p757QQxTiPCFWZIyoFtJcrNkFAUM5LXnFSSBOExGpKBphxFRLrZ9K4cnmglgGEs9OMKTtX5jgxFsthTV0ZIjeSiV4j/eYNUhVduRnmSKsLxbFCYMqhiWIQEAyoIVmyiCcKC6l0hHiGBsNJR1nQI9uLJy6R73rStpn13UW9dl3FUwRE4Bg1gg0vQAregDToAgyfwAt7Au/FsvBofxuestGKUPYfgD4yvXx1coHY=</latexit>

<latexit sha1_base64="NB+utOVttkvdun1aIF7vZICEg1c=">AAACF3icbVDLSsNAFJ3UV62vqks3g0WoC0sigi6LblxWsA9oYphMJu3QyYOZG6GE/IUbf8WNC0Xc6s6/cdJ2UVsPDHM4517uvcdLBFdgmj9GaWV1bX2jvFnZ2t7Z3avuH3RUnErK2jQWsex5RDHBI9YGDoL1EslI6AnW9UY3hd99ZFLxOLqHccKckAwiHnBKQEtutWF7sfDVONRfZreGPHezOQnbMGRAHrI6nFmnee5Wa2bDnAAvE2tGamiGllv9tv2YpiGLgAqiVN8yE3AyIoFTwfKKnSqWEDoiA9bXNCIhU042uSvHJ1rxcRBL/SLAE3W+IyOhKvbUlSGBoVr0CvE/r59CcOVkPEpSYBGdDgpSgSHGRUjY55JREGNNCJVc74rpkEhCQUdZ0SFYiycvk855wzIb1t1FrXk9i6OMjtAxqiMLXaImukUt1EYUPaEX9IbejWfj1fgwPqelJWPWc4j+wPj6BRl3oHQ=</latexit>

Feature Classifier
Feature extractor Classifier transformer

(t
<latexit sha1_base64="QTfY2+j+5OwcGtuD26FuOCSOhdg=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBahHiyJCHosevFYwbSFNpTNdtMu3WzC7kQopb/BiwdFvPqDvPlv3LY5aOuDgcd7M8zMC1MpDLrut1NYW9/Y3Cpul3Z29/YPyodHTZNkmnGfJTLR7ZAaLoXiPgqUvJ1qTuNQ8lY4upv5rSeujUjUI45THsR0oEQkGEUr+VW88M575Ypbc+cgq8TLSQVyNHrlr24/YVnMFTJJjel4borBhGoUTPJpqZsZnlI2ogPesVTRmJtgMj92Ss6s0idRom0pJHP198SExsaM49B2xhSHZtmbif95nQyjm2AiVJohV2yxKMokwYTMPid9oTlDObaEMi3srYQNqaYMbT4lG4K3/PIqaV7WPLfmPVxV6rd5HEU4gVOoggfXUId7aIAPDAQ8wyu8Ocp5cd6dj0VrwclnjuEPnM8ff6aNzw==</latexit>
1) (t)
<latexit sha1_base64="E0ouwy9KtnaqmnGNjtVT4HURNC8=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqEScJ9yM6VCIUjKKVHqp43i9X3Jo7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuap5b8+4vK/WbPI4inMApVMGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwClio1d</latexit>

Memory module

(A)
ERASE() &
<latexit sha1_base64="FTcZj8PugMIJTkuPoGt7p1n5oH8=">AAAB/XicbVDJSgNBEO2JW4zbuNy8NAYlXsKMCApeohLwGJcskBlCT6eTNOlZ6K4R4xD8FS8eFPHqf3jzb+wkc9DEBwWP96qoqudFgiuwrG8jMze/sLiUXc6trK6tb5ibWzUVxpKyKg1FKBseUUzwgFWBg2CNSDLie4LVvf7lyK/fM6l4GNzBIGKuT7oB73BKQEstc8cB9gAASfnm/LY8LBw6Z85By8xbRWsMPEvslORRikrL/HLaIY19FgAVRKmmbUXgJkQCp4INc06sWERon3RZU9OA+Ey5yfj6Id7XSht3QqkrADxWf08kxFdq4Hu60yfQU9PeSPzPa8bQOXUTHkQxsIBOFnVigSHEoyhwm0tGQQw0IVRyfSumPSIJBR1YTodgT788S2pHRdsq2tfH+dJFGkcW7aI9VEA2OkEldIUqqIooekTP6BW9GU/Gi/FufExaM0Y6s43+wPj8ARTVlE4=</latexit>

WRITE() <latexit sha1_base64="zYuMwREmHouMet2zHJ2ZK1atoVs=">AAAB+XicbVDLSgNBEOz1GeNr1aOXwSDES9gVQY9BEfQWJS9IQpidzCZDZh/M9AbDkj/x4kERr/6JN//GSbIHTSxoKKq66e7yYik0Os63tbK6tr6xmdvKb+/s7u3bB4d1HSWK8RqLZKSaHtVcipDXUKDkzVhxGniSN7zhzdRvjLjSIgqrOI55J6D9UPiCUTRS17bbyJ8QMW083ldvJ8Wzrl1wSs4MZJm4GSlAhkrX/mr3IpYEPEQmqdYt14mxk1KFgkk+ybcTzWPKhrTPW4aGNOC6k84un5BTo/SIHylTIZKZ+nsipYHW48AznQHFgV70puJ/XitB/6qTijBOkIdsvshPJMGITGMgPaE4Qzk2hDIlzK2EDaiiDE1YeROCu/jyMqmfl1yn5D5cFMrXWRw5OIYTKIILl1CGO6hADRiM4Ble4c1KrRfr3fqYt65Y2cwR/IH1+QPm3JMo</latexit>


M
<latexit sha1_base64="phZHaf37MWpQea5oXK/jTdgtkH0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFl040aoYB/QhjKZTtqhk5kwcyOU0M9w40IRt36NO//GSZuFth4YOJxzL3PuCRPBDXret1NaW9/Y3CpvV3Z29/YPqodHbaNSTVmLKqF0NySGCS5ZCzkK1k00I3EoWCec3OZ+54lpw5V8xGnCgpiMJI84JWilXj8mOKZEZPezQbXm1b053FXiF6QGBZqD6ld/qGgaM4lUEGN6vpdgkBGNnAo2q/RTwxJCJ2TEepZKEjMTZPPIM/fMKkM3Uto+ie5c/b2RkdiYaRzayTyiWfZy8T+vl2J0HWRcJikySRcfRalwUbn5/e6Qa0ZRTC0hVHOb1aVjoglF21LFluAvn7xK2hd136v7D5e1xk1RRxlO4BTOwYcraMAdNKEFFBQ8wyu8Oei8OO/Ox2K05BQ7x/AHzucPgoqRYw==</latexit>
<latexit
READ()
<latexit sha1_base64="ixfeEIi50mZiRQkoWXUzc2Ll+4M=">AAAB+HicbVDLSsNAFJ3UV62PRl26GSxC3ZREBF3WF7isYh/QhjKZTtqhk0mYuRFr6Je4caGIWz/FnX/jtM1CWw9cOJxzL/fe48eCa3Ccbyu3tLyyupZfL2xsbm0X7Z3dho4SRVmdRiJSLZ9oJrhkdeAgWCtWjIS+YE1/eDnxmw9MaR7JexjFzAtJX/KAUwJG6trFDrBHAEjvrs+vxuWjrl1yKs4UeJG4GSmhDLWu/dXpRTQJmQQqiNZt14nBS4kCTgUbFzqJZjGhQ9JnbUMlCZn20unhY3xolB4OImVKAp6qvydSEmo9Cn3TGRIY6HlvIv7ntRMIzryUyzgBJulsUZAIDBGepIB7XDEKYmQIoYqbWzEdEEUomKwKJgR3/uVF0jiuuE7FvT0pVS+yOPJoHx2gMnLRKaqiG1RDdURRgp7RK3qznqwX6936mLXmrGxmD/2B9fkDEzmSrw==</latexit>
(C) (D)

Figure 1: Visual depiction of feature transformation process on new episodes.

images with the loss in equation 7. As shown by the t-SNE plot (A), the feature extractor Φθ(t−1)
produces features which are well-separated, and these features get stored in memory M. However,
at time t, when a set of 2000 new images is encountered, Φθ(t−1) produces features that are scat-
tered (t-SNE plot (B)). To improve the separation on the new data, the feature transformer is trained
using the (well-separated) features in M as well as poorly separated features (from new data), with
the loss function promoting good separation in new representation. This ensures that all the 8000
images seen until time t is well separated (t-SNE plots (C) and (D)). This is repeated for all time
indices t. Thus, the feature transformer, along with appropriate loss function, continuously changes
the representation to ensure good classification performance.
4 E XPERIMENTAL RESULTS
In this section, we benchmark our algorithm’s performance in various scenarios on relevant datasets.

• Realistic Lifelong scenario along with traditional multi-task(MT) on MNIST rotations


• Incremental learning on Pneumothorax identification (data-sensitive domain)
• New-task learning on iCIFAR100

4.1 P RACTICAL L IFELONG LEARNER - MNIST ROTATIONS DATASET


To simulate a realistic lifelong learning scenario as described in Section 2, we use MNIST rotations
dataset - each task contains digits rotated by a fixed angle between 0 and 180 degrees. Firstly, we

5
Under review as a conference paper at ICLR 2019

randomly permute the rotation angles (to test domain adaptation). For every rotation angle, we ran-
domly permute the class labels (0-10) and divide them into two subsets of 5 classes each (learning
new tasks). We divide the number of training samples into two different subsets (incremental learn-
ing). Table 1 details the episodes of a lifelong learning experiment from the described procedure.

Samples seen so far/


Episode No Rotation Angle Class Labels Total Number of samples available Description
1 ∠(5) [0, 2, 3, 9, 7] 12.5k/50k Start
2 ∠(5) [6, 4, 5, 1, 8] 25k/50k New-task Learning
3 ∠(5) [0, 2, 3, 9, 7] 37.5.5k/50k Incremental Learning
4 ∠(5) [6, 4, 5, 1, 8] 50k/50k Incremental Learning
5 ∠(270) [2, 8, 6, 4, 7] 62.5k/100k Domain Adaptation
6 ∠(270) [5, 0, 1, 3, 9] 75k/100k Domain Adaptation
7 ∠(270) [2, 8, 6, 4, 7] 87.5k/100k Domain Adaptation + Incremental Learning
8 ∠(270) [5, 0, 1, 3, 9] 100k/100k Domain Adaptation + Incremental Learning
.. .. .. .. ..
.. .. .. .. ..

Table 1: Episodes descriptions in one realistic life-long learning experiment scenario

4.1.1 E XPERIMENT DETAILS AND R ESULTS


We used a basic CNN architecture composed of 3 conv-layers of 32 filters and kernel-size - (3x3)
followed by 2 dense layers “fc1” and “fc2” of feature length 256 and a softmax. Our feature trans-
former networks at every episode aims to transform “fc1” features using 2 additional dense layers
of feature length 256. Feature transformers from previous episode serve as initialization for current
episode and these models were optimized for the cumulative loss (equation 7), with λ = 0.2. All
the feature transformers were trained for only 3 epochs, with batch size of 32.
We compare our results to the two obvious life-long learners - naive and cumulative training, which
serve as the lower and upper bounds of performance respectively in the absence of other competitive
algorithms for this setting. While the naive learner finetunes the entire network on the latest episode
data, cumulative learner accumulates data from all the episodes seen so far and retrains the model.
Fig. 2a highlights the remarkable performance of the proposed approach. We show the performance
evolution over the first 25 episodes in Fig. 2a while, performance comparisons over 80 episodes
across multiple experiment runs are averaged and shown in Table. 2b. We demonstrate that just by
storing features in memory and learning feature transformers at every episode, we achieve almost
similar performance to the gold-standard result. Further, it also underlines the applicability of our
approach as single framework to combat different requirements of lifelong learning.
Finally, we also compare our results (figure 3) in the conventional MT setting described in Lopez-
Paz et al. (2017), whose open source implementation we used for our experiments. For details on the
terminology and compared methods, the readers are encouraged to refer to Lopez-Paz et al. (2017).
Fig. 3a demonstrates clear superiority of our approach over multiple methods compared. Addition-
ally, backward transfer (BWT) - quantitative metric that models the deterioration of performance
on older tasks while learning new tasks, is also negligible for our method, which shows resistance
to catastrophic forgetting. Fig. 3b highlights this further as shown by accuracy on first task after
learning subsequent tasks.
4.2 P NEUMOTHORAX IDENTIFICATION FROM X- RAY I MAGES
We simulate a practical manifestation of lifelong learning where a model trained to detect pneumoth-
orax is deployed in a hospital with data arriving incrementally. We utilize a subset of ChestXRay
(Wang et al., 2017b) dataset, which consists of chest X-rays labeled with corresponding diagnoses.
We simulated incremental learning by providing the 8k training images in incremental batches of
2k and measured the performance on held-out validation set of 2k images. Fig. 4a establishes the
baselines for the experiment. As in previous experiment, naive and cumulative training define the
performance bounds. To clearly highlight the value of our feature transfom, we also add another
strong baseline - naive Learner with center loss, which learns on the recent batch but with an aug-
mented loss function (equation 7). In spite of a gain of 5% due to center loss, there is still a loss of
4% performance in the incremental learning set-up in the fourth batch of 2k.
Experimental Details and Results: We used a pre-trained VGG-network (Simonyan & Zisserman,
2014) as the base network and explored the use of features from different layers of the VGG-network

6
Under review as a conference paper at ICLR 2019

MNIST Rotations - Performance on Lifelong Simulator


1.0
0.9
Appropriate Validation Accuracy
0.8
0.7
Naive Learner
0.6 Feature Transformer
Cumulative Learner
0.5
0.4
0.3
0.2
0 3 6 9 12 15 18 21 24
Episode index

(a) (b)

Figure 2: (a) Performance evolution over first 25 episodes and (b) Average performance across
multiple runs at the end of lifelong experiments

MNIST rotations MNIST rotations


classification accuracy

1.0 single 1.0


0.8 independent 0.8
0.6 multimodal
0.6
0.4 EWC
0.2
GEM
Feature Transformer
0.4
0.0 0.2
0.2 0.0
ACC BWT 0 2 4 6 8 10 12 14 16 18 20
(a) Comparison with state-of-the-art methods (b) Evolution of 1st task’s accuracy over time

Figure 3: Comparison of proposed approach in conventional multi-task setting

namely - post the two pooling layers and fully connected layers. Feature transformer network es-
sentially had one additional dense layer per step and was optimized for (equation 7).

PneumoThorax Classification from X-Ray Images


90

85

80
Validation Accuracy

75

70
Naive
65 Naive + Centerloss
Feature Transform
Cumulative
60
2000 4000 6000 8000
Number of Samples

(a) (b)

Figure 4: (a) Performance Comparison on validation dataset and (b) Comparison of feature trans-
formers from different base layers

Fig. 4a captures the performance of feature transformer with the base features being extracted
from first pooling layer - block3 pool. After fourth batch of data, feature transformer result almost
matches the performance of cumulative training. This performance is achieved despite not having
access to the full images but only the stored features. Table. 4b also presents the performance
of feature transformer depending upon the base features used. It can be noted that performance
is lowest for the layer that is closer to the classification layer - fc 2. This is intuitively satisfying
because, the further layers in a deep neural network will be more finely tuned towards the specific
task and deprives the feature transform of any general features.

7
Under review as a conference paper at ICLR 2019

CIFAR100 -Single Incremental Setting


0.8 Naive Learner
Feature Transformer
Cumulative Learner
CIFAR-100 0.7
classification accuracy
0.8 single
0.6

Validation Accuracy
0.6 independent
iCARL
0.5
iCaRl
0.4
0.4 EWC
0.3
AR1
GEM
0.2 Feature Transformer 0.2
0.0 0.1

ACC BWT 1 3 5 7 9 11
Batches
13 15 17 19 21

(a) Multi-task setting (b) Single-incremental task setting

Figure 5: Comparison with state-of-the-art methods - iCIFAR100 dataset

4.3 I CIFAR100 DATASET


We present the 100 classes from CIFAR100 dataset in a sequence of 20 tasks comprising of 5 classes
each. Similar to Lomonaco & Maltoni (2017), we use the definitions for MT and single-incremental
task (SIT). In an MT setting, evaluation is performed only on the new tasks exposed to the learner
in the current episode. We start with a VGG type architecture, pretrained on iCIFAR10 dataset as
suggested by Lopez-Paz et al. (2017) and Lomonaco & Maltoni (2017). Base features are extracted
from flatten layer (before the fully connected layers) and our feature transformers included two
dense layers with feature length = 256. Similar to our earlier experiments, the feature transformer
module from the previous episode initializes the current episode transformer and is optimized for
the cumulative loss, with NoEpochs = 30 and batch size = 32.
4.3.1 M ULTI - TASK SETTING
Fig 5a demonstrates the superiority of our approach by a significant margin of >10%. Further, we
have negligible Backward Transfer.
4.3.2 S INGLE I NCREMENTAL TASK SETTING
Fig 5b captures the performance over 20 batches of data for our method along with cumulative
and naive learner. Unsurprisingly, naive learner performs very poorly, while feature transformer
displays exemplary performance numbers of 40% validation acccuracy after encountering all 20
episodes, while compared to cumulative learner at 50%. Our method significantly outperforms AR1
(Lomonaco & Maltoni, 2017), while iCaRL (Rebuffi et al., 2017) achieves best-in-class perfor-
mance close to gold-standard cumulative learner. This is not surprising because iCaRL is an explicit
rehearsal technique, where exemplar images from previous episodes are stored and replayed while
learning new tasks, while we only store low dimensional features.
4.4 E FFECT OF L IMITED M EMORY
One of the drawbacks of the proposed approach is the assumption of infinite memory and need to
store features computed on all samples observed so far. To understand the extent of this limitation,
we performed ablation experiments limiting the amount of history replayed as well as computing
the storage requirements involved.

Storing all features is not necessary We studied the effect of size of memory by limiting the number
of samples stored to a smaller percentage. We observed that performance dropped from 97% to 94%
when we reduced the memory size to only 25% the original size on MNIST rotations (Table. 2). We
performed similar experiments on Pneumothorax classification problem and achieved similar trends
as shown Table. 3, clearly demonstrating the resistance of the proposed method.

Storing all features is not prohibitive Additionally, calculations of size-on-disk suggests that stor-
ing features of the entire history is not prohibitive. A typical natural/medical image is 256*256*3
integers or more, whereas our representation is only 4096 floats (16kb). Even the largest available
medical image repository of 100k X-ray images takes 1.6GB which is not huge. These are con-

8
Under review as a conference paper at ICLR 2019

% of Feature Cumulative % of Feature


Samples stored Transformer Learner Samples stored Transformer
from history Val Acc Val Acc from history Val Acc
25 94% 95% 25 80.95%
50 94.8% 98% 50 82.95%
75 96% 98.5% 75 85.85 %
100 97 % 99 % 100 86.94 %

Table 2: MNIST rotations Table 3: Pneumothorax dataset

Table: Performance comparisons with limited memory budget

servative estimates. A standard medical image can be of much larger size (1024*1024) and in 3-D
(minimum 10 slices). Any exemplar-based method (iCARL) will have severe storage limitations
than our method. Additionally, storing 50 low-dimensional features occupies same memory as stor-
ing one exemplar image. This directly leads to storing more history compactly while addressing
catastrophic forgetting and privacy.
4.5 C ONTROLLING THE GROWTH OF NETWORK CAPACITY
In the description of the feature transformer framework in section 3.1 and section 3.2, we provided a
generic treatment of the method, where, at the end of the each episode, the features are transformed
up to date and then stored in memory for the next episode. With this scheme, it becomes imperative
to always augment the capacity of the network in order to learn new representations, resulting in
ever-growing network capacity. However, this problem can be easily alleviated by partitioning the
entire network into a base feature extractor and feature transformer layers. The base feature extractor
remains always fixed, and it is only the output of base feature extractor that is always stored in
memory. With this scheme, the feature transformer layers need not grow in capacity and only a few
already existing layers can be adapted for the new data. When the capacity of the feature transformer
layers is not sufficient, then it can be augmented by adding one or more extra layers. In either case,
the stored base features are sufficient to train.

Effect of varying additional capacity


We varied the size of feature transformers and observed the difference in performance. Table 4
shows that by halving the additional capacity does not change the performance on MNIST rotations
dataset at all. In addition, we froze the capacity of feature transformers after 5th episode and adapted
them till end of 80 episodes. It is striking that performance is still high. Similarly, for Pneumoth-
orax classification, Table 5 shows the performance comparisons with varying capacity of 2 , 1 and
zero fc layers post third episode. These experiments (along with Sec 4.4) clearly demonstrate that
power of the proposed approach comes from learning separable representations continually and not
necessarily from storing all features or additional capacity.

Incremental capacity Feature Transformer Incremental capacity Feature Transformer


added per episode Val Acc added per episode Val Acc
2 dense layers 96.4% 2 dense layers 86.94%
1 dense layers 96.5% 1 dense layers 86.43%
No additonal capacity No additonal capacity
(after 5th episode) 96.2% (after 3rd episode) 86.41%

Table 4: MNIST rotations Table 5: Pneumothorax dataset

Table: Performance comparisons with limited incremental compute

5 D ISCUSSION
In the final section, we discuss various points concerning our proposed approach.

Bayesian interpretation of Feature transformers

9
Under review as a conference paper at ICLR 2019

Our feature transformers framework - learning a new representation with every new episode of data,
can be interpreted as a maximum-a-posteriori representation learning, with the previous representa-
tion acting as prior. The implementation described in this paper using the combination loss (equa-
tion 7) is one instantiation of a general incremental representation learning possible in our frame-
work. We have cast the MAP estimate problem to a tractable optimization problem constrained by
a center-loss. In future, we plan to explore other manifestations of our approach with different loss
functions that can ensure better separability.

Information Loss, Incremental Capacity, Data Privacy


As shown in Table. 4b, feature transformer becomes less effective if the base features do not contain
enough relevant information. This also means that additional capacity that every feature transformer
adds may not help or in-fact be counter-productive. If the base features are extracted from layers
close to the input image, there will be problem of traceability which violates the data privacy re-
quirement we want to accomplish. We feel this a potential trade-off between performance and data
privacy which we will investigate in future.

Model compaction for cascade of feature transformers


Another approach to control the growth of network capacity is model compaction, which will be our
future work. At any point in time, the entire set of feature transformer layers can be replaced by a
smaller and simpler network (possibly using distillation techniques) and then again allowed to grow
subsequently. This cycle of grow-and-purge can be used to effectively manage the overall capacity
of the network.
R EFERENCES
F. M. Castro, M. J. Marı́n-Jiménez, N. Guil, C. Schmid, and K. Alahari. End-to-End Incremental
Learning. ArXiv e-prints, July 2018.
Chen He, Ruiping Wang, Shiguang Shan, and Xilin Chen. Exemplar-supported generative repro-
duction for class incremental learning. In 29th British Machine Vision Conference (BMVC 2018),
3–6 Sep 2018.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531, 2015.
Khurram Javed and Muhammad Talha Paracha. Incremental classifier & representation learning.
Learning, 3:4, 2018.
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A
Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcom-
ing catastrophic forgetting in neural networks. Proceedings of the national academy of sciences,
pp. 201611835, 2017.
Jeongtae Lee, Jaehong Yoon, Eunho Yang, and Sung Ju Hwang. Lifelong learning with dynamically
expandable networks. CoRR, abs/1708.01547, 2017.
Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2017.
Xialei Liu, Marc Masana, Luis Herranz, Joost Van de Weijer, Antonio M Lopez, and Andrew D
Bagdanov. Rotate your networks: Better weight consolidation and less catastrophic forgetting.
arXiv preprint arXiv:1802.02950, 2018.
Vincenzo Lomonaco and Davide Maltoni. Core50: a new dataset and benchmark for continuous
object recognition. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg (eds.), Proceedings
of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning
Research, pp. 17–26. PMLR, 13–15 Nov 2017. URL http://proceedings.mlr.press/
v78/lomonaco17a.html.
David Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural
Information Processing Systems, pp. 6467–6476, 2017.

10
Under review as a conference paper at ICLR 2019

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The
sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165.
Elsevier, 1989.
Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning for large
scale image classification: Generalizing to new classes at near-zero cost. In Computer Vision–
ECCV 2012, pp. 488–501. Springer, 2012.
Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level im-
age representations using convolutional neural networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 1717–1724, 2014.
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, and Christoph H. Lampert. iCaRL: Incremental
classifier and representation learning. 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 5533–5542, 2017.
Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7:
123–146, 1995.
Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative
replay. In Advances in Neural Information Processing Systems, pp. 2990–2999, 2017.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
Amal Triki Rannen, Rahaf Aljundi, Mathew B Blaschko, and Tinne Tuytelaars. Encoder based
lifelong learning. IEEE International Conference of Computer Vision, 2017.
Ragav Venkatesan, Hemanth Venkateswara, Sethuraman Panchanathan, and Baoxin Li. A strategy
for an uncompromising incremental learner. arXiv preprint arXiv:1705.00744, 2017.
X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospital-scale chest
x-ray database and benchmarks on weakly-supervised classification and localization of common
thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3462–3471, July 2017a. doi: 10.1109/CVPR.2017.369.
X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospital-scale chest
x-ray database and benchmarks on weakly-supervised classification and localization of common
thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3462–3471, July 2017b. doi: 10.1109/CVPR.2017.369.
Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach
for deep face recognition. In ECCV, 2016.

11

You might also like