Feature Transformers A Unified Representation Learning Framework For Lifelong Learning
Feature Transformers A Unified Representation Learning Framework For Lifelong Learning
Anonymous authors
Paper under double-blind review
A BSTRACT
1 I NTRODUCTION
Deep learning algorithms have achieved tremendous success on various challenging tasks like object
detection, language translation, medical image segmentation, etc. Lifelong learning - the ability to
adapt, benefit and sustain performance post deployment with more data and feedback, is an impor-
tant goal for artificial intelligence (AI) and extremely crucial for sustained utility of these algorithms
in domains like healthcare (learning from rare cases), self driving (learning new object detectors),
etc. While there is no single standard definition of lifelong learning, most of the research in this field
can be classified into one of the following sub-categories:
1. Incremental learning - encountering new data, with no change in distribution, for same task
2. Domain adaptation - data from modified target distributions but for the same task
3. New-task learning - data from tasks that were not presented before
Majority of the successful techniques study these variants in isolation. However, a realistic lifelong
learning scenario would involve a mixture of these manifestations over time. Another major impedi-
ment to a successful lifelong learner is data privacy. In domains like healthcare, it is near impossible
to have data access beyond scope (both time and geography) which leads to catastophic forgetting
(McCloskey & Cohen, 1989)- the inability of machine learning models to retain past knowledge
while learning from new data. In this paper, we provide a unified framework - feature transformers
- for practical lifelong learning to handle data privacy and catastrophic forgetting.
In summary, the major contributions of our work are as follows:
• Define a realistic and a challenging lifelong learning scenario and formulate a unique and
generic mathematical framework - feature transformer, to work in this setting successfully.
• Ensure data privacy by storing only features from previous episodes while successfully
combating catastrophic forgetting.
1
Under review as a conference paper at ICLR 2019
• Knowledge distillation based approaches: An example method for lifelong learning us-
ing knowledge distillation (Hinton et al., 2015) is learning without forgetting (LwF) (Li &
Hoiem, 2017) - where distillation loss was added to match output of the updated network
to that of the old network on the old task output variables. LwF and its extensions does not
scale to a large number of new tasks, suffers from catastrophic forgetting and importantly
do not address incremental learning or domain adapatation settings.
• Rehearsal using exemplar sets: Rebuffi et al. (2017) propose iCaRL, an incremental
learning approach by storing an exemplar set of data from the previous tasks and aug-
menting it with the new data. In more recent works, Javed & Paracha (2018) and Castro
et al. (2018) have argued that a decoupled nearest mean classifier from Rebuffi et al. (2017)
is not essential and have proposed joint learning of feature extraction and classification.
Lopez-Paz et al. (2017) propose Gradient Episodic Memory, a technique which stores data
from previous classes and constrains the gradient update while learning new tasks.
• Rehearsal using generative models: Triki Rannen et al. (2017) propose to reproduce past
knowledge by using task-specific under-complete autoencoders. When a new task is pre-
sented, optimization is constrained by preventing the reconstructions of these autoencoders
from changing, thereby ensuring that the features necessary for the previous tasks are not
destroyed. Shin et al. (2017); He et al. (2018); Venkatesan et al. (2017) employ generative
adversarial networks for recreating the history from previous tasks.
• Network regularization strategies: Methods belonging to this class aim to identify and
selectively modify parts of the neural network which are critical to remember past knowl-
edge or by explicitly penalizing loss of performance on old tasks. Kirkpatrick et al. (2017);
Liu et al. (2018) use the Fischer information matrix to identify the most crucial weights
responsible for prediction of a given task and lower the learning rate for these tasks. Lee
et al. (2017) propose a dynamically expanding network to increase the capacity for new
tasks if the previous architecture is insufficient to represent the data.
2
Under review as a conference paper at ICLR 2019
Our framework lies at the intersection of pseudorehearsal methods and progressive neural networks,
with scope for judiciously utilizing extra capacity, while resisting catastrophic forgetting.
3 L IFE - LONG LEARNING VIA FEATURE TRANSFORMATIONS
Before we present the feature transformer method, we introduce the terminologies and notations.
We consider training a deep neural network which classifies an input to one of the classes c ∈ [C] ,
{1, 2, · · · , C}. We refer to the operation of classifying an input to a particular class as a task. To this
end, the classifier is trained with training dataset (X, Y ), drawn from a joint distribution (X , Y).
We view the deep neural network, defined by the parameters (θ, κ), as the composition of a feature
extractor Φθ : X → F , and a classifier Ψκ
Ψκ ◦ Φθ : X → [C], (1)
where X is the space of input data, and F is a space of low-dimensional feature vectors.
We concisely denote the training of the neural network by TRAIN(θ, κ; D), which minimizes a
loss function on training data D = (X, Y ) and produces the network parameters (θ, κ). Let us also
define the set of all computed features on input F = Φθ (X). 1
When the loss function only penalizes misclassification, the network is expected to learn only the
class separation boundaries in the feature space. However, as we demonstrate experimentally, good
separation of class specific features enables stable learning of representations, which directly has a
bearing on the performance of life-long learning. Therefore, in all our training procedures, we also
use a feature loss which promotes feature separation
model loss = classification loss(θ,κ) + λ · feature-loss(θ) . (2)
In the lifelong learning context, we denote the time varying aspect of the network, training data
and the classes, by using the time symbol t ∈ N , {0, 1, 2, · · · } in superscript on these objects.
Realistically, at any time t > 0, the classifier encounters any of the following canonical situations:
1. The number of classes remain the same for the classifier as at t − 1. The model en-
counters new training data for a subset of classes, without change in their distribution:
(t) (t) (t) (t) (t) (t)
∀t > 0, C (t) = C (t−1) , T1 ⊆ [C (t) ] and ∀τ ∈ T1 , (Xτ , Yτ ) ∼ (Xτ , Yτ ) =
(t−1) (t−1)
(Xτ , Yτ ).
2. The number of classes remain the same for the classifier as at t − 1. The model en-
counters new training data from modified input distribution(s), for a subset of classes:
(t) (t) (t) (t) (t) (t)
∀t > 0, C (t) = C (t−1) , T2 ⊆ [C (t) ] and ∀τ ∈ T2 , (Xτ , Yτ ) ∼ (Xτ , Yτ ) 6=
(t−1) (t−1)
(Xτ , Yτ ).
3. The model encounters new class(es) and corresponding new data: ∀t > 0, C (t) >
(t) (t) (t) (t) (t) (t)
C (t−1) , T3 = [C (t) ]\[C (t−1) ] and ∀τ ∈ T3 , (Xτ , Yτ ) ∼ (Xτ , Yτ ).
In the most generic scenario, a combination of all the three situations can occur at any time index
(t) (t) (t)
t, with training data available for the classes in the set T (t) , T1 ∪ T2 ∪ T3 . However, it is
important to note that at every index t, the classifier is trained to classify all the C (t) classes.
3.1 F EATURE TRANSFORMATION BY AUGMENTING NETWORK CAPACITY
At any time t − 1, the classifier is optimized to classify all the classes [C (t−1) ] and also the set
of features F (t−1) are well separated according to classes. At t, when new training data D(t) =
(t) (t)
∪τ ∈T (t) (Xτ , Yτ ) is encountered, the features extracted using the previous feature extractor
∂F (t) = ∪τ ∈T (t) Φθ(t−1) (Xτ(t) ) , (3)
are not guaranteed to be optimized for classifying the new data and new classes. In order to achieve
good performance on new data and classes, we propose to change the feature representation at time
t, just before the classification stage. We achieve this by defining a feature transformer
Φ∆θ(t) : F (t−1) → F (t) , (4)
1
Though Φθ is a mapping which acts on individual vectors, we abuse the notation here by using it with sets.
3
Under review as a conference paper at ICLR 2019
parameterized by ∆θ (t) , which maps any feature extracted by Φθ(t−1) to a new representation. The
new feature extractor is now given by Φθ(t) , Φ∆θ(t) ◦ Φθ(t−1) , where θ (t) , θ (t−1) ∪ ∆θ (t) .
Practically, this is realized by augmenting the capacity of the feature extractor at each time t by
using one or more fully connected layers2 . It is however possible that Φ∆θ(t) could be simply an
identity transform and feature transformers learnt in previous episodes could be adapted for new
data. This helps in controlling the growth of network capacity over time and this aspect of our work
is described in section 4.5.
The feature transformer is trained, along with a new classifier layer, using the composite loss func-
tion of the form in equation 2, by invoking TRAIN(∆θ (t) , κ(t) ; D(t) ), with D(t) = (∂F (t) , Y (t) )3 .
This ensures that the classifier performs well on the new data. However, strikingly, training a feature
transformer at t does not involve changing the feature extractor Φθ(t−1) at all, and this helps us in
alleviating catastrophic forgetting by efficiently making use of already computed features F (t−1)
through a memory module.
3.2 R EMEMBERING HISTORY VIA MEMORY
The set of all extracted features F (t−1) serves as a good abstraction of the model, for all the tasks
and data encountered till t−1. Therefore, if F (t−1) is available to the model when it encounters new
tasks and data, then the feature transformer at t can take advantage of this knowledge to retain the
classification performance on previous tasks and data as well. To this end, we assume the availability
of an un-ending memory module M, equipped with READ(), WRITE() and ERASE() procedures,
that can store F (t−1) and can retrieve the same at t. In situations where memory is scarce, only a
relevant subset of F (t−1) can be stored and retrieved.
We train the feature transformer at any t > 0 by invoking TRAIN(∆θ (t) , κ(t) ; D(t) ), with the
combined set of features
0
D(t) = (∂F (t) ∪ F (t−1) , ∪t0 ∈[1,2,··· ,t] Y (t ) ), ∀t > 0. (5)
where D = ∪τ ∈T (Xτ , Yτ ) is the given training data set, and µc is the centroid of all features
corresponding to input data labelled as c.
Figure 1 provides snapshot of feature transformation algorithm when a new episode of data is en-
countered. We consider X-Ray lung images (from Wang et al. (2017a)) consisting of two classes:
(i) normal and (ii) pneumothorax. At a time index (t − 1), the classifier model is trained on 6000
2
There is no specific restriction on the kind of layers to be used, but in our present work we use only fully
connected layers.
3 (t) (t)
Y = ∪τ ∈T (t) Yτ
4
Under review as a conference paper at ICLR 2019
Input Task set T (t) , and training data ∪τ ∈T (t) (Xτ(t) , Yτ(t) ), ∀t ≥ 0
Output (θ (t) , κ(t) ), ∀t
t ← 0, ERASE(M) /* Set initial time, erase memory */
D(0) ← ∪τ ∈T (0) (Xτ(0) , Yτ(0) ) /* Obtain initial tasks and training data */
TRAIN(θ (0) , κ(0) ; D(0) ) /* Train initial network */
F (0) ← ∪τ ∈T (0) (Φθ(0) (Xτ(0) )) /* Compute features */
WRITE(M, (F (0) , Y (0) )) /* Write features to memory */
while T RU E do
t ← t + 1, obtain T (t) , ∪τ ∈T (t) (Xτ(t) , Yτ(t) ) /* Obtain current tasks and training data */
(t)
Compute ∂F using equation 3 /* Compute old model features on new data */
(F (t−1) , Y (t−1) ) ← READ(M) /* Read previously computed features from memory */
Form D(t) using equation 5 /* Form composite training data */
TRAIN(∆θ (t) , κ(t) ; D(t) ) /* Train feature transformer */
Φθ(t) ← Φ∆θ(t) ◦ Φθ(t−1) /* Obtain new feature extractor */
Compute F (t) using equation 6 /* Compute new features */
(t) (t0 )
ERASE(M), WRITE(M, (F , ∪t ∈[1,2,··· ,t] Y ))
0 /* Erase & write new features*/
end
Algorithm 1: The life-long learning framework
D(t
<latexit sha1_base64="upQO8RixVY8PJM/p9s36IcqeMkg=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYBDiwbArgh6DevAYwTwkWcPsZDYZMrO7zPQKYclXePGgiFc/x5t/4+Rx0MSChqKqm+6uIJHCoOt+O0vLK6tr67mN/ObW9s5uYW+/buJUM15jsYx1M6CGSxHxGgqUvJloTlUgeSMYXI/9xhPXRsTRPQ4T7ivai0QoGEUrPdw8ZiU89U5GnULRLbsTkEXizUgRZqh2Cl/tbsxSxSNkkhrT8twE/YxqFEzyUb6dGp5QNqA93rI0ooobP5scPCLHVumSMNa2IiQT9fdERpUxQxXYTkWxb+a9sfif10oxvPQzESUp8ohNF4WpJBiT8fekKzRnKIeWUKaFvZWwPtWUoc0ob0Pw5l9eJPWzsueWvbvzYuVqFkcODuEISuDBBVTgFqpQAwYKnuEV3hztvDjvzse0dcmZzRzAHzifP4muj5E=</latexit>
1) [C (t
<latexit sha1_base64="2LZk8BK4jVTYAcOKTp2gRih0GHk=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItQD5ZdEfRY7MVjBfsB27Vk02wbmk2WZFYoS3+GFw+KePXXePPfmLZ70NYHA4/3ZpiZFyaCG3Ddb6ewtr6xuVXcLu3s7u0flA+P2kalmrIWVULpbkgME1yyFnAQrJtoRuJQsE44bsz8zhPThiv5AJOEBTEZSh5xSsBKvt94zKpw4Z1Pg3654tbcOfAq8XJSQTma/fJXb6BoGjMJVBBjfM9NIMiIBk4Fm5Z6qWEJoWMyZL6lksTMBNn85Ck+s8oAR0rbkoDn6u+JjMTGTOLQdsYERmbZm4n/eX4K0U2QcZmkwCRdLIpSgUHh2f94wDWjICaWEKq5vRXTEdGEgk2pZEPwll9eJe3LmufWvPurSv02j6OITtApqiIPXaM6ukNN1EIUKfSMXtGbA86L8+58LFoLTj5zjP7A+fwB70WQXA==</latexit>
1)
] D(t)
<latexit sha1_base64="L5L+dKhb/nd2wnb2RaR3BGTCAYY=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GNRDx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3Uz91hPXRsTqAccJ9yM6UCIUjKKVWrePWQXPJr1S2a26M5Bl4uWkDDnqvdJXtx+zNOIKmaTGdDw3QT+jGgWTfFLspoYnlI3ogHcsVTTixs9m507IqVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtoQvMWXl0nzvOq5Ve/+oly7zuMowDGcQAU8uIQa3EEdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx+s6o8f</latexit>
[C (t) ]
<latexit sha1_base64="BXrnpxOhjASFTq/fieOBhu88mKg=">AAAB8HicbVBNSwMxEM3Wr1q/qh69BItQL2VXBD0We/FYwX7Idi3ZNNuGJtklmRXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmhYngBlz32ymsrW9sbhW3Szu7e/sH5cOjtolTTVmLxiLW3ZAYJrhiLeAgWDfRjMhQsE44bsz8zhPThsfqHiYJCyQZKh5xSsBKD37jMavC+TTolytuzZ0DrxIvJxWUo9kvf/UGMU0lU0AFMcb33ASCjGjgVLBpqZcalhA6JkPmW6qIZCbI5gdP8ZlVBjiKtS0FeK7+nsiINGYiQ9spCYzMsjcT//P8FKLrIOMqSYEpulgUpQJDjGff4wHXjIKYWEKo5vZWTEdEEwo2o5INwVt+eZW0L2qeW/PuLiv1mzyOIjpBp6iKPHSF6ugWNVELUSTRM3pFb452Xpx352PRWnDymWP0B87nDxF2j+o=</latexit>
Trainable
Trained and fixed
<latexit sha1_base64="9eFOUokv3sX/7QKJx5Jmj/Dm1Jw=">AAACF3icdVDLSgMxFM34tr6qLt0Ei1AXDpk6Wt2JblxWsFbojCWTpjY08yC5I5Rh/sKNv+LGhSJudeffmKkVVPRAyOGce7n3niCRQgMh79bE5NT0zOzcfGlhcWl5pby6dqHjVDHeZLGM1WVANZci4k0QIPllojgNA8lbweCk8Fs3XGkRR+cwTLgf0utI9ASjYKRO2faCWHb1MDRf5jX6Iu9k3yTsQZ8DvcqqsONs53mnXCG2S2qHdYKJTeq7u+6I1Fyyt48dm4xQQWM0OuU3rxuzNOQRMEm1bjskAT+jCgSTPC95qeYJZQN6zduGRjTk2s9Gd+V4yyhd3IuVeRHgkfq9I6OhLvY0lSGFvv7tFeJfXjuF3oGfiShJgUfsc1AvlRhiXISEu0JxBnJoCGVKmF0x61NFGZgoSyaEr0vx/+SiZjvEds7cytHxOI45tIE2URU5qI6O0ClqoCZi6Bbdo0f0ZN1ZD9az9fJZOmGNe9bRD1ivH2qfoKw=</latexit>
✓ (t 1) [
<latexit sha1_base64="cyY4ZrPM6zr/1y4CZqRy4tJeApk=">AAAB63icbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnBxEByhL3NJlmyu3fszgnhyF+wsVDE1j9k579xL7lCEx8MPN6bYWZelEhh0fe/vdLa+sbmVnm7srO7t39QPTxq2zg1jLdYLGPTiajlUmjeQoGSdxLDqYokf4wmt7n/+MSNFbF+wGnCQ0VHWgwFo5hLPZYm/WrNr/tzkFUSFKQGBZr96ldvELNUcY1MUmu7gZ9gmFGDgkk+q/RSyxPKJnTEu45qqrgNs/mtM3LmlAEZxsaVRjJXf09kVFk7VZHrVBTHdtnLxf+8borD6zATOkmRa7ZYNEwlwZjkj5OBMJyhnDpCmRHuVsLG1FCGLp6KCyFYfnmVtC/qgV8P7i9rjZsijjKcwCmcQwBX0IA7aEILGIzhGV7hzVPei/fufSxaS14xcwx/4H3+AB4XjkY=</latexit>
<latexit
<latexit sha1_base64="bAhvrEBKQfp8PxT2ABGHcsjzSKg=">AAACG3icbVDLSgNBEJyNrxhfUY9eBoMQL2E3CHoM6sFjBPOAbAyzk04yZPbBTK8Qlv0PL/6KFw+KeBI8+DfOJjloYsEwRVU33V1eJIVG2/62ciura+sb+c3C1vbO7l5x/6Cpw1hxaPBQhqrtMQ1SBNBAgRLakQLmexJa3vgq81sPoLQIgzucRND12TAQA8EZGqlXrLrXIJG5Xij7euKbL3HrI5H2kl8SdXEEyO6TMp6maa9Ysiv2FHSZOHNSInPUe8VPtx/y2IcAuWRadxw7wm7CFAouIS24sYaI8TEbQsfQgPmgu8n0tpSeGKVPB6EyL0A6VX93JMzX2Zam0mc40oteJv7ndWIcXHQTEUQxQsBngwaxpBjSLCjaFwo4yokhjCthdqV8xBTjaOIsmBCcxZOXSbNaceyKc3tWql3O48iTI3JMysQh56RGbkidNAgnj+SZvJI368l6sd6tj1lpzpr3HJI/sL5+ANnkooQ=</latexit>
✓ (t) <latexit sha1_base64="PqpZ0URy/9EQtKjFJ4bKwR/diS0=">AAACFXicbVDLSsNAFJ3UV62vqEs3g0WoICURQZdFNy4r2Ac0MUwmk3boZBJmJkIJ+Qk3/oobF4q4Fdz5N07aLGrrgWEO59zLvff4CaNSWdaPUVlZXVvfqG7WtrZ3dvfM/YOujFOBSQfHLBZ9H0nCKCcdRRUj/UQQFPmM9PzxTeH3HomQNOb3apIQN0JDTkOKkdKSZ545fswCOYn0lzltSXMvm5OgM0ZJgh6yhjrNc8+sW01rCrhM7JLUQYm2Z347QYzTiHCFGZJyYFuJcjMkFMWM5DUnlSRBeIyGZKApRxGRbja9KocnWglgGAv9uIJTdb4jQ5EsttSVEVIjuegV4n/eIFXhlZtRnqSKcDwbFKYMqhgWEcGACoIVm2iCsKB6V4hHSCCsdJA1HYK9ePIy6Z43batp313UW9dlHFVwBI5BA9jgErTALWiDDsDgCbyAN/BuPBuvxofxOSutGGXPIfgD4+sXId6gBA==</latexit>
(t)
✓ (t 1) (t 1)
(B)
<latexit sha1_base64="ZHgZ640QwqzlFBK5Z/PpBl+cxgU=">AAACF3icbVDLSsNAFJ3UV62vqEs3g0WoC0sigi6LblxWsA9oYphMJu3QySTMTIQS8hdu/BU3LhRxqzv/xkmbRW09MMzhnHu59x4/YVQqy/oxKiura+sb1c3a1vbO7p65f9CVcSow6eCYxaLvI0kY5aSjqGKknwiCIp+Rnj++KfzeIxGSxvxeTRLiRmjIaUgxUlryzKbjxyyQk0h/mdOWNPeyOQk6Y5Qk6CFrqDP7NM89s241rSngMrFLUgcl2p757QQxTiPCFWZIyoFtJcrNkFAUM5LXnFSSBOExGpKBphxFRLrZ9K4cnmglgGEs9OMKTtX5jgxFsthTV0ZIjeSiV4j/eYNUhVduRnmSKsLxbFCYMqhiWIQEAyoIVmyiCcKC6l0hHiGBsNJR1nQI9uLJy6R73rStpn13UW9dl3FUwRE4Bg1gg0vQAregDToAgyfwAt7Au/FsvBofxuestGKUPYfgD4yvXx1coHY=</latexit>
<latexit sha1_base64="NB+utOVttkvdun1aIF7vZICEg1c=">AAACF3icbVDLSsNAFJ3UV62vqks3g0WoC0sigi6LblxWsA9oYphMJu3QyYOZG6GE/IUbf8WNC0Xc6s6/cdJ2UVsPDHM4517uvcdLBFdgmj9GaWV1bX2jvFnZ2t7Z3avuH3RUnErK2jQWsex5RDHBI9YGDoL1EslI6AnW9UY3hd99ZFLxOLqHccKckAwiHnBKQEtutWF7sfDVONRfZreGPHezOQnbMGRAHrI6nFmnee5Wa2bDnAAvE2tGamiGllv9tv2YpiGLgAqiVN8yE3AyIoFTwfKKnSqWEDoiA9bXNCIhU042uSvHJ1rxcRBL/SLAE3W+IyOhKvbUlSGBoVr0CvE/r59CcOVkPEpSYBGdDgpSgSHGRUjY55JREGNNCJVc74rpkEhCQUdZ0SFYiycvk855wzIb1t1FrXk9i6OMjtAxqiMLXaImukUt1EYUPaEX9IbejWfj1fgwPqelJWPWc4j+wPj6BRl3oHQ=</latexit>
Feature Classifier
Feature extractor Classifier transformer
(t
<latexit sha1_base64="QTfY2+j+5OwcGtuD26FuOCSOhdg=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBahHiyJCHosevFYwbSFNpTNdtMu3WzC7kQopb/BiwdFvPqDvPlv3LY5aOuDgcd7M8zMC1MpDLrut1NYW9/Y3Cpul3Z29/YPyodHTZNkmnGfJTLR7ZAaLoXiPgqUvJ1qTuNQ8lY4upv5rSeujUjUI45THsR0oEQkGEUr+VW88M575Ypbc+cgq8TLSQVyNHrlr24/YVnMFTJJjel4borBhGoUTPJpqZsZnlI2ogPesVTRmJtgMj92Ss6s0idRom0pJHP198SExsaM49B2xhSHZtmbif95nQyjm2AiVJohV2yxKMokwYTMPid9oTlDObaEMi3srYQNqaYMbT4lG4K3/PIqaV7WPLfmPVxV6rd5HEU4gVOoggfXUId7aIAPDAQ8wyu8Ocp5cd6dj0VrwclnjuEPnM8ff6aNzw==</latexit>
1) (t)
<latexit sha1_base64="E0ouwy9KtnaqmnGNjtVT4HURNC8=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahXkoigh6LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqEScJ9yM6VCIUjKKVHqp43i9X3Jo7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuap5b8+4vK/WbPI4inMApVMGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwClio1d</latexit>
Memory module
(A)
ERASE() &
<latexit sha1_base64="FTcZj8PugMIJTkuPoGt7p1n5oH8=">AAAB/XicbVDJSgNBEO2JW4zbuNy8NAYlXsKMCApeohLwGJcskBlCT6eTNOlZ6K4R4xD8FS8eFPHqf3jzb+wkc9DEBwWP96qoqudFgiuwrG8jMze/sLiUXc6trK6tb5ibWzUVxpKyKg1FKBseUUzwgFWBg2CNSDLie4LVvf7lyK/fM6l4GNzBIGKuT7oB73BKQEstc8cB9gAASfnm/LY8LBw6Z85By8xbRWsMPEvslORRikrL/HLaIY19FgAVRKmmbUXgJkQCp4INc06sWERon3RZU9OA+Ey5yfj6Id7XSht3QqkrADxWf08kxFdq4Hu60yfQU9PeSPzPa8bQOXUTHkQxsIBOFnVigSHEoyhwm0tGQQw0IVRyfSumPSIJBR1YTodgT788S2pHRdsq2tfH+dJFGkcW7aI9VEA2OkEldIUqqIooekTP6BW9GU/Gi/FufExaM0Y6s43+wPj8ARTVlE4=</latexit>
images with the loss in equation 7. As shown by the t-SNE plot (A), the feature extractor Φθ(t−1)
produces features which are well-separated, and these features get stored in memory M. However,
at time t, when a set of 2000 new images is encountered, Φθ(t−1) produces features that are scat-
tered (t-SNE plot (B)). To improve the separation on the new data, the feature transformer is trained
using the (well-separated) features in M as well as poorly separated features (from new data), with
the loss function promoting good separation in new representation. This ensures that all the 8000
images seen until time t is well separated (t-SNE plots (C) and (D)). This is repeated for all time
indices t. Thus, the feature transformer, along with appropriate loss function, continuously changes
the representation to ensure good classification performance.
4 E XPERIMENTAL RESULTS
In this section, we benchmark our algorithm’s performance in various scenarios on relevant datasets.
5
Under review as a conference paper at ICLR 2019
randomly permute the rotation angles (to test domain adaptation). For every rotation angle, we ran-
domly permute the class labels (0-10) and divide them into two subsets of 5 classes each (learning
new tasks). We divide the number of training samples into two different subsets (incremental learn-
ing). Table 1 details the episodes of a lifelong learning experiment from the described procedure.
6
Under review as a conference paper at ICLR 2019
(a) (b)
Figure 2: (a) Performance evolution over first 25 episodes and (b) Average performance across
multiple runs at the end of lifelong experiments
namely - post the two pooling layers and fully connected layers. Feature transformer network es-
sentially had one additional dense layer per step and was optimized for (equation 7).
85
80
Validation Accuracy
75
70
Naive
65 Naive + Centerloss
Feature Transform
Cumulative
60
2000 4000 6000 8000
Number of Samples
(a) (b)
Figure 4: (a) Performance Comparison on validation dataset and (b) Comparison of feature trans-
formers from different base layers
Fig. 4a captures the performance of feature transformer with the base features being extracted
from first pooling layer - block3 pool. After fourth batch of data, feature transformer result almost
matches the performance of cumulative training. This performance is achieved despite not having
access to the full images but only the stored features. Table. 4b also presents the performance
of feature transformer depending upon the base features used. It can be noted that performance
is lowest for the layer that is closer to the classification layer - fc 2. This is intuitively satisfying
because, the further layers in a deep neural network will be more finely tuned towards the specific
task and deprives the feature transform of any general features.
7
Under review as a conference paper at ICLR 2019
Validation Accuracy
0.6 independent
iCARL
0.5
iCaRl
0.4
0.4 EWC
0.3
AR1
GEM
0.2 Feature Transformer 0.2
0.0 0.1
ACC BWT 1 3 5 7 9 11
Batches
13 15 17 19 21
Storing all features is not necessary We studied the effect of size of memory by limiting the number
of samples stored to a smaller percentage. We observed that performance dropped from 97% to 94%
when we reduced the memory size to only 25% the original size on MNIST rotations (Table. 2). We
performed similar experiments on Pneumothorax classification problem and achieved similar trends
as shown Table. 3, clearly demonstrating the resistance of the proposed method.
Storing all features is not prohibitive Additionally, calculations of size-on-disk suggests that stor-
ing features of the entire history is not prohibitive. A typical natural/medical image is 256*256*3
integers or more, whereas our representation is only 4096 floats (16kb). Even the largest available
medical image repository of 100k X-ray images takes 1.6GB which is not huge. These are con-
8
Under review as a conference paper at ICLR 2019
servative estimates. A standard medical image can be of much larger size (1024*1024) and in 3-D
(minimum 10 slices). Any exemplar-based method (iCARL) will have severe storage limitations
than our method. Additionally, storing 50 low-dimensional features occupies same memory as stor-
ing one exemplar image. This directly leads to storing more history compactly while addressing
catastrophic forgetting and privacy.
4.5 C ONTROLLING THE GROWTH OF NETWORK CAPACITY
In the description of the feature transformer framework in section 3.1 and section 3.2, we provided a
generic treatment of the method, where, at the end of the each episode, the features are transformed
up to date and then stored in memory for the next episode. With this scheme, it becomes imperative
to always augment the capacity of the network in order to learn new representations, resulting in
ever-growing network capacity. However, this problem can be easily alleviated by partitioning the
entire network into a base feature extractor and feature transformer layers. The base feature extractor
remains always fixed, and it is only the output of base feature extractor that is always stored in
memory. With this scheme, the feature transformer layers need not grow in capacity and only a few
already existing layers can be adapted for the new data. When the capacity of the feature transformer
layers is not sufficient, then it can be augmented by adding one or more extra layers. In either case,
the stored base features are sufficient to train.
5 D ISCUSSION
In the final section, we discuss various points concerning our proposed approach.
9
Under review as a conference paper at ICLR 2019
Our feature transformers framework - learning a new representation with every new episode of data,
can be interpreted as a maximum-a-posteriori representation learning, with the previous representa-
tion acting as prior. The implementation described in this paper using the combination loss (equa-
tion 7) is one instantiation of a general incremental representation learning possible in our frame-
work. We have cast the MAP estimate problem to a tractable optimization problem constrained by
a center-loss. In future, we plan to explore other manifestations of our approach with different loss
functions that can ensure better separability.
10
Under review as a conference paper at ICLR 2019
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The
sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165.
Elsevier, 1989.
Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning for large
scale image classification: Generalizing to new classes at near-zero cost. In Computer Vision–
ECCV 2012, pp. 488–501. Springer, 2012.
Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level im-
age representations using convolutional neural networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 1717–1724, 2014.
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, and Christoph H. Lampert. iCaRL: Incremental
classifier and representation learning. 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 5533–5542, 2017.
Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7:
123–146, 1995.
Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative
replay. In Advances in Neural Information Processing Systems, pp. 2990–2999, 2017.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
Amal Triki Rannen, Rahaf Aljundi, Mathew B Blaschko, and Tinne Tuytelaars. Encoder based
lifelong learning. IEEE International Conference of Computer Vision, 2017.
Ragav Venkatesan, Hemanth Venkateswara, Sethuraman Panchanathan, and Baoxin Li. A strategy
for an uncompromising incremental learner. arXiv preprint arXiv:1705.00744, 2017.
X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospital-scale chest
x-ray database and benchmarks on weakly-supervised classification and localization of common
thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3462–3471, July 2017a. doi: 10.1109/CVPR.2017.369.
X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospital-scale chest
x-ray database and benchmarks on weakly-supervised classification and localization of common
thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3462–3471, July 2017b. doi: 10.1109/CVPR.2017.369.
Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach
for deep face recognition. In ECCV, 2016.
11