0% found this document useful (0 votes)
57 views

Anomaly Detection and Localization

This document summarizes a research paper about detecting and localizing anomalies in crowded scenes. The paper proposes a joint detector of temporal and spatial anomalies based on modeling appearance and dynamics using mixtures of dynamic textures. Spatial and temporal saliency scores are produced at multiple spatial scales and combined using a conditional random field to provide a globally consistent anomaly map. Experiments on a new dataset of crowded pedestrian walkways show the approach achieves state-of-the-art anomaly detection results.

Uploaded by

ARUOS Soura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Anomaly Detection and Localization

This document summarizes a research paper about detecting and localizing anomalies in crowded scenes. The paper proposes a joint detector of temporal and spatial anomalies based on modeling appearance and dynamics using mixtures of dynamic textures. Spatial and temporal saliency scores are produced at multiple spatial scales and combined using a conditional random field to provide a globally consistent anomaly map. Experiments on a new dataset of crowded pedestrian walkways show the approach achieves state-of-the-art anomaly detection results.

Uploaded by

ARUOS Soura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

18 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO.

1, JANUARY 2014

Anomaly Detection and Localization


in Crowded Scenes
Weixin Li, Student Member, IEEE, Vijay Mahadevan, Member, IEEE, and
Nuno Vasconcelos, Senior Member, IEEE

Abstract—The detection and localization of anomalous behaviors in crowded scenes is considered, and a joint detector of temporal
and spatial anomalies is proposed. The proposed detector is based on a video representation that accounts for both appearance and
dynamics, using a set of mixture of dynamic textures models. These models are used to implement 1) a center-surround discriminant
saliency detector that produces spatial saliency scores, and 2) a model of normal behavior that is learned from training data and
produces temporal saliency scores. Spatial and temporal anomaly maps are then defined at multiple spatial scales, by considering the
scores of these operators at progressively larger regions of support. The multiscale scores act as potentials of a conditional random
field that guarantees global consistency of the anomaly judgments. A data set of densely crowded pedestrian walkways is introduced
and used to evaluate the proposed anomaly detector. Experiments on this and other data sets show that the latter achieves state-of-
the-art anomaly detection results.

Index Terms—Video analysis, surveillance, anomaly detection, crowded scene, dynamic texture, center-surround saliency

1 INTRODUCTION

S URVEILLANCE video is extremely tedious to monitor when


events that require follow-up have very low probability.
For crowded scenes, this difficulty is compounded by the
models must be defined at multiple scales. Second, different
tasks may require different models of normalcy. For instance, a
detector of freeway speed limit violations will rely on
complexity of normal crowd behaviors. This has motivated a normalcy models based on speed features. On the other
surge of interest in anomaly detection in computer vision hand, appearance is more important for the detection of
[1], [2], [3], [4], [5], [6], [7], [8], [9]. However, this effort is carpool lane violators, i.e., single-passenger vehicles in
hampered by general difficulties of the anomaly detection carpool lanes. Third, crowded scenes require normalcy
problem [10]. One fundamental limitation is the lack of a models robust to complex scene dynamics, involving many
universal definition of anomaly. For crowds, it is also independently moving objects that occlude each other in
infeasible to enumerate the set of anomalies that are possible complex ways, and can have low resolution.
in a given surveillance scenario. This is compounded by the In result, anomaly detection can be extremely challen-
sparseness, rarity, and discontinuity of anomalous events, ging. While this has motivated a great diversity of solutions,
which limit the number of examples available to train an it is usually quite difficult to objectively compare different
anomaly detection system. methods. Typically, these combine different representations
One common solution to these problems is to define of motion and appearance with different graphical models
anomalies as events of low probability with respect to a of normalcy, which are usually tailored to specific scene
probabilistic model of normal behavior. This enables a domains. Abnormalities are themselves defined in a some-
statistical treatment of anomaly detection, which conforms what subjective form, sometimes according to what the
with the intuition of anomalies as events that deviate from algorithms can detect. In some cases, different authors even
the expected [10]. However, it introduces a number of define different anomalies on common data sets. Finally,
challenges. First, it makes anomalies dependent on the scale experimental results can be presented on data sets of very
at which normalcy is defined. A normal behavior at a fine different characteristics (e.g., traffic intersection versus
visual scale may be perceived as highly anomalous when a subway entrance), frequently proprietary, and with widely
larger scale is considered, or vice versa. Hence, normalcy varying levels of crowd density.
In this work, we propose an integrated solution to all
these problems. We start by introducing normalcy models
. W. Li and N. Vasconcelos are with the Electrical and Computer that jointly account for the appearance and dynamics of complex
Engineering Department, University of California, San Diego, 9500 crowd scenes. This is done by resorting to a video
Gilman Drive, La Jolla, CA 92093.
E-mail: {wel017, nvasconcelos}@ucsd.edu. representation based on dynamic textures (DTs) [11]. This
. V. Mahadevan is with Yahoo! Labs, Embassy Golf Links Business Park, representation is then used to design models of normalcy
Bengaluru 560071, India. E-mail: [email protected]. over both space and time. Temporal normalcy is modeled
Manuscript received 15 Apr. 2012; revised 26 Feb. 2013; accepted 14 May with a mixture of DTs [12] (MDT) and enables the detection
2013; published online 12 June 2013. of behaviors that deviate from those observed in the past.
Recommended for acceptance by G. Mori. Spatial normalcy is measured with a discriminant saliency
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number
detector [13] based on MDTs, enabling the detection of
TPAMI-2012-04-0294. behaviors that deviate from those of the surrounding
Digital Object Identifier no. 10.1109/TPAMI.2013.111. crowd. The integration of spatial and temporal normalcy
0162-8828/14/$31.00 ß 2014 IEEE Published by the IEEE Computer Society
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 19

with respect to either appearance or dynamics leads to a models local optical flow with a mixture of probabilistic
flexible model of normalcy, applicable to the detection of principal component analysis (PCA) models, [4] and [17]
anomalies of relevance to various surveillance tasks. draw inspiration from classical studies of crowd behavior
To address the scale problem, MDTs are learned at [21] to characterize flow with interaction features (e.g.,
multiple spatial scales. This is done with an efficient social force model), and [1] learns the representative flow of
hierarchical model, where layers of MDTs with successively groups by clustering optical flow-based particle trajectories.
larger regions of video support are learned recursively. The These approaches emphasize dynamics, ignoring anoma-
local measures of spatial and temporal abnormality are then lies of object appearance and, thus, anomalous behavior
integrated into a globally coherent anomaly map, by without outlying motion. Optical flow, pixel change
probabilistic inference. This is implemented with a condi- histograms, or other classical background subtraction
tional random field (CRF), whose single-node potentials are features are also difficult to extract from crowded scenes,
classifiers of local measures of spatial and temporal where the background is by definition dynamic, there are
abnormality, collected over a range of spatial scales. They lots of clutter, and occlusions. More complete representa-
are complemented by a novel set of interaction potentials, tions account for both appearance and motion. For example,
which account for spatial and temporal context, and [2] models temporal sequences of spatiotemporal gradients
integrate anomaly information across the visual field. to detect anomalies in densely crowded scenes, [22] declares
Finally, to address the difficulties of empirical evaluation as abnormal spatiotemporal patches that cannot be recon-
of anomaly detectors on crowded scenes, we introduce a structed from previous frames, and [23] pools appearance
data set of video from walkways in the campus of University and motion features over spatial neighborhoods, using a
of California, San Diego (UCSD), depicting crowds of distance to the nearest spatially colocated feature vector
varying densities. The data set contains 98 video sequences, among all training video clips, to quantify abnormality.
and five well-defined abnormal categories. These are not Object-based representations, based on location, blob
“synthetic,” or “staged,” but abnormal events that occur
shape, and motion [7] or optical flow magnitude, gradients,
naturally, for example, bicycle riders that cross pedestrian
location, and scale [9], have also been proposed. Other
walkways. Ground truth is provided for abnormal events,
representations include a bag-of-words over a set of
as well as a protocol to evaluate detection performance.
manually annotated event classes [24]. Various methods
The remainder of the paper is organized as follows: have also been used to produce anomaly scores. While
Section 2 reviews previous work on anomaly detection in simple spatial filtering suffices for some applications [19],
computer vision. The problems of temporal and spatial crowded scenes require more sophisticated graphical
anomaly detection in crowded scenes are discussed in models and inference. For example, [6] and [1] adopt
Section 3. This is followed by the mathematical character- Gaussian mixture models (GMM) to represent trajectories of
ization of multiscale anomaly maps in Section 4, and the normal behavior. Cong et al. [8] and Zhao et al. [20] learn a
proposed CRF for integration of spatial and temporal sparse basis and define unusual events as those that can
anomalies across different spatial scales in Section 5. Finally, only be reconstructed with either large error or the
an extensive experimental evaluation is discussed in combination of a large number of basis vectors.
Section 6 and some conclusions are presented in Section 7. Contributions of the second type address the integration
of local anomaly scores, which can be noisy, into a globally
2 PRIOR WORK consistent anomaly map. The authors of [2], [25], and [7]
guarantee temporally consistent inference by modeling
Recent advances in anomaly detection address event normal temporal sequences with hidden Markov models
representation and globally consistent statistical inference. (HMMs). While this enforces consistency along the tempor-
Contributions of the first type define features and models al dimension, there have also been efforts to produce
for the discrimination of normal and anomalous patterns. spatially consistent anomaly maps. For example, latent
Models of normal and abnormal behavior are then learned Dirichlet allocation (LDA) has been applied to force flow
from training data, and anomalies detected with a mini- features, in the model of spatial crowd interactions of [4].
mum probability of error decision rule. Although there are On the other hand, [5] and [3] rely on Markov random fields
some exceptions [5], the distribution of abnormal patterns is (MRF) to enforce global spatial consistency. In the realm of
usually assumed uniform, and abnormal events formulated sparse representations, [20] guarantees consistency of
as events of low probability under the model of normalcy. reconstruction coefficients over space and time by inclusion
One intuitive representation for event modeling is based of smoothness terms in the underlying optimization
on object trajectories. It is comprised of either explicitly or problem. Finally, [9] models object relationships, using
implicitly segmenting and tracking each object in the scene, Bayesian networks to implement occlusion reasoning.
and fitting models to the resulting object tracks [14], [15], It should be noted that most of these methods have
[16], [6], [17], [18]. While capable of identifying abnormal not been tested on the densely crowded scenes consid-
behaviors of high-level semantics (e.g., unusual long-term ered in this work. It is unclear that many of them could
trajectories), these procedures are both difficult and deal with the complex motion and object interactions
computationally expensive for crowded or cluttered scenes. prevalent in such scenes. Furthermore, while most
A number of promising alternatives, which avoid proces- methods include some mechanism to encourage spatial
sing individual objects, have been recently proposed. These and temporal consistency of anomaly judgments (MRF,
include the modeling of motion patterns with histograms of LDA, etc.), the underlying decision rule tends to be either
pixel change [5], histograms of optical flow [19], [8], [20], or predominantly temporal (e.g., trajectories, GMMs, HMMs,
optical flow measures [3], [4], [17], [1]. Among these, [3] or sparse representations learned over time) or spatial
20 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014

(e.g., interaction models) but is rarely discriminant with definition of normalcy. In this sense, the detection of spatial
respect to both space and time. This makes it difficult to anomalies can be equated to saliency detection [27].
infer whether spatial or temporal modeling are critically
important by themselves, or what benefits are gained 3.3 Roles of Crowds and Scale
from their joint modeling. Furthermore, the role of scale Most available background subtraction and saliency detec-
is rarely considered. These issues motivate the contribu- tion solutions are not applicable to crowded scenes, where
tions of the following sections. backgrounds can be highly dynamic. In this case, it is not
sufficient to detect variations of image intensity, or even
optical flow, to detect anomalous events. Instead, normalcy
3 ANOMALY DETECTION models must rely on sophisticated joint representations of
We start by proposing an anomaly detector that accounts appearance and dynamics. In fact, even such models can be
for scene appearance and dynamics, spatial and temporal ineffective. Since crowds frequently contain distinct sub-
context, and multiple spatial scales. entities, for example, vehicles or groups of people moving
in different directions, anomaly detection requires model-
3.1 Mathematical Formulation ing multiple video components of different appearance and
A classical formulation of anomaly detection, which we dynamics. A model that has been shown successful in this
adopt in this work, equates anomalies to outliers. A context is the mixture of DTs [12]. This is the representation
statistical model pX ðx
x Þ is postulated for the distribution of adopted in this work.
a measurement X under normal conditions. Abnormalities Another challenging aspect of anomaly detection within
are defined as measurements whose probability is below a crowds is scale. Spatial anomalies are usually detected at
threshold under this model. This is equivalent to a statistical the scale of the smallest scene entities, typically people.
test of hypotheses: However, a normal event at this scale may be anomalous at
a larger scale, and vice versa. For example, while a child
. H0 : x is drawn from pX ðx
x Þ; that rides a bicycle appears normal within a group of
. H1 : x is drawn from an uninformative distribution bicycle riding children, the group is itself anomalous in a
pX ðx
x Þ / 1. crowded pedestrian sidewalk. Local anomaly detectors,
The minimum probability of error rule for this test is to with small regions of interest, cannot detect such anomalies.
reject the null hypothesis H0 if pX ðxx Þ < , where  is the To address this, we represent crowded scenes with a
normalization constant of the uninformative distribution. hierarchy of MDTs that cover successively larger regions.
As usual in the literature, we consider the problem of This is done with a computationally efficient hierarchical
anomaly detection from localized video measurements x , model, where MDT layers are estimated recursively.
where x is a spatiotemporal patch of small dimensions. A similar challenge holds for temporal anomalies. While
their detection is usually based on a small number of video
3.2 Spatial versus Temporal Anomalies frames, certain anomalies can only be detected over long
The normalcy model pX ðx x Þ can have both a temporal and a time spans. For example, while it is normal for two
spatial component. Temporal normalcy reflects the intuition pedestrian trajectories to converge or diverge at any point
that normal events are recurrent over time, i.e., previous in time, a cyclical convergence and divergence is probably
observations establish a contextual reference for normalcy abnormal. Anomaly detection across time scales is, how-
judgments. Consider a highway lane where cars move with ever, more complex than across spatial scales, due to
a certain orientation and speed. Bicycles or cars heading in constraints of instantaneous detection and implementation
the opposite direction are easily identified as abnormal complexity. Since video has to be buffered before anomalies
because they give rise to observations x substantially can be detected, large temporal windows imply long
different from those collected in the past. In this sense, detection delays and storage of many video frames. Due
temporal normalcy detection is similar to background to this, we do not consider multiple temporal scales in this
subtraction [26]. A model of normal behavior is learned work. A single scale is chosen, using acceptable values of
over time, and measurements that it cannot explain are delay and storage complexity, and used throughout our
denoted temporal anomalies. experiments. Note that, like their spatial counterparts,
Spatial normalcy reflects the intuition that some events temporal anomaly maps are computed at multiple spatial
that would not be abnormal per se are abnormal within a scales. Hence, in what follows, the term “scale” refers to the
crowd. Since the crowd places physical or psychological spatial support of anomaly detection, for both spatial and
constraints on individual behavior, behaviors feasible in temporal anomalies.
isolation can have low probability in a crowd context. For
example, while there is nothing abnormal about an
ambulance that rides at 50 mph in a stretch of highway, 4 NORMALCY AND ANOMALY MODELING
the same observation within a highly congested highway is In this section, we review the MDT model, discuss the
abnormal. Note that the only indication of abnormality is design of temporal and spatial models of normalcy, and
the difference between the crowd and the object at the time of formulate the computation of anomaly maps.
the observation, not that the ambulance moves at 50 mph.
Since the detection of such abnormalities is mostly based on 4.1 Mixture of Dynamic Textures
spatial context, they are denoted spatial anomalies. Their The MDT models a sequence of  video frames x 1: ¼
detection does not depend on memory. Instead, it is based x 1 ; x 2 ; . . . ; x   as a sample from one of K dynamic
½x
on a continuously evolving, instantaneously adaptive, textures [11]:
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 21

map at location l is the negative-log probability of the most-


likely state sequence for the patch at l:
" #
X
K  fig 
T ðlÞ ¼  log i p s 1: ðlÞjz ¼ i ; ð3Þ
i¼1
fig
where s 1: ðlÞ x  ðlÞ; z ¼ iÞ. We note that
¼ argmaxs 1: pðss1: jx
this generalizes the mixture of PCA models of optical flow
[3]. The matrix C z of (2b) is a PCA basis for patches drawn
from mixture component z, but the PCA decomposition
reports to patch appearance, not optical flow. Patch dynamics
are captured by the hidden state sequence s 1: , which is a
trajectory in PCA space. Hence, unlike mixtures of optical
flow, the representation is temporally smooth. The joint
representation of appearance and dynamics makes the
MDT a better representation for crowd video than the
Fig. 1. Temporal anomaly detection. An MDT is learned per scene
subregion, at training time. A temporal anomaly map is produced by
mixture of PCA.
measuring the negative log probability of each video patch under the
MDT of the corresponding region. 4.3 Spatial Anomaly Detection
Spatial anomaly detection is inspired by previous work in
X
K saliency detection [27], [13]. Saliency is defined in a center-
x 1: Þ ¼
pðx x 1: jz ¼ iÞ:
i pðx ð1Þ surround manner. Given a set of features, salient locations
i¼1
are those of substantial feature contrast with their immedi-
The mixture components pðx x 1: jz ¼ iÞ are linear dynamic ate surround. Spatial anomalies are then defined as
systems (LDS) defined by locations whose saliency is above some threshold. In this
 work, we rely on the discriminant saliency criterion of [13].
s tþ1 ¼ Az s t þ n t ; ð2aÞ
x t ¼ Czst þ mt; ð2bÞ 4.3.1 Discriminant Saliency
Discriminant saliency formulates the saliency problem as a
where ZPis a multinomial random variable of parameters  hypothesis test between two classes: a class of salient stimuli,
ði  0; i i ¼ 1Þ, which indexes the mixture component and a background class of stimuli that are not salient. Two
from which x t is drawn. s t is a hidden state variable that windows are defined at each scene location l: a center
encodes scene dynamics, and x t the vector of pixels in video window W 1l , with label CðlÞ ¼ 1, containing the location, and
frame t. Az ; C z are the transition and observation matrices a surrounding annular window W 0l , with label CðlÞ ¼ 0,
of component z, whose initial condition is s 1  N ð z ; S z Þ, containing background. A set of feature responses X are
and noise processes are defined by n t  N ð0; Qz Þ and computed for each of the windows W cl , c 2 f0; 1g and SðlÞ,
m t  N ð0; Rz Þ. The model parameters are learned by the saliency of location l, defined as the extent to which they
maximum-likelihood estimation (MLE) from a collection discriminate between the two classes. This is quantified by
of video patches, with the expectation-maximization (EM) the mutual information (MI) between feature responses and
algorithm of [12], which is reviewed in Appendix A.1, class label [13]:
which is available in the Computer Society Digital Library
X
1
at http://doi.ieeecomputersociety.org/10.1109/TPAMI. SðlÞ ¼ fpCðlÞ ðcÞKL½pXjCðlÞ ðx
xjcÞkpX ðx
x Þg; ð4Þ
2013.111. c¼0

4.2 Temporal Anomaly Detection R ðx


where pXjCðlÞ x jcÞ are class-conditional densities and
ðx

Temporal anomaly detection is inspired by the popular KLðpkqÞ ¼ X pX ðx x Þ log pqXXðx x the Kullback-Leibler (KL)
x Þ dx
background subtraction method of [26]. This uses a GMM divergence between pX ðx x Þ and qX ðx
x Þ [30].
per image location to model the distribution of image Locations of maximal saliency are those where the
intensities. Observations of low probability under these discrimination between center and surround can be made
GMMs are declared foreground. For anomaly detection in with highest confidence, i.e., where (4) is maximal.
crowds, the GMM is replaced by an MDT, and the pixel grid The discriminant saliency principle can be applied to many
features [31]. When X consists of optical flow, it generalizes
replaced by one of preset displacement. Grid locations define
the force flow model of [4], where saliency is defined as the
the center of video cells, from which video patches are
difference between the optical flow at l and the average
extracted. The patches extracted from a subregion (group of
flow in its neighborhood (see [4, (8)]). This is a simplified
cells) are used to learn an MDT, during a training phase, as form of discriminant saliency, which replaces the MI of (4)
illustrated in Fig. 1. After this phase, subregion patches of by a difference to the mean background response.
low probability under the associated MDT are considered
anomalies. Given patch x 1: , the distribution of the hidden 4.3.2 Center-Surround Saliency with MDTs
state sequence s 1 under the ith DT component, pSjX ðss1: jx
x 1: ; Optical flow methods provide a coarse representation of
z ¼ iÞ, is estimated with a Kalman filter and smoother [28], dynamics and ignore appearance. For background subtrac-
[29], as discussed in Appendix A.2, available in the online tion, this problem has been addressed with the combination
supplemental material. The value of the temporal anomaly of DTs and discriminant saliency [32]. While using a more
22 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014

powerful representation than force flow, this method learns


a single DT from both center and surround windows. This
assumes a homogeneity of appearance and dynamics
within the two windows that do not hold for crowds,
where foregrounds and backgrounds can be quite diverse.
In this work, we adopt the MDT as the probability
distribution pXjCðlÞ ðx
x1: jcÞ from which spatiotemporal
patches x 1 are drawn. We note that under assumptions of
Gaussian initial conditions and noise, patches x 1: drawn
from a DT have a Gaussian probability distribution [33],

x 1:  N ð ; Þ; ð5Þ


whose parameters follow from those of the LDS (2). When
the class-conditional distributions of the center and sur-
round classes, c 2 f0; 1g, at location l are mixtures of
Fig. 2. Spatial anomaly detection using center-surround saliency with
Kc DTs, it follows that
MDT models.
X
Kc
 
pXjCðlÞ ðx
x 1: jcÞ ¼ ci N x 1: ;  ci ; ci 4.3.3 Spatial Anomaly Map
i¼1
ð6Þ The spatial anomaly map is a map of the saliency SðlÞ at
X
Kc
locations l. Given a location, this requires 1) learning MDTs
¼ ci piXjCðlÞ ðx
x 1: jcÞ;
i¼1
from center and surround windows, and 2) computing a
weighted average of these mixtures to obtain (7). Since
for c 2 f0; 1g. The marginal distribution is then learning MDTs per location is computationally prohibitive,
we resort to the following approximation. A dense collec-
X
1
tion of overlapping spatiotemporal patches is first extracted
pX ðx
x 1: Þ ¼ ½pCðlÞ ðcÞpXjCðlÞ ðx
x1: jcÞ
c¼0 from VðtÞ, a 3D video volume temporally centered at the
1 
X X
Kc
 
 current frame. A single MDT with K g mixture components,
g
¼ pCðlÞ ðcÞ ci N x 1: ;  ci ; ci denoted f gi ; gi gK
i¼1 , is learned from this patch collection.
c¼0 i¼1
ð7Þ Each patch is then assigned to the mixture component of
KX
0 þK1 largest posterior probability. This segments the volume into
¼ !i N ðx
x 1: ;  i ; i Þ superpixels, as shown in Fig. 2.
i¼1
At location l, the MDTs of (6) and (7) are derived from
KX
0 þK1
the global mixture model. The DT components are assumed
¼ !i piX ðx
x 1: Þ;
i¼1
equal to those of the latter and only the mixing proportions
are recomputed, using the ratio of pixels assigned to each
and the saliency measure of (4) requires the KL divergence component in the respective windows:
between (6) and (7). This is problematic because there is no P
closed form solution for the KL divergence between two Kg
X Mil  
l2W cl
MDTs. However, because the MDT components are pXjCðlÞ ðx
x1: jcÞ ¼ P N x 1: ;  gi ; gi ; ð10Þ
i¼1 l2W cl 1
Gaussian, it is possible to rely on popular approximations
to the KL divergence between Gaussian mixtures. We adopt for c 2 f0; 1g. Mil ¼ 1 if l is assigned to mixture
the variational approximation of [34]: component i and 0 otherwise. The prior probabilities for
center and surround, pC ðcÞ, are proportional to the ratio of
KLðpXjC kpX Þ volumes of center and surround windows. SðlÞ is
( PKC C   i  j  )
X j  j exp KL pXjC pXjC ð8Þ computed with (4), using (8) and (9). Note that the KL
 Ci log PK0 þK1     :
!j exp KL piXjC pjX divergence
Kg
 terms in (8) only require the computation of
i j
2 KL divergences between the K g mixture components,
and these are computed only once per frame because all
Each term of (8) contains a KL divergence between DTs,
mixture components are shared (i.e., the terms
which can be computed in closed form [35]. For example, exp ðKLðpkqÞÞ in (8) are fixed per frame). This procedure
for the terms in the denominator is repeated for every frame in the test video, as illustrated
   in Fig. 2.
KL piXjC pjX
" #
1 jj j  1 C   C 2 ð9Þ 4.4 Multiscale Anomaly Maps
¼  
log  C  þ Tr j i þ  i   j j  m ;
2   To account for anomalies at multiple spatial scales, we rely
i
on a hierarchical mixture of dynamic textures (H-MDT).
where m is the number of pixels per frame, and This is a model with various MDT layers, learned from
kzzk ¼ z T 1 z . Numerator terms are computed similarly. regions of different spatial support. At the finest scale, a
All computations can be performed recursively [35]. video sequence is divided into nL subregions (e.g., 5  8
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 23

Fig. 3. Computation of temporal anomaly maps with multiscale spatial supports using the H-MDT. MDTs of increasingly larger spatial support are
estimated recursively, with the H-EM algorithm. Their application to a query video produces temporal anomaly maps based on supports of various
spatial scales.

subregions). nL MDT models fM M i gni¼1


L
are then learned from conditional likelihood of observing a configuration of
patches extracted from each of the subregions. At the anomaly labels y ¼ fyi ji 2 Sg is
coarsest scale, the whole visual field is represented with a (
global MDT. This results in a hierarchy of MDT models 1 X
P ðyy jx
x Þ ¼ exp Aðyi ; x Þ
ffMM 1i gni¼11
; . . . ; M L1 g, where M sj , the jth model at scale s, is Z i2S
learned from subregion Rsj . The hierarchy of support " #) ð11Þ
X 1 X
windows ffR1i gni¼1 1
; . . . ; RL g resembles the spatial pyramid þ Iðyi ; yj ; x ; i; jÞ ;
jN i j j2N
structure of [36]. H-MDT models can be learned efficiently i2S i

with the hierarchical expectation-maximization (H-EM)


where Z is a partition function and N i the neighborhood of
algorithm of [37]. Rather than collecting patches anew from
site i. The single-site and interaction potentials of (11),
larger regions, it estimates the models at a given layer
 
directly from the parameters of the MDT models at the layer Aðyi ; x Þ ¼ log  yi w T f i ; ð12Þ
of immediately higher resolution.
For anomaly detection, each model is applied to the where ðxÞ ¼ ð1 þ ex Þ1 is the sigmoid function, and
corresponding window. This produces L anomaly maps,
I ðyi ; yj ; x ; i; jÞ ¼ yi yj  v T  ðff i ; f j ; i; jÞ ð13Þ
fT 1 ; . . . ; T L g, as illustrated in Fig. 3. A hierarchy of spatial
anomaly maps, fS 1 ; . . . ; S L g is also computed. For all s, the are based on a feature vector f i that concatenates the spatial
computation of S s relies on a global mixture model M . and temporal anomaly scores of site i at the L spatial scales,
The mixing proportions of (10) are computed using plus a bias term (set to 1):
surround windows of size identical to fRsi g and center
windows of constant size, as summarized in Algorithm 1
T
f i ¼ 1; T 1 ðiÞ; . . . ; T L ðiÞ; S 1 ðiÞ; . . . ; S L ðiÞ : ð14Þ
(see Appendix B for all algorithms, available in the online
supplemental material. w ; v are parameter vectors and  a compound feature:

5 GLOBALLY CONSISTENT ANOMALY MAPS  ðff i ; f j ; i; jÞ ¼ ejijj expðh


hi;j Þ; ð15Þ

In this section, we introduce a layer of statistical inference to where ji  jj is the euclidean distance between sites i; j, and
fuse anomaly information across time, space, and scale in a expðh hi;j Þ the entry-wise exponential of h h i;j . The vector
globally consistent manner. h i;j contains the diagonal entries of ðff i  f j Þðff i  f j ÞT .
The single-site potential of (12) reflects the anomaly
5.1 Discriminative Model belief at site i. Using it alone, i.e., without (13), (11) is a
The anomaly maps of the previous section span space, time, logistic regression model. In this case, the detection of each
and spatial scale. Being derived from local measurements, anomaly is based on information from site i exclusively. The
they can be noisy. A principled framework is required to addition of the interaction potential of (13) enables the
1) integrate anomaly scores from the individual maps, model to take into account information from site i’s
2) eliminate noise, and 3) guarantee spatiotemporal con- neighborhood N i . This smoothes the single-site prediction,
sistency of anomaly judgments throughout the visual field. encouraging consistency of neighboring labels. The inter-
For this, we rely on a conditional random field [38] inspired action potential can be interpreted as a classifier that
by the discriminative random field (DRF) of [39]. An predicts whether two neighboring sites have the same
anomaly label yi 2 f1; 1g is defined at each location i in a label. Note that because f contains anomaly scores at
set S of observation sites. Given a video clip x , the different spatial scales, h i;j (or  i;j ) accounts for the
24 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014

deviation , for all parameters. Given N independent


training samples fxx ðnÞ ; y ðnÞ gN
n¼1 , the gradients of the regular-
ized log-likelihood with respect to w , v , and  are
@  N N 
log y ðnÞ n¼1 j x ðnÞ n¼1
@w
w (
XN X  ðnÞ ðnÞ  ðnÞ ðnÞ
¼   yi w T f i yi f i ð17Þ
n¼1 i2S
X )
 ðnÞ  ðnÞ 1
 IE   yi w T f i yi f i  2 w;
i2S
w

@  N N 
log p y ðnÞ n¼1 j x ðnÞ n¼1
@vv (
XN X 1 X  ðnÞ ðnÞ  ðnÞ 

¼ ejijj yi yj exp h h i;j
n¼1 i2S
jN i j j2N
i
" !#)
X 1 X  jijj  ðnÞ  1
 IE e yi yj exp hhi;j  2 v;
i2S
jN j
i j2N v
i

Fig. 4. CRF filter. Top: Graphical model. Bottom: Spatial and temporal ð18Þ
neighborhoods.
and
similarity between the two observations in anomaly spaces  
@  ðnÞ N
of different scale (i.e., under different spatial normalcy log p fyy ðnÞ gN
n¼1 fx x gn¼1
contexts). The interaction potentials adaptively modulate @ 8
XN <X
the intensity of intersite smoothing according to these 1 X   ðnÞ ðnÞ ðnÞ  
similarity measures (and how they are weighted by v ). The ¼ I yi ; yj ; x ; i; j ji  jj
n¼1
: i2S jN i j
j2N i
parameters w and v encode the relative importance of 2 0 139
different features. X 1 X   = 1
þ IE4 @ Iðyi ; yj ; x ðnÞ ; i; j ji  jj A5  2 ;
5.2 Online CRF Filter i2S
jN i j j2N ; 
i

The model of (11) requires inference over the entire video ð19Þ
sequence. This is not suitable for online applications. An
online version can be implemented by conditioning the where the expectation is evaluated with distribution
anomaly label y ðÞ at time  on 1) observations for t , and pðYjX;  Þ. The conditional expectations of (17)-(19) require
2) anomaly labels for t < , leading to evaluation of the partition function Z, a problem known to
be NP-hard. As is common in the literature, this difficulty is
 
P y ðÞ jfyy ðtÞ g1 x ðtÞ gt¼1 ;  avoided by estimating expectations through sampling.
t¼1 ; fx
( " Although sampling methods such as Markov chain Monte
1 X  ðÞ 
¼ exp A yi ; x ðÞ Carlo (MCMC) can converge to the true distribution, this
Z i2S usually requires many iterations. Since the procedure must
1 X  ðÞ 
ð16Þ
be repeated per gradient ascent step, these methods are
þ SS I SS yi ; yj ; x ðÞ ; i; j impractical. On the other hand, approximations such as
jN i j j2N SS
i
#) contrastive divergence minimization (which runs MCMC a
1 X  ðÞ  limited number of times with specific starting points) have
þ  TT  I TT yi ; yk ; x ; i; k ; been shown to be successful for vision applications [40],
N  T
T
i k2N i [41]. We adopt these approximations for CRF learning.
This leverages the fact that, denoting any of the
where S is the set of observations at time  (pixels of the parameters w ; v TT ; v SS ; TT ; SS by
, the partial gradients of
current frame). Two neighborhoods are defined per location (17)-(19) are
i: spatial N SiS (N SiS
S ) and temporal N T T T
T t 1
i (N i
fS gt¼1 ).
@  N N 
The graphical model is shown at the top of Fig. 4, and these log p y ðnÞ n¼1 j x ðnÞ n¼1 ; 
neighborhoods at the bottom. The parameters  ¼ @

fww; v TT ; v SS ; TT ; SS g are estimated during training. XN  


1
¼ F@

y ðnÞ ; x ðnÞ  IEðYjX; y; x ðnÞ Þ  2


;
Þ F@

ðy
n¼1

5.2.1 Learning
ð20Þ
Both (11) and (16) can be learned with standard optimization
techniques, such as gradient descent or the Broyden-Fletcher- where F@

ðyy; x Þ is the sum of the terms in the summations


Goldfarb-Shanno (BFGS) method. To improve generalization, of (17), (18), or (19) that depend on
. Contrastive
the model is regularized with a Gaussian prior of standard divergence approximates the intractable conditional
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 25

expectation IEðYjX;Þ ½F@

ðyy ; x ðnÞ Þ by F@

ð^
y ; x ðnÞ Þ, where y^ is
the “evil twin” of the ground-truth label field y ðnÞ [41]. y^ is
drawn by MCMC, using the inference procedure discussed
in Section 5.2.2, the current parameter estimates, and the
ground-truth labels y ðnÞ as a starting point.
Given the estimate of the partial gradients, the gradient
ascent rule for parameter updates reduces to
" #
X N   ðnÞ ðnÞ   ðnÞ ðnÞ  1


þ F@

y ; x  F@

y^ ; x  2
;
n¼1

Fig. 5. Exemplar normal/abnormal frames in Ped1 (top) and Ped2


ð21Þ
(bottom). Anomalies (red boxes) include bikes, skaters, carts, and
where is a learning rate. In our implementation, this rule wheelchairs.
is initialized with v TT ¼ v SS ¼ 1 and TT ¼ SS ¼ 0. The initial
value of w is learned, assuming a logistic regression model implementation, the filter is run for Ns ¼ 10 iterations.
Again, the complete anomaly detection procedure is
(vv TT ¼ v SS ¼ 0 in (16)), with the procedure of [43].
summarized in Algorithm 4, available in the online
5.2.2 Inference supplemental material.
The inference problem is to determine the most likely
anomaly prediction y ? for a query frame x ðÞ , given 6 EXPERIMENTS
previous predictions fyy ðtÞ g1
t¼1 , and observations fx x ðtÞ gt¼1 :
In this section, we introduce a new data set and an
 1   experimental protocol for evaluation of anomaly detection
y ? ¼ argmax log p y j y ðtÞ t¼1 ; x ðtÞ t¼1 ; 
y in crowded environments and use it to evaluate the
X   1 X
 proposed anomaly detector.
¼ argmax A yi ; x ðÞ þ I ðyi ; yj ; x ; i; jÞ :
y i2S
jN i j j2N 6.1 UCSD Pedestrian Anomaly Data Set
i

ð22Þ In the literature, anomaly detection has frequently been


evaluated by visual inspection [19], [7], [3], or with coarse
Again, exact inference is intractable. We rely on Gibbs ground truth, for example, frame-level annotation of
sampling to approximate the optimal prediction. This abnormal events [4], [1]. This does not completely address
consists of drawing labels from the conditional distribution: the anomaly detection problem, where it is usually desired
  1  to localize anomalies in both space and time. To enable this,
p yi j x ðtÞ t¼1 ; y ðtÞ t¼1 ; y i ; 
  1  we introduce a data set1 of crowd scenes with precisely
p yi ; y i j x ðtÞ t¼1 ; y ðtÞ t¼1 ;  localized anomalies and metrics for the evaluation of their
¼   1  ð23Þ detection. The data set consists of video clips recorded with
p y i j x ðtÞ t¼1 ; y ðtÞ t¼1 ; 
a stationary camera mounted at an elevation, overlooking
1   1 

¼ exp Fi x ðtÞ t¼1 ; y ðtÞ t¼1 ; y i ; yi ;  ; pedestrian walkways on the UCSD campus. The crowd
Z i density in the walkways is variable, ranging from sparse to
where Fi ðfxx ðtÞ gt¼1 ; fyy ðtÞ g1
t¼1 ; y i ; yi ; Þ is the sum of poten-
very crowded. In the normal setting, the video contains
tial functions that depend on site i (i.e., its “Markov only pedestrians. Abnormal events are due to either 1) the
blanket”): circulation of nonpedestrian entities in the walkways, or
2) anomalous pedestrian motion patterns. Commonly
  1 
Fi x ðtÞ t¼1 ; y ðtÞ t¼1 ; y i ; yi ;  occurring anomalies include bikers, skaters, small carts,
  1 X    and people walking across a walkway or in the surrounding
¼ A yi ; x ðÞ þ I yi ; yj ; x ðtÞ t¼1 ; i; j grass. A few instances of wheelchairs are also recorded. All
jN i j j2N ð24Þ
i abnormalities occur naturally, i.e., they were not staged or
X 1   
þ I yj ; yi ; x ðtÞ t¼1 ; j; i ; synthesized for data set collection.
j:i2N j
jN j j The data set is organized into two subsets, corresponding
to the two scenes of Fig. 5. The first, denoted “Ped1,”
and Z i the corresponding partition function: contains clips of 158  238 pixels, which depict groups of
X h  1 i people walking toward and away from the camera, and

Z i ¼ exp Fi x ðtÞ t¼1 ; y ðtÞ t¼1 ; y i ; y0i ;  : ð25Þ some amount of perspective distortion. The second,
y0i denoted “Ped2,” has spatial resolution of 240  360 pixels
The procedure is detailed in Algorithms 2 and 3, available and depicts a scene where most pedestrians move horizon-
in the online supplemental material, where we present the tally. The video footage of each scene is sliced into clips of
online CRF filter used to estimate the label field. During 120-200 frames. A number of these (34 in Ped1 and 16 in
Ped2) are to be used as training set for the condition of
learning, the filter is initialized with the ground-truth labels
(yy 0 ¼ y ðÞ ). During testing, this initialization relies on the 1. Available from http://www.svcl.ucsd.edu/projects/anomaly/data-
predictions of the single-site classifiers (vv TT ¼ v SS ¼ 0). In our set.html.
26 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014

TABLE 1
Note that, although widely used in the literature, the
Composition of UCSD Anomaly Data Set frame-level criterion only measures temporal localization
accuracy. This enables errors due to “lucky co-occurrences”
of prediction errors and true abnormalities. For example, it
assigns a perfect score to an algorithm that identifies a
single anomaly at a random location of a frame with
anomalies. The pixel-level criterion is much stricter and
more rigorous. By evaluating both the temporal and spatial
accuracy of the anomaly predictions, it rules out these
a b
number of clips/number of anomaly instances. some clips contain “lucky co-occurrences.” We believe that the pixel-level
more than one type of anomaly.
criterion should be the predominant criterion for evaluation
of anomaly detection algorithms.
normalcy. The test set contains clips (36 for Ped1 and 12 for
Ped2) with both normal (around 5,500) and abnormal 6.3 Experimental Setup
(around 3,400) frames. The abnormalities of each set are Unless otherwise noted, observation sites are a video sub-
summarized in Table 1. lattice with spatial interval of four pixels and temporal
Frame-level ground-truth annotation, indicating whether interval of five frames. Temporal anomaly maps rely on
anomalies occur within each frame, and manually collected patches of 13  13  15 pixels. The temporal extent of
pixel-level binary anomaly masks, which identify the pixels 15 frames provides a reasonable compromise between the
containing anomalies, are available per test clip. We note ability to detect anomalies and the delay (1.5 s) and storage
that this includes ground truth on Ped1 contributed by (15 video frames) required for anomaly detection. To
Antic and Ommer [9], and supersedes the ground truth minimize computation, patches of variance smaller than
available on an earlier version of this work [43]. We denote 500 are discarded.2 Temporal H-MDT models are learned
the current ground truth by “full annotation” and the from fine to coarse scale. At the finer scale, there are 6  10
previous one by “partial annotation.” Unless otherwise windows R1i on Ped1 (8  11 for Ped2), each covering a
noted, the results of the subsequent sections correspond to 4141 pixel area and overlapping by 25 percent with each
the full annotation. of its four neighbors. An MDT of five components is
learned per window. At coarser spatial scales, an MDT is
6.2 Evaluation Methodology
estimated from the MDTs of the four regions that it covers
Two criteria are used to evaluate anomaly detection at the immediately finer resolution. Each estimated MDT
accuracy: a frame-level criterion and a pixel-level criterion. has one more component than its ancestor MDTs. Overall,
Both are based on true-positive rates (TPR) and false- there are 10 scales in Ped1 and 11 in Ped2. Spatial anomaly
positive rates (FPRs), denoting “an anomalous event” as
maps use a 3131 center window and surround windows
“positive” and “the absence of anomalous events” as
of size equivalent to Rsi . For segmentation, 7  7  10
“negative.” A frame containing anomalies is denoted a
patches are extracted from the 40 frames surrounding that
positive, otherwise a negative. The true and false positives
under analysis. There are five DT components at all levels
under the two criteria are:
of the spatial hierarchy. Both temporal and spatial MDTs
. Frame-level criterion. An algorithm predicts which have an eight-dimensional state space. The sensitivity of
frames contain anomalous events. This is compared the proposed detector to some of these parameters is
to the clip’s frame-level ground-truth anomaly discussed in Appendix C.2, available in the online
annotations to determine the number of true- and supplemental material.
false-positive frames.
6.4 Descriptor Comparison
. Pixel-level criterion. An algorithm predicts which
pixels are related to anomalous events. This is The first experiment evaluated the benefits of MDT-based
compared to the pixel-level ground-truth anomaly over optical flow descriptors. The optical flow descriptors
annotation to determine the number of true-positive considered were the local motion histogram (LMH) of [19],
and false-positive frames. A frame is a true positive the force flow descriptor of [4], and the mixture of optical
if 1) it is positive and 2) at least 40 percent of its flow models (MPPCA) of [3]. LMH uses statistics of local
anomalous pixels are identified; a frame is a false motion, and is representative of traditional background
positive if it is negative and any of its pixels are subtraction representations, force flow is a descriptor for
predicated as anomalous. spatial anomaly detection, and MPPCA a temporal
The two measures are combined into a receiver operating anomaly detector. For the MDT, only the anomaly maps
characteristic (ROC) curve of TPR versus FPR: of finest temporal and coarsest spatial scale were con-
sidered here. Since the goal was to compare descriptors,
# of true-positive frame the high-level components of the models in which they
TPR ¼ ;
# of positive frame were proposed, for example, the LDA of [4], the MRF of
# of false-positive frame [3], and the proposed CRF, were not used. Instead,
FPR ¼ : anomaly predictions were smoothed with a simple
# of negative frame
Performance is also summarized by the equal error rate 2. This variance threshold is quite conservative, only eliminating regions
(EER), the ratio of misclassified frames at which of very little motion. For the data sets used in our experiments, this has not
led to the elimination of any objects from further consideration. In other
FPR ¼ 1  TPR, for the frame-level criterion, or rate of contexts, for example, scenes where objects are static for periods of time,
detection (RD), i.e., 1-EER, for the pixel-level criterion. this could happen. In this case, the threshold should be set to zero.
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 27

TABLE 2 TABLE 3
Descriptor Performance on UCSD Anomaly Data Set Filter Performance on the UCSD Anomaly Data Set


numbers outside/inside parentheses are results by full/partial annota-
tion (same for the rest of the paper).
Overall, although optical flow can signal fast moving
20  20  10 Gaussian filter. Anomaly predictions were anomalous subjects, it leads to too many false positives in
generated by thresholding the filtered anomaly maps and regions of complex motion, occlusion, and so on. More
ROC curves by varying thresholds. interesting is the lack of advantage for either spatial or
The performance of the different descriptors, under both temporal anomaly detection, both among MDT maps and
the frame-level (EER) and pixel-level (RD) criteria (using prior techniques (no clear advantage to either force flow or
both full and partial annotation in Ped1), is summarized in MPPCA). In fact, as shown in Fig. 6, temporal and spatial
Table 2. The corresponding ROC curves are presented in anomalies tend to be different objects. This suggests the
Appendix C.1 (Fig. 13), available in the online supplemental combination of the two strategies.
material. Examples of detected anomalies are shown in
Fig. 6. Under the frame-level criterion, temporal MDT has 6.5 Scale and Globally Consistent Prediction
the best performance in both scenes. Spatial MDT performs We next investigated the benefits of information fusion
worse than others in Ped1 but ranks second in Ped2. across space and scale, with the proposed CRF. We started
However, for the more precise pixel-level criterion, spatial with a single-scale description (S-MDT), using only the
MDT is the top or second best performer. In this case, both anomaly maps at finest temporal and coarsest spatial
MDTs significantly outperform all optical flow descriptors. scales, i.e., a 3D feature per site. We next considered a
The gap between corresponding competitors (e.g., temporal multiscale description, using the whole H-MDT. In both
MDT versus MPPCA or LMH, spatial MDT versus force cases, inference was performed with logistic regression, i.e.,
flow) is of at least 10 percent RD. These results show that the interaction term of (16) turned off, and the Gaussian
there is a definite benefit to the joint representation of filter of the previous section. In each trial, the logistic
appearance and dynamics of the MDT. classifier was trained by Newton’s method [42]. Finally, we
This is not totally surprising, given the limitations of considered the full blown CRF, denoted CRF filter. The
optical flow. First, the brightness constancy assumption is dimensions of the spatial and temporal CRF neighborhoods
easily violated in crowded scenes, where stochastic motion were set to jN SS j ¼ 6, jN TT j ¼ 3. ROC curves were generated
and occlusions prevail. Second, optical flow measures by varying the threshold for prediction.
instantaneous displacement, while the DT is a smooth Table 3 presents a comparison of the three approaches.
motion representation with extended temporal support. The corresponding ROC curves are shown in Appendix C.1
Finally, while optical flow is a bandpass measure, which (Fig. 14), available in the online supplemental material.
eliminates most of the appearance information, the DT Under the pixel-level criterion, the multiscale maps have
models both appearance and dynamics. The last two higher accuracy than their single-scale counterparts, demon-
properties are particularly important for crowded scenes, strating the benefits of modeling anomalies in scale space
where objects occlude and interact in complicated manners. (improvement of RD by as much as 11 percent). The CRF

Fig. 6. Anomaly predictions of temporal MDT, spatial MDT, MPPCA, force flow, and LMH (from left to right). Red regions are abnormal pixels. All
predictions generated with thresholds such that the different approaches have similar FPR under frame-level protocol (these settings apply to all the
subsequent figures unless otherwise stated).
28 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014

Fig. 7. Examples of anomaly localization with Gaussian smoothing (in blue) and CRF filter (in red). The latter predicts more accurately the
spatiotemporal support of anomalies in crowded regions, where occlusion is prevalent.
TABLE 4
Performance of Various Methods (RD/Seconds per Frame) by Pixel-Level Criterion on UCSD Anomaly Data Set

Implementation: } C/2.8-GHz CPU/2-GB RAM; \ C++ and Matlab (feature extraction and model inference)/2.6GHz CPU/2GB RAM; #Matlab/dual-
core 2.7GHz CPU/8GB RAM.

filter further improves performance (improvement of RD by improves the RD to 65 percent. Computationally, the
as much as 3 percent), demonstrating the gains of globally proposed detector is also much more efficient. For
consistent inference. As shown in Fig. 7, the visual improve- implementations on similar hardware (see footnotes
ments are even more substantial.3 Simple filtering does not of Table 4), it requires 1.11 s/frame, as compared to the
take into account interactions between neighboring sites and 3.8 s/frame reported for [8].
smooths the anomaly maps uniformly. On the other hand, Like the proposed detector, the Bayesian video parsing
the CRF adapts the degree of smoothing to the spatiotem- (BVP) of [9] combines spatial and temporal anomaly
poral structure of the anomalies, increasing the precision of detection, using a more complex video representation,
anomaly localization. Note how, in Fig. 7, the CRF-filter parsing of the video to extract all the objects in the scene,
successfully excludes occluded but normally behaving
a support vector machine classifier for detection of temporal
pedestrians from anomaly regions. These improvements
anomalies, a graphical model with seven nodes per site
are not always captured by the frame-level criterion. In fact,
there is little EER difference between S-MDT and H-MDT. (and multiple nonparametric models for location, scale, and
The inconsistency between frame- and pixel-level results in velocity) for detection of spatial anomalies, and occlusion
Tables 2 and 3 shows that the former is not a good measure of reasoning. This is an elegant solution, which achieves
anomaly detection performance. Henceforth, only the pixel- slightly better RD than the proposed detector (2 percent for
level criterion is used in the remaining experiments on this full and 3 percent for partial annotation), but at substan-
data set. tially higher computational cost (5 to 10 times slower). We
believe that when both accuracy and computation are
6.6 Anomaly Detection Performance considered, the proposed detector is a more effective
We next evaluated the performance of the complete solution. However, these results suggest that gains could
anomaly detector. For this, we selected two detectors be achieved by expanding the proposed CRF, as [9] trades a
from the recent literature, with state-of-the-art perfor- much simpler representation of video dynamics (optical
mance for temporal [8] and combined spatial and
flow versus MDT) for more sophisticated inference. It would
temporal anomaly detection [9]. The RD of the various
be interesting to consider CRF extensions with some of the
methods is summarized in Table 4, for both partial and
properties of the graphical model of [9], namely, explicit
full annotation. The corresponding ROC curves are shown
in Fig. 8. Table 4 also presents the processing time per occlusion reasoning. This is left for subsequent research.
video frame of each method. Missing entries indicate
unavailable results for the particular data set and/or
annotation type. A discussion of the detection errors made
by the detector is given in Appendix C.3, available in the
online supplemental material.
On Ped1, the temporal component of the proposed
detector substantially outperforms the temporal detector of
[8]. A multiple-scale temporal anomaly map with CRF
filtering increases the 46 percent RD4 of [8] to 52 percent.
A similar implementation of the spatial anomaly detector
(a multiple-scale map plus CRF filtering) achieves 58 percent.
Combining both maps and multiple spatial scales further

3. More results at http://www.svcl.ucsd.edu/projects/anomaly/


results.html.
4. These numbers refer to partial annotation, the only available for [8]. Fig. 8. ROC curves of pixel-level criterion on Ped1.
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 29

Fig. 9. Impact of context on anomaly maps. First three columns: Temporal anomalies, cell coverage at different HMDT layers shown in blue. Last two
columns: Spatial anomalies, example center (surround) windows shown in blue (light yellow).

6.7 Role of Context in Anomaly Judgments abnormal events (e.g., seconds of normalcy followed by a
We next investigated the impact of normalcy context in short abnormal event). The main limitations of this data set
anomaly judgments. For temporal anomalies, context is are that
determined by the subregion size: As the latter increases,
temporal models become more global. Fig. 9 shows that the 1. it is relatively small (scenes 1, 2, and 3 contain two,
scale of normalcy context significantly impacts anomaly six, and three anomaly instances),
scores. For example, the two cyclists on the left-most 2. it has no pixel-level ground truth,
columns of the figure are missed at small scales but 3. the anomalies are staged, and
detected by the more global models. On the other hand, a 4. it produces very salient changes in the average
leftward heading pedestrian in the third column has high motion intensity of the scene.
anomaly score at the finest scale but is not anomalous in As a result, several methods achieve near perfect detection.
larger contexts. In summary, no single context is effective The proposed detector was based on 3  3 subregions of
for all scenes. Due to the stochastic arrangements of people size 180  180 at the finest spatial scale and a 3-scale
within crowds, two crowds of the same size can require anomaly map for both the temporal and spatial compo-
different context sizes. In general, the optimal size depends nents. One normal-abnormal instance of each scene was
on the crowd configuration and the anomalous event. used to train the temporal normalcy model and CRF filter,
A similar observation holds for spatial anomalies, where and the remaining instances for testing. A comparison to
context is set by the size of the surround window. For previous results in the literature, under the frame-level
example, in the fourth column of Fig. 9, the subject walking criterion, is presented in Table 5 and Fig. 11. Due to the
on the grass is very salient when compared to her
salient motion discontinuities, the temporal component
immediate neighbors, and anomaly detection benefits from
(99.2 percent AUC) substantially outperforms the spatial
a narrower context. For larger contexts, she becomes less
component (97.9 percent). Nevertheless, the complete
unique than a man that walks in the direction opposite to his
neighbors. On the other hand, the cart and bike of the last detector achieves the best performance (99.5 percent). This
column only pop out when the surround window is large is nearly perfect, and comparable to the previous best
enough to cover some pedestrians. In summary, anomalies results in the literature.
depend strongly on scene context, and this dependence can Subway. The Subway data set [19] consists of two
vary substantially from scene to scene. It is, thus, important sequences recorded from the entrance (1 h and 36 min,
to fuse anomaly information across spatial scales. 144,249 frames) and exit (43 min, 64,900 frames) of a

6.8 Performance on Other Benchmark Data Sets


The detection of anomalous events in crowded scenes can be TABLE 5
evaluated in a few data sets other than UCSD. These have Anomaly Detection Performance in AUC/ERR (Percent)
various limitations in terms of size, saliency of the
anomalies, evaluation criteria, and so on. They are discussed
in this section where, for completeness, we also present the
results of the proposed anomaly detector.
UMN. The UMN data set5 contains three escape scenes.
Normal events depict individuals wandering around or
organized in groups. Abnormal events depict a crowd
escaping in panic. Each scene contains several normal-

5. http://mha.cs.umn.edu/Movies/Crowd-Activity-All.avi.
30 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014

For temporal anomaly detection, MDTs were learned


using 20  20  30 patches from 3  4 subregions covering
the intersection. This was the finest level of a 3-scale
hierarchical model. For spatial anomaly detection, segmenta-
tion was computed with a 5-component MDT learned from
15  15  30 patches extracted from 45 consecutive frames.
An observation lattice of step 15  15  10 was used to
evaluate anomaly scores, and the neighborhood size of the
CRF filter was 2. The performance of the detector is
summarized in Table 5 and Fig. 11. Due to the sparsity of
Fig. 10. Anomalies detected by H-MDT CRF on the UMN (left), Subway the scenes (not enough spatial context around cars making
(center), and U-turn (right) data sets. illegal turns to establish them as anomalous) the performance
of the spatial anomaly detector is quite weak. However, the
subway station. Normal behaviors include people entering combination of the spatial and temporal anomaly maps again
and exiting the station; abnormal consist of people moving outperforms the temporal channel, achieving the best
in the wrong direction (exiting the entrance or entering the performance. Overall, the proposed detector has the best
exit) or avoiding payment. The main limitations of this data AUC on this data set. Examples of detected anomalies, for
set are: 1) reduced number of anomalies, and 2) predictable this and the other two data sets, are shown in Fig. 10.
spatial localization (entrance and exit regions). The original
512  384 frames were down sampled to 320  240, and 2  7 CONCLUSION
3 subregions of size 90  90, covering either the entrance or
exit regions, were used at the finest spatial scale. A 3-scale In this work, we proposed an anomaly detector that spans
time, space, and spatial scale, using a joint representation of
anomaly map was computed for both spatial and temporal
video appearance and dynamics and globally consistent
anomalies. Video patches were of size 15  15  15, and
inference. For this, we modeled crowded scenes with a
10 min of video from each sequence was used to train the
hierarchy of MDT models, equated temporal anomalies to
temporal normalcy model and CRF filters, while the
background subtraction, spatial anomalies to discriminant
remaining video was used for testing. Table 5 and Fig. 11
saliency, and integrated anomaly scores across time, space,
present a comparison of the proposed detector against
and scale with a CRF. It was shown that the MDT
recently published results on this data set. Again, the
representation substantially outperforms classical optical
temporal component outperforms its spatial counterpart,
flow descriptors, that spatial and temporal anomaly
but the best performance is obtained by combination of both detection are complementary processes, that there is a
temporal and spatial anomaly maps (H-MDT CRF). This benefit to defining anomalies with respect to various
achieves the best result among all methods, outperforming normalcy contexts, i.e., in anomaly scale space, and that it
the sparse reconstruction of [8] and the local statistical is important to guarantee globally consistent inference
aggregates of [23]. Note that, for this data set, the gains in across space, time and scale. We have also introduced a
both AUC and EER are substantial. challenging anomaly detection data set, composed of
U-turn. The U-turn data set [5] consists of one video complex scenes of pedestrian crowds, involving stochastic
sequence (roughly 6,000 frames of size 360  240) recorded motion, complex occlusions, and object interactions. This
by a static camera overlooking the traffic at a road data set provides both frame-level and pixel-level ground
intersection. The video is split into two clips of equal length truth, and a protocol for the evaluation of anomaly
for cross validation and anomalies consist of illegal vehicle detection algorithms. The proposed anomaly detector was
motion at the intersection. The main limitations of this data shown effective on both this and a number of previous data
set are: 1) the limited size, 2) absence of pixel-level ground sets. When compared to previous methods, it outperformed
truth, and 3) sparseness of the scenes. The latter enables the various state-of-the-art approaches, either in absolute
use of object-based operations, for example, tracking and performance or in terms of the tradeoff between anomaly
analysis of object trajectories [5], which we do not exploit. detection accuracy and complexity.

Fig. 11. ROC curves of frame-level criterion on the UMN (left), Subway (center), and U-turn (right) data sets.
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 31

REFERENCES [25] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan, “Semi-


Supervised Adapted HMMs for Unusual Event Detection,” Proc.
[1] S. Wu, B. Moore, and M. Shah, “Chaotic Invariants of Lagrangian IEEE Conf. Computer Vision and Pattern Recognition, 2005.
Particle Trajectories for Anomaly Detection in Crowded Scenes,” [26] C. Stauffer and W. Grimson, “Adaptive Background Mixture
Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010. Models for Real-Time Tracking,” Proc. IEEE Conf. Computer Vision
[2] L. Kratz and K. Nishino, “Anomaly Detection in Extremely and Pattern Recognition, 1999.
Crowded Scenes Using Spatio-Temporal Motion Pattern Models,” [27] L. Itti, C. Koch, and E. Niebur, “A Model of Saliency-Based Visual
Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. Attention for Rapid Scene Analysis,” IEEE Trans. Pattern Analysis
[3] J. Kim and K. Grauman, “Observe Locally, Infer Globally: A and Machine Intelligence, vol. 20, no. 11, pp. 1254-1259, Nov. 1998.
Space-Time MRF for Detecting Abnormal Activities with Incre- [28] R. Shumway and D. Stoffer, “An Approach to Time Series
mental Updates,” Proc. IEEE Conf. Computer Vision and Pattern Smoothing and Forecasting Using the EM Algorithm,” J. Time
Recognition, 2009. Series Analysis, vol. 3, no. 4, pp. 253-264, 1982.
[4] R. Mehran, A. Oyama, and M. Shah, “Abnormal Crowd Behavior [29] S. Roweis and Z. Ghahramani, “A Unifying Review of Linear
Detection Using Social Force Model,” Proc. IEEE Conf. Computer Gaussian Models,” Neural Computation, vol. 11, no. 2, pp. 305-345,
Vision and Pattern Recognition, 2009. 1999.
[5] Y. Benezeth, P. Jodoin, V. Saligrama, and C. Rosenberger, [30] S. Kullback, Information Theory and Statistics. Dover Publications,
“Abnormal Events Detection Based on Spatio-Temporal Co- 1968.
Occurences,” Proc. IEEE Conf. Computer Vision and Pattern [31] D. Gao, V. Mahadevan, and N. Vasconcelos, “On the Plausibility
Recognition, 2009. of the Discriminant Center-Surround Hypothesis for Visual
[6] A. Basharat, A. Gritai, and M. Shah, “Learning Object Motion Saliency,” J. Vision, vol. 8, no. 7, pp. 1-18, 2008.
Patterns for Anomaly Detection and Improved Object Detection,” [32] V. Mahadevan and N. Vasconcelos, “Background Subtraction in
Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008. Highly Dynamic Scenes,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition, 2008.
[7] T. Xiang and S. Gong, “Video Behavior Profiling for Anomaly
[33] A. Chan and N. Vasconcelos, “Probabilistic Kernels for the
Detection,” IEEE Trans. Pattern Analysis and Machine Intelligence,
Classification of Auto-Regressive Visual Processes,” Proc. IEEE
vol. 30, no. 5, pp. 893-908, May 2008.
Conf. Computer Vision and Pattern Recognition, 2005.
[8] Y. Cong, J. Yuan, and J. Liu, “Sparse Reconstruction Cost for [34] J.R. Hershey and P.A. Olsen, “Approximating the Kullback
Abnormal Event Detection,” Proc. IEEE Conf. Computer Vision and Leibler Divergence between Gaussian Mixture Models,” Proc.
Pattern Recognition, 2011. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing, 2007.
[9] B. Antic and B. Ommer, “Video Parsing for Abnormality [35] A.B. Chan and N. Vasconcelos, “Efficient Computation of the Kl
Detection,” Proc. IEEE Int’l Conf. Computer Vision, 2011. Divergence between Dynamic Textures,” Technical Report SVCL-
[10] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A TR-2004-02, Dept. of Electrical and Computer Eng., Univ. of
Survey,” ACM Computing Surveys, vol. 41, no. 3, article 15, 2009. California San Diego, 2004.
[11] G. Doretto, A. Chiuso, Y. Wu, and S. Soatto, “Dynamic Textures,” [36] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features:
Int’l J. Computer Vision, vol. 51, no. 2, pp. 91-109, 2003. Spatial Pyramid Matching for Recognizing Natural Scene Cate-
[12] A. Chan and N. Vasconcelos, “Modeling, Clustering, and gories,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,
Segmenting Video with Mixtures of Dynamic Textures,” IEEE 2006.
Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 5, [37] A. Chan, E. Coviello, and G. Lanckriet, “Clustering Dynamic
pp. 909-926, May 2008. Textures with the Hierarchical EM Algorithm,” Proc. IEEE Conf.
[13] D. Gao and N. Vasconcelos, “Decision-Theoretic Saliency: Computer Vision and Pattern Recognition, 2010.
Computational Principles, Biological Plausibility, and Implica- [38] J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random
tions for Neurophysiology and Psychophysics,” Neural Computa- Fields: Probabilistic Models for Segmenting and Labeling Se-
tion, vol. 21, no. 1, pp. 239-271, 2009. quence Data,” Proc. 18th Int’l Conf. Machine Learning, 2001.
[14] C. Stauffer and W. Grimson, “Learning Patterns of Activity Using [39] S. Kumar and M. Hebert, “Discriminative Fields for Modeling
Real-Time Tracking,” IEEE Trans. Pattern Analysis and Machine Spatial Dependencies in Natural Images,” Proc. Advances in Neural
Intelligence, vol. 22, no. 8, pp. 747-757, Aug. 2000. Information Processing Systems, 2004.
[15] T. Zhang, H. Lu, and S. Li, “Learning Semantic Scene Models by [40] X. He, R. Zemel, and M. Carreira-Perpinán, “Multiscale Condi-
Object Classification and Trajectory Clustering,” Proc. IEEE Conf. tional Random Fields for Image Labeling,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2009. Computer Vision and Pattern Recognition, 2004.
[16] N. Siebel and S. Maybank, “Fusion of Multiple Tracking [41] G.E. Hinton, “Training Products of Experts by Minimizing
Algorithms for Robust People Tracking,” Proc. European Conf. Contrastive Divergence,” Neural Computation, vol. 14, pp. 1771-
Computer Vision, 2006. 1800, 2002.
[17] X. Cui, Q. Liu, M. Gao, and D.N. Metaxas, “Abnormal Detection [42] T. Minka, “A Comparison of Numerical Optimizers for Logistic
Using Interaction Energy Potentials,” Proc. IEEE Conf. Computer Regression,” technical report, Microsoft Research, 2003.
Vision and Pattern Recognition, 2011. [43] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly
[18] F. Jiang, J. Yuan, S.A. Tsaftaris, and A.K. Katsaggelos, “Anom- Detection in Crowded Scenes,” Proc. IEEE Conf. Computer Vision
alous Video Event Detection Using Spatiotemporal Context,” and Pattern Recognition, 2010.
Computer Vision and Image Understanding, vol. 115, no. 3, pp. 323- [44] T.P. Kah-Kay Yung, “Example-Based Learning for View-Based
333, 2011. Human Face Detection,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 20, no. 1, pp. 39-51, Jan. 1998.
[19] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust Real-
Time Unusual Event Detection Using Multiple Fixed-Location
Monitors,” IEEE Trans. Pattern Analysis and Machine Intelligence, Weixin Li received the bachelor’s degree from
vol. 30, no. 3, pp. 555-560, Mar. 2008. Tsinghua University, Beijing, China, in 2008,
the MSc degree in electrical engineering from
[20] B. Zhao, L. Fei-Fei, and E. Xing, “Online Detection of Unusual
the University of California, San Diego, in 2011,
Events in Videos via Dynamic Sparse Coding,” Proc. IEEE Conf.
and is currently working toward the PhD
Computer Vision and Pattern Recognition, 2011.
degree. His research interests primarily include
[21] D. Helbing and P. Molnár, “Social Force Model for Pedestrian computational vision and machine learning,
Dynamics,” Physical Rev. E, vol. 51, no. 5, pp. 4282-4286, 1995. with specific focus on visual analysis of human
[22] O. Boiman and M. Irani, “Detecting Irregularities in Images and in behavior, activity, and event, and models with
Video,” Int’l J. Computer Vision, vol. 74, no. 1, pp. 17-31, 2007. latent variables and their applications. He is a
[23] V. Saligrama and Z. Chen, “Video Anomaly Detection Based on student member of the IEEE.
Local Statistical Aggregates,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition, 2012.
[24] R. Hamid, A. Johnson, S. Batta, A. Bobick, C. Isbell, and G.
Coleman, “Detection and Explanation of Anomalous Activities:
Representing Activities as Bags of Event N-Grams,” Proc. IEEE
Conf. Computer Vision and Pattern Recognition, 2005.
32 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014

Vijay Mahadevan received the BTech degree Nuno Vasconcelos received the licenciatura
from the Indian Institute of Technology, Madras, degree in electrical engineering and computer
in 2002, the MS degree from Rensselaer science from the Universidade do Porto, Portu-
Polytechnic Institute, Troy, New York, in 2003, gal, and the MS and PhD degrees from the
and the PhD degree from the University of Massachusetts Institute of Technology. He is a
California, San Diego, in 2011, all in electrical professor in the Electrical and Computer En-
engineering. From 2004 to 2006, he was with the gineering Department, University of California,
Multimedia group at Qualcomm Inc., San Diego, San Diego, where he heads the Statistical Visual
California. He is currently with Yahoo! Labs, Computing Laboratory. He has received a US
Bengaluru. His interests include computer National Science Foundation (NSF) CAREER
vision and machine learning and their applications. He is a member of award, a Hellman Fellowship, and has authored more than 150 peer-
the IEEE. reviewed publications. He is a senior member of the IEEE.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like