Anomaly Detection and Localization
Anomaly Detection and Localization
1, JANUARY 2014
Abstract—The detection and localization of anomalous behaviors in crowded scenes is considered, and a joint detector of temporal
and spatial anomalies is proposed. The proposed detector is based on a video representation that accounts for both appearance and
dynamics, using a set of mixture of dynamic textures models. These models are used to implement 1) a center-surround discriminant
saliency detector that produces spatial saliency scores, and 2) a model of normal behavior that is learned from training data and
produces temporal saliency scores. Spatial and temporal anomaly maps are then defined at multiple spatial scales, by considering the
scores of these operators at progressively larger regions of support. The multiscale scores act as potentials of a conditional random
field that guarantees global consistency of the anomaly judgments. A data set of densely crowded pedestrian walkways is introduced
and used to evaluate the proposed anomaly detector. Experiments on this and other data sets show that the latter achieves state-of-
the-art anomaly detection results.
Index Terms—Video analysis, surveillance, anomaly detection, crowded scene, dynamic texture, center-surround saliency
1 INTRODUCTION
with respect to either appearance or dynamics leads to a models local optical flow with a mixture of probabilistic
flexible model of normalcy, applicable to the detection of principal component analysis (PCA) models, [4] and [17]
anomalies of relevance to various surveillance tasks. draw inspiration from classical studies of crowd behavior
To address the scale problem, MDTs are learned at [21] to characterize flow with interaction features (e.g.,
multiple spatial scales. This is done with an efficient social force model), and [1] learns the representative flow of
hierarchical model, where layers of MDTs with successively groups by clustering optical flow-based particle trajectories.
larger regions of video support are learned recursively. The These approaches emphasize dynamics, ignoring anoma-
local measures of spatial and temporal abnormality are then lies of object appearance and, thus, anomalous behavior
integrated into a globally coherent anomaly map, by without outlying motion. Optical flow, pixel change
probabilistic inference. This is implemented with a condi- histograms, or other classical background subtraction
tional random field (CRF), whose single-node potentials are features are also difficult to extract from crowded scenes,
classifiers of local measures of spatial and temporal where the background is by definition dynamic, there are
abnormality, collected over a range of spatial scales. They lots of clutter, and occlusions. More complete representa-
are complemented by a novel set of interaction potentials, tions account for both appearance and motion. For example,
which account for spatial and temporal context, and [2] models temporal sequences of spatiotemporal gradients
integrate anomaly information across the visual field. to detect anomalies in densely crowded scenes, [22] declares
Finally, to address the difficulties of empirical evaluation as abnormal spatiotemporal patches that cannot be recon-
of anomaly detectors on crowded scenes, we introduce a structed from previous frames, and [23] pools appearance
data set of video from walkways in the campus of University and motion features over spatial neighborhoods, using a
of California, San Diego (UCSD), depicting crowds of distance to the nearest spatially colocated feature vector
varying densities. The data set contains 98 video sequences, among all training video clips, to quantify abnormality.
and five well-defined abnormal categories. These are not Object-based representations, based on location, blob
“synthetic,” or “staged,” but abnormal events that occur
shape, and motion [7] or optical flow magnitude, gradients,
naturally, for example, bicycle riders that cross pedestrian
location, and scale [9], have also been proposed. Other
walkways. Ground truth is provided for abnormal events,
representations include a bag-of-words over a set of
as well as a protocol to evaluate detection performance.
manually annotated event classes [24]. Various methods
The remainder of the paper is organized as follows: have also been used to produce anomaly scores. While
Section 2 reviews previous work on anomaly detection in simple spatial filtering suffices for some applications [19],
computer vision. The problems of temporal and spatial crowded scenes require more sophisticated graphical
anomaly detection in crowded scenes are discussed in models and inference. For example, [6] and [1] adopt
Section 3. This is followed by the mathematical character- Gaussian mixture models (GMM) to represent trajectories of
ization of multiscale anomaly maps in Section 4, and the normal behavior. Cong et al. [8] and Zhao et al. [20] learn a
proposed CRF for integration of spatial and temporal sparse basis and define unusual events as those that can
anomalies across different spatial scales in Section 5. Finally, only be reconstructed with either large error or the
an extensive experimental evaluation is discussed in combination of a large number of basis vectors.
Section 6 and some conclusions are presented in Section 7. Contributions of the second type address the integration
of local anomaly scores, which can be noisy, into a globally
2 PRIOR WORK consistent anomaly map. The authors of [2], [25], and [7]
guarantee temporally consistent inference by modeling
Recent advances in anomaly detection address event normal temporal sequences with hidden Markov models
representation and globally consistent statistical inference. (HMMs). While this enforces consistency along the tempor-
Contributions of the first type define features and models al dimension, there have also been efforts to produce
for the discrimination of normal and anomalous patterns. spatially consistent anomaly maps. For example, latent
Models of normal and abnormal behavior are then learned Dirichlet allocation (LDA) has been applied to force flow
from training data, and anomalies detected with a mini- features, in the model of spatial crowd interactions of [4].
mum probability of error decision rule. Although there are On the other hand, [5] and [3] rely on Markov random fields
some exceptions [5], the distribution of abnormal patterns is (MRF) to enforce global spatial consistency. In the realm of
usually assumed uniform, and abnormal events formulated sparse representations, [20] guarantees consistency of
as events of low probability under the model of normalcy. reconstruction coefficients over space and time by inclusion
One intuitive representation for event modeling is based of smoothness terms in the underlying optimization
on object trajectories. It is comprised of either explicitly or problem. Finally, [9] models object relationships, using
implicitly segmenting and tracking each object in the scene, Bayesian networks to implement occlusion reasoning.
and fitting models to the resulting object tracks [14], [15], It should be noted that most of these methods have
[16], [6], [17], [18]. While capable of identifying abnormal not been tested on the densely crowded scenes consid-
behaviors of high-level semantics (e.g., unusual long-term ered in this work. It is unclear that many of them could
trajectories), these procedures are both difficult and deal with the complex motion and object interactions
computationally expensive for crowded or cluttered scenes. prevalent in such scenes. Furthermore, while most
A number of promising alternatives, which avoid proces- methods include some mechanism to encourage spatial
sing individual objects, have been recently proposed. These and temporal consistency of anomaly judgments (MRF,
include the modeling of motion patterns with histograms of LDA, etc.), the underlying decision rule tends to be either
pixel change [5], histograms of optical flow [19], [8], [20], or predominantly temporal (e.g., trajectories, GMMs, HMMs,
optical flow measures [3], [4], [17], [1]. Among these, [3] or sparse representations learned over time) or spatial
20 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014
(e.g., interaction models) but is rarely discriminant with definition of normalcy. In this sense, the detection of spatial
respect to both space and time. This makes it difficult to anomalies can be equated to saliency detection [27].
infer whether spatial or temporal modeling are critically
important by themselves, or what benefits are gained 3.3 Roles of Crowds and Scale
from their joint modeling. Furthermore, the role of scale Most available background subtraction and saliency detec-
is rarely considered. These issues motivate the contribu- tion solutions are not applicable to crowded scenes, where
tions of the following sections. backgrounds can be highly dynamic. In this case, it is not
sufficient to detect variations of image intensity, or even
optical flow, to detect anomalous events. Instead, normalcy
3 ANOMALY DETECTION models must rely on sophisticated joint representations of
We start by proposing an anomaly detector that accounts appearance and dynamics. In fact, even such models can be
for scene appearance and dynamics, spatial and temporal ineffective. Since crowds frequently contain distinct sub-
context, and multiple spatial scales. entities, for example, vehicles or groups of people moving
in different directions, anomaly detection requires model-
3.1 Mathematical Formulation ing multiple video components of different appearance and
A classical formulation of anomaly detection, which we dynamics. A model that has been shown successful in this
adopt in this work, equates anomalies to outliers. A context is the mixture of DTs [12]. This is the representation
statistical model pX ðx
x Þ is postulated for the distribution of adopted in this work.
a measurement X under normal conditions. Abnormalities Another challenging aspect of anomaly detection within
are defined as measurements whose probability is below a crowds is scale. Spatial anomalies are usually detected at
threshold under this model. This is equivalent to a statistical the scale of the smallest scene entities, typically people.
test of hypotheses: However, a normal event at this scale may be anomalous at
a larger scale, and vice versa. For example, while a child
. H0 : x is drawn from pX ðx
x Þ; that rides a bicycle appears normal within a group of
. H1 : x is drawn from an uninformative distribution bicycle riding children, the group is itself anomalous in a
pX ðx
x Þ / 1. crowded pedestrian sidewalk. Local anomaly detectors,
The minimum probability of error rule for this test is to with small regions of interest, cannot detect such anomalies.
reject the null hypothesis H0 if pX ðxx Þ < , where is the To address this, we represent crowded scenes with a
normalization constant of the uninformative distribution. hierarchy of MDTs that cover successively larger regions.
As usual in the literature, we consider the problem of This is done with a computationally efficient hierarchical
anomaly detection from localized video measurements x , model, where MDT layers are estimated recursively.
where x is a spatiotemporal patch of small dimensions. A similar challenge holds for temporal anomalies. While
their detection is usually based on a small number of video
3.2 Spatial versus Temporal Anomalies frames, certain anomalies can only be detected over long
The normalcy model pX ðx x Þ can have both a temporal and a time spans. For example, while it is normal for two
spatial component. Temporal normalcy reflects the intuition pedestrian trajectories to converge or diverge at any point
that normal events are recurrent over time, i.e., previous in time, a cyclical convergence and divergence is probably
observations establish a contextual reference for normalcy abnormal. Anomaly detection across time scales is, how-
judgments. Consider a highway lane where cars move with ever, more complex than across spatial scales, due to
a certain orientation and speed. Bicycles or cars heading in constraints of instantaneous detection and implementation
the opposite direction are easily identified as abnormal complexity. Since video has to be buffered before anomalies
because they give rise to observations x substantially can be detected, large temporal windows imply long
different from those collected in the past. In this sense, detection delays and storage of many video frames. Due
temporal normalcy detection is similar to background to this, we do not consider multiple temporal scales in this
subtraction [26]. A model of normal behavior is learned work. A single scale is chosen, using acceptable values of
over time, and measurements that it cannot explain are delay and storage complexity, and used throughout our
denoted temporal anomalies. experiments. Note that, like their spatial counterparts,
Spatial normalcy reflects the intuition that some events temporal anomaly maps are computed at multiple spatial
that would not be abnormal per se are abnormal within a scales. Hence, in what follows, the term “scale” refers to the
crowd. Since the crowd places physical or psychological spatial support of anomaly detection, for both spatial and
constraints on individual behavior, behaviors feasible in temporal anomalies.
isolation can have low probability in a crowd context. For
example, while there is nothing abnormal about an
ambulance that rides at 50 mph in a stretch of highway, 4 NORMALCY AND ANOMALY MODELING
the same observation within a highly congested highway is In this section, we review the MDT model, discuss the
abnormal. Note that the only indication of abnormality is design of temporal and spatial models of normalcy, and
the difference between the crowd and the object at the time of formulate the computation of anomaly maps.
the observation, not that the ambulance moves at 50 mph.
Since the detection of such abnormalities is mostly based on 4.1 Mixture of Dynamic Textures
spatial context, they are denoted spatial anomalies. Their The MDT models a sequence of video frames x 1: ¼
detection does not depend on memory. Instead, it is based x 1 ; x 2 ; . . . ; x as a sample from one of K dynamic
½x
on a continuously evolving, instantaneously adaptive, textures [11]:
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 21
Fig. 3. Computation of temporal anomaly maps with multiscale spatial supports using the H-MDT. MDTs of increasingly larger spatial support are
estimated recursively, with the H-EM algorithm. Their application to a query video produces temporal anomaly maps based on supports of various
spatial scales.
In this section, we introduce a layer of statistical inference to where ji jj is the euclidean distance between sites i; j, and
fuse anomaly information across time, space, and scale in a expðh hi;j Þ the entry-wise exponential of h h i;j . The vector
globally consistent manner. h i;j contains the diagonal entries of ðff i f j Þðff i f j ÞT .
The single-site potential of (12) reflects the anomaly
5.1 Discriminative Model belief at site i. Using it alone, i.e., without (13), (11) is a
The anomaly maps of the previous section span space, time, logistic regression model. In this case, the detection of each
and spatial scale. Being derived from local measurements, anomaly is based on information from site i exclusively. The
they can be noisy. A principled framework is required to addition of the interaction potential of (13) enables the
1) integrate anomaly scores from the individual maps, model to take into account information from site i’s
2) eliminate noise, and 3) guarantee spatiotemporal con- neighborhood N i . This smoothes the single-site prediction,
sistency of anomaly judgments throughout the visual field. encouraging consistency of neighboring labels. The inter-
For this, we rely on a conditional random field [38] inspired action potential can be interpreted as a classifier that
by the discriminative random field (DRF) of [39]. An predicts whether two neighboring sites have the same
anomaly label yi 2 f1; 1g is defined at each location i in a label. Note that because f contains anomaly scores at
set S of observation sites. Given a video clip x , the different spatial scales, h i;j (or i;j ) accounts for the
24 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014
@ N N
log p y ðnÞ n¼1 j x ðnÞ n¼1
@vv (
XN X 1 X ðnÞ ðnÞ ðnÞ
¼ ejijj yi yj exp h h i;j
n¼1 i2S
jN i j j2N
i
" !#)
X 1 X jijj ðnÞ 1
IE e yi yj exp hhi;j 2 v;
i2S
jN j
i j2N v
i
Fig. 4. CRF filter. Top: Graphical model. Bottom: Spatial and temporal ð18Þ
neighborhoods.
and
similarity between the two observations in anomaly spaces
@ ðnÞ N
of different scale (i.e., under different spatial normalcy log p fyy ðnÞ gN
n¼1 fx x gn¼1
contexts). The interaction potentials adaptively modulate @ 8
XN <X
the intensity of intersite smoothing according to these 1 X ðnÞ ðnÞ ðnÞ
similarity measures (and how they are weighted by v ). The ¼ I yi ; yj ; x ; i; j ji jj
n¼1
: i2S jN i j
j2N i
parameters w and v encode the relative importance of 2 0 139
different features. X 1 X = 1
þ IE4 @ Iðyi ; yj ; x ðnÞ ; i; j ji jj A5 2 ;
5.2 Online CRF Filter i2S
jN i j j2N ;
i
The model of (11) requires inference over the entire video ð19Þ
sequence. This is not suitable for online applications. An
online version can be implemented by conditioning the where the expectation is evaluated with distribution
anomaly label y ðÞ at time on 1) observations for t , and pðYjX; Þ. The conditional expectations of (17)-(19) require
2) anomaly labels for t < , leading to evaluation of the partition function Z, a problem known to
be NP-hard. As is common in the literature, this difficulty is
P y ðÞ jfyy ðtÞ g1 x ðtÞ gt¼1 ; avoided by estimating expectations through sampling.
t¼1 ; fx
( " Although sampling methods such as Markov chain Monte
1 X ðÞ
¼ exp A yi ; x ðÞ Carlo (MCMC) can converge to the true distribution, this
Z i2S usually requires many iterations. Since the procedure must
1 X ðÞ
ð16Þ
be repeated per gradient ascent step, these methods are
þ SS I SS yi ; yj ; x ðÞ ; i; j impractical. On the other hand, approximations such as
jN i j j2N SS
i
#) contrastive divergence minimization (which runs MCMC a
1 X ðÞ limited number of times with specific starting points) have
þ TT I TT yi ; yk ; x ; i; k ; been shown to be successful for vision applications [40],
N T
T
i k2N i [41]. We adopt these approximations for CRF learning.
This leverages the fact that, denoting any of the
where S is the set of observations at time (pixels of the parameters w ; v TT ; v SS ; TT ; SS by
, the partial gradients of
current frame). Two neighborhoods are defined per location (17)-(19) are
i: spatial N SiS (N SiS
S ) and temporal N T T T
T t 1
i (N i
fS gt¼1 ).
@ N N
The graphical model is shown at the top of Fig. 4, and these log p y ðnÞ n¼1 j x ðnÞ n¼1 ;
neighborhoods at the bottom. The parameters ¼ @
ðy
n¼1
5.2.1 Learning
ð20Þ
Both (11) and (16) can be learned with standard optimization
techniques, such as gradient descent or the Broyden-Fletcher- where F@
ðyy ; x ðnÞ Þ by F@
ð^
y ; x ðnÞ Þ, where y^ is
the “evil twin” of the ground-truth label field y ðnÞ [41]. y^ is
drawn by MCMC, using the inference procedure discussed
in Section 5.2.2, the current parameter estimates, and the
ground-truth labels y ðnÞ as a starting point.
Given the estimate of the partial gradients, the gradient
ascent rule for parameter updates reduces to
" #
X N ðnÞ ðnÞ ðnÞ ðnÞ 1
þ F@
y ; x F@
y^ ; x 2
;
n¼1
¼ exp Fi x ðtÞ t¼1 ; y ðtÞ t¼1 ; y i ; yi ; ; pedestrian walkways on the UCSD campus. The crowd
Z i density in the walkways is variable, ranging from sparse to
where Fi ðfxx ðtÞ gt¼1 ; fyy ðtÞ g1
t¼1 ; y i ; yi ; Þ is the sum of poten-
very crowded. In the normal setting, the video contains
tial functions that depend on site i (i.e., its “Markov only pedestrians. Abnormal events are due to either 1) the
blanket”): circulation of nonpedestrian entities in the walkways, or
2) anomalous pedestrian motion patterns. Commonly
1
Fi x ðtÞ t¼1 ; y ðtÞ t¼1 ; y i ; yi ; occurring anomalies include bikers, skaters, small carts,
1 X and people walking across a walkway or in the surrounding
¼ A yi ; x ðÞ þ I yi ; yj ; x ðtÞ t¼1 ; i; j grass. A few instances of wheelchairs are also recorded. All
jN i j j2N ð24Þ
i abnormalities occur naturally, i.e., they were not staged or
X 1
þ I yj ; yi ; x ðtÞ t¼1 ; j; i ; synthesized for data set collection.
j:i2N j
jN j j The data set is organized into two subsets, corresponding
to the two scenes of Fig. 5. The first, denoted “Ped1,”
and Z i the corresponding partition function: contains clips of 158 238 pixels, which depict groups of
X h 1 i people walking toward and away from the camera, and
Z i ¼ exp Fi x ðtÞ t¼1 ; y ðtÞ t¼1 ; y i ; y0i ; : ð25Þ some amount of perspective distortion. The second,
y0i denoted “Ped2,” has spatial resolution of 240 360 pixels
The procedure is detailed in Algorithms 2 and 3, available and depicts a scene where most pedestrians move horizon-
in the online supplemental material, where we present the tally. The video footage of each scene is sliced into clips of
online CRF filter used to estimate the label field. During 120-200 frames. A number of these (34 in Ped1 and 16 in
Ped2) are to be used as training set for the condition of
learning, the filter is initialized with the ground-truth labels
(yy 0 ¼ y ðÞ ). During testing, this initialization relies on the 1. Available from http://www.svcl.ucsd.edu/projects/anomaly/data-
predictions of the single-site classifiers (vv TT ¼ v SS ¼ 0). In our set.html.
26 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014
TABLE 1
Note that, although widely used in the literature, the
Composition of UCSD Anomaly Data Set frame-level criterion only measures temporal localization
accuracy. This enables errors due to “lucky co-occurrences”
of prediction errors and true abnormalities. For example, it
assigns a perfect score to an algorithm that identifies a
single anomaly at a random location of a frame with
anomalies. The pixel-level criterion is much stricter and
more rigorous. By evaluating both the temporal and spatial
accuracy of the anomaly predictions, it rules out these
a b
number of clips/number of anomaly instances. some clips contain “lucky co-occurrences.” We believe that the pixel-level
more than one type of anomaly.
criterion should be the predominant criterion for evaluation
of anomaly detection algorithms.
normalcy. The test set contains clips (36 for Ped1 and 12 for
Ped2) with both normal (around 5,500) and abnormal 6.3 Experimental Setup
(around 3,400) frames. The abnormalities of each set are Unless otherwise noted, observation sites are a video sub-
summarized in Table 1. lattice with spatial interval of four pixels and temporal
Frame-level ground-truth annotation, indicating whether interval of five frames. Temporal anomaly maps rely on
anomalies occur within each frame, and manually collected patches of 13 13 15 pixels. The temporal extent of
pixel-level binary anomaly masks, which identify the pixels 15 frames provides a reasonable compromise between the
containing anomalies, are available per test clip. We note ability to detect anomalies and the delay (1.5 s) and storage
that this includes ground truth on Ped1 contributed by (15 video frames) required for anomaly detection. To
Antic and Ommer [9], and supersedes the ground truth minimize computation, patches of variance smaller than
available on an earlier version of this work [43]. We denote 500 are discarded.2 Temporal H-MDT models are learned
the current ground truth by “full annotation” and the from fine to coarse scale. At the finer scale, there are 6 10
previous one by “partial annotation.” Unless otherwise windows R1i on Ped1 (8 11 for Ped2), each covering a
noted, the results of the subsequent sections correspond to 4141 pixel area and overlapping by 25 percent with each
the full annotation. of its four neighbors. An MDT of five components is
learned per window. At coarser spatial scales, an MDT is
6.2 Evaluation Methodology
estimated from the MDTs of the four regions that it covers
Two criteria are used to evaluate anomaly detection at the immediately finer resolution. Each estimated MDT
accuracy: a frame-level criterion and a pixel-level criterion. has one more component than its ancestor MDTs. Overall,
Both are based on true-positive rates (TPR) and false- there are 10 scales in Ped1 and 11 in Ped2. Spatial anomaly
positive rates (FPRs), denoting “an anomalous event” as
maps use a 3131 center window and surround windows
“positive” and “the absence of anomalous events” as
of size equivalent to Rsi . For segmentation, 7 7 10
“negative.” A frame containing anomalies is denoted a
patches are extracted from the 40 frames surrounding that
positive, otherwise a negative. The true and false positives
under analysis. There are five DT components at all levels
under the two criteria are:
of the spatial hierarchy. Both temporal and spatial MDTs
. Frame-level criterion. An algorithm predicts which have an eight-dimensional state space. The sensitivity of
frames contain anomalous events. This is compared the proposed detector to some of these parameters is
to the clip’s frame-level ground-truth anomaly discussed in Appendix C.2, available in the online
annotations to determine the number of true- and supplemental material.
false-positive frames.
6.4 Descriptor Comparison
. Pixel-level criterion. An algorithm predicts which
pixels are related to anomalous events. This is The first experiment evaluated the benefits of MDT-based
compared to the pixel-level ground-truth anomaly over optical flow descriptors. The optical flow descriptors
annotation to determine the number of true-positive considered were the local motion histogram (LMH) of [19],
and false-positive frames. A frame is a true positive the force flow descriptor of [4], and the mixture of optical
if 1) it is positive and 2) at least 40 percent of its flow models (MPPCA) of [3]. LMH uses statistics of local
anomalous pixels are identified; a frame is a false motion, and is representative of traditional background
positive if it is negative and any of its pixels are subtraction representations, force flow is a descriptor for
predicated as anomalous. spatial anomaly detection, and MPPCA a temporal
The two measures are combined into a receiver operating anomaly detector. For the MDT, only the anomaly maps
characteristic (ROC) curve of TPR versus FPR: of finest temporal and coarsest spatial scale were con-
sidered here. Since the goal was to compare descriptors,
# of true-positive frame the high-level components of the models in which they
TPR ¼ ;
# of positive frame were proposed, for example, the LDA of [4], the MRF of
# of false-positive frame [3], and the proposed CRF, were not used. Instead,
FPR ¼ : anomaly predictions were smoothed with a simple
# of negative frame
Performance is also summarized by the equal error rate 2. This variance threshold is quite conservative, only eliminating regions
(EER), the ratio of misclassified frames at which of very little motion. For the data sets used in our experiments, this has not
led to the elimination of any objects from further consideration. In other
FPR ¼ 1 TPR, for the frame-level criterion, or rate of contexts, for example, scenes where objects are static for periods of time,
detection (RD), i.e., 1-EER, for the pixel-level criterion. this could happen. In this case, the threshold should be set to zero.
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 27
TABLE 2 TABLE 3
Descriptor Performance on UCSD Anomaly Data Set Filter Performance on the UCSD Anomaly Data Set
numbers outside/inside parentheses are results by full/partial annota-
tion (same for the rest of the paper).
Overall, although optical flow can signal fast moving
20 20 10 Gaussian filter. Anomaly predictions were anomalous subjects, it leads to too many false positives in
generated by thresholding the filtered anomaly maps and regions of complex motion, occlusion, and so on. More
ROC curves by varying thresholds. interesting is the lack of advantage for either spatial or
The performance of the different descriptors, under both temporal anomaly detection, both among MDT maps and
the frame-level (EER) and pixel-level (RD) criteria (using prior techniques (no clear advantage to either force flow or
both full and partial annotation in Ped1), is summarized in MPPCA). In fact, as shown in Fig. 6, temporal and spatial
Table 2. The corresponding ROC curves are presented in anomalies tend to be different objects. This suggests the
Appendix C.1 (Fig. 13), available in the online supplemental combination of the two strategies.
material. Examples of detected anomalies are shown in
Fig. 6. Under the frame-level criterion, temporal MDT has 6.5 Scale and Globally Consistent Prediction
the best performance in both scenes. Spatial MDT performs We next investigated the benefits of information fusion
worse than others in Ped1 but ranks second in Ped2. across space and scale, with the proposed CRF. We started
However, for the more precise pixel-level criterion, spatial with a single-scale description (S-MDT), using only the
MDT is the top or second best performer. In this case, both anomaly maps at finest temporal and coarsest spatial
MDTs significantly outperform all optical flow descriptors. scales, i.e., a 3D feature per site. We next considered a
The gap between corresponding competitors (e.g., temporal multiscale description, using the whole H-MDT. In both
MDT versus MPPCA or LMH, spatial MDT versus force cases, inference was performed with logistic regression, i.e.,
flow) is of at least 10 percent RD. These results show that the interaction term of (16) turned off, and the Gaussian
there is a definite benefit to the joint representation of filter of the previous section. In each trial, the logistic
appearance and dynamics of the MDT. classifier was trained by Newton’s method [42]. Finally, we
This is not totally surprising, given the limitations of considered the full blown CRF, denoted CRF filter. The
optical flow. First, the brightness constancy assumption is dimensions of the spatial and temporal CRF neighborhoods
easily violated in crowded scenes, where stochastic motion were set to jN SS j ¼ 6, jN TT j ¼ 3. ROC curves were generated
and occlusions prevail. Second, optical flow measures by varying the threshold for prediction.
instantaneous displacement, while the DT is a smooth Table 3 presents a comparison of the three approaches.
motion representation with extended temporal support. The corresponding ROC curves are shown in Appendix C.1
Finally, while optical flow is a bandpass measure, which (Fig. 14), available in the online supplemental material.
eliminates most of the appearance information, the DT Under the pixel-level criterion, the multiscale maps have
models both appearance and dynamics. The last two higher accuracy than their single-scale counterparts, demon-
properties are particularly important for crowded scenes, strating the benefits of modeling anomalies in scale space
where objects occlude and interact in complicated manners. (improvement of RD by as much as 11 percent). The CRF
Fig. 6. Anomaly predictions of temporal MDT, spatial MDT, MPPCA, force flow, and LMH (from left to right). Red regions are abnormal pixels. All
predictions generated with thresholds such that the different approaches have similar FPR under frame-level protocol (these settings apply to all the
subsequent figures unless otherwise stated).
28 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014
Fig. 7. Examples of anomaly localization with Gaussian smoothing (in blue) and CRF filter (in red). The latter predicts more accurately the
spatiotemporal support of anomalies in crowded regions, where occlusion is prevalent.
TABLE 4
Performance of Various Methods (RD/Seconds per Frame) by Pixel-Level Criterion on UCSD Anomaly Data Set
Implementation: } C/2.8-GHz CPU/2-GB RAM; \ C++ and Matlab (feature extraction and model inference)/2.6GHz CPU/2GB RAM; #Matlab/dual-
core 2.7GHz CPU/8GB RAM.
filter further improves performance (improvement of RD by improves the RD to 65 percent. Computationally, the
as much as 3 percent), demonstrating the gains of globally proposed detector is also much more efficient. For
consistent inference. As shown in Fig. 7, the visual improve- implementations on similar hardware (see footnotes
ments are even more substantial.3 Simple filtering does not of Table 4), it requires 1.11 s/frame, as compared to the
take into account interactions between neighboring sites and 3.8 s/frame reported for [8].
smooths the anomaly maps uniformly. On the other hand, Like the proposed detector, the Bayesian video parsing
the CRF adapts the degree of smoothing to the spatiotem- (BVP) of [9] combines spatial and temporal anomaly
poral structure of the anomalies, increasing the precision of detection, using a more complex video representation,
anomaly localization. Note how, in Fig. 7, the CRF-filter parsing of the video to extract all the objects in the scene,
successfully excludes occluded but normally behaving
a support vector machine classifier for detection of temporal
pedestrians from anomaly regions. These improvements
anomalies, a graphical model with seven nodes per site
are not always captured by the frame-level criterion. In fact,
there is little EER difference between S-MDT and H-MDT. (and multiple nonparametric models for location, scale, and
The inconsistency between frame- and pixel-level results in velocity) for detection of spatial anomalies, and occlusion
Tables 2 and 3 shows that the former is not a good measure of reasoning. This is an elegant solution, which achieves
anomaly detection performance. Henceforth, only the pixel- slightly better RD than the proposed detector (2 percent for
level criterion is used in the remaining experiments on this full and 3 percent for partial annotation), but at substan-
data set. tially higher computational cost (5 to 10 times slower). We
believe that when both accuracy and computation are
6.6 Anomaly Detection Performance considered, the proposed detector is a more effective
We next evaluated the performance of the complete solution. However, these results suggest that gains could
anomaly detector. For this, we selected two detectors be achieved by expanding the proposed CRF, as [9] trades a
from the recent literature, with state-of-the-art perfor- much simpler representation of video dynamics (optical
mance for temporal [8] and combined spatial and
flow versus MDT) for more sophisticated inference. It would
temporal anomaly detection [9]. The RD of the various
be interesting to consider CRF extensions with some of the
methods is summarized in Table 4, for both partial and
properties of the graphical model of [9], namely, explicit
full annotation. The corresponding ROC curves are shown
in Fig. 8. Table 4 also presents the processing time per occlusion reasoning. This is left for subsequent research.
video frame of each method. Missing entries indicate
unavailable results for the particular data set and/or
annotation type. A discussion of the detection errors made
by the detector is given in Appendix C.3, available in the
online supplemental material.
On Ped1, the temporal component of the proposed
detector substantially outperforms the temporal detector of
[8]. A multiple-scale temporal anomaly map with CRF
filtering increases the 46 percent RD4 of [8] to 52 percent.
A similar implementation of the spatial anomaly detector
(a multiple-scale map plus CRF filtering) achieves 58 percent.
Combining both maps and multiple spatial scales further
Fig. 9. Impact of context on anomaly maps. First three columns: Temporal anomalies, cell coverage at different HMDT layers shown in blue. Last two
columns: Spatial anomalies, example center (surround) windows shown in blue (light yellow).
6.7 Role of Context in Anomaly Judgments abnormal events (e.g., seconds of normalcy followed by a
We next investigated the impact of normalcy context in short abnormal event). The main limitations of this data set
anomaly judgments. For temporal anomalies, context is are that
determined by the subregion size: As the latter increases,
temporal models become more global. Fig. 9 shows that the 1. it is relatively small (scenes 1, 2, and 3 contain two,
scale of normalcy context significantly impacts anomaly six, and three anomaly instances),
scores. For example, the two cyclists on the left-most 2. it has no pixel-level ground truth,
columns of the figure are missed at small scales but 3. the anomalies are staged, and
detected by the more global models. On the other hand, a 4. it produces very salient changes in the average
leftward heading pedestrian in the third column has high motion intensity of the scene.
anomaly score at the finest scale but is not anomalous in As a result, several methods achieve near perfect detection.
larger contexts. In summary, no single context is effective The proposed detector was based on 3 3 subregions of
for all scenes. Due to the stochastic arrangements of people size 180 180 at the finest spatial scale and a 3-scale
within crowds, two crowds of the same size can require anomaly map for both the temporal and spatial compo-
different context sizes. In general, the optimal size depends nents. One normal-abnormal instance of each scene was
on the crowd configuration and the anomalous event. used to train the temporal normalcy model and CRF filter,
A similar observation holds for spatial anomalies, where and the remaining instances for testing. A comparison to
context is set by the size of the surround window. For previous results in the literature, under the frame-level
example, in the fourth column of Fig. 9, the subject walking criterion, is presented in Table 5 and Fig. 11. Due to the
on the grass is very salient when compared to her
salient motion discontinuities, the temporal component
immediate neighbors, and anomaly detection benefits from
(99.2 percent AUC) substantially outperforms the spatial
a narrower context. For larger contexts, she becomes less
component (97.9 percent). Nevertheless, the complete
unique than a man that walks in the direction opposite to his
neighbors. On the other hand, the cart and bike of the last detector achieves the best performance (99.5 percent). This
column only pop out when the surround window is large is nearly perfect, and comparable to the previous best
enough to cover some pedestrians. In summary, anomalies results in the literature.
depend strongly on scene context, and this dependence can Subway. The Subway data set [19] consists of two
vary substantially from scene to scene. It is, thus, important sequences recorded from the entrance (1 h and 36 min,
to fuse anomaly information across spatial scales. 144,249 frames) and exit (43 min, 64,900 frames) of a
5. http://mha.cs.umn.edu/Movies/Crowd-Activity-All.avi.
30 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 1, JANUARY 2014
Fig. 11. ROC curves of frame-level criterion on the UMN (left), Subway (center), and U-turn (right) data sets.
LI ET AL.: ANOMALY DETECTION AND LOCALIZATION IN CROWDED SCENES 31
Vijay Mahadevan received the BTech degree Nuno Vasconcelos received the licenciatura
from the Indian Institute of Technology, Madras, degree in electrical engineering and computer
in 2002, the MS degree from Rensselaer science from the Universidade do Porto, Portu-
Polytechnic Institute, Troy, New York, in 2003, gal, and the MS and PhD degrees from the
and the PhD degree from the University of Massachusetts Institute of Technology. He is a
California, San Diego, in 2011, all in electrical professor in the Electrical and Computer En-
engineering. From 2004 to 2006, he was with the gineering Department, University of California,
Multimedia group at Qualcomm Inc., San Diego, San Diego, where he heads the Statistical Visual
California. He is currently with Yahoo! Labs, Computing Laboratory. He has received a US
Bengaluru. His interests include computer National Science Foundation (NSF) CAREER
vision and machine learning and their applications. He is a member of award, a Hellman Fellowship, and has authored more than 150 peer-
the IEEE. reviewed publications. He is a senior member of the IEEE.