Accurate Silhouette Extraction of Multiple Moving Objects For Free Viewpoint Sports Video Synthesis
Accurate Silhouette Extraction of Multiple Moving Objects For Free Viewpoint Sports Video Synthesis
Abstract—In this paper, we propose a new method of au- and the visual texture acquired from multi-view cameras is
tomatic silhouette extraction of multiple moving objects with mapped to the virtual viewpoint selected by the viewers. It
high accuracy for free viewpoint stadium sports video synthesis. is noted that the accuracy of the billboard model has a great
The proposed method is basically composed of three parts,
including a global extraction based on temporal background impact on the quality of video generation, and inaccurate sil-
subtraction, a classification step based on the constraints of houette of the object, such as the extracted silhouette attached
extracted candidates of objects, and local refinement based on with a shadow area, will ruin the synthesis quality of free
the statistical information of the chrominance component of each viewpoint video. Therefore, the accurate extraction of objects
extracted object. Experimental results show that the proposal is very significant in the generation of free viewpoint sports
outperforms the temporal background subtraction model and
Gaussian Mixture Model (GMM) in terms of both objective and video with high quality. In addition, for the practical real-time
subjective evaluations. In addition, the quality of the synthesized application, an automatic extraction method is also necessary.
free viewpoint sports video is also enhanced by adopting more
accurate silhouettes of objects that are extracted by our proposed II. R ELATED W ORKS
method. Furthermore, as there is no manual operation in the
proposed method, the automatic multiple silhouettes extraction Object extraction is a traditional research topic in the field
has also been fully realized. of computer vision, and it is of great significance in numerous
applications of computer vision, such as traffic detection,
I. I NTRODUCTION
human tracking, pattern recognition, and object segmentation.
In recent years, the appearance of Free Viewpoint Video In this paper, especially for the application of synthesizing free
(FVV) has gained increased popularity since it can provide viewpoint sports video, we focus on the accurate silhouette
a beyond-3D experience, where the virtual viewpoints can be extraction of multiple moving objects.
selected freely and moved around, back and forth as well as Basically, to extract moving objects in a single camera with
up and down. [1] The system brings an immersive and ultra static background, the background subtraction method [6] is
realistic feeling to the audience and this experience is called simple and effective. One typical method is to take several
“walk-through” and “fly-through”. [2] consecutive frames of one video as one Gaussian Model,
Regarding the application of the FVV system, the scenes of and to set a threshold value to extract moving objects. This
dynamic sports games, such as soccer, tennis, and baseball, are method is called temporal background subtraction. However,
very attractive and suitable because the system enables people this method relies greatly on the learned background model
to select different viewpoints actively to enhance the interest and the threshold value. Because the threshold value is very
and pleasure in watching games. Basically, there are three sensitive and shadows are also moving with objects, many
types of methods, depth-based, image-based, and 3D model- undesired areas, such as shadows, are also extracted with
based, in generating the free viewpoint video. Both the depth- objects. Therefore, simple temporal background subtraction
based and image-based method require dense arrangement cannot obtain clear silhouettes of the moving objects. In
of camera array, and the viewpoint selection is limited. In addition, the Gaussian Mixture Model (GMM), proposed in
contrast, for the 3D model-based method, there is no restriction [7] and modified in [8] is also proved to be effective in
in the selection of virtual viewpoints in a 3D space. object extraction. The method is also based on a learning
In the previous work [3], the 3D model based method was process, and each pixel in the image gets one Gaussian mixture
adopted and developed for generation of the FVV system. model. However, as the pixels of an object are similar to the
Furthermore, a simplified 3D model, also named the billboard pixels of background, those pixels might also be regarded as
model [4] has also been adopted in [5]. Typically the billboard background and discarded in object extraction, resulting in
model is constructed by the shape of silhouettes of objects that many missing areas in the extracted objects. In recent years,
are extracted from camera images. For instance, each object, other methods have also been proposed for object extraction.
such as a player in a sports game, is represented as a billboard, The method in [9] tries to detect a shadow area for obtaining
c
978-1-4673-7478-1/15/$31.00 ⃝2015 IEEE a precise extracted object. However, this work only serves
for a single object. As there are multiple objects, the method Video sequence as input
does not work well in local areas. In addition, the method
in [10] integrates several saliences together for automatic Temporal Background subtraction
object extraction. However, the method does not consider
shadow removal Moreover, the authors in [5] proposed one Silhouette detection for each extracted candidates
method based on 3D model projections with the likelihood
of removing the shadow areas. However, this method only Obtain silhouette candidate set by classification according
works in a densely-arranged multiple camera system, but is to size, shape and luminance consistence information
not suitable for the sparsely-arranged multiple camera system
nor single camera system. The method in [11] is able to extract Next silhouette candidate
a precise object based on the optimization procedure. However,
Identification of refinement area
this method mainly serves for foreground extraction in a still
image, and it requires manual operation to assign initial points Refinement
to indicate foreground and background before extraction. Thus,
a full automatic object extraction cannot be realized. No
Therefore, according to the related works mentioned above, End of silhouette candidate set?
!"#$%&'()*$+,
(d) Temporal background subtraction (e) GMM-based method [8] (f) Proposed method
(g) Temporal background subtraction (h) GMM-based method [8] (i) Proposed method
Fig. 7. The comparison among three methods for silhouettes extraction, where (a)-(c) are extracted masks by three different methods and (d)-(i) are
corresponding closeup results for detailed comparison (Central camera)
(a) Temporal background subtraction (b) GMM-based method [8] (c) Proposed method
(d) Temporal background subtraction (e) GMM-based method [8] (f) Proposed method
(g) Temporal background subtraction (h) GMM-based method [8] (i) Proposed method
Fig. 8. The comparison among three methods for silhouettes extraction, where (a)-(c) are extracted masks by three different methods and (d)-(i) are
corresponding closeup results for detailed comparison (Side camera)