0% found this document useful (0 votes)
99 views6 pages

Accurate Silhouette Extraction of Multiple Moving Objects For Free Viewpoint Sports Video Synthesis

1) The document proposes a new method for accurately extracting silhouettes of multiple moving objects from sports video for the purpose of free viewpoint video synthesis. 2) It involves three main steps: global extraction using temporal background subtraction, classification of candidates based on size/shape/luminance constraints, and local refinement of silhouettes using statistical chrominance information. 3) Experimental results showed the proposed method outperformed background subtraction and GMM models in objective and subjective evaluations, and improved the quality of synthesized free viewpoint video by providing more accurate object silhouettes.

Uploaded by

7ddv29qw89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views6 pages

Accurate Silhouette Extraction of Multiple Moving Objects For Free Viewpoint Sports Video Synthesis

1) The document proposes a new method for accurately extracting silhouettes of multiple moving objects from sports video for the purpose of free viewpoint video synthesis. 2) It involves three main steps: global extraction using temporal background subtraction, classification of candidates based on size/shape/luminance constraints, and local refinement of silhouettes using statistical chrominance information. 3) Experimental results showed the proposed method outperformed background subtraction and GMM models in objective and subjective evaluations, and improved the quality of synthesized free viewpoint video by providing more accurate object silhouettes.

Uploaded by

7ddv29qw89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Accurate Silhouette Extraction of Multiple Moving

Objects for Free Viewpoint Sports Video Synthesis


Qiang Yao, Hiroshi Sankoh, Houari Sabirin, Sei Naito
KDDI R&D Laboratories, Inc.
Ohara 2-1-15, Fujimino, Saitama, Japan
{qi-yao, sankoh, ho-sabirin, sei}@kddilabs.jp

Abstract—In this paper, we propose a new method of au- and the visual texture acquired from multi-view cameras is
tomatic silhouette extraction of multiple moving objects with mapped to the virtual viewpoint selected by the viewers. It
high accuracy for free viewpoint stadium sports video synthesis. is noted that the accuracy of the billboard model has a great
The proposed method is basically composed of three parts,
including a global extraction based on temporal background impact on the quality of video generation, and inaccurate sil-
subtraction, a classification step based on the constraints of houette of the object, such as the extracted silhouette attached
extracted candidates of objects, and local refinement based on with a shadow area, will ruin the synthesis quality of free
the statistical information of the chrominance component of each viewpoint video. Therefore, the accurate extraction of objects
extracted object. Experimental results show that the proposal is very significant in the generation of free viewpoint sports
outperforms the temporal background subtraction model and
Gaussian Mixture Model (GMM) in terms of both objective and video with high quality. In addition, for the practical real-time
subjective evaluations. In addition, the quality of the synthesized application, an automatic extraction method is also necessary.
free viewpoint sports video is also enhanced by adopting more
accurate silhouettes of objects that are extracted by our proposed II. R ELATED W ORKS
method. Furthermore, as there is no manual operation in the
proposed method, the automatic multiple silhouettes extraction Object extraction is a traditional research topic in the field
has also been fully realized. of computer vision, and it is of great significance in numerous
applications of computer vision, such as traffic detection,
I. I NTRODUCTION
human tracking, pattern recognition, and object segmentation.
In recent years, the appearance of Free Viewpoint Video In this paper, especially for the application of synthesizing free
(FVV) has gained increased popularity since it can provide viewpoint sports video, we focus on the accurate silhouette
a beyond-3D experience, where the virtual viewpoints can be extraction of multiple moving objects.
selected freely and moved around, back and forth as well as Basically, to extract moving objects in a single camera with
up and down. [1] The system brings an immersive and ultra static background, the background subtraction method [6] is
realistic feeling to the audience and this experience is called simple and effective. One typical method is to take several
“walk-through” and “fly-through”. [2] consecutive frames of one video as one Gaussian Model,
Regarding the application of the FVV system, the scenes of and to set a threshold value to extract moving objects. This
dynamic sports games, such as soccer, tennis, and baseball, are method is called temporal background subtraction. However,
very attractive and suitable because the system enables people this method relies greatly on the learned background model
to select different viewpoints actively to enhance the interest and the threshold value. Because the threshold value is very
and pleasure in watching games. Basically, there are three sensitive and shadows are also moving with objects, many
types of methods, depth-based, image-based, and 3D model- undesired areas, such as shadows, are also extracted with
based, in generating the free viewpoint video. Both the depth- objects. Therefore, simple temporal background subtraction
based and image-based method require dense arrangement cannot obtain clear silhouettes of the moving objects. In
of camera array, and the viewpoint selection is limited. In addition, the Gaussian Mixture Model (GMM), proposed in
contrast, for the 3D model-based method, there is no restriction [7] and modified in [8] is also proved to be effective in
in the selection of virtual viewpoints in a 3D space. object extraction. The method is also based on a learning
In the previous work [3], the 3D model based method was process, and each pixel in the image gets one Gaussian mixture
adopted and developed for generation of the FVV system. model. However, as the pixels of an object are similar to the
Furthermore, a simplified 3D model, also named the billboard pixels of background, those pixels might also be regarded as
model [4] has also been adopted in [5]. Typically the billboard background and discarded in object extraction, resulting in
model is constructed by the shape of silhouettes of objects that many missing areas in the extracted objects. In recent years,
are extracted from camera images. For instance, each object, other methods have also been proposed for object extraction.
such as a player in a sports game, is represented as a billboard, The method in [9] tries to detect a shadow area for obtaining
c
978-1-4673-7478-1/15/$31.00 ⃝2015 IEEE a precise extracted object. However, this work only serves
for a single object. As there are multiple objects, the method Video sequence as input
does not work well in local areas. In addition, the method
in [10] integrates several saliences together for automatic Temporal Background subtraction
object extraction. However, the method does not consider
shadow removal Moreover, the authors in [5] proposed one Silhouette detection for each extracted candidates
method based on 3D model projections with the likelihood
of removing the shadow areas. However, this method only Obtain silhouette candidate set by classification according
works in a densely-arranged multiple camera system, but is to size, shape and luminance consistence information
not suitable for the sparsely-arranged multiple camera system
nor single camera system. The method in [11] is able to extract Next silhouette candidate
a precise object based on the optimization procedure. However,
Identification of refinement area
this method mainly serves for foreground extraction in a still
image, and it requires manual operation to assign initial points Refinement
to indicate foreground and background before extraction. Thus,
a full automatic object extraction cannot be realized. No
Therefore, according to the related works mentioned above, End of silhouette candidate set?

how to simultaneously and automatically remove the shadow


Yes
areas attached with silhouettes and restore the missing areas
Output silhouettes of objects in each frame
inside each silhouette after extraction is the core issue for
accurate multiple silhouettes extraction. In this paper, based
No
on this core issue, we propose a new method to extract End of sequence?
the accurate silhouettes of multiple objects automatically by
Yes
considering various features of shadows and local statistical
Finish
chrominance information. The remaining part of this paper is
organized as follows. In section three, the proposed method is
Fig. 1. The flowchart of the proposed model
described in detail. The experimental results are presented in
the fourth section to illustrate the effectiveness of the proposal.
Finally, a brief conclusion is provided in the last part. A. Global background subtraction for rough extraction
III. P ROPOSED M ETHOD In the global background subtraction, a background model is
firstly learned at the pixel level based on several consecutive
The flowchart of the proposal is illustrated in Figure 1. frames in a video sequence. I(t) is defined as each frame
Roughly speaking, the procedure is composed of a global image at the moment t, and T is the total number of frames
extraction part, a classification part and a local refinement in learning the background model. In addition, M and N are
part. First of all, in order to reduce the missing areas inside defined as the width and height of each frame I(t). For a color
the extracted objects, a low threshold value is set in temporal image I(t), it is further decomposed into three color compo-
background subtraction. After the background subtraction, dif- nents, such as RGB or YUV, and we use the YUV format
ferent from other methods for shadow detection and removal, in this paper for decomposing the image I(t) without losing
we do not directly detect the shadow area, but we find that generality, written as I(t) = {I y (t), I u (t), I v (t)}. Next, each
there are two types of shadows, independent shadow that is y
pixel Ii,j y
along the period of T , {Ii,j y
(1), Ii,j y
(2), ..., Ii,j (T )}
departed from the extracted objects and a dependent one that is is approximated by one Gaussian model, and the pixels in
attached to the extracted objects. Based on this observation, a one frame are independent of each other. 1 Therefore, there is
classification is proposed to remove the independent shadows one estimated mean µyi,j and one estimated standard deviation
by considering the constraints of the shadows’ size, shape and y
σi,j along the learning period T for each pixel Ii,j y
in the
luminance consistency. Finally, a refinement is conducted in background model, written as
the local area to remove the dependent shadows. We observe
that the shadow of an object is caused by flood-lights in the T
1! y
stadium and shares similar chrominance with the background. µyi,j = I (t) (1)
T t=1 i,j
Besides, it is assumed that the color difference between objects
(e.g. player’s uniform) and background is at least recognizable and
T
by human eyes. Thus, a threshold is calculated based on the y 1! y
histogram of chrominance information in each local bounding σi,j ={ (I (t) − µyi,j )2 }1/2 , (2)
T t=1 i,j
box of the object to separate the object and the background.
Therefore, the shadows can be removed and the missing area where 1 ≤ i ≤ M, 1 ≤ j ≤ N .
inside the object can be restored. In the following subsections, 1 Please note that the assumption and procedure to process the remaining
after roughly describing the temporal background subtraction, two image components, U and V, are similar to those used to process the Y
we present the proposed method in detail. component.
Fig. 3. The illustration of classification steps to remove the independent
shadows
Fig. 2. The silhouette candidate’s grouping after background subtraction

into Ωbg as background. (For example, the “obj 5” in Figure


Based on the learned background model, the residual of 2 will be classified into Ωbg .) 2 In the second step, the shape
each frame after background subtraction is represented as
information is utilized. Because our target object is a player
R(t) =| I(t) − µ |. In addition, there is a threshold th for
and the shadow is always projected on the ground from one
enhancing the robustness against the change of luminance. If
or two sides of each object, the aspect ratio β between the
the absolute residual value Ri,j (t) of the residual image is no
width wk and the height hk of the ok is supposed to be in a
larger than σi,j + th, written as | Ri,j (t) |≤ σi,j + th, this
reasonable range. Thus, if the aspect ratio β = wk /hk is in the
pixel is regarded as background. (Please note that the method
range of [τ1 , τ2 ], τ1 < β < τ2 , the ok is classified into Ωobj ,
is applied in the YUV space in this paper, and there are µ =
otherwise it is classified into Ωbg . (For example, the “obj 3”
{µy , µu .µv }, σ = {σ y , σ u , σ v }, and th = {thy , thu , thv }.
in Figure 2 will be classified into Ωbg .) In the third step,
However, this method is also suitable for the image in other
the information of luminance consistency is considered and
color spaces or gray scale image.) After the background
calculated in terms of pixel variance in each ok . It is believed
subtraction, a median filter is adopted for denoising.
that if ok only contains background area, the pixel variance in
B. Classification to exclude false silhouette candidates ok is small, because the background is assumed to be smooth
In the classification part, firstly, one bounding box is as- and textureless. Thus, if V ary (ok ) > γ, the ok is classified
signed to each extracted candidate, ok . In the determination of into Ωobj , otherwise it is classified into Ωbg . (For example,
the bounding box for ok , the most up, most down, most left the “obj 4” in Figure 2 will be classified into Ωbg .)
and most right pixel positions of ok are determined, and the
C. Refinement to extract precise silhouette
four corner vertices of each bounding box can be calculated.
(There will be further refinement inside each bounding box, After the false silhouette candidates are rejected out, the
which will be discussed in the next subsection.) After the histogram-based thresholding method is adopted to refine each
assignment of the bounding box, all the extracted candidates silhouette candidate in Ωobj , because the local statistical
are grouped into one set Ω = {ok }, k ∈ (1, 2, . . . , K), where information can provide more robust cues in segmentation and
K is the total number in the set Ω, as shown in Figure 2. extraction. However, the histogram is not calculated from the
In the next part, the size, shape, and luminance consis- luminance component but from the chrominance component,
tency information of the silhouette candidate are utilized to because the shadow and the background share similar chromi-
distinguish the silhouette candidates Ωobj and background nance information, as shown in Figure 4, and the histogram of
candidates Ωbg in the set Ω, where Ω = Ωobj ∪ Ωbg . The chrominance can clearly distinguish the silhouette of objects
classification is composed of three main steps as shown in from the background in local region P k that is obtained
Figure 3. In the first step, the size information is utilized. from the original image by the position information of each
All the ok in the Ω are sorted according to the size of ok , bounding box. One example of a histogram is illustrated in
which is measured by the total number of pixels in each ok , Figure 5. Obviously there is a valley between two peaks in
and then the largest αK, 0 < α < 1, candidates are preserved Figure 5 (b), where the larger peak indicates the background
in Ωobj because the silhouette candidates are not supposed 2 The ball might be also discarded in this step. However, we only focus on
to be too small, such as “obj 1” and “obj 2” as shown in accurate extraction of player’s silhouette in this paper, and the ball can be
Figure 2. Moreover, the remaining candidates are classified extracted by other methods and synthesized in final image.
Fig. 4. One example of refinement candidate (As this animation image is for
illustration purposes only, please refer to supplementary material for actual
image)

Fig. 6. The illustration of camera positions in baseball court


!"#$%&'()*$+,

!"#$%&'()*$+,

5 and 20, 5, 5 for two videos, respectively. In our proposal,


the additional parameters are set as α = 0.8, τ1 = 1, τ2 = 2.1,
and γ = 8 for both sequences in the classification part. For
the GMM based method, we directly implement it by using
OpenCV with default parameter values.
A. The quantitative evaluation of silhouette extraction
!"#$%&'(%)$* !"#$%&'(%)$*
First of all, the quantitative evaluation for the accuracy of
(a) Luminance (b) Chrominance silhouette extraction is presented. Three metrics, including re-
call, precision, and F-measure, are adopted for the evaluation,
Fig. 5. The histogram of one refinement object in luminance and chrominance
domain and the metrics are defined [13] as
TP
Recall = , (4)
TP + FN
while the smaller one indicates the silhouette in P k . In
TP
contrast, there is not a clear valley in Figure 5 (a) because Precision = , (5)
the luminance variation is not large in each P k . Next, based TP + FP
on the local chrominance histogram, a thresholding value is 2 × Recall × Precision
F-Measure = , (6)
calculated by adopting the Otsu method [12] to maximize Recall + Precision
the between-class variance of the silhouette pixels and the where T P is the total number of true positive pixels, F N is
background pixels of P k . As the threshold ρ is found, one the total number of false negative pixels, and F P is the total
binary mask B is employed to identify whether the local pixel number of false positive pixels. Without losing generality, we
pi belongs to the background or silhouette, which is written regard the shadow areas as false positive (F P ) and missing
as " areas in extracted silhouettes as false negative (F N ). There
bi = 1, if pi ≥ ρ
(3) are three methods in the comparison. One is the temporal
bi = 0, if pi < ρ, background subtraction method. One is the GMM based object
where bi ∈ B. Finally, as all the binary masks are collected, extraction. The other one is the proposed method. In addition,
the multiple silhouettes of moving objects can be extracted we also extract the silhouettes of objects by manual operation,
automatically with high accuracy. and take this as the ground truth.
The quantitative comparison result of the three methods is
IV. E XPERIMENTAL R ESULTS illustrated in Table I, and the result is calculated in pixel-level.
In this part, the experimental results are illustrated to show Generally speaking, a higher Recall value corresponds to fewer
the effectiveness of our proposed method. The configuration missing regions, and a higher Precision value reflects fewer
of experiments is set as follows. Our system is a static dual- undesired regions (shadows), and a higher F-Measure value
camera system, where two cameras are sparsely fixed in two shows more robust extraction result. The temporal background
different positions in the baseball court respectively, as shown subtraction method gets the highest Recall value because
in Figure 6. We adopt 4K cameras with the frame rate of we set a low thresholding value in background subtraction.
30 fps and the resolution of 3840 × 2160. For each camera, However, the Precision of the temporal background subtraction
the beginning 600 frames are adopted to learn the background is quite low because the large parts of shadows are also
model, and the other 300 frames are employed to test the extracted with objects due to the low threshold. As for the
accuracy of silhouette extraction. The threshold (thy , thu , GMM method, the Precision value is still not high because the
and thv ) in temporal background subtraction is set as 5, 5, color of the court line is quite different from the background
TABLE I
Q UANTITATIVE EVALUATION OF THREE METHODS FOR SILHOUETTES Finally, we conduct synthesis to generate free viewpoint
EXTRACTION sports video to check the improvement of the synthesized
view by using a more accurate silhouette generated by our
Method Temporal background subtraction GMM [8] Proposal
Recall 0.9750 0.9385 0.9115 proposal. Due to the copyright of images in a baseball game,
Precision 0.6074 0.7156 0.9416 please refer to the supplementary material and demo video
F-Measure 0.7484 0.8113 0.9262 in the visualization comparison. In the comparison, it is
confirmed that the quality of the synthesized view is enhanced
by adopting a more accurate silhouette.
and the court line is also extracted as an object. Regarding our
proposal, the Precision value is the highest one, which proves V. C ONCLUSIONS
that the shadow areas can be greatly removed. In addition, In this paper, we proposed a new method of automatic
the proposal achieves the highest F-Measure value, illustrating silhouette extraction of multiple moving objects with high ac-
that the proposed method is the most robust one. Although the curacy for free viewpoint stadium sports video synthesis. After
proposed method does not achieve the highest Recall value, the background subtraction, the proposed method removed the
the missing parts in the extracted objects almost exist along the independent shadow areas by classification and removed the
boundary, which has a minor effect on the final free viewpoint dependent shadow areas by conducting local refinement for
video synthesis. Please also refer to the subjective evaluation each extracted silhouette. Experimental results showed that
part and the supplementary material. the proposed method was quite effective in shadow removal
and missing area restoration. More significantly, the visual
B. The subjective evaluation of silhouette extraction quality of the synthesized image from a virtual viewpoint
In the following part, the extracted silhouettes by different was also enhanced by the proposed method. In addition, there
methods are compared in Figure 7, 8.3 (Please refer to the was no manual operation in the proposed method, so that the
supplementary material for more experimental results to show automatic multiple silhouettes extraction was fully realized.
the robustness of our proposal.) In a detailed comparison
R EFERENCES
(shadow removal) of (d)-(f) in Figure 7, 8, the shadow is
attached with the extracted silhouette in (d) that is obtained [1] Kanade T., Rander P., and Narayanan P J., “Virtualized reality: Construct-
ing virtual worlds from real scenes,” IEEE multimedia, vol. 4, no. 1, pp.
by the temporal background subtraction method, while the 34–47, 1997.
shadow can be eliminated by the GMM based method and [2] Ishikawa A., Panahpour Tehrani M., Naito S., Sakazawa S., and Koike A.,
the proposed method. However, the GMM method considers “Free viewpoint video generation for walk-through experience using
image-based rendering,” In Proceedings of the 16th ACM international
global information in model establishment, thus the court line conference on Multimedia, pp. 1007–1008, 2008.
whose color is different from the background color is also [3] Ohta Y., Kitahara I., Kameda Y., Ishikawa H., and Koyama T., “Live 3D
regarded as an object. For the proposed method, since the video in soccer stadium,” International Journal of Computer Vision, vol.
75, no. 1, pp. 173–187, 2007.
refinement is conducted in a local area, the global information [4] Decoret X., Durand F., Sillion F X., and Dorsey J., “Billboard clouds for
for some undesired part (such as court line) will not be con- extreme model simplification,” ACM Transactions on Graphics (TOG),
sidered and the high accuracy of extraction can be achieved. vol. 22, no. 3, pp. 689–696, 2003.
[5] Sankoh H., Ishikawa A., Naito S., and Sakazawa S., “Robust background
Moreover, in the comparison (missing area restoration) of subtraction method based on 3D model projections with likelihood,” Mul-
(g)-(i) in Figure 7, 8, there are many missing areas in the timedia Signal Processing (MMSP), 2010 IEEE International Workshop
extracted silhouettes by the temporal background subtraction on, pp. 171–176, 2010.
[6] Elgammal A., Harwood D., and Davis L., “Non-parametric model for
method, because this method only considers temporal infor- background subtraction,” Computer Vision–ECCV 2000, pp. 751–767,
mation and the threshold in global extraction is very sensitive. 2000.
As an object remains stationary (or minor movement) among [7] Stauffer C., and Grimson W. E. L., “Adaptive background mixture models
for real-time tracking,” Computer Vision and Pattern Recognition, 1999.
several consecutive frames, parts of the object have a high IEEE Computer Society Conference on, vol. 2, 1999.
probability of being regarded as background and not being [8] Zivkovic Z., “Improved adaptive Gaussian mixture model for background
extracted, resulting in the undesired missing areas. Concerning subtraction,” Pattern Recognition, 2004. ICPR 2004. Proceedings of the
17th International Conference on, vol. 2, pp. 28–31, 2004.
the GMM based method, the faces and arms of players are [9] Horprasert T., Harwood D., and Davis L S, “A robust background
quite easily regarded as background because the color of faces subtraction and shadow detection,” Proc. ACCV, pp. 983–988, 2000.
and arms is similar to the color of background as shown in [10] Li W. T., Chang H S., Lien K C., Chang H T., and Wang Y F., “Exploring
visual and motion saliency for automatic video object extraction,” Image
Figure 4. For the proposed method, based on the more robust Processing, IEEE Transactions on, vol. 22, no. 7, pp. 2600–2610, 2013.
local statistical information, the extraction is conducted inside [11] Rother C., Kolmogorov V., and Blake A., “Grabcut: Interactive fore-
of each bounding box of the silhouette in the original image ground extraction using iterated graph cuts,” ACM Transactions on
Graphics (TOG), vol. 23, no. 3, pp. 309–314, 2004.
during the refinement process. Therefore, most parts of the [12] Otsu N., “A threshold selection method from gray-level histograms,”
missing areas caused by temporal background subtraction can Automatica, vol. 11, pp. 285–296, 1975.
also be restored. [13] Landabaso J L., Pardas M., and Casas J R., “Shape from inconsistent
silhouette,” Computer Vision and Image Understanding, vol. 111, no. 2,
3 Due to the image copyright of baseball games, we have converted the pp. 210–224, 2008.
color images to binary images. Please refer to the supplementary material for
better visualization.
(a) Temporal background subtraction (b) GMM-based method [8] (c) Proposed method

(d) Temporal background subtraction (e) GMM-based method [8] (f) Proposed method

(g) Temporal background subtraction (h) GMM-based method [8] (i) Proposed method
Fig. 7. The comparison among three methods for silhouettes extraction, where (a)-(c) are extracted masks by three different methods and (d)-(i) are
corresponding closeup results for detailed comparison (Central camera)

(a) Temporal background subtraction (b) GMM-based method [8] (c) Proposed method

(d) Temporal background subtraction (e) GMM-based method [8] (f) Proposed method

(g) Temporal background subtraction (h) GMM-based method [8] (i) Proposed method
Fig. 8. The comparison among three methods for silhouettes extraction, where (a)-(c) are extracted masks by three different methods and (d)-(i) are
corresponding closeup results for detailed comparison (Side camera)

You might also like