Download Video Based Head Detection and Tracking Surveillance System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Factorization of polynomials over finite fields wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Pattern recognition wikipedia , lookup

Algorithm wikipedia , lookup

Theoretical computer science wikipedia , lookup

Transcript
2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2012)
Video Based Head Detection and Tracking
Surveillance System
Di Xie
Lu Dang
Ruofeng Tong
Department of Computer Science
and Technology
Zhejiang University
Hangzhou, China
Department of Computer Science
and Technology
Zhejiang University
Hangzhou, China
Department of Computer Science
and Technology
Zhejiang University
Hangzhou, China
Abstract—In this paper, we present a novel system that
acquires clear human face or head in video stream from a
single surveillance camera and applies several state-of-the-art
computer vision algorithms to generate real-time human head
detection and tracking results. The detection results are obtained
using histogram of gradients (HoG) feature powered by GPU and
the false positives are filtered out by motion and appearance
likelihoods. Then all the true positives are tracked by a track
algorithm based on earth mover’s distance (EMD) and SURF
points. Our experiment results show that the proposed system
makes it possible to accomplish robust detection and tracking
performance in most surveillance scenes.
Index Terms—Histogram of gradients, head detection,
Bayesian theory, multi-object tracking, EMD, SURF.
I. I NTRODUCTION
Nowadays, visual surveillance system is becoming an essential part of our daily life. Since it is quite necessary
to utilize such systems to ensure public security, almost
every public facility has its own surveillance system. With
the development of low-cost surveillance video cameras and
high-speed computer network, it is technological feasible and
economically affordable to provide such system for crime
prevention and crime scene investigation. Visual surveillance
is a long-standing field of computer vision, many efforts
and tremendous progress has been made in this area. The
CMU VSAM[1] system receives video inputs and collects
and disseminate real-time information about the scene; W 4 [2]
real-time visual surveillance system and the ADVISOR[3]
system detect and track multiple people and monitoring their
activities in outdoor scenes such as metro stations; Pfinder[4]
also provides real-time performance in people tracking and
behavior interpretation.
Although there are several existing video based surveillance
systems, all of them do not have the capability of acquiring
clear human face images or sequences. Clear human faces can
provide police more information about the criminals and help
to track down them as soon as possible. Many face or body
detection based method cannot achieve this because human
faces will emerge multiple views and postures in different
surveillance scenes. Moreover, environment illumination variation is another main factor which results these methods to
fail.
978-1-4673-0024-7/10/$26.00 ©2012 IEEE
To solve this problem we propose a novel and robust visual
surveillance system framework that exploits two main components: human head detection and object tracking. Human
heads are first detected using an adaptive head detector based
on histogram of gradients (HoG) feature[5]. Simultaneously,
motion and appearance information are extracted from the
video sequence. Based Bayesian theory, we use two likelihood
to evaluate the probability of a detected region represents an
actual human head. The false positives are eliminated and
the true positives are tracked by the EMD tracking algorithm
with SURF points. The head detection and tracking results
can be further utilized to control the camera PTZ to capture
proper and high-resolution snapshots. The rest of the paper
is structured as follows. After summarizing related work in
Section 2, we demonstrate our system architecture in Section 3. System components and implementation details are
described in Section 4. Furthermore, experimental results and
discussions are given in Section 5. Finally, we conclude the
paper.
II. R ELATED W ORK
A. Human face and body Detection
Pedestrian detection has long been a hot research area in
computer vision and has reached an impressive level[5], [6].
Many available approaches have already shown well performance on individual images. These existing methods generally
consist of two steps: feature extraction and training. Viola
et al[6] proposed a pedestrian detection framework that uses
haar-like features inherited from[7] and then applies AdaBoost
learning algorithms to get satisfactory classification results.
These methods can only be applied in detecting human face
which limits their popularity in actual surveillance systems.
Another well-known pedestrian detector developed by Dalal
et al[5] calculates a set of overlapping histogram of gradients
(HoG) descriptors fed into a support vector machine to achieve
good detection results. In this paper, we will adopt the latter
approach for it is easy to be extended to head detection and
robust to the various gestures of human heads. Computing
HoG feature descriptor is time-consuming and also limits its
application in real-time systems.
2832
B. Object Tracking
In video tracking, mean shift[8], [9] and particle filter[10],
[11], [12] are two main tracking methods categories. These
methods utilize objects’ appearance features such as color,
gradient, optical flow and texture, but are not robust because they only consider the bin-to-bin comparison. Instead
of using illumination-variant features, or applying transforms
to make the features illumination-invariant, Zhao et al[13]
approaches illumination changes using the EMD[14] as a
similarity measure to match color distributions. Their method
uses color signature as the matching feature. But since it needs
clustering algorithm to generate the feature in each iteration
step repeatedly, it wastes the computation resource. Moreover,
the color signature feature itself is also not robust to the
applications in real surveillance scenes because this feature
lacks position and orientation information.
It will take much time and computation resource to extract
the HoG descriptor, so we propose a GPU-based algorithm for
the purpose of real-time processing. Much time is consumed
in gradient computation. When extracting HoG features, a
sliding window is applied and traverse on the frames. For each
position and scale, the extracting operation is independent,
which has excellent parallel property. We use CUDA for GPUbased algorithm. Since one block contains 4 cells, and the
size of one cell is 8 × 8, then one histogram is corresponding
to gradients of 256 pixels. 4 blocks will be processed in a
sliding window, but the 4 blocks are overlapped with each
other, so totally 9 blocks, 576 pixels need to compute gradient.
A CUDA block has 512 threads at most, so we let each CUDA
block process one detection window. Especially, in each block
we use 8 threads to process one extra pixel which solve the
less-threads-more-pixels problem. The acceleration strategy is
shown in Fig.2. When all threads finish computing gradient,
III. S YSTEM A RCHITECTURE
Fig. 1 depicts the system architecture. First, video signals
are captured by the camera and provided to the preprocessing
module(e.g. gamma correction, color balancing). Then the
HoG algorithm is applied in the HoG Head Detection module
to get initial head detections. In order to lower the false
detection rate, we adopt motion and appearance filtering to
eliminate static false positives. And then the improved head
detection results is fed into the EMD Tracking module to
get the actual physical position of the detected human heads.
Simultaneously, based on the detection results, the system
proactively control the PTZ Camera to get high-resolution
human head snapshots.
Fig. 2.
The diagram of GPU based HoG feature extracting algorithm.
Different color represents different cells in the sliding window. In the
CUDA blocks, different color demonstrates the threads used for processing
corresponding pixels in the cells and the white boxes represent the extra pixel.
Fig. 1.
System architecture and process diagram.
IV. A LGORITHM S PECIFICATION
A. Human Head Detection
Since the methods of human face detection cannot adapt
multiple views and postures as well as illumination variation,
in this paper we apply the HoG feature descriptor proposed
by Dalal et al[5]. This feature describe local region’s edge
or gradient information in images so it is insensitive to local
geometry and illumination variation.
In implementation of HoG, we take the configuration as
follows: input image in RGB color space (without any gamma
correction); image gradient computed by applying [-1,0,1]
filter along x- and y- axis without smoothing; linear gradient
voting into 9 orientation bins in 0◦ -180◦ ; 16 × 16 pixel blocks
containing 2 × 2 cells of 8 × 8 pixel; blocks spaced with a
stride of 8 pixels; 24 × 24 detection window and linear SVM
classifier. After voting, the histogram is normalized and finally
form a high dimension vector in feature space.
we save the results in the shared memory of device and vote
for generating the histogram with parallel reduction technique
and normalize them, which is shown in Fig.3.
In real applications, it will result to high false positive rate
if we just use HoG feature. The reason is that the shapes of a
certain categories of objects are similar to human heads’. To
improve the detection result, we propose to include two extra
information to filter the false positives. Motion and appearance
are two important feature in applications. Most part of human
head is hair or skin, which means human head have specific
appearance feature. Even human body is static, their head
may be motorial while most of head-like objects are static,
which means motion information is another good filter. We use
Bayesian posterior to represent the probability of a detected
region belongs to actual human head regions:
p(ωj |x) ∝ p(x|ωj )p(ωj ) j = 1, 2
(1)
in which x means the detected region and ω1 and ω2 mean
human head class and non-human head class respectively.
2833
SURF has more advantages in real applications: 1) the speed
of computing SURF feature is much faster than that of SIFT
because of the integral image method; 2) it can catch the
points which have high contrast in different scales. Typical
SURF point detection result is shown in Fig.4. More detailed
description about SURF can be found in [18].
Fig. 3. The diagram of parallel reduction algorithm. At the first level, ith
element is added with i+offset element; at the second level, ith element is
added with i+offset/2l th element, in which l represents the level. After finishing each level’s computation, all the related threads should be synchronized.
The process is repeated until there is only one element.
p(ωj ) is prior, and we set p(ω1 ) = p(ω2 ) for simplicity.
p(x|ωj ) is the likelihood and formulated as follows:
p(x|ω1 ) = pa (x|ω1 )pm (x|ω1 )
in which pm (x|ω1 ) is defined as:
P
pm (x|ω1 ) =
xi ∈M
T δ(xi )
Ω
|Ω|
(2)
(b)
Fig. 4. SURF feature. a) Original image. b) The extracting result of SURF
feature points at different location and scale. One can see all the prominent
regions (stamen or pistil) of the sunflowers have been detected.
We formulate the tracking problem as getting the minimum
EMD given a set of costs and flows, which is equivalent to
solving a linear programming problem. The goal is to find the
location that corresponds to the smallest EMD:
C(fij ) =
(3)
n
m X
X
d(pi , qj )fij
(6)
i=1 j=1
in which M means the mask generated from time differential
motion detection[15]. Ω means current head candidate region,
so |Ω| is the area of current head candidate region. δ(xi )
is Kronecker delta function. The whole likelihood means the
ratio of two areas between motion region and head candidate
region.
Meanwhile, pa (x|ω1 ) can be computed as below:
pa (x|ω1 ) = π1 N (x|µ1 , Σ1 ) + π2 N (x|µ2 , Σ2 )
(a)
in which d(pi , qj ) is the Euclidean distance between ith SURF
point in the object’s template and jth SURF feature point in
the candidate region. fij is the flow between the two class of
feature points and subject to:
fij ≥ 0,
n
X
(4)
in which N (•) represents a Gaussian distribution, µ as mean
and Σ as covariance matrix. π1 and π2 are weight coefficients
and π1 + π2 = 1. All the parameters can be estimated by
expectation maximization algorithm[16].
Finally, we have the probability to evaluate a candidate
region:
1
p∗ (x|ω1 ) = min{ p(x|ω1 ), 1}
(5)
λ
The factor λ is designed for preventing a low multiplication
between two high confidence. If the final confidence is greater
than 0.5, then the candidate region is considered a real human
head region.
B. Object Tracking
After detecting human head response and control the camera
PTZ to capture the object, we will track it continuously until
the object leaves the screen. We propose a robust tracking
method based on EMD and SURF feature.
Compared with SIFT[17], SURF has almost the same
matching accuracy and rotation-scale invariance. Otherwise,
1 ≤ i ≤ m, 1 ≤ j ≤ n
(7)
fij = wpi
1≤i≤m
(8)
fij = wqj
1≤j≤n
(9)
j=1
m
X
i=1
m X
n
X
fij = 1
(10)
i=1 j=1
in which wpi is the response weight of ith SURF feature point
in the object’s template and wqj is the response weight of jth
SURF point in the candidate region. In implementation, these
weights are generated from choosing k feature points which
have highest response values, normalizing these responses and
sorting them in decreasing order. Simplex method can be
applied to solve this problem.
Before using EMD to match patch, we first use KLT
optical flow[19] to get a coarse location of the object. Instead
of tracking each point according to gradient in x- and ydirection, we tracking each point by the SURF responses.
Based on the optical flow result, we move the track window
onto the location and compute EMD measurement. If the
2834
computed EMD is less than current EMD, then we take the
location as current tracked location and iterate this process
until the computed EMD is larger than current EMD. When
tracking multiple objects, once occlusion is detected, we apply
Bayesian framework to identify the subordinations of these
extracting feature points.
Our proposed method can be fast and efficient because it
only needs one process of extracting SURF points for every
frame while the color signature feature will be computed at
every iteration step for one frame. Moreover, SURF feature is
more robust than color so our method have better performance
than the one based on color signature.
1 has some head-shaped objects and these objects are not
totally static, so the false positive is relatively high. The false
negatives are produced because sometimes the boundary pixels
of foreground objects are similar with background which
results to the computed gradients are nearly equal to zero.
That means the edge information is missing.
Thirdly, we compare the two HoG human head algorithms
with and without motion and appearance filters, which will
show the importance of introducing motion and appearance
likelihood filters. The experiment result can prove that this
step is necessary to real surveillance applications. Comparison
results are shown in Fig. 5. Finally, we show some correctly
V. E XPERIMENT R ESULTS
Extensive and comparative experiments are carried out and
reported in this section. We first show human head detection
results and the detection performance between the methods
with and without motion and appearance filter as well as
CPU/GPU comparison. Then we show the EMD and SURF
tracker’s performance and compare it with the method proposed by Zhao et al[13]. All these experiments are performed
in following hardware environment: Core2 Duo 2.83GHz
CPU, 4G RAM memory and GeForce GTX 460.
A. Human Head Detection
Firstly, we compare the speedup between our GPU algorithm and original CPU HoG implementation. We choose
4 sequences of different resolution videos, test the average
runtime and list the result shown in Table. 1.
Fig. 5. The comparison of HoG algorithms with and without motion and
appearance filtering. Some head-like objects can be eliminated from detected
results generated solely by HoG algorithm.
detected results and their corresponding edge gradient information in Fig. 6.
TABLE I
H O G BASED HEAD DETECTION ALGORITHM PERFORMANCE BETWEEN
GPU AND CPU. GPU TIME INCLUDES TRANSFER DATA LATENCY
BETWEEN HOST AND DEVICE . O UR GPU IMPLEMENTATION HAS GOOD
PERFORMANCE AND CAN NEARLY MEET REAL - TIME REQUIREMENT WHEN
THE RESOLUTION OF VIDEOS IS CIF(352 × 288).
Video1
Video2
Video3
Video4
Resolution
352 × 288
528 × 384
704 × 576
1280 × 720
CPU Time(ms)
3600
7800
20000
45000
GPU Time(ms)
77
162
296
664
Speedup
46.8
48.1
67.6
67.8
Secondly, we test our algorithm in different surveillance
videos, and calculate the precision and recall, which are shown
in Table. 2.
TABLE II
T HE PERFORMANCE OF PROPOSED ALGORITHM . I N THIS TABLE , TF
MEANS THE NUMBER OF TOTAL FRAMES . TP, FP AND FN REPRESENT
TRUE POSITIVE , FALSE POSITIVE AND FALSE NEGATIVE , RESPECTIVELY.
T HE PRECISION IS COMPUTED AS T P/(T P + F P ) AND RECALL
T P/(T P + F N ).
Video1
Video2
Video3
TF
1520
3936
480
TP
291
2912
214
FP
60
69
19
FN
29
324
22
Precision(%)
82.9
97.7
91.8
Recall(%)
90.9
90.0
90.7
From Table. 2 we can see in most situations the precision
and recall are all above 90% except video sequence 1. Video
Fig. 6. HoG detection results. First row shows original image patch and
second row shows corresponding gradient features. One can see the more
prominent the difference between foreground and background, the more
abundant the gradient information is.
B. Object Tracking
In this part we show the robust performance of our proposed
tracking algorithm. Meanwhile, we compare it with the color
signature feature proposed by Zhao et al[13].
In Fig. 7, we show the tracking result on one sequence
comes from PETS video data. Since the color of the tracked
human’s head, body and legs is similar with the background,
color signature based method cannot catch the object’s motion
and finally mistakes the background subregion as the object.
Other experiments and tests show that the proposed algorithm
is available in all kinds of scenes from real surveillance
2835
our method over other methods and the availability in real
surveillance system.
ACKNOWLEDGMENT
(a)
Our work is supported by the Science and Technology Bureau (2010C13023) of Zhejiang Province, China, National High-Tech Research and Development Program (No.
2009AA01Z330) of China and Zhejiang Provincial Natural
Science Foundation of China (No.Y1100018).
(b)
R EFERENCES
Fig. 7. Frames 1, 25, 50, 75 are from one of the PETS sequences. a) The
color signature based method has lost track of the man since low resolution
and the clustering algorithm reduces the information used by tracking. b)
Our tracking algorithm maintains a robust focus on the object throughout the
sequence with the support of SURF points.
[1] R. T. Collins, A. J. Lipton, and T. Kanade, “A system for video
surveillance and monitoring,” Tech. Rep., 2000.
[2] I. Haritaoglu, D. Harwood, and L. Davis, “W4: real-time surveillance of
people and their activities,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 22, no. 8, pp. 809 –830, aug 2000.
[3] N. T. Siebel and S. J. Maybank, “The advisor visual surveillance
system,” in in ECCV 2004 workshop Applications of Computer Vision
(ACV), 2004.
[4] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: realtime tracking of the human body,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 19, no. 7, pp. 780 –785, jul
1997.
[5] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, vol. 1, june 2005, pp.
886 –893 vol. 1.
[6] P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of
motion and appearance,” in Computer Vision, 2003. Proceedings. Ninth
IEEE International Conference on, oct. 2003, pp. 734 –741 vol.2.
[7] P. A. Viola and M. J. Jones, “Robust real-time face detection,” in
International Conference on Computer Vision, 2001.
[8] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid
objects using mean shift,” in IEEE Conference on Computer Vision and
Pattern Recognition, vol. 2, 2000, pp. 142 – 149.
[9] D.Comaniciu, V.Ramesh, and P.Meer, “Kernel-based object tracking,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 25, pp. 564 – 577, 2003.
[10] Z. Khan, T. Balch, and F. Dellaert, “An mcmc-based particle filter for
tracking multiple interacting targets,” in in Proc. ECCV, 2003, pp. 279–
290.
[11] Y.Li, H.Ai, T.Yamashita, S.Lao, and M.Kawade, “Tracking in low frame
rate video : A cascade particle filter with discriminative observers of
different lifespans,” IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1 – 8, 2007.
[12] E.Maggio, E.Piccardo, C.Regazzoni, and A.Cavallaro, “Particle phd
filtering for multi-target visual tracking,” IEEE International Conference
on Acoustics, Speech and Signal Processing, pp. 1101–1104, 2007.
[13] Q.Zhao, S.Brennan, and H.Tao, “Differential emd tracking,” Computer
Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp.
1–8, 2007.
[14] Y. Rubner, “Perceptual metrics for image database navigation,” Ph.D.
dissertation, Stanford University, 1999.
[15] D. Xie, R. Tong, and H. Wu, “A method to distinguish the fire and
flickering vehicle light,” in CSIE (6), 2009, pp. 355–359.
[16] C.M.Bishop, Pattern recognition and machine learning. Springer, 2006.
[17] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
International Journal of Computer Vision, vol. 60, pp. 91–110, 2004.
[18] T. L. H.Bay, A.Ess, “Surf: Speeded up robust features,” Computer Vision
and Image Understanding, vol. 110, pp. 346–359, 2008.
[19] T. B.D.Lucas, “An iterative image registration technique with an application to stereo vision,” International Joint Conference on Artificial
Intelligence, 1981.
systems. Tracking results shown in Fig. 8 demonstrate that
it is robust and general on these surveillance sequences.
(a)
(b)
(c)
Fig. 8.
Tracking results on several surveillance sequences.
VI. C ONCLUSION
For retrieving clear human head or face images from real
surveillance systems, we proposed a novel approach to this
problem. To the best of our knowledge, this is the first work
using HoG, the SURF feature and EMD together in visual
tracking of surveillance system. HoG feature combined with
motion and appearance information can reduce false positives significantly. SURF and EMD based tracking has robust
tracking performance. GPU acceleration design and coarseto-fine tracking strategy makes real-time processing of video
streams possible. Experiments demonstrate the advantage of
2836