Download Representing Videos using Mid-level Discriminative Patches

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Representing Videos using Mid-level
Discriminative Patches
CVPR2013 Poster
Outline
 Introduction
 Mining Discriminative Patches
 Analyzing Videos
 Experimental Evaluation & Conclusion
1. Introduction
 Q.1:What does it mean to understand this video ?
 Q.2:How might we achieve such an understanding?
1. Introduction
 Video
single feature vector
semantic action
object
bits and pieces
 General framework
detect
object
primitive actions
Bayesian networks
storyline
1. Introduction
Drawback:
computational models for identifying semantic entities are not
robust enough to serve as a basis for video analysis
1. Introduction
 Represent video
use
Discriminative spatio-temporal patches
not use
global feature vector or set of semantic entities
Discriminative spatio-temporal patches
automatically
mined
correspond
semantic object
primitive human action
human-object pair
random but informative patches
from training data consisting of hundreds of videos
1. Introduction
 spatio-temporal patches
act as a discriminative vocabulary for action classification
establish strong correspondence between patches in training and test videos.
Using label transfer techniques
align the videos and perform tasks
(Ex. object localization, finer-level action detection etc.)
1. Introduction
1. Introduction
2. Mining Discriminative Patches
 Two conditions
(1)They occur frequently within a class.
(2)They are distinct from patches in other classes.
 Challenge
(1)Space of potential spatio-temporal patches is extremely
large given that these patches can occur over a range of scales.
(2) Overwhelming majority of video patches are uninteresting.
2. Mining Discriminative Patches
 Paradigm : bag-of words
Step1:Sample a few thousand patches, perform k-means clustering
to find representative clusters
Step2:Rank these clusters based on membership in different action
classes.
 Major drawbacks
(1)High-Dimensional Distance Metric
(2)Partitioning
2. Mining Discriminative Patches
(1)High-Dimensional Distance Metric
K-means use standard distance metric
(Ex. Euclidean or normalized cross-correlation)
Not well in high-dimensional spaces
※We use HOG3D
2. Mining Discriminative Patches
(2)Partitioning
Standard clustering algorithms partition the entire feature space.
Every data point is assigned to one of the clusters during the
clustering procedure. However, in many cases, assigning cluster
memberships to rare background patches is hard. Due to the
forced clustering they significantly diminish the purity of good
clusters to which they are assigned
2. Mining Discriminative Patches
 Resolve these issues
1.Use an exemplar-based clustering approach
2.Every patch is considered as a possible cluster center
 Using Exemplar-SVM(e-SVM) to learn
Drawback : computationally infeasible
Resolve: use motion  use Nearest Neighbor
2. Mining Discriminative Patches
Training videos
Training partition: (form cluster)
Validation partition : rank the clusters based on representativeness
(ⅰ)Using simple nearest-neighbor approach(typically k=20)
(ⅱ)Score each patch and rank
(ⅲ)select a few patches per action class and use the e-SVM to learn
(ⅳ)e-SVM are used to form clusters
(ⅴ)re-rank
2. Mining Discriminative Patches
 Goal : smaller dictionary(set of representative patches)
 Criteria (a)Appearance Consistency
Consistency score
(b)Purity
tf-idf (score): same class/different class
※ All patches are ranked using a linear combination of the two score
2. Mining Discriminative Patches
3. Analyzing Videos
 Action Classification
input : test videos
feature vector
Top n e-SVM detectors
output : class
SVM classifier
 Beyond Classification: Explanation via Discriminative Patches
Q. How we can use detections of discriminative patches for establishing
correspondences between training and test videos?
Q. Which detections to select for establishing correspondence?
3. Analyzing Videos
 Context-dependent Patch Selection
Vocabulary size : N
candidate detections :{D1,D2,…,DN}
whether or not the detection of e-SVM i is selected : xi
Appearance term(Ai):e-SVM score for patch i
Class Consistency term(Cli):This term promotes selection of certain e-SVMs
over others given the action class. For example, for the weightlifting class it
prefers selection of the patches with man and bar with vertical motion. We learn
Cl from the training data by counting the number of times that an e-SVM fires
for each class.
3. Analyzing Videos
Penalty term(Pij):is the penalty term for selecting a pair of detections together.
(1)e-SVMs i and j do not fire frequently together in the training data.
(2) the e-SVMs i and j are trained from different action classes.
 Optimization
Integer Program is an NP-hard problem
use
IPFP algorithm
5~10 iterations
4. Experimental Evaluation
 Datasets :UCF-50,Olympics Sport
 Implementation Details:
※Our current implementation considers only cuboid patches
※Patches are represented with HOG3D features (4x4x5 cells with 20
discrete orientations).
 Classification Results
4. Experimental Evaluation
4. Experimental Evaluation
4. Experimental Evaluation
 Correspondence and Label Transfer
4. Experimental Evaluation
4. Experimental Evaluation
 Conclusion
1.A new representation for videos .
2.Automatically mine these patches using exemplar-based
clustering approach.
3.Obtaining strong correspondence and align the videos for
transferring annotations.
4.As a vocabulary to achieve state of the art results for action
classification.