Download B P - O

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
BUILDING PART-BASED OBJECT DETECTORS VIA 3D GEOMETRY
[Pepik et al. 2012]
[Azizpour et al. 2012]
What is the right way to select parts?
Geometry-driven Parts (gDPM)
Parts based on consistent underlying 3D geometry
Advantages:
a)  Better optimization and learning
§  Leverage large-scale RGBD data for constraints
b)  3D Scene Understanding
RGB Input
gDPM Detection
Predicted Geometry
Check Geometric
Consistency and Merge.
Dictionary (D) of Geometrically-consistent 3D elements (100s clusters, 3 Aspect Ratio)
From 3D Parts to Object Hypothesis
Greedily select N parts based on frequency and geometric consistency to
make an object hypothesis: p = [p , . . . , p ], where p : (e , l ). (N ~ [6, 12])
Learning gDPM
Enforcing geometric constraints using large-scale RGBD data. Input - x = {I, I , l}, y 2 { 1, 1}
Table
Sofa
gDPM Detection Predicted Geometry
Bed
Bad Clusters
Good Clusters
DPM Detection
Table
of object instances along with their underlying geometry in
are
3. Overview
Overview
are
defined
based
their
appearance
and
underlying
ge- defined based on their appearance and underlying ge3.
terms
of depth
data.onWe
convert
the depth
data
into surface
ometry. We argue that using a geometric representation in
K-Means on Surface Normal
ometry.
argue
that using
a geometric
normals We
using
the standard
procedure
fromrepresentation
[25]. Our goalin
As input
input to
to the
the system,
system, at
at training,
conjunction with appearance based deformable parts model
As
training, we
we use
use RGB
RGB images
images
conjunction
appearance
based deformable
is to learn awith
deformable
part-based
model whereparts
the model
parts
and
Relative
Depth
of object
object instances
instances along
along with
with their
in
not
only
allows
us
to
have
a
better
initialization
but
also
proof
their underlying
underlying geometry
geometrynot
in
allows
us to
better initialization
but also geproare only
defined
based
onhave
theira appearance
and underlying
terms of
of depth
depth data.
data. We
We convert
convert the
vides additional constraints during the latent update steps.
terms
the depth
depth data
data into
into surface
surface
(10000s
clusters)
vides
additional
constraints
the latent
update steps.
ometry.
We argue
that usingduring
a geometric
representation
in
normals using
using the
the standard
standard procedure
Specifically, our learning procedure ensures not only that
normals
procedure from
from [25].
[25]. Our
Our goal
goal
Specifically,
our appearance
learning procedure
ensures not
that
conjunction with
based deformable
partsonly
model
(3 Aspect Ratios)
to learn
learn aa deformable
deformable part-based
part-based model
the
isis to
model where
where the
the parts
parts
the
updates
consistent
the appearance
space
but latent updates are consistent in the appearance space but
notlatent
only allows
us are
to have
a betterininitialization
but also
proare defined
defined based
based on
on their
their appearance
appearance and
also that the geometry predicted by underlying parts is conFigure 2. A few elements from the dictionary after the initialization
are
and underlying
underlying gegealso
the geometry
predicted
by the
underlying
parts steps.
is convidesthat
additional
constraints
during
latent update
Figure 2. A few elements from the dictionary
after the initialization
Rejected
sistent with the ground truth geometry. Hence, the depth
ometry. We
We argue
argue that
that using
using aa geometric
in
step. They are ordered to highlight the over-completeness of our
ometry.
geometric representation
representationsistent
in
with the
geometry.
Hence,
the depth
step. They are ordered to highlight the over-completeness of our
Specifically,
ourground
learningtruth
procedure
ensures
not only
that
data is not used as an extra feature, but instead provides
initial dictionary.
conjunction with
with appearance
appearance based
conjunction
based deformable
deformable parts
parts model
model
data
is notupdates
used as
extra feature,
but insteadspace
provides
initial dictionary.
the latent
arean
consistent
in the appearance
but
weak supervision during the latent update steps.
notonly
only allows
allows us
us to
to have
have aa better
better initialization
not
initialization but
butalso
alsoproproweak
supervision
during
the latent
update steps.
also that
the geometry
predicted
by underlying
parts is conFigure 2. A few elements from the dictionary after the initialization
vides
additional
constraints
during
the
latent
update
steps.
vides additional constraints during the latent update steps.
paper,
system for of our
sistent with the ground truth geometry. Hence, the depth In this
step.
Theywe
arepresent
ordered atoproof-of-concept
highlight the over-completeness
In this paper, we present a proof-of-concept system for
Specifically,
our
learning
procedure
ensures
not
only
that
Specifically, our learning procedure ensures not only that
buildinginitial
gDPM.
We limit our focus on man-made indoor
data is not used as an extra feature, but instead provides
dictionary.
building
gDPM.
We
limit
our
focus
on
man-made
indoor
thelatent
latent updates
updates are
are consistent
consistent in
rigid objects, such as bed, sofa etc., for three reasons: (1)
the
in the
the appearance
appearance space
spacebut
but
weak supervision during the latent update steps.
rigid
objects,
such
as
bed,
sofa
etc.,
for
three
reasons:
(1)
also that
that the
the geometry
geometry predicted
predicted by
Figure 2. A few elements from the dictionary after the initialization
These
classes are primarily defined based on their physalso
by underlying
underlying parts
parts isisconconFigure
2.
A
few
elements
from thebased
dictionary
after the
initialization
In
this
paper,
we
present
a
proof-of-concept
system
for
These
classes
are
primarily
defined
on
their
physsistent with
with the
the ground
ground truth
truth geometry.
step. They are ordered to highlight the over-completeness
our
ical of
properties,
learning a geometric model
sistent
geometry. Hence,
Hence, the
the depth
depth
2
step.
They
are
ordered
to
highlight
the
over-completeness
of
ourmin(and2 ,therefore
building
gDPM.
We
limit
our
focus
on
man-made
indoor
)
Increasing
ical
properties,
and
therefore
learning
a
geometric
model
data is not used as an extra feature, but instead provides
initial dictionary.
y intuitive sense; (2) These classes
for these categoriesxmakes
data is not used as an extra feature, but instead provides
initial dictionary.
rigid
objects,
such as
bed, intuitive
sofa etc.,sense;
for three
reasons:
(1)
for
these
categories
makes
(2) These
classes
weak supervision during the latent update steps.
have high intra-class variation and are challenging for any
weak supervision during the latent update steps.
These
classes
are
primarily
defined
based
on
their
physhave high intra-class variation and are challenging for any
deformable parts model. We would like to demonstrate that
In this paper, we present a proof-of-concept system for
ical properties,
and
therefore
learning
a
geometric
model
In this paper, we present a proof-of-concept system deformable
for
parts model. We would like to demonstrate that
a joint geometric and appearance based representation gives
building gDPM. We limit our focus on man-made indoor
for
these
categories
makes
intuitive
sense;
(2)
These
classes
building gDPM. We limit our focus on man-made indoor
a joint geometric and appearance based representation gives
us a powerful tool to model intra-class variations; (3) Firigid objects, such as bed, sofa etc., for three reasons: (1)
have
high
intra-class
variation
and
are
challenging
for
any
Figure 3. A few examples of resulting elements in dictionary after
rigid objects, such as bed, sofa etc., for three reasons: us
(1) a powerful tool to model intra-class variations; (3) nally,
Fi- due to the availability of Kinect, data collection for
These classes are primarily defined based on their physFigure 3. A few examples of resulting elements in dictionarythe
after
deformable
parts
model.
We would
like todata
demonstrate
that
refinement procedure.
These classes are primarily defined based on their physnally,
due
to
the
availability
of
Kinect,
collection
for
these
categories
has
become
simpler
and
efficient.
In
our
ical properties, and therefore learning a geometric model
the refinement procedure.
a jointcategories
geometric has
and become
appearance
based and
representation
gives
ical properties, and therefore learning a geometric model
these
simpler
efficient.
In
our
case, we use the NYU v2 dataset [25], which has 1449
4.1. Geometry-driven Dictionary of 3D Elements
for these categories makes intuitive sense; (2) These classes
us
a
powerful
tool
to
model
intra-class
variations;
(3)
Fifor these categories makes intuitive sense; (2) These classes
case, we use the NYU v2 dataset [25], which has 1449
4.1. 3.
Geometry-driven
Dictionary
Elements
RGBD images.
Figure
A few examples of resulting
elementsof
in 3D
dictionary
after Given a set of labeled training images and their correhave high intra-class variation and are challenging for any
nally,
due
to
the
availability
of
Kinect,
data
collection
for
have high intra-class variation and are challenging for any
RGBD images.
the refinement
deformable parts model. We would like to demonstrate that
Given a procedure.
set of labeled training images and their corresponding surface normal data, our goal is to discover a dicthese
categories
has
become
simpler
and
efficient.
In
our
deformable parts model. We would like to demonstrate that
a joint geometric and appearance based representation gives
sponding
surface normal
data, our goal
is toElements
discover ationary
dic- of elements capturing 3D information that can act
case, we use the NYU v2 dataset [25], which has 1449
4.1.
Geometry-driven
Dictionary
of 3D
4.
Technical
Approach
a joint geometric and appearance based representation gives
us a powerful tool to model intra-class variations; (3) Fitionary of elements capturing 3D information that can
act
as
parts in DPM. Our elements should be: 1) representaRGBD
images.
4.
Technical
Approach
us a powerful tool to model intra-class variations; (3) FiFigure 3. A few examples of resulting elements in dictionary after Given a set of labeled training images and their correnally, due to the availability of Kinect, data collection for
as
partsset
inofDPM.
Ourobject
elements
should
be:form
1) representaFigure
3. A few
examples of resulting elements in dictionary
after
tive: jfrequent amongjthe object categories in question; 2)
Given
a large
training
instances
in the
the
refinement
procedure.
nally, due to the availability of Kinect, data collection for
1
N
sponding
surface
normal
data,
our
goal
is
to
discover
a
dicthese categories has become simpler and efficient. In our Giventhe
among
object acategories
in question;
2) consistent
refinement
a large
set ofprocedure.
training object instances in the form
spatially
of RGBDtive:
data,frequent
our goal
is tothe
discover
set of candii with respect to the object. (e.g., a horithese
categories
has
become
simpler
and
efficient.
In
our
tionary
of
elements
capturing
3D
information
that
can
act
case, we use the NYU v2 dataset [25], which has 1449
4. RGBD
Technical
Approach
consistent
with
respect togeometry,
the object.and
(e.g., azontal
hori- surface always occurs on the top of a table and bed,
4.1.data,
Geometry-driven
of
our goal is to Dictionary
discover a of
set3D
of Elements
candidate partsspatially
based on
consistent
underlying
case,
we
use
the
NYU
v2
dataset
[25],
which
has
1449
as
parts
in
DPM.
Our
elements
should
be:
1)
representa4.1.
Geometry-drivenunderlying
Dictionary
of 3D Elements
RGBD images.
occurs ondeformable
the top of aparttable andwhile
bed, it occurs at center of a chair and a sofa). To obtain a
date parts Given
based aonsetconsistent
geometry,
and
use correthese zontal
parts tosurface
learn aalways
geometry-driven
of
labeled
training
images
and
their
RGBD images.
tive:
frequent
among
the
object
categories
in
question;
2)
Given a Given
large set
of
training
object
instances
in theand
form
a
set
of
labeled
training
images
their
correwhile(gDPM).
it occursToat obtain
center such
of a chair
a sofa). To obtain
a
dictionary
of candidate elements which satisfy these propuse these
parts
to
learn
a
geometry-driven
deformable
partbased
model
a set and
of candidate
sponding
surface
normal
data,
our
goal
is
to
discover
a
dicspatially
consistent
with
respect
to
the
object.
(e.g.,
a
horiof RGBD
data, our
goalnormal
is to discover
agoal
set isoftocandiBed
Model
1 erties,
sponding
surface
data,
our
discover
awe
dicdictionary
of candidate
elements
which
satisfy these
prop- we use a two step process: first we initialize our dicparts,
first discover
a dictionary
ofgDPM
geometric
elements
based model
(gDPM).
To
obtain
such
a
set
of candidate
tionary
of
elements
capturing
3D
information
that
can
act
zontal
surface
always
occurs
on
the
top
of
a
table
and
bed,
date parts
basedofonelements
consistent
underlying
geometry, and
4. Technical Approach
tionary
capturing
3D
information
that
can
erties,depth
we use
a two step
process:
initialize ourtionary
dic- by an over-complete set of elements, each satisfying
based
onact
their
information
(section
4.1)first
in awe
categoryparts,
we
first
discover
a
dictionary
of
geometric
elements
4. Technical Approach
as
parts
in
DPM.
Our
elements
should
be:
1)
representawhile
it
occurs
at
center
of
a
chair
and
a
sofa).
To
obtain
a
use theseasparts
toin
learn
a geometry-driven
deformable
partparts
DPM.
Our
elements
should
be:
1)
representathe representativeness property; and then we refine the dicfree
manner
(pooling
the data from set
all ofcategories).
A satisfying
tionary
by an over-complete
elements, each
based
on
their
depth
information
(section
4.1)
in
a
categorytive:
frequent
among
the
object
categories
in
question;
2)
Given a large set of training object instances in the form
dictionary
of
candidate
elements
which
satisfy
these
propbased model
(gDPM). among
To obtain
such
a categories
set of candidate
tive:
frequent
the
object
in(e.g.,
question;
2) representativeness
Given
a
large
set
of
training
object
instances
in
the
form
category-free
dictionary allows property;
us to share
elements
the
andthe
then
we refine thetionary
dic- elements based on their relative spatial location with
free
manner
(pooling
the
data
from
all
categories).
Aa horispatially
consistent
with
respect
to
the
object.
of RGBD data, our goal is to discover a set of candierties,
we
use
a
two
step
process:
first
we
initialize
our
dicparts, wespatially
first discover
a dictionary
of geometric
elements
consistent
with
respect
to
the
object.
(e.g.,
abed,
horirespect
to the object.
of
RGBD
data,
our
goal
is
to
discover
a
set
of
candiacross
multiple
object
categories.
tionary
elements
based
on
their
relative
spatial
location
with
category-free
dictionary
allows
us
to
share
the
elements
zontal
surface
always
occurs
on
the
top
of
a
table
and
date parts based on consistent underlying geometry, and
tionary by an over-complete set of elements, each satisfying
based on their depth information (section 4.1) in a categoryzontal
surface
always
occurs
on
the
top
of
a
table
and
bed,
date
parts
based
on
consistent
underlying
geometry,
and
Initializing the dictionary: We sample hundreds of thouto the object.
across
multiple
object
categories.
while it
occurs at
a chair
a sofa). ToAobtain
arespect
use these parts to learn a geometry-driven deformable partWe use
this
dictionary
to choose
a setand
of then
partswe
forrefine
every the dicthe
representativeness
property;
free manner
(pooling
thecenter
data offrom
all and
categories).
while it occurs
at center
of a chair
and satisfy
a sofa).these
To obtain
a
use
these
parts(gDPM).
to learn aTo
geometry-driven
deformable
partsands
Initializing
dictionary:
We sample
hundreds
thou- of patches, in 3 different aspect-ratios (AR), from the
dictionary
of
candidate
elements
which
propbased
model
obtain such a set
of candidate
object
category
based the
onbased
frequency
of occurrence
and
con- of
tionary elements
on their
relative
spatial
location
with
category-free
dictionary
allows
us to
share
the elements
We use
this
dictionary
to
choose
a
set
of
parts
for
every
Sample
Objects
Frequent
Parts
Selected
Consistent
dictionary
of acandidate
elements
which
satisfy these
propobject
bounding Parts
boxes in the training images (100 − 500
based
obtain such
a set of candidate
sands
of
patches,
in
3
different
aspect-ratios
(AR),
from
the
erties,
we
use
two
step
process:
first
we
initialize
our
dicparts, model
we first(gDPM).
discover aTodictionary
of geometric
elements
sistency
in
the
relative
location
with
respect
to
the
object
respect to the object.
across category
multiple object
categories.
object
based
on
frequency
of
occurrence
and
conSofa
gDPM
Model
1 −patches
erties, we
useover-complete
a two step process:
first we initialize
our dicparts,
first depth
discover
a dictionary
of geometric
elements
object
bounding
boxes
in
the
training
images
(100
500 per object bounding box). We represent these
tionary
by
an
set
of
elements,
each
satisfying
based we
on their
information
(section
4.1) in a categorybounding-boxes.
Finally,
we
use
these
parts
to
initialize
and
Initializing
the
dictionary:
We
sample
hundreds
of
thousistency
in this
the dictionary
relative
location
with
respect
to for
the each
object
We
use
to
choose
a
set
of
parts
every
patches
tionary
by
an
over-complete
set
of
elements,
satisfying
based
on
their
depth
information
(section
4.1)
in
a
categorypatches
per object
bounding
box).
We(AR),
represent
these in terms of their raw surface normal maps. To exthe representativeness
property;
and
then
we
refine
the
dicfree manner (pooling the data from all categories). bounding-boxes.
A
learn
our
gDPM
using
latent
updates
and
hard
mining.
We
sands
of
patches,
in
3
different
aspect-ratios
from
the
Finally,
we use
these
parts
tothen
initialize
and the dicobject
category
based
on
frequency
of
occurrence
and
contract
the
representativeness
property;
and
we
refine
free
manner
(pooling
the
data
from
all
categories).
A
patches
in terms
ofoftheir
rawtraining
surface
normal
maps.
To
ex-a representative set of elements for each AR, we pertionary
elements
based
on
their
relative
spatial
location
with
category-free dictionary allows us to share the elements
exploit
the
geometric
nature
our
parts
and
use
them
to
enobject
bounding
boxes
in
the
images
(100
−
500
learn
ourtionary
gDPM
using latent
updates
and
hardtomining.
We with
sistency
in
the
relative
location
with
respect
the
object
elements
based
on
their
relative
spatial
location
category-free
dictionary
allows
us
to
share
the
elements
tract a geometrical
representative
set of elements
forrepresent
each
AR,these
weform
per-clustering using simple k-means (with k ∼ 1000), in
respect
to
the
object.
across multiple object categories.
force
additional
constraints
at
the
latent
update
patches
per
object
bounding
box).
We
exploit
the
geometric
nature
parts
andtouse
them toand
enbounding-boxes.
Finally,
we of
useour
these
parts
initialize
rawin
surface normal space. This clustering process leads to
respect
to
the
object.
across multiple object categories.
form 4.3).
clustering
using
simple
k-means
(with
k ∼ To
1000),
Initializing
the
dictionary:
We
sample
hundreds
of
thousteps
(section
patches
in
terms
of
their
raw
surface
normal
maps.
exforce
constraints
athard
the mining.
latent update
We use this dictionary to choose a set of parts for every
learn additional
ourInitializing
gDPMgeometrical
using
latent
updates and
We of thouthe
dictionary:
We
sample
hundreds
rawasurface
normalset
space.
This clustering
sands
of
patches,
in
3
different
aspect-ratios
(AR),
from the
tract
representative
of elements
for each process
AR, we leads
per- to
We
use
this
dictionary
to
choose
a
set
of
parts
for
every
steps
(section
4.3).
object category based on frequency of occurrence and conexploit the
geometric
nature
our parts aspect-ratios
and use them to
en- from the
sands
of patches,
inof
3 different
(AR),
object
bounding
boxes
in
the
training
images
(100
− 500
form clustering using simple k-means (with k ∼ 1000), in
object
category
based onlocation
frequency
occurrence
consistency
in the relative
withofrespect
to theand
object
force additional
geometrical
constraints
at
the
latent
update
object bounding
in thebox).
training
images
(100 these
− raw
500 surface normal space.
G
patches
per objectboxes
bounding
We
represent
ThisgDPM
clusteringModel
process leads
Table
1 to
sistency
in
the
relative
location
with
respect
to
the
object
bounding-boxes. Finally, we use these parts to initialize and
steps (section
4.3).
patchesinper
object
bounding
box).normal
We represent
patches
terms
of their
raw surface
maps. To these
exbounding-boxes.
Finally,
we
use
these
parts
to
initialize
and
learn our gDPM using latent updates and hard mining. We
patches
in terms of their
surfacefornormal
maps.
tract
a representative
set ofraw
elements
each AR,
we To
per-exlearn
our
gDPM
using
latent
updates
and
hard
mining.
We
exploit the geometric nature of our parts and use them to enG for each AR, we pertract
a
representative
set
of
elements
form clustering using simple k-means (with k ∼ 1000), in
exploit
the geometric
nature of
our parts at
and
themupdate
to enforce additional
geometrical
constraints
theuse
latent
form
clustering
simple
(with
k ∼ leads
1000),toin
Gusing
raw
surface
normal
space.
Thisk-means
clustering
process
force
geometrical constraints at the latent update
steps additional
(section 4.3).
raw surface normal space. This clustering process leads to
- Model Parameters,
N
steps (section 4.3).
[Endres et al. 2012]
Input Image
Test Set: NYU v2 RGB Images
Table
[Felzenszwalb et al. 2009]
Desired
properties:
As input to the system, at training, we use RGB imagesAs input to the system, at training, we use RGB images
ofinobject instances along with their underlying geometry in
of object instances
along with their underlying geometry
§ 
Representative:
frequent
among
objects
3.
Overview
terms
of
depth
data.
We
convert
the
depth data into surface
terms of depth data. We convert the depth data into surface
normals using the standard procedure from [25]. Our goal
normals
using
the
standardatprocedure
from
[25].
Our
goal
As input
to
the
system,
training,
we
use
RGB
images
§  Spatially consistent
w.r.t
the object
is to learn
a deformable
part-based model where the parts
is to learn a deformable part-based model where the parts
3. Overview
3. Overview
Qualitative Results
Bed gDPM Model 2
Sofa
Key-point/part annotation, e.g.,
anatomical.
Geometry-driven Dictionary of 3D Elements
Experimental Results
Bed gDPM Model 3
Table
Heuristic initialization, e.g., gradient
magnitudes.
Supervised Parts
Effectiveness of DPMs + Richness of Geometric Representation
Sofa
(cluster 1)
Unsupervised Parts
Our Approach
Sofa gDPM Model 3
Sofa gDPM Model 2
Table gDPM Model 2
where
I denotes
an RGB
image,
I - RGB
Image,
I Figure
– Surface
Normal
Map, l -models
bounding-box
location.
7. Learned gDPM
for classes
bed, sofa and table. The first visualization in each template represents the learned appearance
Table gDPM Model 3
Sofa
Discriminative Part-based Models
Abhinav Shrivastava and Abhinav Gupta
I denotes the surface normal map and
root filter,
the
second
visualization
contains
learned
part
filters
super-imposed
on
the
root
filter,
the
third
visualization
is
the
surface
normal
X
1 bounding
2
l
is
location
of
the
box,
and
.
z - Root & Part Locations,
Minimize: LD ( map
) = corresponding
k k + C tomax(0,
1 and
yi fthe
(xfourth
i ))
each
part
visualization
is
of
the
learned
deformation
penalty.
c - Cluster Membership.
2
i=1
Latent
the latent update step on positives uses
f
from
(5)
to
estiβ
Standard Negative Hard-mining
8
mate the latent variables; then we apply SGD to solve for β
>
(I, z, c )
if y = 1,
<SAppearance
by using standard nfcβ (4) and hard-negative mining. At test
X
f (x) =
i
G
time,
we
only
use
the
standard
scoring
function
(2)
(which
>
S
(e
,
!(I
,
l
)),
if
y
=
+1.
S
(I,
z,
)
+
Appearance
c
G
i
:
is also equivalent i=1
to setting λ = 0 in (5)).
Standard Appearance Score
of root and part filters.
Geometry constrained
Positive Latent Update
Geometry constraints on part
placements.
!(·) – Raw surface normal at
We now present
experimental
results
to
demonstrate
the
SG (·) – Similarity function between
surface
normal maps representation and coneffectiveness of adding
geometric
5. Experiments
Quantitative Results
Test Set: NYU v2 RGB Images
Table 1. AP performance on the task of object detection.
Bed
Chair
M.+TV
Sofa
Table
DPM (No Parts)
20.94
10.69
6.38
5.51
2.73
DPM
22.39
14.44
8.10
7.16
3.53
DPM (Our Parts, No Latent)
DPM (Our Parts)
26.59
29.15
5.71
11.43
2.35
4.17
6.82
8.30
3.41
1.76
gDPM
33.39
13.72
9.28
11.04
4.05