Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BUILDING PART-BASED OBJECT DETECTORS VIA 3D GEOMETRY [Pepik et al. 2012] [Azizpour et al. 2012] What is the right way to select parts? Geometry-driven Parts (gDPM) Parts based on consistent underlying 3D geometry Advantages: a) Better optimization and learning § Leverage large-scale RGBD data for constraints b) 3D Scene Understanding RGB Input gDPM Detection Predicted Geometry Check Geometric Consistency and Merge. Dictionary (D) of Geometrically-consistent 3D elements (100s clusters, 3 Aspect Ratio) From 3D Parts to Object Hypothesis Greedily select N parts based on frequency and geometric consistency to make an object hypothesis: p = [p , . . . , p ], where p : (e , l ). (N ~ [6, 12]) Learning gDPM Enforcing geometric constraints using large-scale RGBD data. Input - x = {I, I , l}, y 2 { 1, 1} Table Sofa gDPM Detection Predicted Geometry Bed Bad Clusters Good Clusters DPM Detection Table of object instances along with their underlying geometry in are 3. Overview Overview are defined based their appearance and underlying ge- defined based on their appearance and underlying ge3. terms of depth data.onWe convert the depth data into surface ometry. We argue that using a geometric representation in K-Means on Surface Normal ometry. argue that using a geometric normals We using the standard procedure fromrepresentation [25]. Our goalin As input input to to the the system, system, at at training, conjunction with appearance based deformable parts model As training, we we use use RGB RGB images images conjunction appearance based deformable is to learn awith deformable part-based model whereparts the model parts and Relative Depth of object object instances instances along along with with their in not only allows us to have a better initialization but also proof their underlying underlying geometry geometrynot in allows us to better initialization but also geproare only defined based onhave theira appearance and underlying terms of of depth depth data. data. We We convert convert the vides additional constraints during the latent update steps. terms the depth depth data data into into surface surface (10000s clusters) vides additional constraints the latent update steps. ometry. We argue that usingduring a geometric representation in normals using using the the standard standard procedure Specifically, our learning procedure ensures not only that normals procedure from from [25]. [25]. Our Our goal goal Specifically, our appearance learning procedure ensures not that conjunction with based deformable partsonly model (3 Aspect Ratios) to learn learn aa deformable deformable part-based part-based model the isis to model where where the the parts parts the updates consistent the appearance space but latent updates are consistent in the appearance space but notlatent only allows us are to have a betterininitialization but also proare defined defined based based on on their their appearance appearance and also that the geometry predicted by underlying parts is conFigure 2. A few elements from the dictionary after the initialization are and underlying underlying gegealso the geometry predicted by the underlying parts steps. is convidesthat additional constraints during latent update Figure 2. A few elements from the dictionary after the initialization Rejected sistent with the ground truth geometry. Hence, the depth ometry. We We argue argue that that using using aa geometric in step. They are ordered to highlight the over-completeness of our ometry. geometric representation representationsistent in with the geometry. Hence, the depth step. They are ordered to highlight the over-completeness of our Specifically, ourground learningtruth procedure ensures not only that data is not used as an extra feature, but instead provides initial dictionary. conjunction with with appearance appearance based conjunction based deformable deformable parts parts model model data is notupdates used as extra feature, but insteadspace provides initial dictionary. the latent arean consistent in the appearance but weak supervision during the latent update steps. notonly only allows allows us us to to have have aa better better initialization not initialization but butalso alsoproproweak supervision during the latent update steps. also that the geometry predicted by underlying parts is conFigure 2. A few elements from the dictionary after the initialization vides additional constraints during the latent update steps. vides additional constraints during the latent update steps. paper, system for of our sistent with the ground truth geometry. Hence, the depth In this step. Theywe arepresent ordered atoproof-of-concept highlight the over-completeness In this paper, we present a proof-of-concept system for Specifically, our learning procedure ensures not only that Specifically, our learning procedure ensures not only that buildinginitial gDPM. We limit our focus on man-made indoor data is not used as an extra feature, but instead provides dictionary. building gDPM. We limit our focus on man-made indoor thelatent latent updates updates are are consistent consistent in rigid objects, such as bed, sofa etc., for three reasons: (1) the in the the appearance appearance space spacebut but weak supervision during the latent update steps. rigid objects, such as bed, sofa etc., for three reasons: (1) also that that the the geometry geometry predicted predicted by Figure 2. A few elements from the dictionary after the initialization These classes are primarily defined based on their physalso by underlying underlying parts parts isisconconFigure 2. A few elements from thebased dictionary after the initialization In this paper, we present a proof-of-concept system for These classes are primarily defined on their physsistent with with the the ground ground truth truth geometry. step. They are ordered to highlight the over-completeness our ical of properties, learning a geometric model sistent geometry. Hence, Hence, the the depth depth 2 step. They are ordered to highlight the over-completeness of ourmin(and2 ,therefore building gDPM. We limit our focus on man-made indoor ) Increasing ical properties, and therefore learning a geometric model data is not used as an extra feature, but instead provides initial dictionary. y intuitive sense; (2) These classes for these categoriesxmakes data is not used as an extra feature, but instead provides initial dictionary. rigid objects, such as bed, intuitive sofa etc.,sense; for three reasons: (1) for these categories makes (2) These classes weak supervision during the latent update steps. have high intra-class variation and are challenging for any weak supervision during the latent update steps. These classes are primarily defined based on their physhave high intra-class variation and are challenging for any deformable parts model. We would like to demonstrate that In this paper, we present a proof-of-concept system for ical properties, and therefore learning a geometric model In this paper, we present a proof-of-concept system deformable for parts model. We would like to demonstrate that a joint geometric and appearance based representation gives building gDPM. We limit our focus on man-made indoor for these categories makes intuitive sense; (2) These classes building gDPM. We limit our focus on man-made indoor a joint geometric and appearance based representation gives us a powerful tool to model intra-class variations; (3) Firigid objects, such as bed, sofa etc., for three reasons: (1) have high intra-class variation and are challenging for any Figure 3. A few examples of resulting elements in dictionary after rigid objects, such as bed, sofa etc., for three reasons: us (1) a powerful tool to model intra-class variations; (3) nally, Fi- due to the availability of Kinect, data collection for These classes are primarily defined based on their physFigure 3. A few examples of resulting elements in dictionarythe after deformable parts model. We would like todata demonstrate that refinement procedure. These classes are primarily defined based on their physnally, due to the availability of Kinect, collection for these categories has become simpler and efficient. In our ical properties, and therefore learning a geometric model the refinement procedure. a jointcategories geometric has and become appearance based and representation gives ical properties, and therefore learning a geometric model these simpler efficient. In our case, we use the NYU v2 dataset [25], which has 1449 4.1. Geometry-driven Dictionary of 3D Elements for these categories makes intuitive sense; (2) These classes us a powerful tool to model intra-class variations; (3) Fifor these categories makes intuitive sense; (2) These classes case, we use the NYU v2 dataset [25], which has 1449 4.1. 3. Geometry-driven Dictionary Elements RGBD images. Figure A few examples of resulting elementsof in 3D dictionary after Given a set of labeled training images and their correhave high intra-class variation and are challenging for any nally, due to the availability of Kinect, data collection for have high intra-class variation and are challenging for any RGBD images. the refinement deformable parts model. We would like to demonstrate that Given a procedure. set of labeled training images and their corresponding surface normal data, our goal is to discover a dicthese categories has become simpler and efficient. In our deformable parts model. We would like to demonstrate that a joint geometric and appearance based representation gives sponding surface normal data, our goal is toElements discover ationary dic- of elements capturing 3D information that can act case, we use the NYU v2 dataset [25], which has 1449 4.1. Geometry-driven Dictionary of 3D 4. Technical Approach a joint geometric and appearance based representation gives us a powerful tool to model intra-class variations; (3) Fitionary of elements capturing 3D information that can act as parts in DPM. Our elements should be: 1) representaRGBD images. 4. Technical Approach us a powerful tool to model intra-class variations; (3) FiFigure 3. A few examples of resulting elements in dictionary after Given a set of labeled training images and their correnally, due to the availability of Kinect, data collection for as partsset inofDPM. Ourobject elements should be:form 1) representaFigure 3. A few examples of resulting elements in dictionary after tive: jfrequent amongjthe object categories in question; 2) Given a large training instances in the the refinement procedure. nally, due to the availability of Kinect, data collection for 1 N sponding surface normal data, our goal is to discover a dicthese categories has become simpler and efficient. In our Giventhe among object acategories in question; 2) consistent refinement a large set ofprocedure. training object instances in the form spatially of RGBDtive: data,frequent our goal is tothe discover set of candii with respect to the object. (e.g., a horithese categories has become simpler and efficient. In our tionary of elements capturing 3D information that can act case, we use the NYU v2 dataset [25], which has 1449 4. RGBD Technical Approach consistent with respect togeometry, the object.and (e.g., azontal hori- surface always occurs on the top of a table and bed, 4.1.data, Geometry-driven of our goal is to Dictionary discover a of set3D of Elements candidate partsspatially based on consistent underlying case, we use the NYU v2 dataset [25], which has 1449 as parts in DPM. Our elements should be: 1) representa4.1. Geometry-drivenunderlying Dictionary of 3D Elements RGBD images. occurs ondeformable the top of aparttable andwhile bed, it occurs at center of a chair and a sofa). To obtain a date parts Given based aonsetconsistent geometry, and use correthese zontal parts tosurface learn aalways geometry-driven of labeled training images and their RGBD images. tive: frequent among the object categories in question; 2) Given a Given large set of training object instances in theand form a set of labeled training images their correwhile(gDPM). it occursToat obtain center such of a chair a sofa). To obtain a dictionary of candidate elements which satisfy these propuse these parts to learn a geometry-driven deformable partbased model a set and of candidate sponding surface normal data, our goal is to discover a dicspatially consistent with respect to the object. (e.g., a horiof RGBD data, our goalnormal is to discover agoal set isoftocandiBed Model 1 erties, sponding surface data, our discover awe dicdictionary of candidate elements which satisfy these prop- we use a two step process: first we initialize our dicparts, first discover a dictionary ofgDPM geometric elements based model (gDPM). To obtain such a set of candidate tionary of elements capturing 3D information that can act zontal surface always occurs on the top of a table and bed, date parts basedofonelements consistent underlying geometry, and 4. Technical Approach tionary capturing 3D information that can erties,depth we use a two step process: initialize ourtionary dic- by an over-complete set of elements, each satisfying based onact their information (section 4.1)first in awe categoryparts, we first discover a dictionary of geometric elements 4. Technical Approach as parts in DPM. Our elements should be: 1) representawhile it occurs at center of a chair and a sofa). To obtain a use theseasparts toin learn a geometry-driven deformable partparts DPM. Our elements should be: 1) representathe representativeness property; and then we refine the dicfree manner (pooling the data from set all ofcategories). A satisfying tionary by an over-complete elements, each based on their depth information (section 4.1) in a categorytive: frequent among the object categories in question; 2) Given a large set of training object instances in the form dictionary of candidate elements which satisfy these propbased model (gDPM). among To obtain such a categories set of candidate tive: frequent the object in(e.g., question; 2) representativeness Given a large set of training object instances in the form category-free dictionary allows property; us to share elements the andthe then we refine thetionary dic- elements based on their relative spatial location with free manner (pooling the data from all categories). Aa horispatially consistent with respect to the object. of RGBD data, our goal is to discover a set of candierties, we use a two step process: first we initialize our dicparts, wespatially first discover a dictionary of geometric elements consistent with respect to the object. (e.g., abed, horirespect to the object. of RGBD data, our goal is to discover a set of candiacross multiple object categories. tionary elements based on their relative spatial location with category-free dictionary allows us to share the elements zontal surface always occurs on the top of a table and date parts based on consistent underlying geometry, and tionary by an over-complete set of elements, each satisfying based on their depth information (section 4.1) in a categoryzontal surface always occurs on the top of a table and bed, date parts based on consistent underlying geometry, and Initializing the dictionary: We sample hundreds of thouto the object. across multiple object categories. while it occurs at a chair a sofa). ToAobtain arespect use these parts to learn a geometry-driven deformable partWe use this dictionary to choose a setand of then partswe forrefine every the dicthe representativeness property; free manner (pooling thecenter data offrom all and categories). while it occurs at center of a chair and satisfy a sofa).these To obtain a use these parts(gDPM). to learn aTo geometry-driven deformable partsands Initializing dictionary: We sample hundreds thou- of patches, in 3 different aspect-ratios (AR), from the dictionary of candidate elements which propbased model obtain such a set of candidate object category based the onbased frequency of occurrence and con- of tionary elements on their relative spatial location with category-free dictionary allows us to share the elements We use this dictionary to choose a set of parts for every Sample Objects Frequent Parts Selected Consistent dictionary of acandidate elements which satisfy these propobject bounding Parts boxes in the training images (100 − 500 based obtain such a set of candidate sands of patches, in 3 different aspect-ratios (AR), from the erties, we use two step process: first we initialize our dicparts, model we first(gDPM). discover aTodictionary of geometric elements sistency in the relative location with respect to the object respect to the object. across category multiple object categories. object based on frequency of occurrence and conSofa gDPM Model 1 −patches erties, we useover-complete a two step process: first we initialize our dicparts, first depth discover a dictionary of geometric elements object bounding boxes in the training images (100 500 per object bounding box). We represent these tionary by an set of elements, each satisfying based we on their information (section 4.1) in a categorybounding-boxes. Finally, we use these parts to initialize and Initializing the dictionary: We sample hundreds of thousistency in this the dictionary relative location with respect to for the each object We use to choose a set of parts every patches tionary by an over-complete set of elements, satisfying based on their depth information (section 4.1) in a categorypatches per object bounding box). We(AR), represent these in terms of their raw surface normal maps. To exthe representativeness property; and then we refine the dicfree manner (pooling the data from all categories). bounding-boxes. A learn our gDPM using latent updates and hard mining. We sands of patches, in 3 different aspect-ratios from the Finally, we use these parts tothen initialize and the dicobject category based on frequency of occurrence and contract the representativeness property; and we refine free manner (pooling the data from all categories). A patches in terms ofoftheir rawtraining surface normal maps. To ex-a representative set of elements for each AR, we pertionary elements based on their relative spatial location with category-free dictionary allows us to share the elements exploit the geometric nature our parts and use them to enobject bounding boxes in the images (100 − 500 learn ourtionary gDPM using latent updates and hardtomining. We with sistency in the relative location with respect the object elements based on their relative spatial location category-free dictionary allows us to share the elements tract a geometrical representative set of elements forrepresent each AR,these weform per-clustering using simple k-means (with k ∼ 1000), in respect to the object. across multiple object categories. force additional constraints at the latent update patches per object bounding box). We exploit the geometric nature parts andtouse them toand enbounding-boxes. Finally, we of useour these parts initialize rawin surface normal space. This clustering process leads to respect to the object. across multiple object categories. form 4.3). clustering using simple k-means (with k ∼ To 1000), Initializing the dictionary: We sample hundreds of thousteps (section patches in terms of their raw surface normal maps. exforce constraints athard the mining. latent update We use this dictionary to choose a set of parts for every learn additional ourInitializing gDPMgeometrical using latent updates and We of thouthe dictionary: We sample hundreds rawasurface normalset space. This clustering sands of patches, in 3 different aspect-ratios (AR), from the tract representative of elements for each process AR, we leads per- to We use this dictionary to choose a set of parts for every steps (section 4.3). object category based on frequency of occurrence and conexploit the geometric nature our parts aspect-ratios and use them to en- from the sands of patches, inof 3 different (AR), object bounding boxes in the training images (100 − 500 form clustering using simple k-means (with k ∼ 1000), in object category based onlocation frequency occurrence consistency in the relative withofrespect to theand object force additional geometrical constraints at the latent update object bounding in thebox). training images (100 these − raw 500 surface normal space. G patches per objectboxes bounding We represent ThisgDPM clusteringModel process leads Table 1 to sistency in the relative location with respect to the object bounding-boxes. Finally, we use these parts to initialize and steps (section 4.3). patchesinper object bounding box).normal We represent patches terms of their raw surface maps. To these exbounding-boxes. Finally, we use these parts to initialize and learn our gDPM using latent updates and hard mining. We patches in terms of their surfacefornormal maps. tract a representative set ofraw elements each AR, we To per-exlearn our gDPM using latent updates and hard mining. We exploit the geometric nature of our parts and use them to enG for each AR, we pertract a representative set of elements form clustering using simple k-means (with k ∼ 1000), in exploit the geometric nature of our parts at and themupdate to enforce additional geometrical constraints theuse latent form clustering simple (with k ∼ leads 1000),toin Gusing raw surface normal space. Thisk-means clustering process force geometrical constraints at the latent update steps additional (section 4.3). raw surface normal space. This clustering process leads to - Model Parameters, N steps (section 4.3). [Endres et al. 2012] Input Image Test Set: NYU v2 RGB Images Table [Felzenszwalb et al. 2009] Desired properties: As input to the system, at training, we use RGB imagesAs input to the system, at training, we use RGB images ofinobject instances along with their underlying geometry in of object instances along with their underlying geometry § Representative: frequent among objects 3. Overview terms of depth data. We convert the depth data into surface terms of depth data. We convert the depth data into surface normals using the standard procedure from [25]. Our goal normals using the standardatprocedure from [25]. Our goal As input to the system, training, we use RGB images § Spatially consistent w.r.t the object is to learn a deformable part-based model where the parts is to learn a deformable part-based model where the parts 3. Overview 3. Overview Qualitative Results Bed gDPM Model 2 Sofa Key-point/part annotation, e.g., anatomical. Geometry-driven Dictionary of 3D Elements Experimental Results Bed gDPM Model 3 Table Heuristic initialization, e.g., gradient magnitudes. Supervised Parts Effectiveness of DPMs + Richness of Geometric Representation Sofa (cluster 1) Unsupervised Parts Our Approach Sofa gDPM Model 3 Sofa gDPM Model 2 Table gDPM Model 2 where I denotes an RGB image, I - RGB Image, I Figure – Surface Normal Map, l -models bounding-box location. 7. Learned gDPM for classes bed, sofa and table. The first visualization in each template represents the learned appearance Table gDPM Model 3 Sofa Discriminative Part-based Models Abhinav Shrivastava and Abhinav Gupta I denotes the surface normal map and root filter, the second visualization contains learned part filters super-imposed on the root filter, the third visualization is the surface normal X 1 bounding 2 l is location of the box, and . z - Root & Part Locations, Minimize: LD ( map ) = corresponding k k + C tomax(0, 1 and yi fthe (xfourth i )) each part visualization is of the learned deformation penalty. c - Cluster Membership. 2 i=1 Latent the latent update step on positives uses f from (5) to estiβ Standard Negative Hard-mining 8 mate the latent variables; then we apply SGD to solve for β > (I, z, c ) if y = 1, <SAppearance by using standard nfcβ (4) and hard-negative mining. At test X f (x) = i G time, we only use the standard scoring function (2) (which > S (e , !(I , l )), if y = +1. S (I, z, ) + Appearance c G i : is also equivalent i=1 to setting λ = 0 in (5)). Standard Appearance Score of root and part filters. Geometry constrained Positive Latent Update Geometry constraints on part placements. !(·) – Raw surface normal at We now present experimental results to demonstrate the SG (·) – Similarity function between surface normal maps representation and coneffectiveness of adding geometric 5. Experiments Quantitative Results Test Set: NYU v2 RGB Images Table 1. AP performance on the task of object detection. Bed Chair M.+TV Sofa Table DPM (No Parts) 20.94 10.69 6.38 5.51 2.73 DPM 22.39 14.44 8.10 7.16 3.53 DPM (Our Parts, No Latent) DPM (Our Parts) 26.59 29.15 5.71 11.43 2.35 4.17 6.82 8.30 3.41 1.76 gDPM 33.39 13.72 9.28 11.04 4.05