Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Repeatability Test for Two Orientation Based Interest Point Detectors Björn Johansson and Robert Söderberg April 26, 2004 Technical report LiTH-ISY-R-2606 ISSN 1400-3902 Computer Vision Laboratory Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden [email protected], [email protected] Abstract This report evaluates the stability of two image interest point detectors, starpattern points and points based on the fourth order tensor. The Harris operator is also included for comparison. Different image transformations are applied and the repeatability of points between a reference image and each of the transformed images are computed. The transforms are plane rotation, change in scale, change in view, and change in lightning conditions. We conclude that the result largely depends on the image content. The starpattern points and the fourth order tensor models the image as locally straight lines, while the Harris operator is based on simple/non-simple signals. The two methods evaluated here perform equally well or better than the Harris operator if the model is valid, and perform worse otherwise. 1 Contents 1 Introduction 2 Interest point detectors 2.1 Harris, nms . . . . . . 2.2 Harris, subpixel . . . . 2.3 Fourth order tensors . 2.4 Star patterns . . . . . 3 . . . . 3 3 4 4 5 3 Experimental setup 3.1 Repeatability criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Transformation of scale, rotation and view . . . . . . . . . . . . . . . . . . 3.3 Transformation of light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Experimental results 11 4.1 Comparison of the two Harris versions . . . . . . . . . . . . . . . . . . . . 11 4.2 Rotation, Scale, and View . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.3 Variation of illumination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 Conclusions and discussion 12 2 1 Introduction In the last few years, a number of experiments has been performed to evaluate the stability for interest point detectors and local descriptors, see e.g. [11, 8]. Stable interest points are useful for example in object recognition applications, see e.g. [6, 7, 5], where a local image content descriptor is computed in every interest point. These descriptors can be used in a feature matching algorithm to find the objects and object poses. The need for stable points may not be crucial but depends on what you do with them, e.g. the choice of descriptor. However, in general, stable points should make the system more robust. Among the number of detectors evaluated in Schmid et. al. [11] it was found that the Harris operator was the most stable one. This report will evaluate two other methods for detection of interest points and compare with the Harris operator. The experiments and philosophy are very much the same as in [11], and we refer to this reference for more details. Basically, different image transformations are computed and the repeatability of points between a reference image and each of the transformed images are computed. The transformations are in [11] generated by actually moving the camera. The homographies are estimated by a registration of the images using a pattern that is projected onto the image by a projector. We use a simpler approach in this paper; the transformations are simulated in the computer. This approach has a drawback in that it introduces interpolation noise, and that the simple camera model may be unrealistic. On the other hand we do not have to estimate the homography. We will first describe the methods that are evaluated in this report. Then we explain some details of the experiment setup and finally present the results. 2 Interest point detectors This section gives a short description of the methods included in the evaluation. Two versions of the Harris operator are considered. The parameters and thresholds for the methods are chosen such that they all are based on fairly the same region size and that they give approximately the same amount of points. 2.1 Harris, nms The Harris function is computed as H = det(T) − α trace(T) , where α = 0.04 and T is the structure tensor Z T = g(x)∇I∇I T dx . (1) (2) g is a Gaussian window function with σ = 2. The image gradient is computed using differentiated Gaussian filters with σ = 1. Local maxima points are found by non-maximum suppression. All filters can be made separable. 3 Figure 1: Local features detected by the tensor from left to right: crossing, T-crossing, corner, non-parallel lines and parallel lines. 2.2 Harris, subpixel Same as the previous one, except that the local maxima points are found with subpixel accuracy. A second order polynomial is fitted to the Harris image around each of the maxima pixels in the previous method, and the local maxima of the polynomial function gives the subpixel position. 2.3 Fourth order tensors This method uses the tensor representation explained in [9]. The tensor can represent one, two, or three line segments. If the tensor is reshaped to a matrix in a proper way the number of line segments will correspond to the rank of the tensor. For example, a tensor representing a corner will have rank two. By using this property the tensor representation can be used as an interest point detector, where the interest points is the local features illustrated in figure 1. The detection process is explained in detail in [10], but the basics are: 1. Compute the image gradient, where a differentiated Gaussian filter with σ = 1 is used (same as for the Harris methods). 2. The image gradient is improved to suit the tensor representation by using a method described in [4], where the response from edges and lines are amplified and made more concentrated. 3. An orientation tensor T is estimated from the improved image gradient. 4. The fourth order tensor is estimated by applying a number of separable filters on the elements in the orientation tensor. These filters represents a subset of monomes up to the fourth order. 5. The fourth order tensor is reshaped to a matrix and a measurement, c2 , for rank two is calculated: t = σ1 + σ2 + σ3 −9d + qt d = σ1 σ2 σ3 c2 = , where 3d − 3qt + t3 q = σ1 σ2 + σ2 σ3 + σ3 σ1 and where σi represent the matrix three largest singular values. 6. A selection of interesting tensors is performed by picking each tensor corresponding to a local maxima in the rank two image weighted with the tensor norm. The interest point position is then calculated as the crossing between the two line segments. 4 2.4 Star patterns The method we use to find star-patterns is a combination of the ideas in [2, 3, 1]. This method is explained in detail in [5, 4]. The basics are: 1. Compute the image gradient ∇I, A differentiated Gaussian filter with σ = 1 is used (same as for the Harris methods). 2. Star patterns are found as local maxima to the function Z Sstar = g(x) h∇I, x⊥ i2 dx . (3) g is a Gaussian window function with σ = 2. Sstar is made more selective by inhibition with a measure for simple signals. Local maxima points are then found by non-maximum suppression. 3. The point positions are improved by minimizing the circle pattern function Z Scircle (p) = g(x) h∇I, x − pi2 dx (4) that are computed around each of the local maxima points. The algorithm needs to compute a subset of monomes (or derivatives) up to the second order on the three images Ix2 , Iy2 , and Ix Iy . All filters can be made separable. 3 3.1 Experimental setup Repeatability criterion The repeatability criterion is the same as in [11]. We give a short summary here. Let Ir denote the reference image and let Ii denote an image that has been transformed. Let {xr } denote the interest points in reference image Ir , and let {xi } denote the interest points in the transformed image i. For two corresponding points in xr and xi in image Ir and Ii we have xi = Hri xr , (5) where Hri denotes the homography between the two images (the points is here represented in homogeneous coordinates). As in [11] we remove the points that do not lie in the common scene part of images Ir and Ii . Let Ri () denote the set of corresponding points pairs within -distance, i.e. Ri () = {(xr , xi ) | dist(Hri xr , xi ) < } . (6) The repeatability rate for image Ii is defined as ri () = |Ri ()| , min(nr , ni ) (7) where nr = |{xr }| and ni = |{xi }| are the number of points detected in the common part of the two images. Note that 0 ≤ ri ≤ 1. 5 3.2 Transformation of scale, rotation and view The homographies for the rotation, scale, and view transformations can be found in many text books, but we still include a short derivation for sake of completeness. The T relation between a point X = X Y Z in the 3D world and the corresponding point T x = x y in the image is x X λ =P , (8) 1 1 where P = K [R|t] is the camera (projection) matrix. The matrix K contains the camera parameters. We assume the simple model f I x0 , (9) K= 0 1 where f is the focal length and x0 is the origin for the image coordinate system. The matrix R and the vector t defines the transformation of the 3D coordinate system. For the reference image we assume that the optical axis of the camera is orthogonal to the image in the 3D world and that the distance between the camera and the image is d. This gives R = I and t = 0, and from (8) we then get −1 xr . (10) X = dK 1 We now use (8) and (10) to compute a relation between the point xr in the reference image and a corresponding point x in another image taken with the camera in a different position, i.e. for general choice of R and t. The relation becomes x X λ = K [R|t] 1 1 = KRX + Kt −1 xr = KRdK + Kt 1 xr −1 , = dKRK + [0|Kt] 1 (11) and we identify the general homography between the reference image and a transformed image as H = dKRK−1 + [0|Kt] . (12) We get the following homographies for the special cases of rotation, scale, and view: • Plane rotation: H = KRK−1 , cos ϕ − sin ϕ 0 where R = sin ϕ cos ϕ 0 . 0 0 1 6 (13) • Scale change: H = dI + [0|Kt] , 0 where t = 0 . d0 • Viewpoint change: Equation (12) where 1 0 0 R = 0 cos ϕ − sin ϕ 0 sin ϕ cos ϕ , (14) 0 t = [I − R] 0 . 1 (15) The list below contains data for the experiments: • Number of test images: 6, see figure 2. • Plane rotation: 18 images evenly spread between 0◦ and 180◦ (ϕ = 0, 1, . . . , N − 1, N = 18). The first image is used as reference. πk , N −1 k = • Scale change: 9 images with a scale change (non-evenly spread) up to three times 2 the original size (d0 = 1c − 1, where c = 2 Nk−1 + 1, k = 0, . . . , N − 1, N = 9). The first image is used as reference. • View change: 21 images with a change in view between −45◦ and 45◦ (ϕ = k = −N, . . . , N , N = 10). The middle image is used as reference. πk , 4N • Two choices of is used, = 0.5 and = 1.5. Figure 2 shows the test images that is used for the rotation, scale, and view transformations. They range from real images to synthetical images. Figure 3 shows an example of each of the transforms for one of the test images. The images has been expanded by zero padding before the transformation. This helps to avoid loss of points in the transformations (note however, that the padding is not enough for the scale transformation). 3.3 Transformation of light The test images for the light transformations are however not simulated. These images are taken of 3D scenes by a stationary camera, and either the camera shutter or the light source is changed. Three sequences were taken and shown in figure 4. In the first sequence we change the camera shutter, and the middle image is used as reference. The last two sequences are taken by changing the light source position, and the first image is used as reference in both cases. The scene is not planar and we therefore do not really have a ground truth in the last two cases. But we believe that the evaluation is still relevant since similar situations appear for example in object recognition applications, where the training data for an object is taken with different lightning conditions than the query data, see e.g. [5]. 7 Toy car Aerial image Corner test image 50 50 100 50 150 100 200 150 250 200 300 250 350 100 150 200 250 100 200 100 300 Toy monestary 200 300 50 Picasso painting 100 150 200 250 Semi-synthetic room 50 100 50 50 150 100 200 100 150 250 150 200 300 250 350 200 100 200 300 50 100 150 200 250 50 100 150 200 Figure 2: Test images for the rotation, scale, and view transformations. 8 250 Rotation Frame 1 Frame 7 Frame 13 Frame 18 1 512 1 512 Scale Frame 1 Frame 3 Frame 5 Frame 7 Frame 9 Frame 16 Frame 21 1 512 1 512 View Frame 1 Frame 6 Frame 11 1 512 1 512 Figure 3: An example of the transformations rotation, scale, and view. 9 Change in camera shutter Frame 1 Frame 4 Frame 7 Frame 10 Frame 13 1 428 1 572 Frame 1 Frame 2 Change of light source position Frame 3 Frame 4 Frame 5 Frame 6 Frame 7 Frame 2 Change of light source position Frame 3 Frame 4 Frame 5 Frame 6 Frame 7 1 428 1 572 Frame 1 1 428 1 572 Figure 4: Test sequences for the light transformations. The first sequence contains 13 frames and the last two sequences contain 7 frames each. 10 scale, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris nms harris subpix 0 1 2 rotation 3 0 1 1 0.5 0.5 0 0 50 100 view 150 0 50 0 −40 −20 1 0.5 0.5 −20 0 20 40 2 0 1 0 −40 1 100 0 3 150 20 40 Figure 5: Average results of the two Harris versions on the test images in figure 2 for the rotation, scale, and view transformations. 4 Experimental results We will show the average results for all the test images. But we will also show the individual results for each test image to show that the result differs depending on the type of test image. 4.1 Comparison of the two Harris versions From the results it was found that Harris with subpixel accuracy performed overall much better than Harris using only non-max suppression, figure 5 shows one example. The difference is most obvious for smaller than 1 pixel, as would be expected. Because of this result we will only include the subpixel Harris from now on, to make the presentation less messy. 4.2 Rotation, Scale, and View Figure 6 contain the average results for all methods except Harris nms. The results are somewhat inconclusive, but if we examine each test image separately we see that subpixel-Harris performs best for natural images. The star-patterns and the four order tensors perform equally well or better than Harris on images that better resembles their 11 scale, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris subpix tensor star patterns 0 1 2 rotation 3 0 1 1 0.5 0.5 0 0 50 100 view 150 0 50 0 −40 −20 1 0.5 0.5 −20 0 20 40 2 0 1 0 −40 1 100 0 3 150 20 40 Figure 6: Average results on the test images in figure 2 for the rotation, scale, and view transformations. models, i.e. straight lines and sharper corner points, especially for the scale transformation (figures 9 and 12). 4.3 Variation of illumination The results on the light transformation sequences is shown in figure 13. We conclude that the differences are not significant between the different methods. 5 Conclusions and discussion From the experiments we may conclude that the choice of operator depends on the image content. The star-pattern method and the fourth order tensor assumes a bit more advanced model than the Harris operator. These methods has the best performance if their corresponding models can describe the image content, as would be expected. If the model is less valid it seems that it is better to use a more crude model as in the Harris operator. The operators described here is intended to be used in object recognition tasks. Other interest point operators have been used in this application, one of the most successful methods in recent years is to find local maxima in DoG(Difference of Gaussians) scale space, see e.g. [6, 7]. These points were also evaluated (using the implementation by 12 scale, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris subpix tensor star patterns 0 1 2 rotation 3 0 1 1 0.5 0.5 0 0 50 100 view 0 150 50 0 −40 −20 1 0.5 0.5 −20 0 20 40 2 0 1 0 −40 1 3 100 0 150 20 40 Figure 7: Result on test image 1 for the rotation, scale, and view transformations. scale, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris subpix tensor star patterns 0 1 2 rotation 3 0 1 1 0.5 0.5 0 0 50 100 view 150 0 50 0 −40 −20 1 0.5 0.5 −20 0 20 40 2 0 1 0 −40 1 100 0 3 150 20 40 Figure 8: Result on test image 2 for the rotation, scale, and view transformations. 13 scale, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris subpix tensor star patterns 0 1 2 rotation 3 0 1 1 0.5 0.5 0 0 50 100 view 0 150 50 0 −40 −20 1 0.5 0.5 −20 0 20 40 2 0 1 0 −40 1 100 0 3 150 20 40 Figure 9: Result on test image 3 for the rotation, scale, and view transformations. scale, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris subpix tensor star patterns 0 1 2 rotation 3 0 1 1 0.5 0.5 0 0 50 100 view 150 0 50 0 −40 −20 1 0.5 0.5 −20 0 20 40 2 0 1 0 −40 1 100 0 3 150 20 40 Figure 10: Result on test image 4 for the rotation, scale, and view transformations. 14 scale, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris subpix tensor star patterns 0 1 2 rotation 0 3 1 1 0.5 0.5 0 0 50 100 view 0 150 50 0 −40 −20 1 0.5 0.5 −20 0 20 40 2 0 1 0 −40 1 100 0 3 150 20 40 Figure 11: Result on test image 5 for the rotation, scale, and view transformations. scale, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris subpix tensor star patterns 0 1 2 rotation 3 0 1 1 0.5 0.5 0 0 50 100 view 150 0 50 0 −40 −20 1 0.5 0.5 −20 0 20 40 2 0 1 0 −40 1 100 0 3 150 20 40 Figure 12: Result on test image 6 for the rotation, scale, and view transformations. 15 Test sequence 1, ε = 0.5 ε = 1.5 1 1 0.5 0.5 harris subpix tensor star patterns 0 0.5 1 1.5 Test sequence 2, ε = 0.5 0 1 1 0.5 0.5 0 2 4 6 Test sequence 3, ε = 0.5 0 1 1 0.5 0.5 0 2 4 6 0 0.5 1 ε = 1.5 1.5 2 4 ε = 1.5 6 2 4 6 Figure 13: Results on the light change sequences in figure 4. 16 Lowe) on the test images in this report, but the result were poor. However, it is not fair to evaluate the stability for an operator that is computed in several scales. The larger scales may be less stable, but that may not matter if your descriptor is computed in a region that is proportional to the scale of the interest points. Acknowledgments We gratefully acknowledge the support from the Swedish Research Council through a grant for the project A New Structure for Signal Processing and Learning, and the European Commission through the VISATEC project IST-2001-34220 [12] . References [1] J. Bigün. Pattern recognition in images by symmetries and coordinate transformations. Computer Vision and Image Understanding, 68(3):290–307, 1997. [2] W. Förstner. A framework for low level feature extraction. In Proceedings of the third European Conference on Computer Vision, volume II, pages 383–394, Stockholm, Sweden, May 1994. [3] Björn Johansson. Multiscale curvature detection in computer vision. Lic. Thesis LiU-Tek-Lic-2001:14, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, March 2001. Thesis No. 877, ISBN 91-7219-999-7. [4] Björn Johansson and Anders Moe. Patch-duplets for object recognition and pose estimation. Technical Report LiTH-ISY-R-2553, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, November 2003. [5] Björn Johansson and Anders Moe. Patch-duplets for object recognition and pose estimation. In Ewert Bengtsson and Mats Eriksson, editors, Proceedings SSBA’04 Symposium on Image Analysis, pages 78–81, Uppsala, March 2004. SSBA. [6] David G. Lowe. Object recognition from local scale-invariant features. In Proc. ICCV’99, 1999. [7] David G. Lowe. Local feature view clustering for 3D object recognition. In Proc. CVPR’01, 2001. [8] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In IEEE Conference on Computer Vision and Pattern Recognition, pages 257–263, June 2003. [9] K. Nordberg. A fourth order tensor for representation of orientation and position of oriented segments. Technical Report LiTH-ISY-R-2587, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, Februari 2004. 17 [10] Klas Nordberg and Robert Söderberg. Detection and estimation of features for estimation of position. In Ewert Bengtsson and Mats Eriksson, editors, Proceedings SSBA’04 Symposium on Image Analysis, pages 74–77, Uppsala, March 2004. SSBA. [11] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. Int. Journal of Computer Vision, 37(2):151–172, 2000. [12] URL: http://www.visatec.info. 18