Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman 23-1-2007 1 Introduction • Content-based Image Retrieval (CBIR) – The process of searching for digital images in large databases based on image contents – It consists of four modules in general: • • • • data acquisition and processing feature representation data indexing query and feedback processing 2 Introduction • Content-based Image Retrieval (CBIR) – Extensively studied in both academia and industry – Yet, traditional data indexing techniques are not scalable to high dimensional data – Image contents in high dimensional feature space – Thus, traditional CBIR systems are not efficient 3 Contribution • Propose a scalable CBIR scheme by applying localitysensitive hashing (LSH) – LSH is good at indexing high dimensional data • Conduct a comprehensive empirical evaluation of CBIR over a half million images – There is very limited empirical study on that large CBIR systems • Address some challenges for building scalable CBIR systems on large-scale data – some innovative ideas are suggested for tackling these issues 4 Hashing Algorithm • Hashing algorithm is commonly employed for indexing large-scale database – Due to its fast database lookup capability – O(1) on average • However, it is never used for similarity indexing – It builds index for searching identical copy of the query key, but not for searching near neighbors 5 Locality-Sensitive Hashing • An emerging new indexing algorithm – Proposed to solve high-dimensional near neighbor searching problem in Euclidean space l2 – It can answer queries in sublinear time – Each near neighbor being reported with a fixed probability • Principles – Two close points most likely share the same hash value – By looking into the hash bucket of the query point, we obtain many near neigbhors of the query points – Large fraction of data points are not processed 6 Locality-Sensitive Hashing • we employ the E2LSH (Exact Euclidean LSH) package • The probability of finding near neighbors can be controlled by two parameters L and k – L: number of hash functions g(v) = (h1(v), … hk(v)) • L: larger L increases the probability of finding all R-near neighbors • k: larger k reduces the chance of hitting data points that are not R-near neighbors • h(v) controlled by two parameters a and b Same h if this is within [ (n-1)w, (n)w ) – a: a d dimensional vector with entries chosen independently from a Gaussian distribution – b: a real number chosen uniformly from the range [0, w] 7 Main problem of E2LSH • One main problem is that E2LSH is a memory based implementation – All the data points and the data structures are stored in the main memory – Maximum database size limited by the amount of free main memory available 8 Our Scalable Implementation • We propose a multi-partition indexing approach – We divide the whole database into multiple partitions – Each of them is small enough to be loaded into the main memory – Then we can process queries on each partition respectively 9 The Detailed Procedure 1. Divide the database into n partitions, where the number of partitions n = ceiling of (database size / max partition’s size). 2. Run E2LSH indexing on each of the partitions to create the hash tables; 3. Given a query q, load the pre-built hash tables T of one partition into the main memory; 4. Run E2LSH query on T to retrieve top k ranking images with respect to q; 5. Repeat (3) and (4) steps until all partitions are processed. 6. Collect the results from all partitions and return top k ranking results with respect to the query q. 10 Some Critical Issues • Disk-access overhead for loading the hash tables into the main memory • We can consider some parallel solutions to overcome this overhead issue and speedup the overall solution in our future work 11 Feature Representation • We represent an image with three types of visual feature: color, shape, and texture • For color, we use Grid Color Moment • For shape, we employ an edge direction histogram • For texture, we adopt Gabor feature • In total, a 238-dimensional feature vector is employed to represent each image in our image database 12 Empirical Evaluation • Experimental Testbed – Testbed containing 500,000 images crawled from Web – 5,000 images from COREL image data set are engaged as the query set • Contains 50 categories • Each category consists of exactly 100 images that are randomly selected • Every category represents a different semantic topic, such as butterfly, car, cat, dog, etc 13 Performance Metrics • Two standard performance metrics – Precision and recall – Relevant image if it is one of the 100 COREL images in the database which share the same category with the query images – Evaluate efficiency by average CPU time elapsed on a given query 14 Experimental Setup • Form a query set of 1000 image examples by randomly select 20 images from each of the 50 COREL image categories • Prepare 10 image databases of different sizes ranging from 50,000 images to 500,000 images • A database of size N contains the followings: – 5,000 COREL images – N-5000 other images selected from our testbed • Extract the image features using the techniques discussed 15 Experimental Setup • Perform a series of experiments using our CBIR system with LSH indexing on all the 10 databases – LSH’s parameters: L = 550 and k = 34 – Retrieve top 20 and top 50 ranking images for each query image – Calculate recall and precision • Simulate two rounds of relevance feedback – re-querying the database with relevant examples with respect to the given query • Environments – 3.2GHz Intel Pentium 4 PC with 2GB memory running Linux kernel 2.6 – All implementations are programmed by C++ language • The same experiments are repeated with Exhaustive Searching indexing 16 Average Precision of TOP20 • The results of LSH is very close to the ES results • Their maximal difference is no more than 5% at any database size Average Precision of TOP20 17 Average Recall of TOP50 • Average recall and average precision of our CBIR system decrease with the database size, yet the decreasing rate • Diminishes when the database size increases Average Recall of TOP50 18 Average Query Time • the query time for ES is linear to the database size, while the one for LSH is sublinear 19 Time Performance of LSH over ES on different databases • LSH approach is much faster than the ES solution with an average speedup greater than 4 times • The gap of time performance between them grows even faster when the database size increases 20 Conclusion and Future Work • Proposed a scalable CBIR scheme based on a fast high dimensional indexing technique, LSH • Conducted extensive empirical evaluations on a large testbed of a half million images – Our scalable CBIR system is more efficient than traditional exhaustive linear searching methods – Our system is scalable to large-scale CBIR applications • Addressed some limitations and challenges in our current solution • Consider parallel solutions in the coming future 21 Q&A 22