Download Cortina: a web image search engine

Thursday, May 27, 2004 A System for Large-scale, Content-based Web Image Retrieval - and the Semantics within Till Quack Thursday, May 27, 2004 Task  Create a content-based image retrieval system for the WWW  Large-scale, one order of magnitude larger than existing systems. Means O(106) items  Relevance Feedback  Explore and exploit the semantics within  Take large-scale, content-based image retrieval one step closer to commercial applications Thursday, May 27, 2004 Outline  Content-based Image Retrieval on the WWW  PART I: A System for Image Retrieval on the WWW  Features  Retrieval  Relevance Feedback  Software Design  PART II: The Semantics within  Identifying a Method to find Semantics  Data Mining for Semantic Clues  Frequent Itemset Mining and Association Rules  The Visual Link  Discussion & Demonstration  Conclusions & Outlook Thursday, May 27, 2004 Content-based Image Retrieval on the WWW  Characteristics of the data repository     Size: 4.2 billion documents in Google’s index Diversity: Documents in any context, language Control: Anybody can publish anything Dynamics: Ever changing  System Requirements  FAST  SCALABLE  Make use of all the information available  Motivation for a new system  Existing systems • Either pure text (Google) • Or pure content-based  Large-Scale Thursday, May 27, 2004 PART I: A System for Large-scale, Content-based Image Retrieval on the WWW Ullrich Moenich Till Quack Lars Thiele Thursday, May 27, 2004 System Overview World Wide Web DMOZ Data Image Spider Image Description Keyword Request Matching Images Images (Binaries) User picks relevant images Nearest Neighbor Search Inverted Index keyid | imageid mySQL Keyword Extraction Feature Extraction Keyword Indexing Keywords Offline Matching Images Visual Features Cluster Cluster 1 2Cluster n Cluster Cluster 1 2Cluster n Cluster Cluster 1 2Cluster n Clustering Cluster Cluster 1 2Cluster n Retrieval Thursday, May 27, 2004 Visual Features describe the Images  Global Features from MPEG-7 Standard  Currently no Segmentation • Reasons: Scalability and the diversity of the data  Texture Features  Edge Histogram Descriptor (EHD) • Histogram of quantified edge directions. 80 dimensions  Homogeneous Texture Descriptor (HTD) • Output of Gabor filter-bank. 62 dimensions.  Color Features  Scalable Color Descriptor (SCD) • Color Histogram. 256, 128, 64 or 32 dimensions  Dominant Color Descriptor (DCD) • Up to 8 dominant colors (3d color-space) and their percentages – 32 “dimensions” • “Bins” defined for each image Thursday, May 27, 2004 Collateral Text as an additional Feature  ALT Tag and Collateral Text around images  VERY uncontrolled annotation  Stemming: Porter Stemmer  Example: training -> train  More matching terms for boolean queries  But also some new ambiguities • train: to train [verb] / the train [noun] Thursday, May 27, 2004 Retrieval in 2 Steps er ption Keyword Request 1. Text Retrieval Matching Images Images (Binaries) User picks relevant images Nearest Neighbor Search Inverted Index keyid | imageid Feature Extraction Matching Images Keyword Indexing Visual Features 2. Visual Nearest Neighbor Search Cluster Cluster 1 2Cluster n Cluster Cluster 1 2Cluster n Cluster Cluster 1 2Cluster n Clustering Cluster Cluster 1 2Cluster n Retrieval Thursday, May 27, 2004 Retrieval: Text  Options  Boolean query on inverted index  Vector Space Model  LSI etc.  Choice  Ranked boolean queries on inverted index  Ranking: tf*idf  Reasons  Speed  Sparsity of data: • 600 000 Keywords in total • 1 document: 10-50 words Keyword Keyid ImageId tf shoe 124 1233 1 sport 341 1233 1 red 345 1233 1 banana 445 1234 1 fruit 75 1234 2 Order 875 1234 1 Thursday, May 27, 2004 Retrieval – Visual Features (MPEG-7)  K-Nearest Neighbor search (K-NN)  Find K closest candidates ci to query image q in a vector space  Distance: Minkowsky Metrics for distance d(ci,q) namely L1 and L2 norms  Most MPEG-7 descriptors are highdimensional vectors  The “dimensionality curse” applies  High dimensional spaces behave “weirdly”  In particular the distances are not too meaningful Thursday, May 27, 2004 Retrieval – Challenges for Visual Features  We have several (visual) feature types How can we combine them?  Our database is very large. How can we search it fast enough?  i.e. how can we avoid comparing the query vector with each database entry? Thursday, May 27, 2004 A Combined Distance for the MPEG7 Features  We use a combined distance of all the visual feature types  The individual distances occupy different ranges in different distributions  The distributions were transformed to a normal distribution in the range [0,1]  The distances are then combined linearly Thursday, May 27, 2004 Clustering speeds up the search      Problem  Millions of items in DB  Linear search over the whole dataset too slow  Looking only for the K nearest neighbors anyway (One) Solution  Partition the data into Clusters, identified by representative, the centroid  Only search the cluster whose centroid is closest to query q K-Means clustering algorithm  Not the best, in particular in HD spaces  But fast! Problem with Clustering:  Query at the border of a cell does not find all the nearest neighbors Simple Solution: Overlapping Clusters  Problem: Scalability • • Original data 7GB Overlapping data: 50 GB Imageid Primary Descriptor Secondary Descriptor 1 Secondary Descriptor 2 Secondary Descriptor 3 122 ehd htd scd dcd 45233 ehd htd scd dcd 6688 ehd htd scd dcd Thursday, May 27, 2004 Relevance Feedback Improves the Results  Relevance feedback: User input to improve search results iteration by iteration  i.e. the user selects „good matches“  We obtain the following information: 1. A new query vector which is a combination of the relevant images = Query Vector Movement 2. The ratios for the combination of the feature types Thursday, May 27, 2004 Relevance Feedback: Query Vector Movement  Construct the query vector qn of images selected in iteration n Vector component k Feature type f (EHD,SCD,HTD) i=1...M relevant images The final, new query vector is q = 0.75 *qn + 0.25 *qn-1 i.e. move from the old query vector towards the new vector Thursday, May 27, 2004 Relevance Feedback: Weight Adapation  Which feature is most important for the given query?  The one for which all the relevant images are closest  Determine the ratios for the combination based on the average distance, e.g. for the EHD  and set Thursday, May 27, 2004 Implementation – Software and Hardware  Languages: C++ and Perl  Inline::CPP to connect Layers     WWW: Apache and CGI Relational DB: mySQL Operating System: OS X Hardware  Dual 2 GHZ Apple G5, 2GB RAM  Teran Terrabyte Disk Array Thursday, May 27, 2004 Part II: The Semantics Within Thursday, May 27, 2004 Semantics: Combining Text and Visual Features  Our dataset is multi-modal  Keywords and several visual features  Not only valid for WWW data • Video: image+speech, • Bio-imagery: image+microscope setting, cell coloring fluid  Goal: Try to jointly use the different modes  Do semantic relations between the modes exist?  Learn something about these semantic relations  Improve the retrieval precision based on them  Challenges in our project:  Large-scale  Noisy and uncontrolled data  Only global visual features Thursday, May 27, 2004 Identifying a Method to find the Semantics  Related work  Latent Semantic Indexing (LSI) [Westerveld 2000] • – problem O(N2m3), N=Documents+Terms, m=concept space  Statistical models [Barnard, Forsyth 2001-2004] • Problem O: “several hours for several thousand images” • Problem: It is a (rather strict, hierarchical) model  Others • Neural networks (SOM etc.) • Hidden Markov Models  Often: Classification  We don’t know our classes, or: there are just too many  We can’t train them either (data too diverse and noisy)  Most of the methods above only tested on relatively small, supervised datasets  There is one more option … Thursday, May 27, 2004 Method: Data Mining for Semantic Clues  Mine the data for patterns  Find them only where they exist  Deduce Rules from them  Scalable methods available  Frequent Itemset Mining and Association Rules  Classic Application: Market baskets, Census data …  Some works on Multimedia data • [Zaïane 98]: Datacubes with appended keywords • [Tešić et al. 03]: Perceptual associations (texture) within images Thursday, May 27, 2004 Frequent Itemsets and Association Rules  Itemset I  Transaction T   Database D Support of Itemset A  A is called frequent if  Rule  Support of a Rule  Statistical significance  Confidence of a Rule  Strength of implication  Maximum likelihood estimate that B is true given that A is true Thursday, May 27, 2004 Example & Advantages  Example: Market Baskets  Rule {Diaper,Milk}{Beer}  Advantages TID Items 1 Bread, Milk 2 Beer, Diaper, Bread, Eggs 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Bread, Diaper, Milk  Human readable  Can be edited  Fast Algorithms available  Note: Associations are not correlations  The same concept, just simpler  Associations and correlations: [Brin, Motwani, Silverstein 98] Thursday, May 27, 2004 Using FIMI to find the itemsets  Frequent Itemset Mining (FIMI)  Find frequent itemsets with support > minsupp  Minimal support minsupp given by “an expert”  First Algorithm: APriori [Agrawal et al. 93]  Basic Idea: If an itemset is frequent, all its subsets must be frequent (Monotonicity)  k-passes over dataset for itemsets of length k  ~O(knp) n transactions, p items, itemsets of length k  Today’s algorithms  Rely on the same basic principle  But much faster (Main Reason: Data structures) • Usually only 2 database passes • ~linear runtime  State-of-the-art algorithm overview: FIMI’03  We used: fpmax* [Grahne, Zhu: Nov 03] Thursday, May 27, 2004 Diapers and Beer !!?  Application to the domain of Multimedia data: 1. Formulate images as transactions 2. Low-level clusters serve as a dimensionality reduction for the visual features 3. We find associations of visual features (clusters) and keywords 4. From theses associations we deduce semantic rules  Advantages  Comparably low computational complexity  Other data sources can be integrated in the same manner (e.g. long-term relevance feedback)  Challenges  Noisy, uncontrolled data  Associations within keywords much stronger than associations between keywords and visual features  Uneven distribution of cluster sizes (K-Means problem) Thursday, May 27, 2004 Characteristics of the Itemsets and Rules  There are associations  Within text {shoe}  {walk}  Within visual clusters {EHD 14}  {SCD 12}  Between text and visual clusters {shoe}  {EHD 14}  Measure for interestingness or choice of rules from FI  Confidence?  Statistical Criteria?  Background Knowledge? (Example: pregnant -> Woman: 100% confidence)  Our „Background Knowledge“: Rules that connect keywords and low-level features are more interesting  Since this is known, the mining can be adapted and made even faster Thursday, May 27, 2004 Exploiting the Itemsets and Rules Thursday, May 27, 2004 Selecting Interesting Low-Level Clusters based on Rules  Clusters were introduced to partition the visual feature vector data and search only on certain clusters    But association rules might find and solve this situation        Problem: We miss certain nearest neighbors if images for a concept are spread over several clusters Unsatisfactory Solution: Overlapping Clusters Clusters are re-united If number of images for concept in both clusters is >minsupp Example: {shirt} -> {ehd249,ehd310} reunites these clusters for the initial keyword-query “shirt”! This is scalable - unlike overlapping clusters Another benefit is that more images labeled with the original keyword are “injected” into the results of K-NN search Currently: One Keyword as high level semantic concept Future: Find high level semantic concepts by mining associations within text first Thursday, May 27, 2004 The Visual Link  Another contribution, NOT related to Frequent Itemset Mining and Association Rules…  Since search-concept suggests visual nearest neighbor search with relevance feedback after intitial keyword search:  It would be nice to have a diverse selection of images for a given keyword on the first page of results  Images sorted not only by keyword ranking, but also based on visual feature information  Basic idea: For a given keyword query, build groups of images that are visually close.  Larger groups are more important  Show only one representative per group Thursday, May 27, 2004 The Visual Link: A Graph-Based Approach  Let I(Q) be a set of images matching a keyword query Q  Define a graph G(V,E)  i.e. images are visually linked if the distance between them is lower than a given threshold  Do a connected component analysis to find connected components C  For each component C find the „best“ representative rC  Re-rank results based on representatives rC Thursday, May 27, 2004 The Visual Link: An Example Thursday, May 27, 2004 The Visual Link: An Approximation  Problem: Distance calculations for graph take too long  Clusters cannot be used  Loading individual vectors takes a lot of time  Solution:  Approximate distance  Idea: If images in the same cluster and same distance range to the centroid  Probability that they are „close“ is high  New definition for visually linked  If in same cluster and same range of relative distance to its centroid  Can be encoded in relational DB! And comes at nearly no extra cost in creation Imageid Clusterid 2ndClusterid Reldist 1 221 122 0.6 2 342 345 0.8 3 223 42 0.2 4 12 126 0.4 Thursday, May 27, 2004 Discussion & Demo Thursday, May 27, 2004 Discussion: Precision  Measuring the quality of such a large-scale system is difficult  Precision/Recall measure not possible: ground truth not known • C: correct results • D: Desired results • A: Actual results  We measure the precision based on user questioning Thursday, May 27, 2004 Before we continue … some numbers  Number of Images: 3 006 660  Size of Image data: 111 GB  Feature Extraction: 15 days (dual 2Ghz CPU, 2GB RAM)  Number of distinct keywords: 680 256  Size of inverted keyword index table: 50 260 345 lines  MySQL database size: 23 GB Thursday, May 27, 2004 And now … the moment you’ve all been waiting for …  The Demo of Cortina Thursday, May 27, 2004 Conclusions  A system with over 3 Million items was implemented  Probably the largest CBIR System to date?  A retrieval concept was introduced  a keyword query followed by relevance feedback and visual nearest neighbour search  Superior to existing retrieval concepts (query by keyword or query by example)  Data mining to explore and exploit semantics in largescale systems was introduced Thursday, May 27, 2004 Questions Thursday, May 27, 2004 Outlook  Many extensions and improvements possible  Segmentation • Or maybe rather some simple tiling  Indexing • K-Means should be replaced • Suggestion: VA-File based approach [Manjunath,Tesic 03]  Association Rule Mining • Multilevel Approach • First keywords for high level semantic concepts • Then visual features Thursday, May 27, 2004 Thanks  Ullrich Moenich and Lars Thiele Thursday, May 27, 2004 Which Rules are of Interest?  There are associations  Within text {shoe}  {walk}  Within visual clusters {EHD 14}  {SCD 12}  Between text and visual clusters {shoe}  {EHD 14, SCD 12}  There are long and short rules  Short rules have higher support by the nature of the problem  Long rules contain more (precise) information about the semantics  Measure for interestingness or choice of rules from FI  Confidence?  Statistical Criteria?  Background Knowledge? (Example pregnant Woman ) Thursday, May 27, 2004 Characteristics and Challenges  Chosen criteria  Mainly interested in rules {keywords}  {visual feature clusters}. (Our “Background Knowledge”)  Support, confidence  Mine long and short rules  Restriction of the problem: Mine for frequent itemsets per keyword  i.e. all images=transactions for a given keyword  This means • We avoid being distracted by associations within keywords • The method is made even more scalable  The keyword as a placeholder for a semantic concept  A keyword does not always stand for a single semantic concept  Proposal for future versions: Multi-Level approach: • First {keywords}  {keywords} rules to identify “real” semantic concepts • Then itemset mining per identified concept Thursday, May 27, 2004 Characteristics of the Itemsets and Rules - Overall Thursday, May 27, 2004 Why keyword filtering of the results does not work Thursday, May 27, 2004 Proposal: Semantic Clusters  Ultimate goal: Search some kind of „Semantic Clusters“ instead of visual feature clusters  Proposal based on approach from Ester et al. 2002, 2003  Clustering based on frequent itemsets, originally for text  Clustering criterion: minimize overlap

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Cortina: a web image search engine