Download Lecture 6: Multimedia IR: Indexing and Searching

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12): Multimedia IR: Indexing and Searching Alexander Gelbukh www.Gelbukh.com Previous Chapter: Conclusions  Basically, images are handled as text described them  Namely, feature vectors (or feature hierarchies)  Context can be used when available to determine features  Also, queries by example are common  From the point of view of DBMS, integration with IR and multimedia-specific techniques is needed  Object-oriented technology is adequate 2 Previous Chapter: Research topics  How similarity function can be defined?  What features of images (video, sound) there are?  How to better specify the importance of individual features? (Give me similar houses: similar = size? color? strructure? Architectural style?)  How to determine the objects in an image?  Integration with DBMSs and SQL for fast access and rich semantics  Integration with XML  Ranking: by similarity, taking into account history, profile 3 The problem  Data examples:  2D/3D color/grayscale images: e.g., brain scans, scientific databases of vector fields  (2D) video,  (1D) voice/music; (1D) time series: e.g., financial/marketing time series; DNA/genomic databases  Query examples:  find photographs with the same color distribution as this  find companies whose stock prices move as this one  find brain scans with a texture of a tumor  Applications: search; data mining 4 Solution  Reduce the problem to search for multi-dimensional points (feature vectors, but vector space is not used)  Define a distance measure  for time series: e.g., Euclidean distance between vectors  for images: e.g., color distribution (Euclidean distance); another approach: mathematical morphology  Other features as vectors  For search within distance, the vectors are organized in R-trees  Clustering plays important role 5 Types of queries  All within given distance  Find all images that are within 0.05 distance from this one  Nearest-neighbor  Find 5 stocks most similar to IBM  All pairs within given distance  Further: clustering  Whole object vs. sub-pattern match  Find parts of image that are...  E.g., in 512  512 brain scans, find pieces similar to the given 16  16 typical X-ray of a tumor  Like passage retrieval for text documents 6 Neighbor and pairs types of queries     The objects are organized in R-trees For neighbor queries: branch-and-bound algorithm For pairs: recently discovered algorithms These types of queries are not discussed here 7 Desiderata for a method  Fast  No sequential search with all objects  Correct  100% recall  Precision is less important, though kept low. False alarms are easy to discard manually  Little space overhead  Dynamic  easy to insert, delete, update 8 Types of methods  Linear quadtrees  Complexity = hypersurface of the query region  Grows exponentially with dimensionality  grid-files  Complexity grows exponentially with dimensionality  R-trees methods, such as R*-trees  Most used due to lower complexity 9 R-tree  Objects and parts of images represented as Minimal Bounding Rectangle (MBR)  Can overlap for different objects  Larger objects contain smaller objects  MBRs are nested  MBRs are arranged into a tree  In storage, an index of disk blocks is maintained  Disk blocks are fetched at once at hardware level  For better insertion/deletion, tight MBRs are needed  Good clustering is needed 10 File structure of R-tree  Corresponds to disk blocks  Fanout = 3: number of parts to group 11 R-tree R-tree 12 Search in R-tree Range queries: find objects within distance  from query object     = Find MBRs that intersect with query’s MBR Determine MBR of the query Descend the tree Discarding all MBRs that do not intersect with the query’s MBR Many variations of R-tree method have been proposed 13 Indexing Only consider here whole match queries   Given collection of objects and distance function Find objects within given distance  from given object Q  Problems: 1. Slow comparison of two objects 2. Huge database  GEMINI approach   GEneric Multimedia object INdexIng Attempts to solve both problems 14 GEMINI indexing  Quick-and-dirty test to quickly discard bad objects  Uses clusters to avoid sequential search  Quick test  Single-valued feature, e.g., average for series. Averages differ much  objects differ much  Not vice-versa. False alarms are OK  Several features, but fewer than all data. E.g., deviation for series 15 Algorithm  Map the actual objects into f-dimensional feature space  Use clusters (e.g., R-trees) to search  Retrieve objects, compute the actual distances, and discard false alarms 16 17 Feature selection  Features should reflect distances  Allow no misses (100% recall)  features should make things look closer  Lower Bound lemma:     If distance in feature space  actual distance then 100% recall (we speak about whole-match queries) Holds for distance search, nearest-neighbor, pair search 18 Algorithm (more detail)      Determine distance Choose features Prove that distance in feature space  for actual objects Use quick method (R-tree) to search in feature space For found objects, compute the actual distances (this can be expensive)  Discard false alarms  objects with greater actual distances, even if in feature space the distance is OK  Example: similar averages, but different series 19 Discussion  The method does NOT improve quality  Provides SAME quality as sequential search, but faster  Distance definition requires domain/application expert  How much do the two images differ?  What is important/unimportant for the specific application?  Feature selection requires a good knowledge engineer  Choose the most characteristic feature: discriminative  If needed, choose the second best, etc.  Good features should be orthogonal: combination adds info 20 Example: Time series  In yearly stock movements, find ones similar to IBM  Distance: Euclidean (365-D vectors); others exist  Features:  First feature is average.  If needed, Discrete Fourier Transform (DFT) coefficients  Or, Discrete Cosine Transform, waivelet Transform, etc.  Lower-bound lemma:     Parseval theorem: DFT preserves distances (DCT, WT too) First several coefficients give  distance Transforms “concentrate energy” in the first coefficients Thus, the more realistic prediction of distance 21 Time series: Applications  Such feature selection is effective for many skewed spectrum distributions  Colored noises: the energy decreases as F–b     b = 0: white spectrum: unpredictable. Method useless. b = 1: pink noise: works of art b = 2: brown noise: stock movements b > 2: black noise: river levels, rainfall patterns  The greater b the better the first coefficients of the transform predict the actual distance  Some other n-D signals show similar properties  JPEG compression ignores higher coefficients 22 Time series: Performance  Fewer features  more false alarms  time lost  More features  more complex computation  Optimal number of features proves to be about 1..3  for skewed enough distributions  JPEG compression shows that photographs have it 23 Time series: Sub-pattern search  Use sliding window  Encode each window with few features 24 Example: Color images  Give me images with a texture of tumor like this one  Give me images with blue at top and red at bottom  Handles color, texture, shape, position, dominant edges 25 Color images: Color representation  Compute color histogram  Distance: use color similarity matrix  Very expensive computationally: cross-talk between features (compare all to all features) 26 27 Color images: Feature mapping  The GEMINI question again: What single feature is the most representative?  Take average R, G, B  Lower-bound?  Yes: Quadratic Distance Bounding theorem 28 Automatic feature selection      Features can be selected automatically In texts: Latent semantic indexing (LSI) Many methods Principle components analysis (= LSI), ... In fact, they can reduce features, but not define them  Of colors, one can select characteristic combinations  But not classify into faces and flowers  So description of the objects is still on human researchers 29 Research topics     Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1D) New types of data.  What features to select? How to determine them?  Mixed-type data (e.g., webpages, or images with sound and description)  What clustering/IR methods are better suited for what features? (What features for what methods?)  Similar methods in data mining, ... 30 Conclusions  How to accelerate search? Same results as sequential  Ideas:  Quick-and-dirty rejection of bad objects, 100% recall  Fast data structure for search (based on clustering)  Careful check of all found candidates  Solution: mapping into fewer-D feature space  Condition: lower-bounding of the distance  Assumption: skewed spectrum distribution  Few coefficients concentrate energy, rest are less important 31 Thank you! Till Tuesday 11, 6 pm 32

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 6: Multimedia IR: Indexing and Searching