Download Slide 1

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR Alexander Gelbukh www.Gelbukh.com Previous Chapter: Conclusions  How to accelerate search? Same results as sequential  Ideas:  Quick-and-dirty rejection of bad objects, 100% recall  Fast data structure for search (based on clustering)  Careful check of all found candidates  Solution: mapping into fewer-D feature space  Condition: lower-bounding of the distance  Assumption: skewed spectrum distribution  Few coefficients concentrate energy, rest are less important 2 Previous Chapter: Research topics     Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1D) New types of data.  What features to select? How to determine them?  Mixed-type data (e.g., webpages, or images with sound and description)  What clustering/IR methods are better suited for what features? (What features for what methods?)  Similar methods in data mining, ... 3 The problem  Very large document collections  Google: 4,000,000,000 pages  Slow response?  Solution: parallel computing  Google: 10,000 computers 4 Parallel architectures Data stream Instruction stream Single Multiple Single SISD classical SIMD simple Multiple MISD (rare) MIMD many SISD 5 MIMD architecture  The most common  Can be  tightly coupled  loosely coupled  Distributed  Many computers interacting via network  PC Clusters  Similar to MIMD computers, but greater cost of communication  very loosely coupled  More coarse-grained programs 6 Performance improvement Time: speedup S  Ideally, N times (number of processors)  In practice impossible  The problem does not decompose into N equal parts  Communication and control overhead  < 1 / f, where f is the largest separable fraction of the problem Cost  Per processor: S / N 7 Two approaches to parallelism  Build new algorithms  E.g., neural nets  Naturally parallel  Problem: to define the retrieval task  Adapt the existing techniques to parallelism  Allows relying on well-studied approaches  We will consider this option 8 Ways to use parallelism  Multitasking  N search engines  Good for processing many queries Problems:  A single query is not speeded up  Bottleneck: disk access (index)  Possible solution: replicating (part of) data. RAIDs  Parallel algorithms  IR = data. Main question: how to partition the data  Document / index term matrix (terms can be LSI dimensions, signature bits, etc) 9 Possible partitionings  Horizontal: document partitioning. Union of results  Vertical: term partitioning. Basically, intersect results 10 Inverted files: Logical partitioning  Logical vs. physical document partitioning  Logical: for each term, use pointers into inverted file data for each processor, to indicate its portion 11 Inverted files: Logical partitioning Construction and updating  Also parallel Construction  Assign docs to processors  Order docs such that each processor has an interval  Process in parallel  Merge. Each piece is ordered already 12 Inverted files: Physical document partitioning     Several separate collections, one per processor Separate indices Then the lists are merged (they are already ordered) Priority queue is used     The result is not sorted; Insertion is quick The maximal element can be found quickly First k elements can be found rather quickly Details in the book  Consistent scores are needed  Global statistics is needed. Can be computed at index time 13 Logical or physical partitioning?  Logical requires less communication  Faster  Physical is more flexible. Simpler implementation  Simpler conversion of existing systems 14 Inverted files: Term partitioning  Each processor processes a part of the inverted file  The results are intersected (for AND)  (or as appropriate for Boolean operations, OR and NOT)  When term distribution in user queries is skewed, then document partitioning is better  When uniform, term partitioning is better.  Twice for long queries, 5 – 10 times for short (Web-like) 15 Suffix arrays  Array construction can be parallelized  merges are parallel  Document partitioning is applied straightforwardly  Each processor maintains its own suffix array  Term partitioning can be applied  Each processor owns a branch of the tree (lexicographic interval)  Bottleneck: all processors need access to the entire text 16 17 Signature files  Document partitioning: straightforward  Create query signature, distribute to each processor  Merge results (using Boolean operations if needed)  Term partitioning: shorter signatures  Merging and eliminating false drops is slow  This method is not recommended 18 SIMD computers  Single Instruction, Multiple data  Uncommon  Good for simple operations  Bit operations in signature files  Details in the book  Ranking is supported in hardware in some computers  If signature file does not fit into memory, can be processed in batches  I/O overhead  Use multiple queries with the same batch  This improves throughput, but not response time 19 … SIMD computers  Inverted files are difficult to adapt to SIMD  The inverted file is restructured  Details in the book 20 Distributed IR  MIMD with  Slow communication  Not all nodes are used for a given query  Encryption issues  Document partitioning is usually used  Term partitioning imposes greater communication overhead  Document clustering can be useful (to distribute docs by processors)  Index clusters and then search only the best ones  Another approach: use training queries, then similarity of the user query to these 21 Research topics     How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck  Meta search engines  Creating large collections with judgements  Is recall important? 22 Conclusions  Parallel computing can improve  response time for each query and/or  throughput: number of queries processed with same speed  Document partitioning is simple  good for distributed computing  Term partitioning is good for some data structures  Distributed computing is MIMD computing with slow communication  SIMD machines are good for Signature files  Both are out of favor now 23 Thank you! Till May 17? 18?, 6 pm 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1