* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
Survey
Document related concepts
Transcript
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR Alexander Gelbukh www.Gelbukh.com Previous Chapter: Conclusions How to accelerate search? Same results as sequential Ideas: Quick-and-dirty rejection of bad objects, 100% recall Fast data structure for search (based on clustering) Careful check of all found candidates Solution: mapping into fewer-D feature space Condition: lower-bounding of the distance Assumption: skewed spectrum distribution Few coefficients concentrate energy, rest are less important 2 Previous Chapter: Research topics Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1D) New types of data. What features to select? How to determine them? Mixed-type data (e.g., webpages, or images with sound and description) What clustering/IR methods are better suited for what features? (What features for what methods?) Similar methods in data mining, ... 3 The problem Very large document collections Google: 4,000,000,000 pages Slow response? Solution: parallel computing Google: 10,000 computers 4 Parallel architectures Data stream Instruction stream Single Multiple Single SISD classical SIMD simple Multiple MISD (rare) MIMD many SISD 5 MIMD architecture The most common Can be tightly coupled loosely coupled Distributed Many computers interacting via network PC Clusters Similar to MIMD computers, but greater cost of communication very loosely coupled More coarse-grained programs 6 Performance improvement Time: speedup S Ideally, N times (number of processors) In practice impossible The problem does not decompose into N equal parts Communication and control overhead < 1 / f, where f is the largest separable fraction of the problem Cost Per processor: S / N 7 Two approaches to parallelism Build new algorithms E.g., neural nets Naturally parallel Problem: to define the retrieval task Adapt the existing techniques to parallelism Allows relying on well-studied approaches We will consider this option 8 Ways to use parallelism Multitasking N search engines Good for processing many queries Problems: A single query is not speeded up Bottleneck: disk access (index) Possible solution: replicating (part of) data. RAIDs Parallel algorithms IR = data. Main question: how to partition the data Document / index term matrix (terms can be LSI dimensions, signature bits, etc) 9 Possible partitionings Horizontal: document partitioning. Union of results Vertical: term partitioning. Basically, intersect results 10 Inverted files: Logical partitioning Logical vs. physical document partitioning Logical: for each term, use pointers into inverted file data for each processor, to indicate its portion 11 Inverted files: Logical partitioning Construction and updating Also parallel Construction Assign docs to processors Order docs such that each processor has an interval Process in parallel Merge. Each piece is ordered already 12 Inverted files: Physical document partitioning Several separate collections, one per processor Separate indices Then the lists are merged (they are already ordered) Priority queue is used The result is not sorted; Insertion is quick The maximal element can be found quickly First k elements can be found rather quickly Details in the book Consistent scores are needed Global statistics is needed. Can be computed at index time 13 Logical or physical partitioning? Logical requires less communication Faster Physical is more flexible. Simpler implementation Simpler conversion of existing systems 14 Inverted files: Term partitioning Each processor processes a part of the inverted file The results are intersected (for AND) (or as appropriate for Boolean operations, OR and NOT) When term distribution in user queries is skewed, then document partitioning is better When uniform, term partitioning is better. Twice for long queries, 5 – 10 times for short (Web-like) 15 Suffix arrays Array construction can be parallelized merges are parallel Document partitioning is applied straightforwardly Each processor maintains its own suffix array Term partitioning can be applied Each processor owns a branch of the tree (lexicographic interval) Bottleneck: all processors need access to the entire text 16 17 Signature files Document partitioning: straightforward Create query signature, distribute to each processor Merge results (using Boolean operations if needed) Term partitioning: shorter signatures Merging and eliminating false drops is slow This method is not recommended 18 SIMD computers Single Instruction, Multiple data Uncommon Good for simple operations Bit operations in signature files Details in the book Ranking is supported in hardware in some computers If signature file does not fit into memory, can be processed in batches I/O overhead Use multiple queries with the same batch This improves throughput, but not response time 19 … SIMD computers Inverted files are difficult to adapt to SIMD The inverted file is restructured Details in the book 20 Distributed IR MIMD with Slow communication Not all nodes are used for a given query Encryption issues Document partitioning is usually used Term partitioning imposes greater communication overhead Document clustering can be useful (to distribute docs by processors) Index clusters and then search only the best ones Another approach: use training queries, then similarity of the user query to these 21 Research topics How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck Meta search engines Creating large collections with judgements Is recall important? 22 Conclusions Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed Document partitioning is simple good for distributed computing Term partitioning is good for some data structures Distributed computing is MIMD computing with slow communication SIMD machines are good for Signature files Both are out of favor now 23 Thank you! Till May 17? 18?, 6 pm 24