Download SIGIR01 - Information retrieval

MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst. Manmatha Introduction • MetaSearch / Distributed Retrieval – Well defined problem • Language Models are a good way to solve these problems. – Grand Challenge • Massively Distributed Multi-lingual Retrieval Manmatha MetaSearch • Combine results from different search engines. – Single Database – Or Highly Overlapped Databases. – » Example, Web. Multiple Databases or Multi-lingual databases. • Challenges – Incompatible scores even if the same search engine is used for different – databases. » Collection Differences, and engine differences. Document Scores depend on query. Combination on a per query basis makes training difficult. • Current Solutions involve learning how to map scores between different systems. – Manmatha Alternative approach involves aggregating ranks. Current Solutions for MetaSearch – Single Database Case • Solutions – Reasonable solutions involving mapping scores either by simple – – – – normalization, equalizing score distributions, training Rank Based methods – eg Borda counts, Markov Chains.. Mapped scores are usually combined using linear weighting. Performance improvement about 5 to 10%. Search engines need to be similar in performance » May explain why simple normalization schemes work. • Other Approaches – A Markov Chain approach has been tried. However, results on standard – Manmatha datasets are not available for comparison. Shouldn’t be difficult to try more standard LM approaches. Challenges – MetaSearch for Single Databases • Can one combine search engines which differ a lot in performance effectively? – Improve performance even using poorly performing engines? – How? Or use resource selection like approach case to eliminate poorly performing engines on a per query basis. • Techniques from other fields. – Techniques in economics and social sciences for voter aggregation may be useful (Borda count, Condorcet ..) • LM approaches – Will possibly improve performance by characterizing the scores at a finer granularity than say score distributions. Manmatha Multiple Databases • Two main factors determine variation in document scores – Search engine scoring functions. – Collection variations which essentially change the IDF. • Effective score normalization requires – Disregarding databases which are unlikely to have the answer – – » Resource Selection. Normalizing out collection variations on a per query basis. Mostly ad hoc normalizing functions. • Language Models. – Resource Descriptions already provide language models for collections. – Could use these to factor out collection variations. – Tricky to do this for different search engines. Manmatha Multi-lingual Databases • Normalizing scores across multiple databases. – Difficult Problem • Possibility: – Create language models for each database. – Use simple translation models to map across databases. – Use this to normalize scores. – Difficult. Manmatha Distributed Web Search • Distribute web search over multiple sites/servers. – Localized/ Regional. – Domain dependent. – Possibly no central coordination. – Server Selection/ Database Selection with/without explicit queries. • Research Issues – Partial representations of the world. – Trust, Reliability. • Peer to peer. Manmatha Challenges • Formal Methods for Resource Descriptions, Ranking, Combination – Example. Language Modeling – Beyond collections as big documents • Multi-lingual retrieval – Combining the outputs of systems searching databases in many languages. • Peer to Peer Systems – Beyond broadcasting simple keyword searches. – Non-centralized – Networking considerations e.g. availability, latency, transfer time. • Distributed Web Search • Data, Web Data. Manmatha

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SIGIR01 - Information retrieval