Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Digital Libraries: What Should We Expect from Search Engines Dr. John M. Lervik, CEO FAST ECDL 2003 Fast Search & Transfer (FAST) • Fast Search & Transfer (FAST) – Search technology company founded 1997 (background from NTNU) – ~ 200 employees – Profitable and well funded (> $90m cash) – Oslo Stock Exchange: 'FAST.NO' • Strong focus on publishing Norway – Publishers: Elsevier, LexisNexis, etc. Chicago San Francisco Boston Washington London Paris Munich Rome – Digital Libraries: Norwegian Nat’l Library, Univ Library Bielefeld, etc. FAST is the creator of the real-time integrated search and filter solution behind the scenes at some of the world's best known companies with the world's most demanding search problems Tokyo Outline • What is a search engine? • Third Generation Search Engines: Architecture • Third Generation Search Engines: Relevance • Example: Scirus • Summary Digital Library Challenges • Digital libraries face an information management challenge – Huge and increasing amount of digital data – Data/content aggregation, data store (repository), information retrieval & discovery, etc • Increasing amount of digital data – Media types: Books, magazines, CDs, ... – Media formats: Text/numbers (incl metadata), audio files, images, video – Must support various access patterns, copyright, etc • Need flexible and efficient interfaces between information and users • Search engine as digital information management platform Third Generation Search Engines: Architecture Traditional View of Search Engines - 1st and 2nd Generation Analyze Content Analyze Query Matching Unstructured Data User Query Result Set Search Engines vs Databases... • • • Search Engines now also provide: • Data consistency • Fault tolerance/high availability • Distributed architectures • Low latency incremental indexing • ... • 3-D scalability – Data volume: 10 – 100 TB – Number of users: > 1,000 QPS – Latency: < 1 sec data input and query latency Data & content diversity – Both structured and unstructured data – Manage multiple data & content repositories Understand content and users – Linguistic methodology to understand textual content – Query analysis On-the-fly data analysis – Reduce dependency on upfront data modeling – Powerful content discovery and result navigation Databases: Transaction processing, Structured data, Upfront data modeling Search Engines: Aggregate & Retrieve data, Structured & Unstructured data, On-the-fly data analysis Search Engine How It Works Open, modular, scalable architecture TUNING, ADMINISTRATION Web Content WEB CRAWLER Pipeline CONNECTORS FILTER QUERY & RESULT PROCESSING Databases DATABASE CONNECTOR DOCUMENT PROCESSING CONNECTORS Multimedia FILE TRAVERSER Query Pipeline SEARCH Files, Documents Custom Applications Pipeline Vertical Applications Portals Results Custom Front-Ends Alert Mobile Devices Content Push Index Files Search Engine How It Works • Connect to content sources and get data – – – – Web pages (e.g. XML, HTML, WML): Crawler Files, documents (e.g. Word, Excel, pdf): File traverser Database content (e.g. Oracle, DB2): Database connectors Applications (e.g. Notes, Exchange, CMS/DMS): Application connectors TUNING, ADMINISTRATION Web Content SEARCH Databases Custom Applications DATABASE CONNECTOR DOCUMENT PROCESSING CONNECTORS Multimedia FILE TRAVERSER Query Pipeline FILTER CONNECTORS Files, Documents Pipeline QUERY & RESULT PROCESSING WEB CRAWLER Vertical Applications Portals Results Custom Front-Ends Alert Mobile Devices Content Push Index Files Search Engine How It Works • Analyze and index content to make it searchable – Convert and process content through pre-processing pipeline: • Lemmatization, entity extraction, taxonomy classification, ontology • Custom logic (e.g. adding special tags) – Write content to index files TUNING, ADMINISTRATION Web Content WEB CRAWLER Pipeline CONNECTORS FILTER QUERY /RESULT PROCESSING Databases DATABASE CONNECTOR DOCUMENT PROCESSING CONNECTORS Multimedia FILE TRAVERSER Query Pipeline SEARCH Files, Documents Custom Applications Pipeline Vertical Applications Portals Results Custom Front-Ends Alert Mobile Devices Content Push Index Files Search Engine How It Works • Analyze query – Use query language or query API – Convert and process query through query pipeline: • Linguistic processing • Custom logic (e.g. query term modification/addition) TUNING, ADMINISTRATION Web Content WEB CRAWLER SEARCH Custom Applications FILTER CONNECTORS Databases DATABASE CONNECTOR DOCUMENT PROCESSING CONNECTORS Multimedia FILE TRAVERSER Query Pipeline QUERY PROCESSING Files, Documents Pipeline Vertical Applications Portals Results Custom Front-Ends Alert Mobile Devices Content Push Index Files Search Engine How It Works • Match query to content index – Query- and content adaptive matching – Exploit all information and structure in the data TUNING, ADMINISTRATION Web Content FILTER Pipeline CONNECTORS Databases DATABASE CONNECTOR DOCUMENT PROCESSING CONNECTORS Multimedia FILE TRAVERSER Query Pipeline SEARCH Files, Documents Custom Applications Pipeline QUERY /RESULT PROCESSING WEB CRAWLER Vertical Applications Portals Results Custom Front-Ends Alert Mobile Devices Content Push Index Files Search Engine How It Works • Return results to user – Convert and process results through result pipeline: • Resort, filter for security, organize for dynamic drilldown – Pass results on to application (generated or through API) – Push results to alert engine and then external environment (e.g. mail, queue) TUNING, ADMINISTRATION Web Content SEARCH Databases DATABASE CONNECTOR FILTER Pipeline CONNECTORS Multimedia FILE TRAVERSER DOCUMENT PROCESSING CONNECTORS Files, Documents Custom Applications Query Pipeline RESULT PROCESSING WEB CRAWLER Vertical Applications Portals Results Custom Front-Ends Alert Mobile Devices Content Push Index Files Search Engine Features Relevant, Organized Information • Linguistic Analysis – – – – – – Auto-language detection Natural language processing Approximate matching (spelling) Lemmatization (grammar) Entity extraction, anti-phrasing Multiple dictionaries, thesauri • Taxonomy and Classification – – – – – – Structured, unstructured data Supervised, unsupervised categorization Dynamic classification Auto-taxonomy generation (terms, Web) Taxonomy toolkit Ontologies • Tuning relevancy – – – – Absolute and relative query boosting Relative document boosting Custom processing logic (pre-index, query) Business Manager’s Control Panel • Powerful Search Language – – – – – Exact matches, wildcards, multiple terms “more like this” (query by example), “near” Text, integer, Boolean expressions Integer comparisons (>, , =, <, , ) Infinite level of parentheses • Flexible Search and Sort – – – – – Range searching Default sort, sort by field Static, dynamic teasers, any field Full inclusion, exclusion URI control Robot aware • Navigation and Drill-Down Tools – – – – Structure, unstructured data Dynamic drill-down (faceted browsing) Results-based binning Statistical analysis Search Engines: An Ideal Information Management Platform • Software system for overall information management – – – – Universal data aggregation: public & proprietary Central content repository: source data & metadata, ... Efficient information access (through seach interface) – including push (alert) Powerful data mining and discovery Third Generation Search Engines: Relevance Relevance Model 3rd Generation Search Engines • Algorithmic – Mining • Preparing unstructured information for intelligent information discovery • Information/entity extraction, structural document analysis, classification, NLP, ... – Matching • Tuning of recall: Spell-check, query analysis, linguistic analysis ... – Ranking • Tuning of precision • Assigning of key parameters – The CASQF framework • Relevancy benchmark framework – Optimization through machine learning - NLP – Navigation / Information discovery • Extensive result analysis through LiveAnalytics • Adaptive answering/discovery capability Not only query results; Also supporting query Result driven info discovery • Rule Based – Rules: Override algorithmic search, preferred content – Automatic detection of “correct answers” Relevance Mining: Content Refinement • Algorithmic – Mining • Preparing unstructured information for intelligent information discovery • Information/entity extraction and structural document analysis, classification, NLP, ... – Matching • Tuning of recall: Spell-check, query analysis, linguistic analysis ... – Ranking • Tuning of precision • Assigning of key parameters – The CASQF framework • Relevancy benchmark framework – Optimization through machine learning - NLP – Navigation / Information discovery • Extensive result analysis through LiveAnalytics • Adaptive answering/discovery capability Not only query results; Also supporting query Result driven info discovery • Rule Based – Rules: Override algorithmic search, preferred content – Automatic detection of “correct answers” Relevance Mining: Entity Extraction • What is it? – Techniques to automatically detect, extract and normalize ”entities” from documents’ text, e.g., names of people or companies, noun phrases, product names or codes, adresses, chemical compounds, acronyms, ... • Why bother? – Make unstructured data more structured • Enhance relevance (precision searching:”it” vs. ”IT”, boosting based on entities, improved vectorization, ...) • Enables navigation, i.e., attributes to drill down along – Dictionary compilation • Spellchecking, classification, query associations, etc. – Improve classification quality: entities are more specific and less ambiguous • How is it done? – Realized as document processors that annotate/modify the indexed document and/or otherwise persist the entities. – Guided by dictionaries, syntactical cues, grammars and/or document structure Entities... PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes. The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said. A first round loser last year, Williams is hoping to progress beyond the quarter-finals for the first time in her career. ... As Opposed to Words 11, 25, 3, 3, 6, 6, 65, a, a, advance, american, and, and, anything, aside, at, become, being, beyond, bianka, blustery, breezed, brushing, career, center, champion, court, d, do, finals, first, first, first, for, french, french, garros, german, her, here, here, hoping, i, i, i, in, in, into, is, lamade, last, loser, love, love, love, million, minutes, monday, more, of, on, open, open, open, paris, past, progress, quarter, raced, reuters, roland, round, round, s, said, second, second, seed, seeded, than, the, the, the, the, the, the, the, the, the, time, to, to, to, to, u, venus, well, williams, williams, wimbledon, year (Wordlist of article above) Automatically Extracted Entities Relevance Mining: Structural Document Analysis • Benefits – Sensitive to Journal of Cancer Research Issue 5, 2003 -12 Investigations in E. coli • Content • Structure • Formatting – Document classification • Improve precision in information discovery – Document segmentation • Segmentation used in tunable relevance model • Enabling contextual entity extraction B. C. Abracadabra Department of Molecular Medicine University of Wisconsin S. Miheev Analytical Laboratory Russian Academy of Scieneces Moscow Abstract In this study we investigate……… 1. Introduction 2. Materials and Methods 9. References [1] Abracadabra, B.C., Discovery of E. coli for Genetic Research, Conf. Canc. 1997, 231-245 [2] Tomason T., Latest Developments in Cancer Research, Int J Med, 1999, 23, 12-16 [3] Zoralek Q.W., Geneteics. A Personal View. Int. Conf. Gen., 1993, 3-12 Relevance Mining: Structural Document Analysis Journal Title Journal of Cancer Research Issue 5, 2003 -12 Investigations in E. coli Article Title B. C. Abracadabra Department of Molecular Medicine University of Wisconsin Author Section Text Block Types Section Heading S. Miheev Analytical Laboratory Russian Academy of Scieneces Moscow Abstract In this study we investigate……… 1. Introduction “Readable” Text paragraph 2. Materials and Methods Bibliography heading Bibliography line Complex Block Types 9. References [1] Abracadabra, B.C., Discovery of E. coli for Genetic Research, Conf. Canc. 1997, 231-245 [2] Tomason T., Latest Developments in Cancer Research, Int J Med, 1999, 23, 12-16 [3] Zoralek Q.W., Geneteics. A Personal View. Int. Conf. Gen., 1993, 3-12 Doc Type Relevance Matching: Recall optimization • Algorithmic – Mining • Preparing unstructured information for intelligent information discovery • Information/entity extraction, structural document analysis, classification, NLP... – Matching • Tuning of recall: Spell-check, query analysis, linguistic analysis ... – Ranking • Tuning of precision • Assigning of key parameters – The CASQF framework • Relevancy benchmark framework – Optimization through machine learning - NLP – Navigation / Information discovery • Extensive result analysis through LiveAnalytics • Adaptive answering/discovery capability Not only query results; Also supporting query Result driven info discovery • Rule-Based – Rules: Override algorithmic search, preferred content – Automatic detection of “correct answers” Relevance Matching: Reasons for inadequate matching • Ortographical analysis (incl. typos + other spelling variations) – Typos (in queries, documents) – Official variants: E.g. German (Dutch) spelling reform • Morphological analysis (incl. all forms of a given word) – Handled via linguistic normalization (lemmatization) • Syntactic analysis (including handling natural language) – Entity/phrase extraction, anti-phrases, etc. • Semantic analysis (understanding the intention behind the words) – Handled by a combination of general and specific thesauri and ontologies, as well as automatic phrasing, and anti-phrasing, etc. Reasons for Inadequate Matching Ortographical Analysis hewlet packard hewlettpackard hewlitt packard hewllet packard hewlwtt packard hewett packard hewitt packard hewlette packard hewlett packerd hewelett packard hawlett packard hewlet-packard hewlett pakard hewllett packard hewlit packard hewlett parkard helwett packard hewletpackard hewlett packart hewlitt-packard hewlett pachard hewlett pacard hewlett packhard hewlett packert hewlet packerd hewelt packard hewlet pakard hewelet packard hawlet packard hwelett packard heweltt packard hewellet packard hewlatt packard helwet packard hewleet packard hewlwt packard hewlwtt-packard helett packard Hewlittpackard hawlett-packard hewlett parckard hewlet packart hulett packard hewlert packard hewlet pacard hewletpackard com hewlet pachard .... • And quite a few more variations... • Taking this into account lead to a 500% increase in recall! • This observation applies to all queries involving proper names, product names, technical terms, etc Relevance Matching: Linguistic Query Analysis Tokenizer Make sure - special characters are treated correctly - on demand: no lower casing - etc.. Character Normalization Language specific stop word lists - configurable for other languages - can be switched off at query time Tokenizer Phrasing Spellchecker Anti-Phrasing & Stopwords Natural Language Analysis Normalization NLQ Adaptive Query Evaluation Baseform red. Synonyms Temp. Morph/syn Expansion Customer’s QT Adaptive Query Evaluation Select: Content Ranking profiles Lemmatization + Synonym: Reduction to base form, represented symbolically: <lang> <concept> <lemma> 012 319827 002 For lemmatization / synonym dictionaries added by customer without wanting to re-index: Query Expansion E.g. special thesaurus support - for narrower & broader terms - for special phonetic search - …. Relevance Matching: Adaptive Query Evaluation General Queries Hybrid Queries Specific Queries ‘New York’ ‘C source code downloads’ ‘HP printer driver LP 6j’ Content Format Reference Relevance Ranking: The CASQF Framework • Algorithmic – Mining • Preparing unstructured information for intelligent information discovery • Information/entity extraction and structural document analysis, classification, NLP... – Matching • Tuning of recall: Spell-check, query analysis, linguistic analysis ... – Ranking • Tuning of precision • Assigning of key parameters – The CASQF framework • Relevancy benchmark framework – Optimization through machine learning - NLP – Navigation / Information discovery • Extensive result analysis through LiveAnalytics • Adaptive answering/discovery capability Not only query results; Also supporting query Result driven info discovery • Rule Based – Rules: Override algorithmic search, preferred content – Automatic detection of “correct answers” Relevance Ranking: The CASQF Framework • Completeness – How well does the query match superior contexts like the title or the url? – Example: query=”Mexico”, Is ”Mexico” or ”University of New Mexico” best? • Authority – Is the document considered an authority for this query? – Examples: Web link cardinality, article references, product revenue, page impressions, ... • Statistics – How well does the contents of this document on overall match the query? – Examples: Proximity, context weights, tf-idf, degree of linguistic normalization,++ • Quality – What is the quality of the document? – Examples: Homepage?, Entry point to product group?, Press release?, ... • Freshness – How fresh is the document compared to the time of the query? Orthogonal attributes aggregated from search relevance primitives Relevance Ranking: Machine Learning Optimization Relevance Primitives Auto-language detection Approximate matching Lemmatization (grammar) Phrase detection Anti-phrasing (stop words) Multiple dictionaries thesauri Natural language (who, what, where, etc.) Proximity search Spatial relevance Contextual tagging … • Transfer functions CASQF Attributes – – – – – Completeness Authority Statistics Quality Freshness Attribute weights • Correlations – Complete & Fresh? – Authority & Frequent query – …. Correlations Relevance Navigation & Information Discovery • Algorithmic – Mining • Preparing unstructured information for intelligent information discovery • Information/entity extraction and structural document analysis, classification, NLP... – Matching • Tuning of recall: Spell-check, query analysis, linguistic analysis ... – Ranking • Tuning of precision • Assigning of key parameters – The CASQF framework • Relevancy benchmark framework – Optimization through machine learning - NLP – Navigation / Information discovery • Extensive result analysis through LiveAnalytics • Adaptive answering/discovery capability Not only query results; Also supporting query Result driven info discovery • Rule Based – Rules: override algorithmic search, preferred content – Automatic detection of “correct answers” Form of Result Sets • Traditional: Results sets are typically lists of document identifiers • 3rd generation: Result set depending on the query intentions – – – – Traditional result set lists Dynamic clustering: Supervised and unsupervised Dynamic drill-down: Live analytics ... Intelligent Organization The search bar 2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there” The hierarchical tree Medline Demo - 12 million journal articles from > 4,000 journals Dynamic Drill-Down in Auto-Extracted Entities Dynamic Drill-Down • MESH keywords • Publication year • Journal Title • Author(s) • Substances • Etc LiveAnalytics™: Tool for Dynamic Drill-Down • Multi-dimensional navigation and binning tool – Multiple ways to drill down through structured data – E.g. database rows, metadata, automatically extracted attributes (entities) • Data properties may be: – Textual: author, publication, … – Numeric: date, price, … – Free text: abstract, ... • Source may be historical or live feeds Relevance Rule Based • Algorithmic – Mining • Preparing unstructured information for intelligent information discovery • Information/entity extraction and structural document analysis, classification, NLP... – Matching • Tuning of recall: Spell-check, query analysis, linguistic analysis ... – Ranking • Tuning of precision • Assigning of key parameters – The CASQF framework • Relevancy benchmark framework – Optimization through machine learning - NLP – Navigation / Information discovery • Extensive result analysis through LiveAnalytics • Adaptive answering/discovery capability Not only query results; Also supporting query Result driven info discovery • Rule Based – Rules: override algorithmic search, preferred content – Automatic detection of “correct answers” Relevance Model Rule Based: Business Logic Example Example: Scirus Scirus Scirus is the leading online search engine for scientific content Proprietary Databases Scientific Web Pages 18M article records (Medline, SciencDirect, …) 120 million Web pages (.edu, .gov, .org, .com, …) Value Added Twice winner of SEW Best Specialty Search Engine award Functionalities • Large-scale content aggregation • Automatic content & page classificat. • Query refinements (1-D drilldown) Summary • Search engines can do more than just search… – – – – Total information management system Open, scalable and modular architecture: Allows for customization Adapts to content and queries Powerful data discovery and navigation • Many exciting technology developments to come – – – – More advanced content and query analysis Adaptive query- & content-sensitive matching Dynamic result set presentation and navigation … Questions?