* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 9-20-2006-overview
Survey
Document related concepts
Transcript
Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research guideline for unstable angina Example: Angina treatments unstable angina management herbal treatment for angina pain medications for treating angina alternative treatment for angina pain treatment for angina angina treatments Structured databases (e.g., drug info, WHO drug adverse effects DB, etc) MedLine PDR Medical reference and literature Web search results 2 Research Goal Seamless, intuitive, efficient, and robust access to knowledge in unstructured sources Some approaches: Retrieve the relevant documents or passages Question answering Construct domain-specific “verticals” (MedLine) Extract entities and relationships Network of relationships: Semantic Web 3 Semantic Relationships “Buried” in Unstructured Text … A number of well-designed and executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris … RecommendedTreatment Drug Condition statins recurrent myocardial infarction statins strokes statins unstable angina pectoris Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives • Corporate mergers, succession, location • Terrorist attacks ] M essage U nderstanding C onferences 4 What Structured Representation Can Do for You: Large Text Collection Structured Relation … allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web 5 Challenges in Information Extraction Portability • • Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune Scalability, Efficiency, Access • • Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years Approach: learn from data ( “Bootstrapping” ) • • Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction 6 The Snowball System: Overview Organization Location Conf Microsoft Redmond 1 IBM Armonk 1 Intel Santa Clara 1 AG Edwards St Louis 0.9 Air Canada Montreal 0.8 7th Level Richardson 0.8 3Com Corp Santa Clara 0.8 3DO 3M 1 Snowball Redwood City 0.7 Minneapolis 2 Text Database 0.7 MacWorld San Francisco 0.7 ... ... .. 157th Street Manhattan 0.52 15th Party Congress China 0.3 15th Century Europe Dark Ages 0.1 3 7 Snowball: Getting User Input Organization Headquarters Microsoft Redmond IBM Armonk Intel Santa Clara Get Examples Find Example Occurrences in Text Evaluate Tuples Extract Tuples ACM DL 2000 Tag Entities Generate Extraction Patterns User input: • a handful of example instances • integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc… 8 Evaluating Patterns and Tuples: Expectation Maximization EM-Spy Algorithm • • • • “Hide” labels for some seed tuples Iterate EM algorithm to convergence on tuple/pattern confidence values Set threshold t such that (t > 90% of spy tuples) Re-initialize Snowball using new seed tuples Organization Headquarters Initial Final Microsoft Redmond 1 1 IBM Armonk 1 0.8 Intel Santa Clara 1 0.9 AG Edwards St Louis 0 0.9 Air Canada Montreal 0 0.8 7th Level Richardson 0 0.8 3Com Corp Santa Clara 0 0.8 3DO Redwood City 0 0.7 3M Minneapolis 0 0.7 MacWorld San Francisco 0 0.7 ….. 0 157th Street Manhattan 0 0.52 15th Party Congress China 0 0.3 15th Century Europe Dark Ages 0 0.1 0 9 Adapting Snowball for New Relations Large parameter space • • • • • Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy Automatically estimate parameter values: • • • Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded 10 SDM 2006 Example Task 1: DiseaseOutbreaks Proteus: Snowball: 0.409 0.415 11 ISMB 2003 Example Task 2: Bioinformatics 100,000+ gene and protein synonyms extracted from 50,000+ journal articles Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT) “APO-1, also known as DR6…” “MEK4, also called SEK1…” 12 Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06] • CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks Medical literature: PDRHealth, Micromedex… [Ph.D. Thesis] • AdverseEffects, DrugInteractions, RecommendedTreatments Biological literature: GeneWays corpus [ISMB’03] • Gene and Protein Synonyms 13 CIKM 2005 Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from background President George W Bush’s three-day visit to India 0.07 0.06 frequency 0.05 0.04 0.03 0.02 0.01 0 the to and said 's company mrs won president Quantify as relative entropy (Kullback-Liebler divergence) LM C ( w) KL( LM C || LM BG ) LM C ( wi ) log LM BG ( w) wV After calibration, metric predicts if bootstrapping likely to work 14 Extracting All Relation Instances From a Text Database Information Extraction System Text Database Structured Relation Brute force approach: feed all Expensive for docs to information extraction system large collections Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing keyword index How to identify “useful” documents? ] 15 Accessing Text DBs via Search Engines Search Engine Information Extraction System Text Database Search engines impose limitations • Limit on documents retrieved per query • Support simple keywords and phrases • Ignore “stopwords” (e.g., “a”, “is”) Structured Relation 16 Text-Centric Task I: Information Extraction Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus) Date Disease Name Locatio n Jan. 1995 Malaria Ethiopia July 1995 Mad Cow Disease U.K. Feb. 1995 Pneumonia U.S. Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan May 1995 Ebola Zaire 17 Executing a Text-Centric Task Text Database Extraction System 1. Retrieve documents from database Similar to relational world 2. Process documents Output Tokens … 3. Extract output tokens Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results →underlying data distribution dictates what is best Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) Unlike the relational world 18 QXtract: Querying Text Databases for Robust Scalable Information EXtraction User-Provided Seed Tuples DiseaseName Location Date Malaria Ethiopia Jan. 1995 Ebola Zaire May 1995 Query Generation Search Engine Queries Promising Documents Text Database Information Extraction System DiseaseName Location Date Malaria Ethiopia Jan. 1995 Ebola Zaire May 1995 Problem: keyword Cow Disease Extracted RelationLearnMad queries The U.K. July 1995 Pneumonia The U.S. Feb. 1995 to retrieve “promising” documents 19 Learning Queries to Retrieve Promising Documents 1. Get document sample with “likely negative” and “likely positive” examples. User-Provided Seed Tuples Seed Sampling Text Database ? 3. Train classifiers to “recognize” useful documents. 4. Generate queries from classifier model/rules. ? ? ? ? ? ? ? 2. Label sample documents using information extraction system as “oracle.” Search Engine Information Extraction System tuple1 tuple2 tuple3 tuple4 tuple5 + + - - + + - - Classifier Training + + - - + + - - Query Generation Queries 20 SIGMOD 2003 Demonstration 21 Querying Graph Tokens The querying graph is a bipartite graph, containing tokens and documents t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> Each token (transformed to a keyword query) retrieves documents Documents contain tokens t3 d3 <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> 22 Sizes of Connected Components How many tuples are in largest Core + Out? In Core Out In t0 Core (strongly connected) Out In Core Out Conjecture: • Degree distribution in reachability graphs follows “power-law.” • Then, reachability graph has at most one giant component. Define Reachability as Fraction of tuples in largest Core + Out 23 NYT Reachability Graph: Outdegree Distribution Matches the power-law distribution MaxResults=10 MaxResults=50 24 NYT: Component Size Distribution Not “reachable” MaxResults=10 CG / |T| = 0.297 “reachable” MaxResults=50 CG / |T| = 0.620 25 Connected Components Visualization DiseaseOutbreaks, New York Times 1995 26 SIGMOD 2006 Estimate Cost of Retrieval Methods Alternatives: • Scan, Filtered Scan, Tuples, QXtract General cost model for text-centric tasks • Information extraction, summary construction, etc… Estimate the expected cost of each access method • • • Parametric model describing all retrieval steps Extended analysis to arbitrary degree distributions Parameters estimates can be “piggybacked” at runtime Cost estimates can be provided to a query optimizer for nearly optimal execution 27 Optimized Execution of Text-Centric Tasks Scan Filtered Scan Tuples 28 Current Research Agenda Seamless, intuitive, and robust access to knowledge in biologicial and medical sources Some research problems: • Robust query processing over unstructured data • Intelligently interpreting user information needs • Text mining for bio- and medical informatics • Model implicit network structures: • Entity graphs in Wikipedia • Protein-Protein interaction networks • Semantic maps of MedLine 29 Deriving Actionable Knowledge from Unstructured (text) Data Extract actionable rules from medical text (Medline, patient reports, …) • Joint project (early stages) with medical school, GT Epidemiology surveillance (w/ SPH) Query processing over unstructured data • Tune extraction for query workload • Index structures to support effective extraction • Queries over extracted and “native” tables 30 Text Mining for Bioinformatics Impossible to keep up with literature, experimental notes Automatically update ontologies, indexes Automate tedious work of post-wetlab search Identify (and assign text label) DNA structures 31 Mining Text and Sequence Data ROC50 scores for each class and method PSB 2004 32