Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Accessing, Managing, and Mining Unstructured Data Eugene Agichtein 1 The Web 20B+ of machine-readable text (some of it useful) (Mostly) human-generated for human consumption Both “artificial” and “natural” phenomenon Still growing? Local and global structure (links) Headaches: Dynamic vs. static content People figured out how to make money Positives: Everything (almost) is on the web People (eventually) can find info People (on average) are not evil 2 Wait, there is more Blogs, wikipedia Hidden web: > 25 million databases Accessible via keyword search interfaces E.g., MedLine, CancerLit, USPTO, … 100x more data than surface web (Transcribed) speech from Classified Genetic sequence annotations Biological & Medical literature Medical records, reports, alerts, 911 calls 3 Outline Unstructured data (text, web, …) is Important (really!) Not so unstructured Main tasks/requirements and challenges Example problem: query optimization for text-centric tasks Fundamental research problems/directions 4 Unstructured data = natural language text (for this talk) Incredibly powerful and flexible means of communicating knowledge Local structures: syntax Papers, news, web pages, lecture notes, patient records, shopping lists… English syntax HTML layout Semantics implicit, ambiguous, subjective I saw a man with a chainsaw Need incredibly powerful and flexible decoder 5 Some more structure Explicit link structure Web, Blogs, Wikipedia, citations Implicit link structure Co-occurrence of entities within same document/context implies link between entities Occurrence of same entity in multiple documents implies link between documents Physical location Page primarily “about” Atlanta User somewhere around N. Decatur Rd E-mail sender is two floors down More on this later 6 Global Problem Space Crawling (accessing) the data Storing (multiple version of) data “Understanding” the data information Indexing information Integration from multiple sources User-driven information retrieval Exploiting unstructured data in applications System-driven knowledge discovery Building a nuclear/hydro/wind/ power plant 7 To Search or to Crawl? Towards a Query Optimizer for TextCentric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006] Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus) Date Disease Name Location Jan. 1995 Malaria Ethiopia July 1995 Mad Cow Disease U.K. Feb. 1995 Pneumonia U.S. May 1995 Ebola Zaire 8 An Abstract View of Text-Centric Tasks Output Tokens Text Database … Extraction System 1. Retrieve documents from database 2. Process documents 3. Extract output tokens Task Token Information Extraction Relation Tuple Database Selection Word (+Frequency) Focused Crawling Web Page about a Topic For the rest of the talk 9 Executing a Text-Centric Task Output Tokens Text Database Extraction … System 1. Retrieve documents from database Similar to relational world 2. Process documents 3. Extract output tokens Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results →underlying data distribution dictates what is best Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) Unlike the relational world 10 Execution Plan Characteristics Output Tokens Text Database 1. Question: How do we choose the… Extraction fastest execution plan for reaching System a target recall ? Retrieve documents from database 2. Process documents 3. Extract output tokens Execution Plans have two main characteristics: Execution Time Recall (fraction of tokens retrieved) “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?” 11 Outline Description and analysis of crawl- and query-based plans Scan Crawl-based Filtered Scan Iterative Set Expansion Automatic Query Generation Query-based (Index-based) Optimization strategy Experimental results and conclusions 12 Scan Output Tokens Text Database Extraction … System 1. Retrieve docs from database 2. Process documents 3. Extract output tokens Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Question: How many documents does Scan retrieve to reach target recall? Time for retrieving a document Time for processing a document Filtered Scan uses a classifier to identify and process only promising documents (details in paper) 13 Estimating Recall of Scan <SARS, China> Modeling Scan for Token t: What is the probability of seeing t (with frequency g(t)) after retrieving S documents? A “sampling without replacement” process Token t d1 d2 S documents ... After retrieving S documents, frequency of token t follows hypergeometric distribution Recall for token t is the probability that frequency of t in S docs > 0 dS ... dN D Probability of seeing token t after retrieving S documents g(t) = frequency of token t Sampling for t 14 Estimating Recall of Scan <SARS, China> <Ebola, Zaire> Modeling Scan: Multiple “sampling without replacement” processes, one for each token Overall recall is average recall across tokens Tokens t1 t2 Sampling for t1 Sampling for t2 ... tM d1 d2 → We can compute number of documents required to reach target recall d3 ... Execution time = |Retrieved Docs| · (R + P) dN D Sampling for tM 15 Outline Description and analysis of crawl- and query-based plans Scan Crawl-based Filtered Scan Iterative Set Expansion Automatic Query Generation Query-based Optimization strategy Experimental results and conclusions 16 Iterative Set Expansion Output Tokens Text Database … Extraction Query System 1. Query database with seed tokens Generation 2. Process retrieved documents 3. Extract tokens from docs (e.g., <Malaria, Ethiopia>) 4. Augment seed tokens with new tokens (e.g., [Ebola AND Zaire]) Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for retrieving a Time for processing document a document Time for answering a query17 Querying Graph Tokens The querying graph is a bipartite graph, containing tokens and documents t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> Each token (transformed to a keyword query) retrieves documents Documents contain tokens t3 d3 <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> 18 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tokens as queries (estimates time) Number of tokens that appear in the retrieved documents (estimates recall) Tokens t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> To estimate these we need to compute the: Degree distribution of the tokens discovered by retrieving documents Degree distribution of the documents retrieved by the tokens (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) t3 d3 <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> 19 Elegant analysis framework based on generating functions – details in the paper Recall Limit: Reachability Graph Tokens Documents t1 d1 t2 d2 t3 d3 t4 d4 t5 d5 Reachability Graph t1 t2 t3 t5 t4 t1 retrieves document d1 that contains t2 Upper recall limit: determined by the size of the biggest connected component 20 Automatic Query Generation Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens Details in the papers 21 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions 22 Summary of Cost Analysis Our analysis so far: Takes as input a target recall Gives as output the time for each plan to reach target recall (time = infinity, if plan cannot reach target recall) Time and recall depend on task-specific properties of database: Token degree distribution Document degree distribution Next, we show how to estimate degree distributions on-the-fly 23 Estimating Cost Model Parameters Token and document degree distributions belong to known distribution families Task Document Distribution Token Distribution Information Extraction Power-law Power-law Content Summary Construction Lognormal Power-law (Zipf) Focused Resource Discovery Uniform Uniform 10000 100000 y = 43060x-3.3863 10000 1000 y = 5492.2x-2.0254 Number of Tokens Number of Documents 1000 100 10 1 1 10 Document Degree 100 100 10 1 1 10 100 Token Degree 1000 24 Can characterize distributions with only a few parameters! Parameter Estimation Naïve solution for parameter estimation: Start with separate, “parameter-estimation” phase Perform random sampling on database Stop when cross-validation indicates high confidence We can do better than this! No need for separate sampling phase Sampling is equivalent to executing the task: →Piggyback parameter estimation into execution 25 On-the-fly Parameter Estimation Correct (but unknown) distribution Pick most promising execution plan for target recall assuming “default” parameter values Start executing task Update parameter estimates during execution Switch plan if updated statistics indicate so Initial, default estimation Updated estimation Updated estimation Important Only Scan acts as “random sampling” 26 All other execution plan need parameter adjustment (see paper) Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions 27 Correctness of Theoretical Analysis 100,000 Execution Time (secs) 10,000 Scan 1,000 Filt. Scan Automatic Query Gen. Iterative Set Expansion 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 Solid lines: Actual time Dotted lines: Predicted time with correct parameters 0.70 0.80 0.90 1.00 Task: Disease Outbreaks Snowball IE system 182,531 documents from NYT 28 16,921 tokens Experimental Results (Information Extraction) 100,000 Execution Time (secs) 10,000 Scan Filt. Scan 1,000 Iterative Set Expansion Automatic Query Gen. OPTIMIZED 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 0.70 0.80 0.90 1.00 Solid lines: Actual time Green line: Time with optimizer (results similar in other experiments – see paper) 29 Conclusions Common execution plans for multiple text-centric tasks Analytic models for predicting execution time and recall of various crawl- and query-based plans Techniques for on-the-fly parameter estimation Optimization framework picks on-the-fly the fastest plan for target recall 30 Global Problem Space Crawling (accessing) the data “Understand” the data information Indexing information Integration from multiple sources User-driven information retrieval Exploiting unstructured data in applications System-driven knowledge discovery 31 Some Research Directions Modeling explicit and Implicit network structures Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles Robust information extraction, retrieval, and query processing Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology) Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine) 32 Page Quality: In Search of an Unbiased Web Ranking [Cho, Roy, Adams, SIGMOD 2005] “popular pages tend to get even more popular, while unpopular pages get ignored by an average user” 33 Sic Transit Gloria Telae: Towards an Understanding of the Web’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004] 34 Modeling Social Networks for Epidemiology, security, … Email exchange mapped onto cubicle locations. 35 Some Research Directions Modeling explicit and Implicit network structures Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles Robust information extraction, retrieval, and query processing Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology) Integrating information in structured and unstructured sources Query processing over unstructured text Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine) 36 ISMB 2003 Applying Text Mining for Bioinformatics 100,000+ gene and protein synonyms extracted from 50,000+ journal articles Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT) “APO-1, also known as DR6…” “MEK4, also called SEK1…” 37 Examples of Entity-Relationship Extraction „We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ CBF-A CBF-B interact complex associates CBF-C CBF-A-CBF-C complex 38 Another Example Z-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a specific inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK signalling pathway was involved in Z-100-induced repression of HIV-1 replication in MDMs. These findings suggest that Z-100 might be a useful immunomodulator for control of HIV-1 infection. 39 Query AliBaba, Ulf Leser, http://wbi.informatik.hu-berlin.de:80 Extracted info PubMed visualized Links to databases 40 Agichtein & Eskin, PSB 2004 Mining Text and Sequence Data ROC50 scores for each class and method 41 Some Research Directions Modeling explicit and Implicit network structures Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles Robust information extraction, retrieval, and query processing Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology) Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine) 42 Structure and evolution of blogspace [Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006] Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week. 43 Structure of implicit entity-entity networks in text [Agichtein&Gravano, ICDE 2003] Connected Components Visualization DiseaseOutbreaks, 44 New York Times 1995 Some Research Directions Modeling explicit and Implicit network structures Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles Robust information extraction, retrieval, and query processing Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology) Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources Information diffusion/propagation in online sources Information propagation on the web, news In collaborative sources (wikipedia, MedLine) 45 Thank You Details: http://www.mathcs.emory.edu/~eugene/ 46