Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Web’s Many Models ? Michael J. Cafarella University of Michigan AKBC May 19, 2010 Web Information Extraction Much recent research in information extractors that operate over Web pages Web crawl + domain-independent IE should allow comprehensive Web KBs with: 2 Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data) Very high, “web-style” recall “More-expressive-than-search” query processing But where is it? Web Information Extraction Omnivore 3 “Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA. Suggested remedies for data ingestion, user interaction This talk says why ideas in that paper might already be out of date, gives alternative ideas If there are mistakes here, then you have a chance to save me years of work! Outline Introduction Data Ingestion User Interaction 4 Previously: Parallel Extraction Alternative: The Data-Centric Web Previously: Model Generation for Output Alternative: Data Integration as UI Conclusion Parallel Extraction Previous hypothesis 5 Many data models for interesting data, e.g., relational tables and E/R graphs, etc. Should build large integration infrastructure to consume many extraction streams Database Construction (1) 6 Start with a single large Web crawl Database Construction (2) Each of k extractors emits output that: 7 Has an extractor-dependent model Has an extractor-and-Web-page-dependent schema Database Construction (3) 8 For each extractor output, unfold into common entity-relation model Database Construction (4) 9 Unify results Database Construction (5) 10 Emit final database Potential Problems Pressing problems: Tables, entities probably OK for now Many data sources (DBPedia, Facebook, IMDB) already match one of these two pretty well One possible different direction: the Data-Centric Web 11 Recall Simple intra-source reconciliation Time Addresses recall only The Data-Centric Web 12 The Data-Centric Web 13 The Data-Centric Web 14 The Data-Centric Web 15 The Data-Centric Web 16 The Data-Centric Web 17 The Data-Centric Web 18 The Data-Centric Web 19 The Data-Centric Web 20 The Data-Centric Web 21 The Data-Centric Web 22 The Data-Centric Web 23 Data-Centric Lists Lists of Data-Centric Entities give hints: About what the target entity contains That all members of set are DCEs, or not 24 That members of set belong to a class or type (e.g., program committee members) Build the Data-Centric Web 1. 2. 3. 4. 5. 25 Download the Web Train classifiers to detect DCEs, DCLs Filter out all pages that fail both tests Use lists to fix up incorrect Data-Centric Entity classifications Run attr/val extractors on DCEs Yields E/R dataset, for insertion into DBPedia, YAGO, etc In progress now… with student Ashwin Balakrishnan, entity detector >95% acc. Research Question 1 How many useful entities… Put differently: 26 Lack a page in the Data-Centric Web? (That means no homepage, no Amazon page, no public Facebook page, etc.) AND are otherwise well-described enough online that IE can recover an entity-centric view? Does every entity worth extracting already have a homepage on the Web? Research Question 2 Does a single real-world entity have more than one “authoritative” URL? 27 Note that Wikipedia provides pretty minimal assistance in choosing the right entity, but does a good job Outline Introduction Data Ingestion User Interaction 28 Previously: Parallel Extraction Alternative: The Data-Centric Web Previously: Model Generation for Output Alternative: Data Integration as UI Conclusion Model Generation for Output Previous hypothesis 29 Many different user applications built against single back-end database Difficult task is translating from back-end data model to the application’s data model Query Processing (1) 30 Query arrives at system Query Processing (2) 31 Entity-relation database processor yields entity results Query Processing (3) 32 Query Renderer chooses appropriate output schema Query Processing (4) 33 User corrections are logged and fed into later iterations of db construction Potential Problems Many plausible front-end applications, none yet totally compelling and novel 34 Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an enduser application Need to explore possible applications rather than build multi-app infrastructure One possible different direction: data integration as user primitive Data Integration as UI Can we combine tables to create new data sources? Many existing “mashup” tools, which ignore realities of Web data 35 A lot of useful data is not in XML User cannot know all sources in advance Transient integrations Dirty data Interaction Challenge Try to create a database of all “VLDB program committee members” 36 Octopus Provides “workbench” of data integration operators to build target database 37 Most operators are not correct/incorrect, but high/low quality (like search) Also, prosaic traditional operators Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova, Halevy] Walkthrough - Operator #1 SEARCH(“VLDB program committee members”) serge abiteboul inria michael adiba …grenoble antonio albano …pisa … … serge abiteboul inria anastassia ail… carnegie… 38 gustavo alonso etz zurich … … Walkthrough - Operator #2 Recover relevant data serge abiteboul inria CONTEXT() michael adiba …grenoble antonio albano …pisa … … serge abiteboul inria CONTEXT() 39 anastassia ail… carnegie… gustavo alonso etz zurich … … Walkthrough - Operator #2 Recover relevant data CONTEXT() CONTEXT() 40 serge abiteboul inria 1996 michael adiba …grenoble 1996 antonio albano …pisa 1996 … … … serge abiteboul inria 2005 anastassia ail… carnegie… 2005 gustavo alonso etz zurich 2005 … … … Walkthrough - Union Combine datasets serge abiteboul Union() 41 inria 1996 michael adiba …grenoble 1996 serge abiteboul inria 1996 antonio albano …pisa 1996 michael adiba …grenoble 1996 … … … antonio albano …pisa 1996 serge abiteboul inria 2005 anastassia ail… serge abiteboul gustavo alonso anastassia ail… … gustavo alonso carnegie… 2005 inria 2005 etz zurich 2005 carnegie… 2005 … … etz zurich 2005 … … … Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic “publications” EXTEND( “publications”, col=0) serge inria 1996 1996 abiteboul “Large Scale P2P Dist…” serge abiteboul inria michael adiba adiba …grenoble 1996 …grenoble michael 1996 “Exploiting bitemporal…” antonio albano …pisa antonio albano Example …pisa 1996 1996 “Another of a…” serge abiteboul inria serge inria 2005 2005 abiteboul “Large Scale P2P Dist…” anastassia ail… ail… carnegie… 2005 carnegie… anastassia 2005 “Efficient Use of the…” gustavo alonso alonso 2005 etz zurichgustavo 2005 “A Dynamicetz andzurich Flexible…” … … …… … • User has integrated data sources with little effort • No wrappers; data was never intended for reuse 42 … CONTEXT Algorithms Input: table and source page Output: data values to add to table SignificantTerms sorts terms in source page by “importance” (tf-idf) 43 Related View Partners 44 Looks for different “views” of same data CONTEXT Experiments 45 Data Integration as UI 46 Compelling for db researchers, but will large numbers of people use it? Conclusion Automatic Web KBs rapidly progressing Recall still not good enough for many tasks, but progress is rapid Not clear what those tasks should be, and progress is much slower 47 Difficult to predict what’s useful Sometimes difficult to write a “new app” paper Omnivore’s approach not wrong, but did not directly address these problems