Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SIGMOD2009 Overview Web group Li Yukun Outline Overview SIGMOD2009 Overview two selected papers Optimizing Complex Extraction Programs over Evolving Text Data Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Section of SIGMOD2009 Research Session 1: Security I Research Session 2: Databases on Modern Hardware Research Session 3: Information Extraction Research Session 4: Security II Research Session 5: Large-Scale Data Analysis Research Session 6: Entity Resolution Research Session 7: Testing and Security Research Session 8: Column Stores Research Session 9: Data on the Web Research Session 10: Probabilistic Databases I Research Session 11: Database Optimization Research Session 12: Probabilistic Databases II Research Session 13: Skyline Query Processing Research Session 14: Understanding Data and Queries Research Session 15: Nearest Neighbor Search Research Session 16: Query Processing on Semi-structured Data Research Session 17: Data Integration Research Session 18: Keyword Search Research Session 19: Semi-structured Data Management Research Session 20: Data Management Pearls Research Session 21: Indexing SIGMOD keynote talks Enterprise Applications - OLTP and OLAP - Share One Database Architecture Hasso Plattner (Hasso-Plattner-Institute for IT Systems Engineering) Transforming Data Access Through Public Visualization Fernanda B. Viegas (IBM) Martin Wattenberg (IBM) Web-based visualizations—ranging from political art projects to news stories—have reached audiences of millions. Meanwhile, new initiatives in government, aimed at all citizens, point to an era of increased transparency. a "living laboratory" web site where people may upload their own data, create interactive visualizations, and carry on conversations. Political discussions, citizen activism, religious discussions, game playing, and educational exchanges all happen on the site. To further support these scenarios, and the users they represent, will require continued innovation in data presentation and interaction. SIGMOD INVITED SESSIONS Special Invited Session on Human-Computer Interaction with Information Design for Interaction Daniel Tunkelang (Endeca) Voyagers and Voyeurs: Supporting Social Data Analysis Jeffrey Heer (Stanford University) Augmented Social Cognition Ed H. Chi (PARC) Special Invited Session on Systems Research and Information Management Storage Class Memory: Technology, Systems and Applications Richard F. Freitas (IBM) Distributed Data-Parallel Computing Using a High-Level Programming Language Michael Isard (Microsoft Research) Yuan Yu (Microsoft Research) SIGMOD TUTORIALS Large-Scale Uncertainty Management Systems: Learning and Exploiting Your Data FPGA: What's in it for a Database? Keyword Search on Structured and Semi-Structured Data Database Research in Computer Games Anonymized Data: Generation, Models, Usage Summary Hot words Probabilistic,Semi-structure, Security, Search&Query, Extraction&resolution User Interaction DataSpace Framework Browsing Query Kd search 关联数据库 Association DB Evolution Entity Association Extraction Email Domain resolution Memo 用户日志 Users Documents Integration Web Blogs pages DB Future work on DataSpace Managing Entity and association Entity Identify and Resolution Data extraction and cleaning Pay-as-you-go integration Uncertain data mapping Update of entity and association Query&Search in dataspace Keyword search Approximate query Facet-based search in dataspace Selected readings Data integration Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences Core Schema Mappings Entity Resolution Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Entity Resolution with Iterative Blocking A Grammar-based Entity Representation Framework for Data Cleaning Data on the Web Indexing A Revised R*-tree in Comparison with Related Index Structures Understanding Data and Queries Why Not? Query by Output Detecting and Resolving Unsound Workflow Views for Correct Provenance Analysis Query processing on Semi-structured data Scalable Join Processing on Very Large RDF Graphs Optimizing Complex Extraction Programs over Evolving Text Data Robust Web Extraction: An Approach Based on a Probabilistic Tree-Edit Model Combining Keyword Search and Forms for Ad Hoc Querying of Databases Outline Overview SIGMOD2009 Two selected papers Optimizing Complex Extraction Programs over Evolving Text Data Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Paper 1 Introduction Motivation Traditional IE method: Static Practical conditions: Dynamic corpus DBlife(10000+URLs,120+MB corpus snapshot.) Enterprise Intranet Problem How to efficiently extract information based on Dynamic corpora Problem Definition Concepts Data pages, Extractors, Mentions An extractor E:p→R(a1,a2,…,an) extracts mentions of relation R from page p. A mention of R is a tuple(m1,m2,…,mn,)such that mi is either a mention of attribute ai or nil. Examples Assumptions Extract mentions from each single data pages Methods Concepts Extractor scope Let s.start and s.end be the start and end character positions of a string s in a page p. We say an extractor E has scope α iff for any mention m = (m1, . . . ,mn) produced by E, (maxi mi.end − mini mi.start) < α, where mi.start and mi.end are the start and end character positions of attribute mention mi in page p. Extractor Context The β-context of mention m in page p is the string p[(m.start−β)..(m.end+ β)], i.e., the string of m being extended on both sides by β characters. We say extractor E has context β iff for any m and p′ obtained by perturbing the text of p outside the βcontext of m, applying E to p′ still produces m as a mention. Clallenges Matchers (Find overlaping) Solutions CAPTURING IE RESULTS REUSING CAPTURED IE RESULTS Level of Reuse: IE Results to Capture: Storing Captured IE Results: Scope of Mention Reuse Overall Processing Algorithm Identifying Reuse with Matchers SELECTING A GOOD IE PLAN Searching for Good Plans Cost Model Evaluation(DataSet) Experimental Results Paper 2 Introduction Jone Smith J. Smith John.Smith J.Smith What is entity resolution Motivation to identify and group references that co-refer, that is, refer to the same entity. New data characters: Examples The output a clustering of references, where each cluster is supposed to represent one distinct entity. Problem definition Entity Resolution ER problem has been studied in several research areas under many names such as coreference resolution, deduplication, object uncertainty,record linkage, reference reconciliation, etc. In the past, a wide variety of techniques have been developed for ER problem. Methods Similarity (metrics, textual, attributes, and etc.) Blocking Voting Problem Pay little attention to context feature Problem Definition To identify co-offer relationship between two mentions Context-based framework Context features Effectiveness Generality Number of clusters Overview of the approaches Meta-level Classification Context-extended classification Context-weighted Classification Creating final clusters Experiments Web domain Data set by WWW05[Bekkerman, and etc.] Contain web pages of 12 different persons Created by searching web using Google RealPub domain 11682 publications 14590 authors 3084 departments 1494 organizations Experimental results on Web domain Summary How to manage uncertainty data, and unstructured data are becoming a hot topic It is also important problem of DataSpace Based on it, to select promising topics. Thanks