Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier Outline • Dataspaces: – collections of heterogeneous (un)-structured data. – examples and characteristics • Dataspace Support Platforms: – “Pay-as-you-go” data management • Getting there: – Recent progress and research challenges Outline • Dataspaces: – collections of heterogeneous (un)-structured data. – examples and characteristics • Dataspace Support Platforms: – “Pay-as-you-go” data management • Getting there: – Recent progress and research challenges Shrapnels in Baghdad Story courtesy of Phil Bernstein Personal Information Management AttachedTo [Semex: Dong et al.] Recipient ConfHomePage ExperimentOf CourseGradeIn PublishedIn Sender Cites EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor BudgetOf OriginatedFrom HomePage AddressOf Google Base What do they have in common? • All dataspaces contain >20% porn. • The rest is spam. What do they have in common? • Must manage all the data in the space • Need best-effort services with no setup time. • Data is heterogeneous, – possibly unstructured • Do not have control over the data, just access. Isn’t this Data Integration? Phenotype Gene Sequenceable Entity Protein OMIM Experiment Nucleotide Sequence Microarray Experiment SwissProt HUGO GeneClinics Structured Vocabulary LocusLink GO Entrez GEO No, it’s Data Co-existence • Data integration systems require semantic mappings. Semantic Mappings Books BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Inventory Database A Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName BookCategories ISBN Category CDCategories CDs Album ASIN Price DiscountPrice Studio ASIN Category Artists ASIN ArtistName GroupName Inventory Database B No, It’s Data Co-existence • Data integration systems require semantic mappings. • Dataspaces are “pay-as-you-go”: – Provide some services immediately – Create more tight integrations as needed. The Cost of Semantics Schema first vs. schema last Benefit Dataspaces Data integration solutions Investment (time, cost) Why Now? • Data management is moving towards dataspace-like applications. • Prediction: – Data management is about people, not enterprises. – In 5 years our community will figure it out. • We’ve made relevant progress: – Combining DB & IR – Creation and management of semantic mappings. – Uncertainty, lineage, inconsistency Dataspaces Fundamentals: Participants and Relationships RDB java sensor WSDL snapshot 1hr updates SDB java XML manually created WSDL RDB sensor view schema mappingsensor RDB replica XML DSSP Components Catalog Local Store & Index Search & query participants’ • Heterogeneous index • Seamless flow from sensor • •Find WSDL java capabilities • Reference search toreconciliation query participants snapshot 1hr updates •Relationships • Additional associations • Query about • Discover andSDB sensor •Quality of both XMLlocate • java Cache for performance & participants, data refine manually created sensor availability • Lineage, uncertainty, relationships schema mapping completeness • Maintain WSDL RDB • Set up workflows XML catalog replica view Relax RDB RDB Administration Discovery & Enhancement DSSP Components Catalog Local Store & Index Search & query RDB sensor WSDL java snapshot 1hr updates SDB java sensor XML manually created schema mapping WSDL RDB view RDB replica sensor XML Administration Discovery & Enhancement Querying a Dataspace • Best effort, based on: – Approximate semantic mappings – Other mechanisms • Example -- searching for Beng Chin’s phone number: – Keyword search for “beng chin ooi” – Examine attributes of tuples/XML elements – Match attributes to ‘address’ Querying a Dataspace • Best effort • Combine structured and unstructured data Two Kinds of Data Structured Data (or XML) [Dong, Liu, Halevy] Unstructured Data Querying a Dataspace • Best effort • Combine structured and unstructured data • Rank: answers, sources Volvo Palo alto Volvo Palo alto Acura integra Palo alto Querying a Dataspace • • • • Best effort Combine structured and unstructured data Rank: answers, sources Iterative -- sequences of queries: – Used cars palo alto – Saab for sale – Classified ads Classified ads for Saabs near Palo Alto Querying a Dataspace • • • • • Best effort Combine structured and unstructured data Rank: answers, sources Iterative: sequences of queries Reflective: need to explain and expose confidence in answers. Dataspace Reflection • Sources of uncertainty: – Unreliable sources – Data obtained by imprecise extractions – Inconsistent data – Approximate query answering (mappings) • A DSSP must: – Expose the lineage of an answer • Web search engines already do this – Reason about relationship between sources LUI Introspection • Currently, three distinct formalisms: – Uncertainty – Lineage – Inconsistency • Single formalism should do it all: – Uncertainty and lineage (Trio @ Stanford) – Lineage & inconsistency (Orchestra @ U.Penn) ULDB’s [Benjelloun, Das Sarma, Halevy, Widom] • Combine uncertainty and lineage – Based on x-tuples: {(t1 | t2 | t3)} • Queries can be answered with no additional complexity – You can do even better than uncertain DBs. • Because of lineage, you can sometimes obtain completeness. DSS Components Catalog Local Store & Index Search & query RDB sensor WSDL java snapshot 1hr updates SDB java sensor XML manually created schema mapping WSDL RDB view RDB replica sensor XML Administration Discovery & Enhancement Alon Halevy authoredPaper Semex: … author Luna Dong author authoredPaper Personal Information Space Inverted List Alon Dong Halevy Luna Semex Xin Departmental Database StuID LastName FirstName … 1000001 Xin Dong … … … … … Alon Halevy authoredPaper Departmental Database Semex: … author Luna Dong StuID LastName FirstName … 1000001 Xin Dong … … … … … author authoredPaper Personal Information Space Inverted List Alon 1 Dong Halevy Luna Semex Xin Query: Dong 1 1 1 1 1 1 Alon Halevy authoredPaper Departmental Database Semex: … author Luna Dong StuID LastName FirstName … 1000001 Xin Dong … … … … … author authoredPaper Personal Information Space Inverted List Alon 1 Dong Halevy 1 1 1 Luna Semex Xin Query: FirstName “Dong” 1 1 1 Alon Halevy authoredPaper Departmental Database Semex: … author Luna Dong StuID LastName FirstName … 1000001 Xin Dong … … … … … author authoredPaper Personal Information Space Inverted List Alon&&name&& 1 Dong&&FirstName&& 1 Dong&&name&& Halevy&&name&& 1 1 Luna&&name&& Semex&&title&& Xin&&LastName&& Query: FirstName “Dong” 1 1 1 Alon Halevy authoredPaper Departmental Database Semex: … author Luna Dong StuID LastName FirstName … 1000001 Xin Dong … … … … … author authoredPaper Personal Information Space Inverted List Alon&&name&& 1 Dong&&FirstName&& 1 Dong&&name&& Halevy&&name&& Luna&&name&& Semex&&title&& Xin&&LastName&& Query: name “Dong” 1 1 1 1 1 Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong StuID LastName FirstName … 1000001 Xin Dong … … … … … author authoredPaper Personal Information Space Inverted List Alon&&name&& 1 Dong&&name&&FirstName&& 1 Dong&&name&& Halevy&&name&& Luna&&name&& Semex&&title&& Xin&&name&&LastName&& Query: name “Dong” 1 1 1 1 1 Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong StuID LastName FirstName … 1000001 Xin Dong … … … … … author authoredPaper Personal Information Space Inverted List Alon&&name&& 1 Dong&&name&&FirstName&& 1 Dong&&name&& Halevy&&name&& 1 1 Luna&&name&& Semex&&title&& Xin&&name&&LastName&& Query: Paper author “Dong” 1 1 1 Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong StuID LastName FirstName … 1000001 Xin Dong … … … … … author authoredPaper Personal Information Space Inverted List Alon&&author&& Alon&&name&& 1 1 Dong&&author&& 1 Dong&&name&&FirstName&& 1 Dong&&name&& Halevy&&name&& 1 1 Luna&&name&& Semex&&authoredPaper&& 1 1 Semex&&title&& Xin&&name&&LastName&& Query: Paper author “Dong” 1 1 1 DSS Components Catalog Local Store & Index Search & query RDB sensor WSDL java snapshot 1hr updates SDB java sensor XML manually created schema mapping WSDL RDB view RDB replica sensor XML Administration Discovery & Enhancement Enhancing a Dataspace AttachedTo Recipient ConfHomePage ExperimentOf CourseGradeIn PublishedIn Cites EarlyVersion ArticleAbout Sender ComeFrom PresentationFor • Creating associations FrequentEmailer CoAuthor • [Dong & Halevy, CIDR 2005] OriginitatedFrom • Reference reconciliation BudgetOf • [Dong et al., SIGMOD 2005] • Very active field. HomePage AddressOf DSS Components Catalog Local Store & Index Search & query RDB sensor WSDL java snapshot 1hr updates SDB java sensor XML manually created schema mapping WSDL RDB view RDB replica sensor XML Administration Discovery & Enhancement Reusing Human Attention • Human attention is most expensive. • Reuse whenever possible. E.g.,: – Manual schema mapping – Annotations – Queries written on data – Temporary collections of items – Operations on the data (cut & paste) • Solicit semantic information selectively – The ESP game: [von Ahn et al.] Learning from Past Matches [Doan et. al, Transformic] • Every manual map is a learning example. • Learn models for elements in mediated schema. • Use multi-strategy learning. • Thousands of maps in very little time. Reuse for a very related task. Corpus-based Matching Product productID name price 0X7630AB12 The Concert in Central Park $13.99 salePrice $11.99 Music ASIN title artists recordLabel discountPrice (no tuples) [Madhavan et al.] Obtaining More Evidence Product CD productID name price salePrice albumName prodID 0X7630AB12 The Concert in Central Park $13.99 $11.99 Corpus MusicCD CD ASIN album artistName price discountPrice 4Y3026DF23 The Best of the Doors The Doors $16.99 $12.99 prodID albumName artists recordCompany price salePrice 9R4374FG56 Saturday Night Fever The Bee Gees Columbia $14.99 $9.99 Comparing with More Evidence Product CD productID name price salePrice albumName prodID 0X7630AB12 Music ASIN Title album 4Y6DF23 The Best of the Doors The Concert in Central Park $13.99 $11.99 MusicCD artists recordLabel discount artistName recordCompany price The Doors Columbia $12.99 Challenges • Learn from other kinds of user activities • Create other kinds of relationships between participants • Identify higher-level goals from user actions • Develop a formal framework for reusing human attention. Conclusion “Dataspaces: because that’s the size of the problem” • • • • Pay-as-you-go data integration Data management for the masses Key: reuse of human attention Need to be very data driven in our research Some References • www.cs.washington.edu/homes/alon • SIGMOD Record, December 2005: – Original dataspace vision paper • PODS 2006: – Specific technical challenges for dataspace research • Semex: CIDR 2005, SIGMOD 2005 • Teaching integration to undergraduates: – SIGMOD Record, September, 2003. 1. Build initial models Ms T S Name: Instances: Type: … s Mt Name: Instances: Type: … t 2. Find similar elements in corpus Corpus of schemas and mappings 3. Build augmented models T S M’s Name: Instances: Type: … s M’t Name: Instances: Type: … t 4. Match using augmented models 5. Use additional statistics (IC’s) to refine match