Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay Form and content • The Web today – HTML generated by hand, wysisyg editors, ‘webified’ databases – HTML specifies rendering for human reading – Screen scraping required to consolidate data • The Web in the future – Common interchange format (XML) – Concentrate on content, not form – Represent data class broader than relations Role of databases • Contribute – Data storage and indexing – Query processing and optimization – Views, transformations, integration • Adopt – Search modalities – Content-based approximate search – Linguistic analysis Features of semi-structured data • • • • • No explicit schema, or volatile schema Schema size comparable to data size Structure changes without notice Heterogeneous, deeply nested, irregular Has nature of documents rather than tables Semi-structured data model example Bib &o1 complex object paper paper book references &o12 &o24 references author title year &o29 references author http page author title publisher title author author author &o43 &25 &96 1997 last firstname lastname atomic object firstname lastname &243 “Serge” “Abiteboul” “Victor” Object Exchange Model (OEM) first &206 “Vianu” 122 133 Syntax { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } } Some observations • • • • Missing or additional attributes Multiple attributes Different types in different objects Heterogeneous collections Object ID’s and references <person id=“o555”> o456 <name>Jane </name> children children </person> mother <person id=“o456”> o555 o123 <name>Mary</name> <children idref=“o123 o555”/> </person> <person id=“o123” mother=“o456”> <name>John</name> </person> Names and acronyms • OEM (Object Exchange Model): a semistructured data model from Stanford, 1995 • Lore: a system for storing data adhering to the OEM • Lorel: a query language for Lore • XML (eXtensible Markup Language): a simplification of SGML and a generalization of HTML • XML-QL: Query language for XML Lorel query examples select Bib.paper.title from Bib.paper where Bib.paper.year >1995 Alternative select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X Transitive closure Navigating partially known structures XML-QL query examples where <book language=“french”> <publisher><name>Morgan Kaufmann</name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml” construct $a where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml” construct <result><author>$a</><lang>$l</></> XML storage in ternary relation Ref S o u rc e &o1 & & & & & paper &o2 title &o3 author author &o4 “The Calculus” “…” year &o5 “…” &o6 “1986” • Too many joins • Label name storage redundant o1 o2 o2 o2 o2 Val N ode & & & & o3 o4 o5 o6 L abel D est paper title a u th o r a u th o r year & & & & & o2 o3 o4 o5 o6 V a lu e T h e C a lc u lu s … … 1986 Storage optimization through mining paper paper paper Paper1 paper year author title author authortitle authortitleauthor title fn ln fn ln fn ln fn fn 1 ln 1 fn 2 ln 2 title year X X X X X X X - X - X X X X - ln Paper2 • Inline common cases • Tolerate a few nulls a u th o r X title X Schema extraction • Schema: a template for type/semantics specification • Conformance – Does that data conform to a given schema ? • Classification – If so, which objects belong to what classes/types? • Applications – Storage and query optimization Graph simulation Given two edge-labeled graphs G1 and G2, a simulation is a relation R between nodes such that if (x1, x2) is in R, and (x1, a, y1) is in G1, then there exists (x2, a, y2) in G2 (same label) such that (y1,y2) is in R x1 G1 R x2 a a y1 R y2 G2 Upper and lower bound schema • Lower bound schema – Conformance: find simulation R from S to D – Classification: check if (c,x) in R – Used in storage optimization • Upper bound schema (data guides) – Conformance: find simulation R from D to S – Classification: check if (x,c) in R – Used in path index generation and query optimization Sample data &r employee employee employee employee employee manages employee employee manages manages manages manages &p1 &p2 managedby &p3 company worksfor worksfor &p4 &p5 &p6 &p7 managedby managedby managedby worksfor employee managedby worksfor worksfor worksfor worksfor worksfor &c &p8 Lower bound schema Root &r employee company employee Bosses &p1,&p4,&p6 worksfor Company &c manages managedby worksfor Regulars &p2,&p3,&p5,&p7,&p8 Storage using lower bound schema Lower-bound schema person Root company works-for managed-by Company Employee c.e.o. name address name string Employee o id … … nam e … … m a n a g e d -b y … … w o rk s-fo r … … Store rest in overflow graph Upper bound schema (DataGuides) Root &r company managedby employee Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 manages worksfor Bosses &p1,&p4,&p6 worksfor Company &c manages managedby worksfor Regulars &p2,&p3,&p5,&p7,&p8 Query optimization issues Select x from A.B x where exists y in x.C: y=5 A A D A D B D D B B B B B B B C C C C C C C C C 5 5 5 4 4 5 4 4 5 B What makes the problem difficult • • • • Selectivity estimation Index selection Access cost models Clustering choices Part Two Information Retrieval and Databases Soumen Chakrabarti CSE, IIT Bombay Information retrieval (IR) • Search – ‘Inverted’ index – Boolean match – Relevance ranking cat dog D5: 3, 37, 50 D7: 9, 20 D7: 7, 90, 400 D20: 22, 533 • Classification – Learn topics from examples • Clustering – Discover topics from a document collection • Never done inside a relational database Current style of loose integration • RDBMS provides hooks • Declare some columns as textual with keyword index • Inserts, updates, and deletes trigger external program, e.g., Verity search engine • Search engine maintains separate indices • Simple query rewriting to combine relational and text-match where-clauses Reasons • Space – BLOB vs. pure relational representation – Average English word is only 5 bytes • Time – Most text engines are resigned to flexible (i.e., no) model for data consistency – Much faster read-only access than relational database lookups New features desired • Operations that are more complex than keyword search can benefit from tighter coupling with RDBMS • Approximate search is essential (Anand Rajaraman, Amazon.com, SIGMOD 99) – Misspelling book title, author name common – Variant of OEM edge label (author/writer/poet) • Similarity extends to structure as well (‘Travolta’ NEAR ‘Cage’ = ‘Face/Off’) Case study: generalized ‘like’ • SQL has limited string matching constructs – like ‘%x’, ‘x%’, ‘%x%’ – x must be exact match • Need more lenient match – Applications: LDAP, IR • String edit distance is not suitable – “Given query, order strings in database in increasing order of edit distance and pick top 5” Sliding-window matching nas asc sce cen ent pas sca cal nascent pascal ras rascal • Given a query, scan to get a set of 3-grams • Similarity of string in database to query = number of shared 3-grams Issues • • • • • Minimally disruptive architecture Low storage overheads Fast query processing Good selectivity estimates Combining with other predicates for ranking • Efficiently handling updates