* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Author
Survey
Document related concepts
Transcript
Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng Acknowledgement: Part of the materials in this set of XML slides are extracted from Prof. Dan Suciu’s course materials. Thanks for his permission of using them 1 HTML XML SGML a W3C standard to complement HTML origins: structured text SGML motivation: • HTML describes presentation • XML describes content HTML4.0 XML SGML http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000) 2 HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999 HTML describes the presentation 3 XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … XML describes the content </bibliography> 4 SGML and XML eXtensible Markup Language - XML XML 1.0 – a recommendation from W3C, 1998 XML – related standards: W3C group – XSL, XQuery,… Roots: SGML [http://www.w3.org/MarkUp/SGML/] • Markup = encoding: possibilities for expressing information • Information modelling freedom, Reusability, provability, validity • But a very nasty language • SGML is an international standard for device-independent, system-independent methods of representing texts in electronic form After the roots: a format for sharing data (on the Web) 5 Basic XML Terminology tags: book, title, author, … start tag: <book>, end tag: </book> elements: <book>…<book>,<author>…</author> elements are nested empty element: <red></red> abbrv. <red/> an XML document: single root element 6 The Role of XML Data XML is designed for data exchange, not to replace relational or E/R data Sources of XML data: • Created manually with text editors: not really data • Generated automatically from relational data • Text files, replacing older data formats: Web server logs, scientific data (biological, astronomical) • Stored/processed in native XML engines: very few applications need that today 7 XML Advantages for Web Data Over SGML • Supported by mainstream browsers such as IE and Netscape • Standard Stylesheet • Standard linking Over HTML • • • • Interchangable Searchable Reusable Enables Automation 8 Why XML is of Interest to Us HTML fails to meet structure specification – tags for appearance only XML is a language syntax for data • Note: we have no langauge syntax for relational data • But XML is not relational: semistructured This is exciting because: • • • • Can translate any data to XML Can ship XML over the Web (HTTP) Can input XML into any application Thus: make data sharing and exchange on the Web possible! 40% annual growth and accelerating: publishers, government, education,… 9 Relational Data in XML person nam e XML: row phone name John 3634 Sue 6343 D ic k 6363 person “John” row row phone name phone 3634 “Sue” name 6343 “Dick” phone 6363 <person> <row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row> </person> 10 XML Data: more expressive Missing attributes: <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person> Could represent in a table with nulls no phone ! name phone John 1234 Joe 11 XML Data: more expressive Repeated attributes <person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone> </person> Impossible in tables: name phone Mary 2345 two phones ! 3456 ??? 12 XML Data: more expressive Attributes with different types in different objects <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> structured name ! Nested collections (no 1NF) Heterogeneous collections: • <db> contains both <book>s and <publisher>s 13 XML Data Sharing and Exchange application application object-relational Integrate XML Data Transform WEB (HTTP) Warehouse application relational data Specific data management tasks legacy data 14 From Relational Data to XML Data XML – Data tree and its doc persons name row row row phone name John Sue Dick persons 3634 6343 6363 “John” phone name phone 3634 “Sue” name 6343 “Dick” phone 6363 <persons> <row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row> </persons> 15 XML Data XML is self-describing Schema elements become part of the data • Relational schema: persons(name,phone) • In XML <persons>, <name>, <phone> are part of the data, and are repeated many times ! Consequence: XML is much more flexible XML data is semistructured data 16 XML from/to Relational Data XML publishing: • relational data XML XML storage: Relational data is regular in XML tree representation. But XML data can be nonrelational! • XML relational data Supplier ? Supplier ? Common XML Illusion Broker 17 XML Publishing: ideas sketch XML Tuple streams XML publishing Relational Database SQL Web Application XPath/ XQuery 18 XML Publishing Exporting the data is relatively easier: we do this already for HTML Translating XQuery SQL is hard XML publishing systems: Research: Experanto (IBM/DB2), SilkRoute (AT&T Labs and UW, software unavailable) now SilkRoute downloadable • XQuery SQL Commercial: SQL Server, Oracle • Only XPath SQL and with restrictions A middle-ware approach 19 XML Publishing Will follow the idea in SilkRoute [ACM TODS 27(4), 2002] student enroll course Backend relational engine Relational schema: • Student(sid, name, address) • Course(cid, title, room) • Enroll(sid, cid, grade) Schemas are often proprietary but XML schemas are public 20 XML Publishing <xmlview> <course> <title> Operating Systems </title> <room> MGH084 </room> <student> <name> John </name> <address> Seattle </address > <grade> 3.8 </grade> </student> <student> …</student> … </course> <course> <title> Database </title> <room> EE045 </room> <student> <name> Mary </name> <address> Shoreline </address > <grade> 3.9 </grade> </student> <student> …</student> … </course> … </xmlview> • Group by courses: Redundant representation of students • Other representations possible too: tags may not be the same as attributes in general 21 XML Publishing First thing to do: design the DTD (public for users) <!ELEMENT xmlview (course*)> <!ELEMENT course (title,room,student*)> <!ELEMENT student (name,address,grade)> <!ELEMENT name (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT grade (#PCDATA)> Second thing to do: develop an XML canonical view under db root for each relation {t1,…tk} with schema R(A1,…,An) <R><row><A1>a11</A1>…<An>a11</An></row>… <row><A1>ak1</A1>…<An>ak1</An></row> </R> 22 Now we write an XQuery to export relational data XML Note: result should conform to DTD (slide 20 for the relational Schema; the generated result is shown in slide 21.) <xmlview> { FOR $x IN /db/Course/row RETURN <course> <title> { $x/title/text() } </title> <room> { $x/room/text() } </room> { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN <student> <name> { $z/name/text() } </name> <address> { $z/address/text() } </address> <grade> { $y/grade/text() } </grade> </student> } </course> } </xmlview> 23 XML Publishing Query: find Mary’s grade in Operating Systems XQuery over public view FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN <answer> $y/grade/text() </answer> SilkRoute does this automatically SQL over database schema SELECT Enroll.grade FROM Student, Enroll, Course WHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid 24 XML Publishing How do we choose the output structure ? Determined by agreement, with our partners, or dictated by committees in XML dialects (called applications) and generate DTDs The DTD is selective wrt the underlying databases XML Data is often nested, irregular, etc – that’s why we need the xmlview query in slide 23 to do the transformation No agreed normal forms for XML but a few work on it… [M. Arenas and L. Libkin, A Normal Form for XML Documents] 25 XML Storage Often the XML data is small and is parsed directly into the application (DOM API) – file based management (pros and cons?) Sometimes it is big, and we need to store it in a database Much harder than XML publishing (why ?) A fundamental XML storage problem: • How do we choose the schema of the database ? Possible solutions: • Schema derived from DTD – structure mapping • Storing XML as a graph such as “Edge relation” – model mapping • Native Approach – native XML/SSD engine 26 XML Storage in a Relational DB Use generic schema • [Florescu, Kossman 1999] Use DTD to derive schema • [Shanmugasundaram, et al. 1999] Use data mining to derive schema • [Deutsch, Fernandez, Suciu 1999] Use the Path table • [T.Amagasa, T.Shimura, S.Uemura 2001] 27 XML Storage: Edge Relation [Florescu, Kossman 1999] Monet [Schmidt et al. WebDB 00] Structured-based approach Mapping tree to relations Use generic relational schemas (independent on the XML schema): DOM Tree Ref(source,label,dest) Val(node,value) 28 XML Storage: Edge Relation Ref &o1 S o u rc e paper & & & & & &o2 title author &o3 author &o4 “The Calculus” “…” year &o5 “…” [Florescu, Kossman 1999] &o6 “1986” o1 o2 o2 o2 o2 Val N ode & & & & o3 o4 o5 o6 L abel D est paper title a u th o r a u th o r year & & & & & o2 o3 o4 o5 o6 V a lu e T h e C a lc u lu s … … 1986 29 XML Storage: Edge Relation In practice may need more tables for reference links and nodes: RefTag1 Source Dest RefTag1(source,dest) RefTag2(source,dest) … IntVal(node,intVal) RealVal(node,realVal) … &o2 &o5 &o6 &o1 &o7 &o8 &o8 paper title &o3 &o7 &o2 author author &o4 year &o5 &o8 &o6 30 XML Storage: DTD to Schema [Christophides, Abiteboul, Cluet, Scholl 1994] [Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999] Basic idea: use the XML schema to derive the relational schema DTD: <!ELEMENT paper (title, author*, year?)> <!ELEMENT author (firstName, lastName)> Relational schemas: Paper(pid, title, year) Author(aid, pid, firstName, lastName) 31 XML Storage: DTD to Schema Each Element corresponds to a relation Each Attribute of Element corresponds to a column of relation Connect elements using foreign keys But the problems: fragmentation! Example: How many relations should be used for finding an address of an author? Paper(pid, <!ELEMENT title, year) paper (title, author*, year?)> Author(aid, <!ELEMENT pid, firstName, author (firstName,lastName,address)> lastName) Address <!ELEMENT (addid, aid, address country) (street, city, country)> City (cid, <!ELEMENT addid, postcode, city (postcode?, cityname) cityname)> (streetno?, streetname)> Street<!ELEMENT (sid, addid, street postcode, streetname) 32 XML Storage: Path Relations [T.Amagasa, T.Shimura, S.Uemura 2001] ACM TOIT 2001 1(1) Store paths as strings (model-based approach) XPath expressions become the SQL like operator Additional information for parent/child, ancestor/descendant relationship XRel: table schemas for paths, elements and val XParent:table schemas for elements, labelpath, parent (alternative but similar) 33 XML Storage: Path Relations Path pathID Pathexpr 1 #/bib 2 #/bib#/paper 3 #/bib#/paper#/author 4 #/bib#/paper#/title 5 #/bib#/paper#/year 6 #/bib#/book#/author 7 #/bib#/book#/title 8 #/bib#/book#/publisher One entry for every path in the database Relatively small 34 XML Storage: Path Relations Positions in doc Element Val Node ID Path ID Start 1 1 0 1000 - 2 2 5 200 1 3 3 8 20 2 4 3 21 30 2 5 3 31 100 2 6 3 101 150 2 7 4 151 180 2 8 2 300 500 1 End Parent ID ... One entry for every element in the database relatively large NodeID Val 3 Smith 4 Vance 5 Tim 6 Wallace 7 The Best… 6 3 7 4 8 2 ... One entry for every leaf in the database 35 Relatively large XRel Example: Path Table root 1 Contains all path string PLAY 2 • Path(pathID, pathexp) PathID PathExp SCENE 4 0 #/PLAY 1 #/PLAY#/ACT 2 #/PLAY#/ACT#/SCENE 3 #/PLAY#/ACT#/SCENE#/@id 4 #/PLAY#/ACT#/SCENE#/TITLE 5 #/PLAY#/ACT#/SCENE#/SPEECH … … ACT 3 id 5 “000” TITLE 6 “Intro” ACT … … SCENE … SPEECH 7 SPEAKER 8 “CURIO” TEXT 9 “This is …” … Path Table 36 XRel Example: Other Tables Element Table 0 6 11 27 49 57 91 Example docID PathID start end index 21 <PLAY> <ACT> <SCENE id=“000”> <TITLE> Intro </TITLE> <SPEECH> <SPEAKER>CURIO</SPEAKER> <TEXT>This is …</TEXT> </SPEECH> </SCENE> </ACT> </PLAY> 1 0 0 … 1 1 1 1 6 … 1 1 48 1 2 11 … 1 1 90 1 4 27 48 1 1 … … … … … … Attribute Table docID PathID start end 1 3 0 … PathExp #/PLAY … 21 21 value “000” Text Table Path Table PathID reindex docID PathID start end value 1 4 34 40 “Intro” … … … … … 37 Storing XML as a Graph Every XML instance is a tree Hence we can store it as any graph, using an Edge table In addition we need a Value table to store the data values (#PCDATA) Edge relation summary: Same relational schema for every XML document: • Edge(Source, Tag, Dest) • Value(Source, Val) Generic: works for every XML instance But inefficient: • Repeat tags multiple times • Need many joins to reconstruct data 38 Storing XML as a Graph 0 db 2 book 4 3 title Edge 5 author 6 title 1 9 book 7 publisher 8 author author 11 10 title state “Complete “Morgan “Transaction “Chamberlin” “Bernstein” “Newcomer” Guide Kaufman” Processing” to DB2” Source Tag Dest 0 db 1 1 book 2 2 title 3 2 author 4 1 book 5 5 title 5 ... Value Source Val 3 Complete guide . . . 6 4 Chamberlin author 7 6 ... ... ... ... ... 39 “CA” Storing XML as a Graph What happens to queries: FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title SELECT vtitle.value FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitle WHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source 40 XML Storage: Data Mining to Schema [Deutsch, Fernandez, Suciu 1999] Given: • One large XML data instance • No schema/DTD • Query workload Problem: find a “good” relational schema for it Notice: even when a DTD is present, it may be imprecise: • E.g. when a person may have 1-3 phones: phone* 41 XML Storage: Data Mining to Schema A paper D B C paper paper paper Paper1 fn 1 year author title author authortitle authortitleauthor title A X B X D X fn ln fn ln fn fn ln ln ln 1 fn 2 ln 2 title year X X X X - X - X X X X - Paper2 a u th o r X C [Deutsch, Fernandez, Suciu 1999] title X 42 Useful References Data on the Web: from Relations, to Semistructured Data and XML, Abiteboul, Buneman, Suciu • W3C homepage, www.w3.org • For foundations For current standards F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage Strategies, SIGMOD Record, 2002 XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002 Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In Proc. of SIGMOD, 2002 Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No. 1, pp 110-141, 2001 D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 1999 J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities, In Proc. of VLDB, 1999 C. Zhang et al., On Supporting Containment Queries in Relational Databases Management Systems, In Proc. of SIGMOD, 2001 43