Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS331: Advanced Database Systems: Semistructured Data Management Norman Paton University of Manchester [email protected] Semistructured Data Two views of data: Databases: structured, modelled, queried, programmed. Documents: partially structured, authored, read, navigated. Semistructured data management is at the confluence of these two views. XML is the principal data representation notation for semistructured data. XML can be seen as: An extensible markup language for documents. A data model for hierarchical data. A notation for communicating data with its structure. See also: COMP30352 – IR, Hypermedia and the Web XML Language Space XML (Extensible Markup Language) is just that: a markup language with an extensible collection of tags. XML is associated with many related standards within the W3C (World Wide Web Consortium): http://www.w3.org/. XML Related Standards: XPath: navigation. XQuery: queries. XSLT: transformations. XML Schema: document description. DOM: modelling documents as objects. ... ... and underpins: Web Services. The Semantic Web. Markup Markup is the inclusion of symbols with special meaning in a text document. Languages with markup: LaTeX. HTML. RTF. XML. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>The Silver Pigs</title> </head> <body bgcolor="#ffffff" text="#000000“> <h1>Rome</h1> \documentclass{llncs} \begin{document} \title{An Experimental Performance Evaluation of Join Algorithms for Parallel Object Databases} \author{Sandra Sampaio\inst{1} \and Jim Smith\inst{2}\and Norman W. Paton\inst{1}\and Paul Watson\inst{2}} ... XML Markup In XML, content is structured using tags. Tags are distinguished by the characters ‘<‘ and ‘>’. Tags often come in pairs, round some content, as start and end tags. Start tag <html> <head> <title>The Silver Pigs</title> </head> <body bgcolor="#ffffff" text="#000000“> <h1>Rome</h1> ... </body> </html> End tag Elements An element is a meaningful unit of content enclosed by tags. An application may be able to interpret an element. Elements may be ordered or nested. Context matters, especially given nesting. <station> <name>Oxford Road</name> <city>Manchester</city> </station> Element Hierarchies <country> An XML document <name>United Kingdom</name> essentially <people> <pop>60094648 </pop> represents <maleLE>75.74</maleLE> hierarchical data. <femaleLE>80.7</femaleLE> </people> The elements in a </country> well formed XML country document will match name people and respect the hierarchy. pop maleLE femaleLE Attributes Attributes provide auxiliary information about elements. Attributes are embedded within start tags, and have the form “name = value”. <station updatedBy = “Fred Bloggs” validUntil = “22/06/2005”> <name>Oxford Road</name> <city>Manchester</city> </station> <country source=“http://www.cia.gov”> <name>United Kingdom</name> <people> <pop>60094648 </pop> <maleLE>75.74</maleLE> <femaleLE>80.7</femaleLE> </people> </station> Models An XML file may be able to contain any old tags, in any order or combination (while remaining well-formed). Restrictions on the legal tags and values a document can contain may be specified using a DTD or an XML Schema. A DTD (Document Type Definition) provides a concise syntax for modelling documents (but is on the way out). An XML Schema definition is itself an XML document, which provides a wide range of modelling constructs for constraining other XML documents. Trains in XML Hierarchical models can capture most cycle-free data fairly naturally. Hierarchical models, however, promote some concepts and demote others. The relational model, by contrast, treats all concepts as (broadly) equal. <train> <tno>3107101</tno> <source>Edinburgh</source> <destination>London</destination> <visit> <name>Edinburgh</name> <time>06:00</time> </visit> <visit> <name>York</name> <time>08:00</time> </visit> <visit> <name>London</name> <time>10:00</time> </visit> </train> Tools XML is widely used, and many software systems can read/write XML formats. Generic tools have also been developed for designing/editing XML. XML Spy showing a valid XML file as text. XML Spy XML Spy supports: Editing. Data modelling. Validation. Transformation. … XML Spy showing an XML schema document as a tree. Oxygen Oxygen supports: Editing. Data modelling. Validation. Transformation. Querying. … Oxygen showing an XML schema document as a Tree and text. XML Databases Native XML Databases: Store XML in the database directly (“native”). Make XML Schema the optional schema definition language. Query the database using XML query languages (XPath/XQuery). Program database data as XML data structures (XML:DB, DOM, ...). An XPath query and result in eXist XML Databases Native XML databases: Tamino - Software AG: eXist - Open Source: http://www.softwareag.com/tamino/. http://exist.sourceforge.net/. Standard APIs: XML:DB Initiative: http://www.xmldb.org/; Both Tamino and eXist provide XML:DB APIs. XQJ: XQuery API for Java; Java community standard. XML and Relational Databases Storage options: Decomposed: store an XML document in relational tables, and reconstruct on retrieval. Composed: store an XML document as an attribute of a relational table. Retrieval options: Represent relational tables as XML (e.g. Java WebRowSet). Relational vendors: Tend to support both composed and decomposed storage models. Provide APIs that accommodate XML for data transport or display (e.g., in Web Services or for Web interface generation). Summary XML is becoming increasingly ubiquitous for data representation for: files, transport, storage, metadata. Data management systems must support storage, querying and communication using XML. Soon everything will be stored using XML? Don’t believe it! Further Reading S. Abiteboul, P. Buneman, D. Suchi, Data on the Web, Morgan-Kaufmann, 1999. N. Bradley, The XML Companion (3rd Edition), Addison-Wesley, 2002. Data Modelling in XML XML Schema XML Schema is a W3C standard for modelling using XML. An XML Schema definition is itself an XML document – there is an XML Schema for XML Schema! XML Schema files have a .xsd suffix; XML data files have a .xml suffix. An XML Schema can specify: Which elements are mandatory/optional. Which attributes are mandatory/optional. Element/attribute types. Cardinalities. Relative ordering. Role of XML Schema Unlike in relational/object databases: An XML database need not have a schema. An XML schema may not be very prescriptive in terms of what can or cannot be stored. Train Model sequence recurring sequence Train ComplexType <xs:complexType name="TrainType"> <xs:sequence> <xs:element name="tno" type="xs:string"/> <xs:element name="source" type="xs:string"/> <xs:element name="destination" type="xs:string"/> <xs:sequence maxOccurs="unbounded"> <xs:element name="visit"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="time" type="xs:time"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:sequence> </xs:complexType Elements Elements are defined thus: <element name = “the-name”>. Attributes associated with elements: type: specifies the kind of content that an element with no attributes or sub-elements can have. Default imposes no constraints. minOccurs, maxOccurs: the number of times an element can occur. A value of unbounded allows open-ended cardinality. Default is once and only once. Built-in Types There are many built-in types: string. integer, positiveInteger, negativeInteger. short, long. date, dateTime, time. id, idref. anyURI Complex Elements Any element with sub-elements or attributes is declared to have a complex type. <xs:element name="visit"> <xs:complexType> <xs:sequence> <xs:element name="name" type=“..."/> <xs:element name="time" type=“..."/> </xs:sequence> </xs:complexType> </xs:element> Sequence: the sub- elements must appear in the given order. Choice: a selection is made from the subelements. Both sequence and choice can have minOccurs/maxOccurs. Attributes Attributes can be defined within complex types. Attributes are optional unless use=“required”. <xs:complexType name="TrainType"> <xs:sequence> <xs:element name="tno" type="xs:string"/> ... </xs:sequence> <xs:attribute name="engine" type="xs:string"/> </xs:complexType> The resulting data file can populate the attribute. <train ... xsi:type="TrainType“ engine="125"> <tno>3107101</tno> <source>Edinburgh</source> <destination>London</destination> <visit> ... </visit> </train> Building on Existing Types New types can be constructed from existing types by: Extension: for complex types, this means that new types can be defined that add attributes or elements to the type on which they are based. Restriction: for complex types, this means that new types can be defined with fewer attributes or elements, reduced cardinalities, etc. Type Extensions Example <xs:complexType name="StationType"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="type" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:complexType name="DistrictStationType"> <xs:complexContent> <xs:extension base="StationType"> <xs:sequence> <xs:element name="main" type="xs:string"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType> Type Extensions in Use <xs:element name="stations"> <xs:complexType> <xs:choice maxOccurs="unbounded"> <xs:element name="station" type="StationType"/> <xs:element name="districtStation" type="DistrictStationType"/> </xs:choice> </xs:complexType> models <stations> <station> <name>London</name> <type>main</type> </station> <districtStation> <name>York</name> <type>district</type> <main>London</main> </districtStation> </stations> Cross References Hierarchal models do not naturally support shared components. In documents, crossreferences and hyperlinks are very common. XML has several crossreferencing schemes (e.g., ID/IDREF, XPointer). Within an XML document: A value of type ID must be unique. A value of type IDREF must match some ID within the document. ID and IDREF for Trains ID used to identify station IDREF used to reference station XML Schema for ID/IDREF <xs:element name="station"> <xs:complexType> <xs:sequence> <xs:element name="city" type="xs:string"/> <xs:element name="name" type="xs:ID"/> </xs:sequence> </xs:complexType> </xs:element> ... <xs:element name="visit"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:IDREF"/> <xs:element name="time" type="xs:time"/> </xs:sequence> </xs:complexType> </xs:element> Example: What is the schema? <shiporder orderid="889923“> <orderperson>John Smith</orderperson> <shipto> <name>Ola Nordmann</name> <address>Langgt 23</address> <city>4000 Stavanger</city> <country>Norway</country> </shipto> <item> <title>Empire Burlesque</title> <quantity>1</quantity> <price>10.90</price> </item> <item> …</item> </shiporder> Summary XML data can be parsed, transmitted and queried in the absence of any formal description of its structure. Many applications need to be able to make assumptions about the structure of documents they process. XML Schema provides a wide range of modelling facilities for defining XML documents. Further Reading The W3C Consortium Tutorial is short but informative: D. Fallside, XML Schema Part 0: Primer, 2001: http://www.w3schools.com/schema/ http://www.w3.org/TR/xmlschema-0/ N. Bradley, The XML Companion (3rd Edition), Addison-Wesley, 2002. Querying XML Documents using XPath and XQuery XPath and XQuery XPath is a W3C standard for addressing parts of an XML document. XPath is widely used in XML languages and tools, including XQuery and XSLT. XPath is not especially expressive. XQuery is a W3C standard for accessing and restructuring XML documents. XQuery is supported by several XML databases, but is less widely deployed than XPath. XPath is used within XQuery. Trains Model The following diagram shows the XML schema of the data queried in the following example queries. XPath XPath uses path expressions to describe routes through documents. These expressions address locations in a hierarchy in a way that is familiar from file systems (/books/chapters/chapter1). XPath also includes a function library (shared with XQuery) for manipulating numerical, string, date, node and sequence values. Standardisation: XPath 1.0 has been a W3C standard since 1999; XPath 2.0 became a W3C standard in January 2007. XPath Terminology Nodes include elements, documents (root elements) and attributes; nodes can be addressed using XPath. XPath includes constructs for exploring relationships between nodes, such as parent, child, ancestor and descendent. XPath Syntax Expression Description nodename Selects child nodes of the current node. / Selects the root node. // Selects any node in the document. . Selects the current node. .. Selects the parent of the current node. Selects attributes. @ Simple Paths - 1 What are the numbers of the trains? /trains/train/tno //tno <tno>22403101</tno> <tno>22407101</tno> <tno>22403102</tno> <tno>22446301</tno> … Simple Paths - 2 Where do trains start? /trains/train/source/ <source>aberdeen</source> <source>aberdeen</source> <source>aberdeen</source> <source>aberdeen</source> <source>aberdeen</source> … Duplicates Where do trains start (no duplicates)? (or tags) distinct-values(/trains/train/source) aberdeen banchory edinburgh aberdour … Predicates What are the destinations of trains that start in Aberdeen? /trains/train[source="aberdeen"]/destination <destination>edinburgh</destination> <destination>edinburgh</destination> <destination>edinburgh</destination> <destination>inverness</destination> <destination>inverness</destination>… Extracting Node Values What are the destinations of trains that start in Aberdeen (no tags)? /trains/train[source="aberdeen"] /destination/text() edinburgh edinburgh edinburgh inverness … Aggregates How many trains start from aberdeen? count(/trains/train[source="aberdeen"]) 9 Parents What are the sources of trains that visit York? /trains/train[visit/name="york"]/source /trains/train/visit[name="york"]/../source <source>edinburgh</source> <source>edinburgh</source> <source>edinburgh</source> <source>london</source>… Positions in Sequences Where does the last train (in the document) from aberdeen visit? /trains/train[source="aberdeen"][last()]/visit <visit> <name>banchory</name> <time>18:20:00</time> </visit> … XQuery XQuery is intended to play the same sort of role for XML data as SQL plays for relational data. XQuery is a functional language, and thus is declarative and compositional. XQuery can be used to restructure XML data, as well as to ask questions about it. XQuery became a W3C standard in January 2007. XQuery Syntax XQuery is based on FLOWR (pronounced “flower”) expressions: for clauses bind variables to values in streams of tuples; let clauses bind variables to the complete result of an expression; where clauses filter values in tuple streams; order by clauses sort tuple streams; and return clauses construct results. Iteration and Filtering What are the destinations of trains that start in Aberdeen? for $i in /trains/train where $i/source = "aberdeen" return $i/destination <destination>edinburgh</destination> <destination>edinburgh</destination> <destination>edinburgh</destination>… Identifying Documents What are the destinations of trains that start in Aberdeen? for $i in doc("/db/trains/Trains.xml")/trains/train where $i/source = "aberdeen“ return $i/destination <visit> <name>banchory</name> <time>18:20:00</time> </visit> … Constructing Results How many visits are made by each train? for $t in /trains/train let $v := $t/visit return <numvisits> {$t/tno} <count> {count($v)} </count> </numvisits> <numvisits> <tno>22403101</tno> <count>2</count> </numvisits> … Ordering Results How many visits are made by each train, ordered by the number of visits? for $t in /trains/train let $c := count($t/visit) order by $c descending return <numvisits> {$t/tno} <count> {$c} </count> </numvisits> <numvisits> <tno>46303101</tno> <count>7</count> </numvisits> … Quantifiers Which trains visit York? for $t in /trains/train where some $v in $t//visit satisfies $v/name = "york" return $t/tn <tno>3107101</tno> <tno>46307101</tno> <tno>22407101</tno> … Implicit Quantification Which trains visit York? for $t in /trains/train where $t//visit/name = "york" return $t/tn <tno>3107101</tno> <tno>46307101</tno> <tno>22407101</tno>… Reorganisation Reorganise the document to nest the trains inside the stations. for $s in distinct-values(//visit/name) return <station> <name> {$s} </name> {for $t in /trains/train, $v in $t//visit where $v/name = $s return <visit> {($t/tno, $v/time)} </visit>} </station <station> <name>edinburgh</name> <visit> <tno>22403101</tno> <time>09:50:00</time> </visit> … </station> … Examples Write XQueries that: Return the number of visits paid to the most frequently visited station. Return the names of the stations that received the most visits. Further Reading The W3C Consortium Tutorials are short but informative; you should read these: http://www.w3schools.com/xpath/ http://www.w3schools.com/xquery/ There are many books on querying XML documents, e.g.: Howard Katz (ed), XQuery from the Experts, Addison Wesley, 2004. XML Databases Native XML Databases Native XML databases support collections of XML documents. Native XML databases provide light weight support for management of large document collections. Such databases typically support: XPath and XQuery for querying. XUpdate for manipulating documents. The XML:DB API for programmatic access. Accessing eXist The top level of an eXist database is a collection; collections in eXist are not typed. Like a file system, a collection can contain other collections or documents. The eXist client supports management of collections, uploading of documents, etc. XML:DB XML:DB provides a Java API similar in concept to JDBC. Class cl = Class.forName(“org.exist.xmldb.DatabaseImpl”); Database database = (Database)cl.newInstance(); DatabaseManager.registerDatabase(database); Collection col = DatabaseManager.getCollection( "xmldb:exist://localhost:8080/exist/xmlrpc/db" ); XPathQueryService service = (XPathQueryService) col.getService("XPathQueryService", "1.0"); ResourceSet result = service.query(args[0]); ResourceIterator i = result.getIterator(); while(i.hasMoreResources()) { Resource r = i.nextResource(); System.out.println((String)r.getContent()); } XML In Oracle 10g Oracle was not designed to store XML data. Extensions to support XML occur: In the server, so that Oracle can store XML data using tables or built-in types. Around the server, so that XML data can be transferred to and from Oracle applications. XML Type Server datatype: Stored in the database. Configurable storage options (CLOBS or by mapping to object relational structures). Accessible through SQL, PL/SQL, Java. Operations to: Create XML Type values. Map to/from XML Type values. Test properties of XML Type values. Query the content of XML Type values using XPath. Storing XMLType Values XMLType values can be stored as column attributes. Stored either through a mapping onto tables or as Character Large Objects (CLOBs). create table franchise ( company varchar(40) not null, franchise varchar(40) not null, trains SYS.XMLTYPE not null, primary key (company, franchise)) Operations on XMLTypes SYS_XMLGEN createXML String Relational Construct XMLType getStringVal() extract Number getNumberVal() Populating XMLTypes Standard table and column modification operations apply to complete values of XMLType: Insert. Delete. Update. The updateXML operation is provided for making changes within XMLType values. insert into franchise values ( 'GNER', 'East Coast Main Line', sys.XMLType.createXML( '<trains> <train> <tno>3107101</tno> <source>Edinburgh</source> <destination> London </destination> <visit> …</visit> </train> </trains>') ); XMLType Methods existsNode() returns 0 if the XPath expression does not match the document, else 1: existsNode(xpath IN varchar2) RETURN number. extract() returns the fragment that matches an XPath expression as an XMLType: extract(xpath IN varchar2) RETURN sys.XMLType. Selecting XML From SQL The function sys.XMLType.getStringVal converts its XMLType parameter into a string. Retrieve details of GNER trains: select f. franchise, sys.XMLType.getStringVal(f.trains) from franchise f where f.company='GNER' East Coast Main Line <trains> ... </train> Extracting Elements The extract() method on XMLType can be used to access individual elements. Retrieve the src and dst of GNER trains: select f.franchise, f.trains.extract('//source').getStringVal() as src, f.trains.extract('//destination').getStringVal() as dst from franchise f where f.company='GNER' East Coast Main Line <source>Edinburgh</source><source>York</source> <destination>London</destination><destination>London</destination> Extracting Element Values The text() function can extract element contents. select f.franchise, f.trains.extract('//source/text()').getStringVal() as src, f.trains.extract('//destination/text()').getStringVal() as dst from franchise f where f.company='GNER' East Coast Main Line EdinburghYork LondonLondon Extracting for Comparison Comparisons can be carried out either in XPath or in SQL. select f.company from franchise f where f.trains.existsnode('//visit[name="York"]') = 1 select f.company from franchise f where f.trains.extract('//visit/name/text()').getStringVal() like '%York%' GNER Testing Before Retrieval To avoid including empty values in results, the existsNode() method can test for documents that match an Xpath expression. select f.franchise, f.trains.extract('//train/tno').getStringVal() as tno from franchise f where f.trains.existsnode('//train/tno') = 1 East Coast Main Line <tno>3107101</tno> <tno>4630710</tno> XML In Relational Databases XML is increasingly ubiquitous: Data for Web display. Data in domainspecific standards. Data for communication between web services. Relational vendors want: Web pages to be generated from databases. Their storage managers to support all forms of data management. Database applications to be exposed as Web Services. This means full support for XML input, output and storage. Further Reading Oracle 10g, XML DB Developers Guide [Chapter 3: Using Oracle XML DB]. Full documentation for eXist is available along with the software from: http://exist.sourceforge.net/ Extra Slides