Download Semistructural databases and XML

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oracle Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Ingres (database) wikipedia , lookup

Functional Database Model wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

ContactPoint wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Semistructural databases
Database lectures for mathematics students
Zbigniew Jurkiewicz, Institute of Informatics UW
May 29, 2016
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Semistructural data
Due to the dramatic growth of WWW and Internet it is now
easy to place information in the net to make it publicly
available.
It is natural to try to use this information as a database.
However the data stored in the form of HTML or XML files
has irregular structure.
So people started to call such data sources as
semistructural data.
Disadvantage: chaotic query languages, similar to
procedural query languages like Cobol 30 years ago.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Semistructural data
Data model based on trees
Flexible representation of data: directed graph
Schema include in data, “self-describing” data
Useful for information integration (e.g. virtual data
warehouses)
Good model for storing XML
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Example
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Semistructural graph
Nodes = objects
Arc labels = object attributes
Atomic values in tree leaves
Flexibility: no restrictions on
labels of outgoing arcs
number of descendants
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Queries
Query langugages based on concept of path expressions
(e.g. Lorel).
Path expression = regular expression describing a path
from root.
Example paths
biblio.book|article.author
biblio._*.author
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Example queries
All book authors:
Query z1
select author: x
from biblio.book.author x;
All items having Jeffrey Ullman as author:
Query z2
select item: x
from biblio._ x
where "Jeffrey Ullman" in X.author;
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Example queries
Authors of items which have “database” occuring in title
Query z3
select author: y
from biblio._ x, x.author y, x.title z
where ".*(D|d)atabase.*" ˜ z;
Authors and titles of all books.
Query z4
select item: title: y, author: z
from biblio.book x, x.title y, x.author z;
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML
Rapid expansion of WWW, which was based on pages
written in HTML, resulted in rediscovery of advantages of
parenthesized representation of structured data.
Language XML (eXtensible Markup Language) is a step in
the direction of standarization of the representation of data
stored in textual files and sent out by network.
Because XML is a simplified version of SGML, main XML
object is traditionally called document, while it often
contains no ordinary text, but structured data.
Most CASE tools include the possibility of importing and
exporting files in XML, some of them even use it as native
storage representation.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML
eXtensible Markup Language
Documents marked with tags
Extensible user-defined semantic tags, e.g. <student>
HTML contains only the fixed “presentational” (i.e. useful
for formatting and display) set of tags, e.g. <blockquote>
WWW Consortium page http://www.w3.org/XML/
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML
Generally, XML can be used to represent any data having
structure.
XML expression is a fully parenthesized form of data
representation.
Parentheses in XML have labels (in other words they
correspond to phrase markers from formal grammars), e.g.
instead of writeing a list on numbers 3, 5 and 4 in the
simple form
(3 5 4)
in XML we will write
<list>3 5 4</list>
Constructs <list> and </list> are called starting tag
and ending tag, however you can look on them simply as
opening and closing parentheses.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML
Using XML you can put tags around nearly arbitrary set of
characters.
A pair of matching tags with the text between them is
called XML element.
Character sequence contained inside tag is element’s
name, and the text between tags is element’s contents,
morever the XML tag may contain attributes, e.g. the
following element
<list title="grades" date="2004-10-22">
3 5 4
</list>
has two attributes: title and date. Their values are
strings "grades" oraz "2004-10-22".
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML
The following examples show two methods of representing
grades register:
As a parenthesized list. This could be an internal
representation used in a program written in Lisp, Scheme,
Dylan or Ruby.
As XML expression.
Note that in this form the information about the subject and
semester is given as attributes of the element <exam>,
which helps with information exchange.
Similarly index of a student is given as an attribute of the
element <results>. Additionally the grade for each
exercise was put in a separate element.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
List
("Databases" "Spring/2009"
("201" 78 88 69)
("202" 88 87 86)
("203" 99 88 88)
("204" 77 78 77)
("205" 90 89 81)
("206" 67 78 81))
Representing exam protocol as a list
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML
<exam subject="Databases" semester="Spring 2009">
<grades index="201">
<pts>78</pts> <pts>88</pts> <pts>69</pts>
</grades>
<grades index="202">
<pts>88</pts> <pts>87</pts> <pts>86</pts>
</grades>
<grades index="203">
<pts>99</pts> <pts>88</pts> <pts>88</pts>
</grades>
<grades index="204">
<pts>77</pts> <pts>78</pts> <pts>77</pts>
</grades>
<grades index="205">
<pts>90</pts> <pts>89</pts> <pts>81</pts>
</grades>
<grades index="206">
<pts>67</pts> <pts>78</pts> <pts>81</pts>
</grades>
</exam>
Representing exam protocol in XML
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML
We see clearly that XML generalizes parenthesized list
notation. Parentheses are named (labeled), and each
parenthesized element may have additional attributes.
XML document schema (i.e. hierachical document
structure and tags used for elements and attributes) should
be previously defined.
This is done with a separate document.
Initially DTD (Document Type Description) was used, more
modern solution is to use XML Schema.
Document schema description could be perceived as a
definition of a vocabulary used (the so called “semantic
web” is a more global approach).
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Element
Element = any document fragment contained between
complementary pair of tags, for example
<actor> ...
</actor>
Simple elements, e.g. <br /> in HTML, are exception to
this rule. They do not have contents and do not occur in
pairs.
Note: some HTML browsers insist on space before slash.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Auxiliary tags
Comments <!- ...
->
Processing instructions
<?name ...
?>
for example XML document should start with the instruction
<?xml version="1.0" ?>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Document
XML Document = single element
May be preceded by optional prolog containing
XML declaration
<?xml version="1.0" ?>
document type definition (DTD), usually by a reference to a
separate file
<!DOCTYPE name of the main element SYSTEM "file.dtd">
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML vs HTML
Small and capital letter in tags are different.
Attribute values should always be in quotes.
No implicit termination for some tags (e.g. </p>).
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML correctness levels
Well-formed: syntactic correctness, paired tags — each
opening tag (e.g. <student> should have corresponding
closing tag (e.g. </student>), does not need DTD;
<?xml version="1.0" standalone="yes" ?>
<body>
...
</body>
Valid: described with some DTD (Document Type
Definition (and consistent with it ;-)
<?xml version="1.0" standalone="no" ?>
<!DOCTYPE Student SYSTEM "student.dtd">
<Student>
...
</Student>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
DTD — tags
<!DOCTYPE main-tag [ element ...
]>
<!ELEMENT tag (component,...)>
Example
<!DOCTYPE Students [
<!ELEMENT Students (Student*)>
<!ELEMENT Student (firstname,lastname,address,year)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
...
]>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
DTD – tags
?
*
|
#PCDATA
CDATA
#REQUIRED
optional element
closure (occurs 0 or more times)
alternative
any text without tags
any text
required attribute
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
DTD — example use
<?xml version="1.0" standalone="no" ?>
<!DOCTYPE Studenci SYSTEM "student.dtd">
<Students>
<Student>
<firstname>Onufry</firstname>
<lastname>ZagÅĆoba</lastname>
<address>Dzikie Pola</address>
<year>1648</year>
</Student>
<Student>
...
</Student>
...
</Students>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Example DTD
<!DOCTYPE Exchange [
<!ELEMENT exchange (title?, rate*)>
<!ELEMENT rate (#PCDATA)>
<!ATTLIST rate
currency CDATA #REQUIRED
type (sale|purchase|average) "average">
...
]>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
DTD usage
<?xml version="1.0" ?>
<exchange>
<title>Exchange rates</title>
<rate currency="USD">4,235</rate>
...
</exchange>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
DTD — attributes
Placed in the opening tag.
Form: attribute="value"
Also used as links for connecting elements
Declared by
<!ATTLIST element
atribute type
...>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
DTD – example with attributes
<!DOCTYPE Students [
<!ELEMENT Students (Student*)>
<!ELEMENT Student (firstname,lastname,address,year)>
<!ATTLIST Student
studentID ID
attends IDREFS>
<!ELEMENT lastname (#PCDATA)>
...
]>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Another example
<?xml version="1.0" standalone="no" ?>
<!DOCTYPE Students SYSTEM "student.dtd">
<Students>
<Student studentID="OZ" attends="ms,gpp">
<firstname>Onufry</firstname>
<lastname>ZagÅĆoba</lastname>
<address>Dzikie Pola</address>
<year>1648</year>
</Student>
<Student>
...
</Student>
...
</Students>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML Schema
Currently often instead of DTD the newer XML Schema
notation is being used, as it is more expressive.
Example description of student element
<xsd:element name="student">
<xsd:sequence>
<xsd:element name="first-name" type="xsd:string"/>
<xsd:element name="last-name" type="xsd:string"/>
<xsd:element name="addresss" type="xsd:string"/>
<xsd:element name="year" type="xsd:int"/>
</xsd:sequence>
<xsd:/element>
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Linking attributes
Type ID marks the identifying attribute – for use in other
elements.
Type IDREF is a reference to the value of ID attribute in
other element.
Missing “type discipline”!
But this is the simplest mechanism, there are more
advanced: XLink and XPointer.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Connections between documents
XML sublanguage XLL (eXtensible Link Language), also known
as XLink.
More description possibilities than HTML references.
One can create a reference to many documents (with
selection by user) and to groups of documents (similar to
frame with list).
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Displaying
A document written in XML can be transformed into other
form or placed on a WWW server.
For display in browser we must however describe the
method of display.
For simple applications it is enough to use Cascading Style
Sheets (CSS).
CSS sheet to be used is declared in the document header
with
<link rel="stylesheet" type="text/css" href="name.css">
Style for elements may be also directly specified using
attribute style
<li style="color: red">
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Displaying
SGML uses DSSSL for transformations.
In HTML cascading style sheets (CSS) serve to change
standard method of tag display.
In XML we may also specify transformations used for tags
with XSL (eXtensible Stylesheet Language), e.g.
Display tag <exchange> as HTML table (<table>).
Display tag <rate> as table row (<tr>).
Using appropriate XML parser we can make conversion
into any other form (e.g. TEX).
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML for information exchange
The important and typical application of XML is to
exchange information between different CASE tools.
The structure of application modeled in UML in such tools
as Rational Rose can be exported as XML document and
read into other tool, e.g. generator for special applications,
or simply put into WWW server.
General mapping rules:
objects ↔ XML documents
classes ↔ XML schemas
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XMI
OMG (Object Management Group) www.omg.org
proposed a standard format of information exchange called
XMI (XML Metadata Interchange)
http://cgi.omg.org/cgi-bin/doc?ad/01-06-12.
Many interesting informations can be found also at
http://XMLmodeling.com.
Umbrello uses XMI.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Programming
XPath is a language (or more precisely notation for
patterns) to select a set of nodes within a hierarchical
document.
XQuery is an XPath-based query language for querying
XML documents. A query is a search statement to retrieve
specific portions of a document that conform to a specified
search criterion.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XPath
Notation which lets to describe a set of elements (nodes).
In the simplest case an expression is a sequece of element
names, for example
/Students/Student/name
describes the contents of all elements <name>.
To fetch the name of the first student we can use index
/Students/Student[1]/name
We can use functions, e.g.
count(/Students/Student)
returns the number of students.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
Namespaces
The term Namespace) corresponds to modules or
packages from ordinary programming languages.
Namespaces are usually named with URL addresses, e.g.
http://www.w3.org/2001/XMLSchema
Remark: the use of URL address does not imply that there
needs to be a document under this address. The XML
parser does not try to look at it. Usually however the
document contains some description of the namespace.
As URLs are long, usually the aliases are declared, e.g.
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
SQL/XML
New part in SQL:2003
A new built-in type, XML.
4 built-in operators:
XMLPARSE: returns a value of XML type given an SQL
character string expression
XMLSERIALIZE: returns a value of character string type
given an XML expression
XMLROOT: modifies the root information item of an XML
value and returns the modified value.
XMLCONCAT: concats two or more XML values and
returns the resulting value.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
SQL/XML
A predicate, IS DOCUMENT, to test whether an XML value
has a single root element.
5 “publishing functions” that generate values of XML type
from SQL expressions:
XMLELEMENT
XMLFOREST
XMLATTRIBUTE
XMLNAMESPACES
XMLAGG
Host language bindings for values of XML type.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML type
A new SQL built-in type
Can be used wherever a SQL data type is allowed — as the
type of a column of a table, parameter of a routine, attribute
of an UDT, or a SQL variable.
Strongly-typed — values of XML type are distinct from their
textual representation.
Semantics of operations on values of XML type is specified
by assuming a tree-based internal representation based on
the XML Information Set Recommendation (Infoset).
The Infoset model is modified in one significant way: the
document information item of Infoset is replaced by a new
kind of information item, XML root information item.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m
XML in Postgres
In PostgreSQL there is XML-related functionality based on
the SQL/XML standard.
In version 8.3 it covered XML syntax checking and XPath
queries.
XML columns are declared using xml data type, e.g.
CREATE TABLE test (a xml, b xml);
Specialized functions has been added to query XML
values, e.g.
SELECT xmlelement(name test, xmlattributes(a, b))
FROM test;
Starting from PostgreSQL 9.2, a new native data type for
JSON values has been added. This is another
semistructural representation, originated in Javascript
programming language.
Zbigniew Jurkiewicz, Institute of Informatics UW
Semistructural databases
Database lectures for m