Download Native XML Databases - DAMA-MN

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Native XML Databases
Ronald Bourret
[email protected]
http://www.rpbourret.com
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Overview
• What is a native XML database?
• Native XML database architectures
• When should I use a native XML database?
• Normalization, referential integrity, scalability, and
performance
• Native XML database features
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
What is a Native XML
Database?
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Blame Software AG
• Software AG coined the term “native XML database” ...
• ... and used it to market Tamino ...
• ... without ever defining it
• For a long time
» Everybody knew Tamino was a “native XML database”
» Nobody knew what Tamino did or how it worked
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
What is a native XML database?
• A database that stores XML documents as XML
• Defines a (logical) model for an XML document
• Fundamental unit of (logical) storage is a document
• Can have any physical storage
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Example: Storing a sales order
Store data
Store documents
as text
Store documents
as DOM objects
Orders
...
1234
...
...
...
29.10.00
...
...
...
Gallagher Industries
...
...
Items
...
1234
1234
...
...
1
2
...
...
A-10
B-43
...
...
12
600
...
...
10.95
3.99
...
Customers
...
Gallagher Industries
...
...
Parts
...
B-43
A-10
...
...
...
...
...
...
...
...
...
...
...
...
...
<Order>
<Number>1234</Number>
<Customer>Gallagher Industries</Customer>
<Date>29.10.00</Date>
<Order>
<Item Number="1">
<Number>1234</Number>
<Part>A-10</Part>
<Customer>Gallagher Industries</Customer>
<Quantity>12</Quantity>
<Date>29.10.00</Date>
<Price>10.95</Price>
<Order>
<Item Number="1">
</Item>
<Number>1234</Number>
<Part>A-10</Part>
<Item<Customer>Gallagher
Number="2">
Industries</Customer>
<Quantity>12</Quantity>
<Part>B-43</Part>
<Date>29.10.00</Date>
<Price>10.95</Price>
<Quantity>600</Quantity>
<Order>
<Item Number="1">
</Item>
<Price>3.99</Price>
<Number>1234</Number>
<Item <Part>A-10</Part>
Number="2">
</Item> <Customer>Gallagher
Industries</Customer>
<Quantity>12</Quantity>
<Part>B-43</Part>
</Order>
<Date>29.10.00</Date>
<Price>10.95</Price>
<Quantity>600</Quantity>
<Item Number="1">
</Item>
<Price>3.99</Price>
<Part>A-10</Part>
<Item
</Item> Number="2">
<Quantity>12</Quantity>
</Order> <Part>B-43</Part>
<Price>10.95</Price>
<Quantity>600</Quantity>
</Item>
<Price>3.99</Price>
<Item Number="2">
</Item>
</Order> <Part>B-43</Part>
<Quantity>600</Quantity>
<Price>3.99</Price>
</Item>
</Order>
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Element
Element
Element
Element
Text
Text
Text
Element
Attr Element ...
Element
Element
Element
Element
Text
Text
Text
Element
Attr Element ...
Element
Element
Element
Element
Text
Text
Text
Element
Attr Element ...
Element
Element
Element
Element
Text
Text
Text
Element
Attr Element ...
Logical model of XML document
• Must include elements, attributes, PCDATA, and
document order
• Examples are XPath data model, XML Infoset, DOM,
and model implied by SAX 1.0
• Documents stored and retrieved according to the model
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Fundamental unit of storage
• Fundamental unit of (logical) storage is a document
• Equivalent structure in a relational database is a row
• Document usually contains single set of data
• In future, unit of storage could be a fragment
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Physical storage
• Can have any physical storage
• For example, can be built on a relational, hierarchical, or
object-oriented database...
• ... or use a proprietary storage format such as indexed,
compressed files
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Native XML Database
Architectures
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Text-based storage
• Stores documents as text
• Can use file system, BLOB, proprietary storage, etc.
» XML-aware text engine in RDBMS is a native XML database
• Uses indexes heavily
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Text-based storage
<Address>
<Street>123 Main St.</Street>
<City>Chicago</City>
<State>IL</State>
<PostCode>60609</PostCode>
<Country>USA</Country>
</Address>
<Address>
<Address>
<Street>123 Main St.</Street>
<Address>
<Street>123 Main St.</Street>
<City>Chicago</City>
<Street>123 Main St.</Street>
<City>Chicago</City>
<State>IL</State>
<City>Chicago</City>
<State>IL</State>
<PostCode>60609</PostCode>
<State>IL</State>
<PostCode>60609</PostCode>
<Country>USA</Country>
<PostCode>60609</PostCode>
<Country>USA</Country>
</Address>
<Country>USA</Country>
</Address>
</Address>
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Text-based databases
• Indexed files
» TextML
• Proprietary
» GoXML DB
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Model-based storage
• Stores documents according to a specific model
• For example, maps DOM to relational database
• Underlying storage can be relational, object-oriented,
hierarchical, or proprietary
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Model-based storage
<Address>
<Street>123 Main St.</Street>
<City>Chicago</City>
<State>IL</State>
<PostCode>60609</PostCode>
<Country>USA</Country>
</Address>
Element
Element Element Element Element Element
Text
Text
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Text
Text
Text
Model-based databases
• Pre-parsed DOM
» Infonyte (PDOM), dbXML, XDBM
• Proprietary
» Tamino, Birdstep, Lore, Neocore(?), SIM(?), Virtuoso(?),
XYZFind
• Relational
» Xfinity, DBDOM, eXist
• Object-oriented
» eXcelon, X-Hive, Ozone/Prowler, 4Suite
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
When Should I Use a Native
XML Database?
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Storing document-centric
documents
• Saves physical info (entity references, CDATA, etc.)
• Stores document ID / name
• Supports document-centric queries
» Retrieve the first section containing a list in the third chapter
» Retrieve the headings of all chapters that contain hyperlinks
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
“Natural” format is XML
• XHTML, DocBook, etc.
• Data stored temporarily as XML
» For example, in a message queue
• Common format of many documents is XML
» For example, Web search engine database
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Retrieval speed is critical
• One hierarchical view must predominate
» Happens today: 15 billion gigabytes of data in IMS
» Relational queries are hierarchy-neutral
• Speed depends on:
» Query
» Underlying storage engine
» Output format (DOM, SAX, string)
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Semi-structured data
• Structure is present, but not regular like tabular data
• For example, geneological records or patient records
• Difficult to store in a relational database
» Choice is many tables or many nulls
• Structure might not be known at design time
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Well-formed documents
• No known schema
• Best example is documents stored by Web search engine
• Storing data in such documents is very inefficient
» Tables and mappings must be created at run-time
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Normalization,
Referential Integrity,
Scalability, and Performance
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Normalization
• Means that a given piece of data appears only once
• Reduces disk usage
• Reduces potential update errors
• Fundamental concept of relational databases
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Normalization and
native XML databases
• Concept same as in relational database
• Only difference is database model
» Relational tables are flat, can only store single values
» XML documents are hierarchical, can store multiple values
• Not required
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Example: Sales order
• Requires two tables in RDBMS
• Can store in a single document in native XML database
• Both are “normalized”
Orders
...
1234
...
...
...
29.10.00
...
...
...
Gallagher Industries
...
...
Items
...
1234
1234
...
...
1
2
...
...
A-10
B-43
...
...
12
600
...
...
10.95
3.99
...
Relational database
<Order>
<Number>1234</Number>
<Customer>Gallagher Industries</Customer>
<Date>29.10.00</Date>
<Item Number="1">
<Part>A-10</Part>
<Quantity>12</Quantity>
<Price>10.95</Price>
</Item>
<Item Number="2">
<Part>B-43</Part>
<Quantity>600</Quantity>
<Price>3.99</Price>
</Item>
</Order>
XML document
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Problem: Real sales order
• Real world not that simple
• Sales order probably contains customer information
» ID, name, bill-to address, ship-to address, etc.
<Order>
<Number>1234</Number>
<Customer>
<CustID>020962</CustID>
<CustName>Gallagher Industries</CustName>
<BillToAddress>...</BillToAddress>
<ShipToAddress>...</ShipToAddress>
</Customer>
<Date>29.10.00</Date>
<Item Number="1">
<Part>A-10</Part>
<Quantity>12</Quantity>
<Price>10.95</Price>
</Item>
<Item Number="2">
<Part>B-43</Part>
<Quantity>600</Quantity>
<Price>3.99</Price>
</Item>
</Order>
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Solutions: Real sales order
• Normal: Store customer info in separate file
» Use XLinks or joins
» XLinks not widely supported (will be in future?)
» If normalized and flat, might as well use relational database
• Non-normal: Store customer info in each sales order
» Trades speed for query flexibility and update complexity
» Real-world relational databases often not normal
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Normalization and
document-centric documents
• Often not worth doing
• For example, in a collection of user manuals
» Each contains copyright, company logo, company address
» Duplicate information not worth normalizing
• Matters only when there is significant overlap
» Procedures common to many models of same product
» List of worldwide customer support contacts
» ...
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Referential integrity
• Refers to validity of pointers to other data
» For example, PartNumber in Items points to valid row in Parts
• Applies to XLinks and external entity references
• XLinks generally not supported => not an issue
• Probably not enforced for external entity references
• Needs support in the future
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Scalability and performance
• Outside my area of expertise
• Native XML databases appear to scale / perform
» Much better than relational databases when retrieving whole
documents or fragments
» Much worse than relational databases when retrieving
unindexed data
» Slower(?) than relational databases when retrieving views of
indexed data that don’t follow the storage hierarchy
• Benchmark data not yet available
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Whole documents or fragments
• Text-based databases are very fast
» Data is contiguous on disk
» Retrieval requires index lookup and single disk read
1. Index lookup
2. Position disk head
3. Read to here
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Whole documents or fragments
(cont.)
• Model-based databases with proprietary storage are fast
» Generally use physical pointers between nodes
1. Index lookup
2. Position disk head
3. Follow pointers to here
Node
Node
Node
Node
Node
Node
• Model-based databases built on other DBs may be fast
» Depends on underlying database and implementation strategy
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Views not following
storage hierarchy
• Slower than hierarchical views?
• May require many index lookups or linear searches
» Pointers to parent nodes should help in model-based databases
• Relational databases are query neutral
Get the dates of all sales
orders for part “A-10”
1. Index lookup for part “A-10”
2. Follow pointers to Order?
3. Search children for Date?
<Order>
<Number>1234</Number>
<Customer>Gallagher Industries</Customer>
<Date>29.10.00</Date>
<Item Number="1">
<Part>A-10</Part>
<Quantity>12</Quantity>
<Price>10.95</Price>
</Item>
<Item Number="2">
<Part>B-43</Part>
<Quantity>600</Quantity>
<Price>3.99</Price>
</Item>
</Order>
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Indexed data
• Native XML databases use indexes heavily
• Index lookup speed same as any database, but ...
• ... more index lookups may be required than by RDBMS
• Update times slower due to index updates
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Unindexed data
• Slow for model-based databases
» Must read all elements, not just elements of a particular type
» Comparisons slower due to converting text
• Very slow for text-based databases
» Must parse document as well as comparing values
Find date 29.10.00
Relational database:
1. Search this column
Model-based native XML database:
1. Search all elements for Date elements
2. Search text for all Date elements
Orders
...
1234
...
...
...
29.10.00
...
...
Element
...
Gallagher Industries
...
...
Element
Element
Element
Text
Text
Text
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Element
Attr Element ...
Query return types
• String, DOM tree, SAX events
• Text-based databases
» Very fast returning strings
» Slow returning DOM trees or SAX events due to parsing
• Model-based databases
» Probably similar speed to relational databases for all types
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Native XML Database
Features
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Document Collections
• Contain related documents
• Similar to
» Catalog/schema in relational database
» Directory in file system
• Some databases allow nested collections
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Indexes
• All databases use indexes
• Some databases index everything
• Other databases allow user to specify what to index
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Query Languages
• XPath and XQL are most common
» Usually include extensions for multi-document queries
• Many databases have proprietary languages
• XQuery will probably be standard in the future
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Updates
• Many databases simply replace existing document
• Some databases allow updates through live DOM
• Other databases have fragment update language
• Best way to do updates still unclear
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Transactions, Locking,
and Concurrency
• Most databases support transactions
• Locking often at document (not fragment) level
• Whether this is an issue depends on
» What is stored in a single document
» Number of concurrent users
• Fragment locking probably more common in future
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
APIs
• Most databases have proprietary APIs
» XML:DB is database-neutral API
» Standard API (XML:DB or other) likely in future
• APIs similar to ODBC
» Query language is separate from API
» Methods to connect, execute queries, retrieve results, commit
transactions
» Results returned as single document or set of documents
» Documents returned as string, DOM tree, or SAX events
• Most databases support HTTP
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Round-tripping
• All native XML databases can round-trip documents
• Round-trip level depends on database
• Text-based databases usually do exact round-tripping
• Model-based databases round-trip at level of model
» Minimum is elements, attributes, PCDATA, and document
order
» May be less than canonical XML (comments and processing
instructions discarded)
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
External data
• Some databases can merge data from external databases,
such as with ODBC, OLE DB, JDBC
• Whether data is live depends on database
• In the future, most databases will probably support live
external data
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
External entity storage
• Not clear whether to store entity or URI
» Storing entity value is incorrect if URI points to live data
» Storing URI may be incorrect if entity meant as a snapshot
• Not sure how databases handle this problem
• Correct answer is probably to let user decide
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Resources
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Resources
• Ronald Bourret’s Papers Page
» http://www.rpbourret.com/xml/index.htm
• XML:DB.org’s Resources Page
» http://www.xmldb.org/resources.html
• XML:DB Mailing List
» http://www.xmldb.org/projects.html
Copyright 2001, Ronald Bourret, http://www.rpbourret.com
Questions?
Ronald Bourret
[email protected]
http://www.rpbourret.com
Copyright 2001, Ronald Bourret, http://www.rpbourret.com