* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download lindsay
Survey
Document related concepts
Transcript
XML Databases in BMI
CSE
300
UCONN Spring 2008, CSE 300: BMI
taught by: Prof. Steve Demurjian
<ClinicalDocument
presented by: James Lindsay
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:mif="urn:hl7-org:v3/mif" xmlns="urn:hl7-org:v3">
<realmCode code="US"/>
<typeId root="2.16.840.1.113883.1.3" extension="POCD_HD000040"/>
<!-- Conformant to NHSN Generic Constraints -->
<templateId root="2.16.840.1.113883.3.117.1.1.1" />
<!-- Conformant to the NHSN Constraints for BSI Numerator Report ->
<templateId root="2.16.840.1.113883.3.117.1.1.3.1" />
...
</ClinicalDocument>
CDSS-1
Overview
CSE
300
What is XML:
Overview, tags, schema.
XML query languages:
XPath XQuery.
XML data models:
Data/document -centric,
biomedical data.
Storage Strategy + XML
DBMS:
Relational, CMS, native.
Native XML DBMS
Pros / Cons.
Biomedical Information
BMI Databases
Overview, XML.
HL7 and CDA
Overview, examples.
Examples of BMI XML.
UCONN BMI XML.
Survey of Technology.
CDSS-2
XML overview
CSE
300
eXtensible
Markup Language
Similar to HTML
Meta-language that describes the content of the
document (self-describing).
XML is primarily used as a data storage and
interchange medium.
XML exists in plain text format, however it may
be compressed, or altered for transfer.
CDSS-3
XML overview cont.
CSE
300
There
are no predefined data (tags), or grammer
inherently in XML.
XML tags give an XML document structure and
meaning.
Available tags are defined by a schema.
All tags in an XML document come in pairs,
open and close.
Tags are completely nested, and there is no
ambiguity in their order.
CDSS-4
XML tags
CSE
300
XML tags
may have an element field which is used to
store information within the tag. Meta-data.
Plain text can be placed between tags. This text is not
parsed.
CDATA is character data. This means that any string
of non-markup characters is legal as part of the
attribute.
The ENTITY attribute type indicates that the attribute
will represent an external entity in the document itself.
The ID attribute type if you want to specify a unique
identifier for each element.
CDSS-5
XML Schema
CSE
300
The
structure of an XML document is defined by
its schema.
Dozens on languages to define XML schema:
DTD
W3C
(XSD)
NG - Relax
This
file can validate any instance of an XML
document against it self.
This file, or schema also defines allowable tags.
CDSS-6
Schema Example (XSD)
CSE
300
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="shiporder">
<xs:complexType>
<xs:sequence>
<xs:element name="orderperson" type="xs:string"/>
<xs:element name="shipto">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="item" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="note" type="xs:string" minOccurs="0"/>
<xs:element name="quantity" type="xs:positiveInteger"/>
<xs:element name="price" type="xs:decimal"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="orderid" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
CDSS-7
XML Structure
CSE
300
XML employees
a tree structure model for
representing data. (previous slide)
shiporder
shipto
orderperson
orderid
name
address
city
country
item
title
name
quantity
price
CDSS-8
Querying XML - XPath
CSE
300
Many
languages to query XML. We'll focus on
XPath and XQuery as they are W3C standards.
Xpath is a compact method of traversing
previous tree.
Designed to facilitate use via URL/URI's.
/shiporder/item/name
← view all items' names
Extensible
to add user defined behaviors.
Treats each tag as a node in the tree.
CDSS-9
Querying XML - XQuery
CSE
300
Functional
extension of XPath
XML equivalent of SQL
Navigate and manipulate document nodes.
Works on collections of documents, or even
fragments.
FOR $b IN document("bib.xml")//book
WHERE $b/publisher = "Morgan Kaufmann"
AND $b/year = "1998"
RETURN $b/title
CDSS-10
XML Models
CSE
300
Naively
there are two models of XML use:
Data-centric
Document-centric
In
reality, most XML use is a hybrid of the two.
More important is the database strategy used with
XML.
Relational
Content
Managment
Native XML
CDSS-11
Data – centric model
CSE
300
Information
is generally stored in a relational
database.
XML is transport medium, nothing more.
Irrelevent to application that data exists as
XML for some period of time.
Characteristics:
Fine
grained data.
Data relationship is insignificant.
Need to transfer relational information.
Means of storing new information.
CDSS-12
Document – centric Model
CSE
300
When
XML is utilized soley as a document.
(This pesentation in Open Office).
The documents in part, or in full are stored and
retrived.
Does not originate from relational database.
Document used for human consumption.
Usually information written by hand in a
language like PDF, RTF then converted to XML.
CDSS-13
Reality: Hybrid Model
CSE
300
Most
documents like a PDF will also contain
small grained information (last edited date,
character set).
Data from a relational DB may even be a
document, or require self description.
Various database technologies support all
models.
Important to understand your data, and choose
db technology that is most compatible.
CDSS-14
Medical Data Model
CSE
300
Medical
data is non-homogeneous.
But, there exists general trends in medical data:
Fine
grain data such as dates, times, images.
Documents and human generated descriptions and
observations.
Human interaction creates semi-structured data.
Ability
to transfer information is esential.
Medical data fits into hybrid model.
CDSS-15
Data – centric Comparison
CSE
300
Advantages:
Utlizes existing database software. (IBM, Oracle, MS)
Quick ( existing db's are already fast).
Dual role (not limited only to XML).
Many even support XQuery
Disadvantages:
More configuration (mapping relational -> XML).
Slower when creating complex XML files due to middle step.
CDSS-16
Document – entric Comparison
CSE
300
Advantages:
Good
integration into workflow.
Document managment made easy.
Collaboration, and web publishing.
Disadvantages:
Not
able to extract data from document directly.
Not designed for high availability, high load systems.
Non-uniformity in implementations.
CDSS-17
Storage Strategy: Relational
CSE
300
Utilizing
a relational database to store XML
documents and data is very popular.
In a very data – centric application this approach
is intuitive.
Most top tier database applications support XML
in some way.
Oracle,
Software
SQL server, IBM, etc...
is highly supported and well developed.
CDSS-18
XML Shema mapping
CSE
300
Using
a relational DB requires mapping XML
schema to DB schema.
Table based:
Often
implemented as a middleware layer.
Schema structure must follow row-column
convention.
Object
– relational:
XML
is a tree of objects.
Mapped to DB using well established OR methods.
Natively supported in some DB apps.
CDSS-19
Storage Strategy: CMS
CSE
300
Used
in exclusively document-centric model.
Various programs allow indexing, storage,
manipulation, and publication of XML
documents.
Application specific.
Numerous implementations, most recently
Open Office and MS Word 2007.
Not very interesting or useful in context of
biomedical information.
CDSS-20
Storage Strategy: Native
CSE
300
Semi
– structured data.
Mapping
to relational DB causes inflation and null
space.
Need more functionality and granularity than CMS
Performance
increase over relational DB by
avoiding joins.
Assuming
data is in appropriate order on disk.
Only
returns XML, need to convert for non
XML manipulation.
Development still in infancy as of Winter 2007.
CDSS-21
Native XML Databases
CSE
300
Definition:
”A database that has an XML document as its fundamental unit of
(logical) storage and defines a (logical) model for an XML document,
as opposed to the data in that document, and stores and retrieves
documents according to that model. At a minimum, the model must
include elements, attributes, PCDATA, and document order.”
Data types: No support in XML, need a mapping.
Document or database schema can be used.
External user defined mapping.
Not necessary when only transfering data.
No
requirement on underlying medium or
implementation.
Two architectures; text and model based.
CDSS-22
Native: Text-based
CSE
300
Use
any DB.
Rather than mapping schemas, store entire XML documents.
Usually involves saving entire document as a BLOB /
Character LOB.
Utilize various text field searches to retrieve info from XML
document.
Some DB text searching are being made XML aware.
Speed: Document located on disk preferences full or partial
document retrieval.
CDSS-23
Native: Model-based
CSE
300
Internal
object model of the document schema.
Store this model in a database.
Relational
/ object-oriented database.
Proprietary.
Performance
similar to chosen db engine.
Still limited by hierachy of XML data.
Retrieve
Support
all orderid's from hundreds of docs slow.
for common XML query languages
XPath,
XQuery, etc...
CDSS-24
Native XML: TLC
CSE
300
In
the traditional database world, Transactions,
locking and concurrency are paramount.
Native XML databases aren't mature enough to
support everything.
Most support transactions, but what about LC?
Document
level locking is easy, but too coarse.
Only a few implementations support node level
locking.
Commercial
products generally support ACID,
free ones just starting too (2008).
CDSS-25
Native XML: API's
CSE
300
Ubiquity
Still
of ODBC interfaces.
applies to native XML databases.
Most
implementations provide their own
interface for a variety of languages.
Industry standardization:
XML:DB API
from XML:DB.org, programming
language neutral.
JSR 225: Xquery API for JAVA (XQJ). IBM and
Oracle.
CDSS-26
Native XML: The Rest
CSE
300
Referential
integrity is supported in an adhoc
manner at best.
Database cannot enforce user defined (via
schema) integrity.
Some
standard mechanisms allow it.
Eventually
both mechanisms will be supported.
Currently relies heavily on application for
normalization and integrity.
Certainly a drawback for medical applications.
CDSS-27
Native XML: Scalability
CSE
300
Limitation
of any DB is time spent seeking
HD.
XML only needs to find pointer to head of doc.
Therefore an XML DB should scale well in the
context of retrieving data.
The only caviat is if the retrieval breaks the
document hierachy.
More pointers must be followed, potentially
slowing retrieval greatly.
Where there is money, there is a way.
CDSS-28
Biomedical Information
CSE
300
Overview
of the field.
Data storage and transfer problem.
XML as a solution.
BMI XML examples.
Next section: Choosing a native DB.
CDSS-29
BMI Overview
CSE
300
The
convergence of computation and
biomedicine.
The NIH BMI Science and Tech Initiative:
Define
biomedical computing as a science.
Many sources of information:
Clinical, surgical, genetics, drug design, biology.
Standardization
in software.
Algorithm development, high speed computing.
All
relieves on efficient storage and transfer of
information.
CDSS-30
BMISTI: Databases
CSE
300
”Biomedical computing is entering an age where creative
exploration of huge amounts of data will lay the foundation of
hypotheses.” ~NIH Director
Problems:
Standards. Terminology, syntax and semantics need to be defined
and agreed upon to allow integration of data.
Curation. Database submissions need to be checked and crossreferenced to avoid the transitive propagation of error.
Interoperability. Data should be as consistent as possible across
databases so that researchers can compare and contrast it.
Computational
and Systems issue:
Utilize and manipulate information.
Procress large volumes of information.
CDSS-31
BMI: XML
CSE
300
Data
sharing and semantic interoperability.
Case study: Electronic Health Record.
The
development and use of an integrated health
record for a patient.
Hetergenous data, e.g. clinical, clinical-trial, genomic
data.
Primary
Obstacle: Proprietary data formats.
Uniformity on technical level: Text file.
Step towards semantic goal.
CDSS-32
XML in Clinical Data
HL7
CSE
300
standards organization.
V2: ASCII
bar format. example:
HL7V3|1|2.02
Message|2.16.840.1.113883.1122^CNTRL-3456|2002081614303516^- --->
06:00||3.0|2.16.840.1.113883^POLB_IN004410||P|I|ER|ER
respondTo|RSP|tel:555-555-5555^^WP
entit yRsp|||{FAM^^Hippocrates~GIV^^Harold~GIV^^H~SFX^AC^MD}|tel:555-555-5555^^WP
sender|SND|nfs:127.127.127.255
device||2.16.840.1.113883.1122^GHH LAB|{GIV^^An Entit y Name}^L|||tel:555-555-2005^^H
agencyFor
representedOrganization||\NOTH\
location|||2.16.840.1.113883.1122^ELAB-3|{^^GHH Lab}^TN
receiver|RCV|nfs:127.127.127.0
device|||2.16.840.1.113883.1122^GHH O E|{GIV^^An Entit y Name}^L|||tel:555-555-2005^^H
agencyFor
representedOrganization|||2.16.840.1.113883.19.3.1001|{^^GHH Outpatient Clinic}^TN
location|||2.16.840.1.113883.1122^BLDG4|{^^GHH Outpatient Clinic}^TN
Awkward, inflexible, unclear meaning of values.
CDSS-33
HL7 V3 Specification
CSE
300
Built
around Reference Information Model:
Entity, Role, Participation, and Act
Utilizes dedicated vocabularites and data types.
Every specification must begin from RIM.
Clinical Document Architecture
Utilizes XML with tags like ”observation, code,
value and id”.
<observation classCode="OBS" moodCode="EVN">
<id root="10.23.4573.15879"/>
<code code="313193002" codeSystem="2.16.840.1.113883.6.96"
codeSystemName="SNOMED CT" displayName="Peak flow"/>
<effectiveTime value="20000407"/>
<value xsi:type="RTO_PQ_PQ">
<numerator value="260" unit="l"/>
<denominator value="1" unit="min"/>
</value>
</observation>
CDSS-34
XML in Clinical Trials
CSE
300
Example:
Drug studies
Utilizing
XML would eliminate manual transcription
when moving data from one system to another.
XML is
a universal datatype as it stores
everything in text.
Therefore
Clinical
can handle new tech. seamlessly.
Data Interchange Standards Consortium.
Industry
standardization.
CDSS-35
CDISC: ODM
CSE
300
Operational
Data Model:
XML
based.
Facilitate moving data from any collection system to
clinical trial sponsor.
Addresses real world issues:
Incomplete data
Partial data transfer
Versioning and branching.
ODM
1.1 current version.
CDSS-36
ODM: Layout
CSE
300
CDSS-37
XML in Genomic Data
CSE
300
Various
groups export their data in XML
NCBI,
EBI
They
do not follow same schema, only allows
partial semantic interoperability.
Microarray Gene Experssion Group (MAGE)
publishes a schema.
MAGE
files are often several gigabytes.
Illustrates overhead of XML, however researches still
use it because of interoperability.
CDSS-38
XML Complexity
CSE
300
Clinical
Genomics Special Interest Group
(HL7)
Use
genomic data in clinical enviroment.
Utilize
several models such as MAGE, BSML
(for dna seqs)
All
information in raw models not necessary.
”Bubbling
up” analyzes large raw data sets,
extracts useful information.
Transfer useful information to new schema /
model.
Bottom
line, there exists complex workflows to
CDSS-39
XML BMI Issues
CSE
300
Clinical
information like a verbal description or
advice is unstructured.
How
do you query this?
Schemas
and Models are extremely complex,
with nesting, recursion and compound data
types.
Difficult
mapping to relational databases.
XML instances
What
may be gigabytes in size.
database solutions exist to handle such large
files?
CDSS-40
XML BMI Examples
CSE
300
A closer
Mayo
look at the Clinical Document Arch.
clinic's implementation of CDA.
Case
study using native XML database to
facilitate research based upon clinical texts.
Tamino
XML DB.
Querying native BD.
UCONN
BMI, CSE 300 Spring 2008
CDSS-41
XML BMI: CDA
www.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf
CSE
300
A clinical
document is:
Persistence:
exists for a defined time period.
Stewardship: Maintained by a designated care taker.
Potential for authentication: May be legally
authenticated.
It must be human readable on a standard web
browser.
Utilizes standard XML syntax
CDSS-42
XML BMI: CDA
www.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf
CSE
300
Mayo
clinics use of CDA:
CDSS-43
A Native XML Database Design for Clinical Document Research
Johnson, Campbell, et. al
CSE
300
Facilitate research, especially research on clinical text.
User needs to be accounted for:
Process queries against text.
Process queries against annotations.
Standard method for querying.
Non-heirachical document selection (by patient, date,...)
Return varying level of document granularity.
A schema which adapts to new information without
breaking old query formulations.
A schema which adapts to new annotations.
CDSS-44
cont.
CSE
300
Tamino
XML DBMS: A commercial product.
Supports
XQuery, text search which address many
of the querying needs.
Utilizes
the CDA for structuring metainformation.
A schema structures documents on sentance by
sentance level.
Allows
high level of granularity.
Tags
to link words to sementic and vocabulary
library.
CDSS-45
UCONN BMI
CSE
300
Utilize
a native XML DB to store docuemnts.
Documents could be PHR, health data / statistics,
or system meta-data (registration).
Our goal is to provide secure submission and
retrieval of a variety of XML data.
For spring 2008, only focusing on submitting
registration data.
CDSS-46
UCONN BMI: Overview
CSE
300
Current state:
User
HTML
Browser:
HTML Form
Java Server
Create XML
document
Java
Submit to DB
XML
Data exists in three different domains:
It is in HTML, a text datatype when the user enters it.
The server maps the html to java strings to create the
XML.
The XML is written to a file on the server, and submitted
to the database via a java API.
CDSS-47
UCONN BMI: Problems
CSE
300
There
are 2 transformations of data.
Each
requires a hand coded mapping.
This leads to sloppy code, wasted resources.
Only
does XML as input, what about output?
The database is obtuse (sedna), what other
options exists?
Do we want to store / transmit application data?
CDSS-48
UCONN BMI: Model (potential)
CSE
300
System
User
HTML
js -> XML
Browser:
HTML Form
XQuery
Submit to DB
Java Server
Java
XML
Utilize client side JS to create XML.
Use java API to manipulate XML.
Problems:
Document verified through schema, and Xquery.
Awkward to cross reference input with any other data.
Advantages:
No server side data type conversion.
This model applies to user driven input and systems interactions.
CDSS-49
UCONN BMI: Model retrieval
CSE
300
XQuery
Query
User /
System
HTML
JS
Java Server
DB
Java
XSLT
XML
Client queries in XQuery or predefined query in server.
Server uses API to execute XQuery to DB.
Java Server is given XML document, it can:
Apply java based XSLT and return to requestor. (more reliable)
Return raw document, client side JS applies XSLT. (less server
load)
Both
CDSS-50
UCONN BMI: Retrival Problems
CSE
300
There
is still no method of performing business
logic outside the scope of XSLT or XQuery.
What types of data should be retrieved in
XML:
Data
that does not require complex logic, like login
credential validation, or registration.
Health records and data which follow a defined
schema.
Education, treatment, and research information
which follow a defined schema.
CDSS-51
UCONN BMI: XML Future
CSE
300
Focus
implementing XML features on the
appropriate data.
Choose an XML database which offers high
reliability, and ease of use.
Develope XSLT templates for transforming XML
data to appropriate format.
CDSS-52
Survey of Native XML DBMS
CSE
300
Comprehensive
List:
http://www.rpbourret.com/xml/XMLDatabaseProds.h
tm#native
Commercial:
Tamino
XML Server.
Well developed, supported, many tools available.
Open
Source:
Sedna:
Fully supports ACID, XQuery.
eXist: Great managment, documentation, indexing.
CDSS-53
eXist
http://www.rpbourret.com/xml/ProdsNative.htm#exist
CSE
300
Proprietary data store B+ trees).
Supports XQuery/XPath 2.0
Full text searches.
XML:DB API.
Document level concurrency.
Complete documentation.
Incomplete transaction support.
CDSS-54
Sedna
http://www.rpbourret.com/xml/ProdsNative.htm#sedna
CSE
300
Underlying data storage based on DataGuide
Supports XQuery/XPath 2.0
Full text searches.
Custom API for various languages.
Command line admin.
Transaction support.
CDSS-55
Questions?
CSE
300
Thank
you.
CDSS-56
References
CSE
300
“Canonical XML Version 1.0”, John Boyer. 15 March 2001.
W3C
“XML Path Language (Xpath) 2.0”. W3C working Draft. 2
May 2003. W3C
“XML Schema”. XML Schema Working Group. 1 January
2008. W3C
<http://www.w3.org/XML/Schema>
“XML Schema: Formal Description” Brown, Fuchs, et. al. 25
September 2001. W3C
<http://www.w3.org/TR/xmlschema-formal/>
“Extensible Markup Language (XML)”. 1 January 2008. W3C
<http://www.w3.org/XML/>
http://www.25hoursaday.com/StoringAndQueryingXML.html
http://www.nih.gov/about/director/060399.htm
http://www.research.ibm.com/journal/sj/452/shabo.html
“Overview of the CDISC Operational Data Model”. 26 April
2002. CDISC
CDSS-57