Download Questions Differentiate between attributes and elements in XML

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Questions
1. Differentiate between attributes and elements in XML? List some of the important attributes used in
specifying elements in XML schema.
2. XML and HTML
3. Differentiate between attributes and elements in XML? List some of the important attributes used in
specifying elements in XML schema.
4. X Path
5. X Query
6. Differentiate between XML schema and XML DTD with suitable example.
Unit 5: Database Related Standards
SQL Standards:
SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard
language for relational database management systems. SQL statements are used to perform tasks such as update data on a
database, or retrieve data from a database.
Standards are important because of the complexity of database systems and their need for interoperation. Formal standards
exist for SQL. De facto standards, such as ODBC and JDBC, and standards adopted by industry groups, such as CORBA,
have played an important role in the growth of client–server database systems.
Since SQL is the most widely used query language, much work has been done on standardizing it. ANSI and ISO, with the
various database vendors, have played a leading role in this work. The SQL-86 standard was the initial version. The IBM
Systems Application Architecture (SAA) standard for SQL was released in 1987. As people identified the need for more
features, updated versions of the formal SQL standard were developed, called SQL-89 and SQL-92. The SQL:1999 version
of the SQL standard added a variety of features to SQL.
The SQL:2003 version of the SQL standard is a minor extension of the SQL:1999 standard. Some features such as the
SQL:1999 OLAP features were specified as an amendment to the earlier version of the SQL:1999 standard, instead of
waiting for the release of SQL:2003.
The SQL:2003 standard was broken into several parts:
• Part 1: SQL/Framework provides an overview of the standard.
• Part 2: SQL/Foundation defines the basics of the standard: types, schemas, tables, views, query and update statements,
expressions, security model, predicates, assignment rules, transaction management, and so on.
• Part 3: SQL/CLI (Call Level Interface) defines application program interfaces to SQL.
• Part 4: SQL/PSM (Persistent Stored Modules) defines extensions to SQL to make it procedural.
• Part 9: SQL/MED (Management of External Data) defines standards or interfacing an SQL system to external sources. By
writing wrappers, system designers can treat external data sources, such as files or data in non relational databases, as if they
were “foreign” tables.
• Part 10: SQL/OLB (Object Language Bindings) defines standards for embedding SQL in Java.
• Part 11: SQL/Schemata (Information and Definition Schema) defines a standard catalog interface.
• Part 13: SQL/JRT (Java Routines and Types) defines standards for accessing routines and types in Java.
• Part 14: SQL/XML defines XML-Related Specifications. The missing numbers cover features such as temporal data,
distributed transaction processing, and multimedia data, for which there is as yet no agreement on the standards.
The latest versions of the SQL standard are SQL:2006, which added several features related to XML, and SQL:2008, which
introduces a number of extensions to the SQL language.
Database Connectivity Standards
The ODBC standard is a widely used standard for communication between client applications and database systems. ODBC
is based on the SQL Call Level Interface (CLI) standards developed by the X/Open industry consortium and the SQL
Access Group, but it has several extensions. The ODBC API defines a CLI, an SQL syntax definition, and rules about
permissible sequences of CLI calls. The standard also defines conformance levels for the CLI and the SQL syntax. For
example, the core level of the CLI has commands to connect to a database, to prepare and execute SQL statements, to get
back results or status values, and to manage transactions.
The next level of conformance (level 1) requires support for catalog information retrieval and some other features over and
above the core-level CLI; level 2 requires further features, such as ability to send and retrieve arrays of parameter values
and to retrieve more detailed catalog information. ODBC allows a client to connect simultaneously to multiple data sources
and to switch among them, but transactions on each are independent; ODBC does not support two-phase commit.
A distributed system provides a more general environment than a client– server system. The X/Open consortium has also
developed the X/Open XA standards for interoperation of databases. These standards define transaction management
primitives (such as transaction begin, commit, abort, and prepare to- commit) that compliant databases should provide; a
transaction manager can invoke these primitives to implement distributed transactions by two-phase commit.
The XA standards are independent of the data model and of the specific interfaces between clients and databases to
exchange data. Thus, we can use the XA protocols to implement a distributed transaction system in which a single
transaction can access relational as well as object-oriented databases, yet the transaction manager ensures global consistency
via two-phase commit.
There are many data sources that are not relational databases, and in fact may not be databases at all. Examples are flat files
and email stores. Microsoft’s OLE-DB is a C++ API with goals similar to ODBC, but for nondatabase data sources that
may provide only limited querying and update facilities. Just like ODBC, OLE-DB provides constructs for connecting to a
data source, starting a session, executing commands, and getting back results in the form of a rowset, which is a set of
result rows. However, OLE-DB differs from ODBC in several ways. To support data sources with limited feature support,
features in OLE-DB are divided into a number of interfaces, and a data source may implement only a subset of the
interfaces. An OLE-DB program can negotiate with a data source to find what interfaces are supported. In ODBC
commands are always in SQL. In OLE-DB, commands may be in any language supported by the data source; while some
sources may support
SQL, or a limited subset of SQL, other sources may provide only simple capabilities such as accessing data in a flat file,
without any query capability. Another major difference of OLE-DB from ODBC is that a rowset is an object that can be
shared by multiple applications through shared memory. A rowset object can be updated by one application, and other
applications sharing that object will get notified about the change.
The Active Data Objects (ADO) API, also created by Microsoft, provides an easy-to-use interface to the OLE-DB
functionality, which can be called from scripting languages, such as VBScript and JScript. The newer ADO.NET API is
designed for applications written in the .NET languages such as C# and Visual Basic.NET. In addition to providing
simplified interfaces, it provides an abstraction called the DataSet that permits disconnected data access.
Object Database Standards
Standards in the area of object-oriented databases have so far been driven primarily by OODB vendors. The Object
Database Management Group (ODMG) was a group formed by OODB vendors to standardize the data model and language
interfaces to OODBs. The C++ language interface specified by ODMG was briefly outlined in Chapter 22. ODMG is no
longer active. JDO is a standard for adding persistence to Java. The Object Management Group (OMG) is a
consortium of companies, formed with the objective of developing a standard architecture for distributed software
applications based on the object-oriented model. OMG brought out the Object Management Architecture (OMA)
reference model. The Object Request Broker (ORB) is a component of the OMA architecture that provides message
dispatch to distributed objects transparently, so the physical location of the object is not important. The Common
Object Request Broker Architecture (CORBA) provides a detailed specification of the ORB, and includes an Interface
Description Language (IDL), which
is used to define the data types used for data interchange. The IDL helps to sup port data conversion when data are shipped
between systems with different data representations.
Microsoft introduced the Entity data model, which incorporates ideas from the entity-relationship and object-oriented data
models, and an approach to integrating querying with the programming language, called Language Integrated Querying or
LINQ. These are likely to become de facto standards.
XML-Based Standards
A wide variety of standards based on XML (see Chapter 23) have been defined for a wide variety of applications. Many of
these standards are related to e-commerce. They include standards promulgated by nonprofit consortia and
corporate-backed efforts to create de facto standards. RosettaNet, which falls into the former category, is an industry
consortium that uses XML-based standards to facilitate supply-chain management in the computer and information
technology industries. Supply-chain management refers to the purchases of material and services that an organization needs
to function. In contrast, customer-relationship management refers to the front end of a company’s interaction, dealing with
customers. Supply-chain management requires standardization of a variety of things such as:
• Global company identifier: RosettaNet specifies a systemfor uniquely identifying companies, using a 9-digit identifier
called Data Universal Numbering System (DUNS).
• Global product identifier: RosettaNet specifies a 14-digit Global Trade Item Number (GTIN) for identifying products and
services.
• Global class identifier: This is a 10-digit hierarchical code for classifying products and services called the United
Nations/Standard Product and Services Code (UN/SPSC).
• Interfaces between trading partners: RosettaNet Partner Interface Processes (PIPs) define business processes between
partners. PIPs are system-to-system XML-based dialogs: They define the formats and semantics of business documents
involved in a process and the steps involved in completing a transaction. Examples of steps could include getting product
and service information, purchase orders, order invoicing, payment, order status requests, inventory management, post-sales
support including service warranty, and so on. Exchange of design, configuration, process, and quality information is
also possible to coordinate manufacturing activities across organizations.
Participants in electronic marketplaces may store data in a variety of database systems. These systems may use different data
models, data formats, and data types. Furthermore, there may be semantic differences (metric versus English measure,
distinct monetary currencies, and so forth) in the data. Standards for electronic marketplaces include methods for wrapping
each of these heterogeneous systems with an XML schema. These XML wrappers form the basis of a unified view of data
across all of the participants in the marketplace.
Simple Object Access Protocol (SOAP) is a remote procedure call standard that uses XML to encode data (both parameters
and results), and uses HTTP as the transport protocol; that is, a procedure call becomes an HTTP request. SOAP is backed
by the World Wide Web Consortium (W3C) and has gained wide acceptance in industry. SOAP can be used in a variety of
applications. For instance, in business-to-business e-commerce, applications running at one site can access data from and
execute actions at other sites through SOAP.
ODMG 3.0
ODMG 3.0 was developed by the Object Data Management Group (ODMG). The ODMG is a consortium of vendors and
interested parties that work on specifications for object database and object-relational mapping products.
ODMG 3.0 is a portability specification. It is designed to allow for portable applications that could run on more than one
product. ODMG 3.0 uses the Java, C++, and Smalltalk languages as much as possible, to allow for the transparent
integration of object programming languages.
The major components of ODMG 3.0 specification are:
Object Model. The common data model supported by ODMG implementations is based on the OMG Object Model. The
OMG core model was designed to be a common denominator for object request brokers, object database systems, object
programming languages, and other applications. In keeping with the OMG Architecture, a profile has been designed for
their model, adding components (e.g., relationships) to the OMG core object model to support the ODMG needs.
Object Specification Languages.
The two specification languages are the Object Definition Language (ODL) and Object Interchange Format (OIF)
languages.
ODL is a specification language used to define the object types that conform to the ODMG Object Model and is based on
the OMG IDL. OIF is a specification language used to dump and load from a file or set of files.
Object Query Language. This is a declarative (nonprocedural) language for querying and updating objects. SQL-92 was
used as the basis for OQL.
C++ Language Binding. This is the binding of ODMG implementations to C++. This is called the C++ OML, or object
manipulation language. The C++ binding also includes a version of the ODL that uses C++ syntax, a mechanism to invoke
OQL, and procedures for operations and transactions.
Smalltalk Language Binding. This is the binding of ODMG implementations to Smalltalk. It defines the binding in terms
of the mapping between ODL and Smalltalk, which is based on the OMG Smalltalk binding for IDL. The Smalltalk binding
also includes a mechanism to invoke OQL and procedures for operations on databases and transactions.
Java Language Binding. This is the binding between the ODMG Object Model (ODL and OML) and the Java
programming language as defined by the Java 2 Platform. The Java language binding also includes a mechanism to invoke
OQL and procedures for operations and transactions.
It is possible to read and write the same database from C++, Smalltalk, and Java, as long as the programmer stays within the
common subset of supported data types. Note that, unlike SQL in relational systems, the ODMG data manipulation
languages are tailored to specific application programming languages, in order to provide a single, integrated environment
for programming and data manipulation. This is called transparent persistence.
Such transparent persistence is illustrated by the following diagram and contrasts with the database sublanguage of SQL and
its variants. In this diagram, you only see the host programming language and no database sublanguage or call-level
interface as in JDBC.
SQL/92 CLI
In 1995 an addendum for a Call-level Interface, SQL- 92/CLI (sometimes called CLI-95), was adopted. CLI-95 is a subset
of the popular (de facto standard) ODBC interface from Microsoft and others; it functions as a callable interface to an SQL
database system, providing a highly dynamic capability by contrast with the relatively static facility provided by embedded
SQL. CLI (and ODBC, of course) is primarily used by ad hoc applications, like decision support applications, whereas
embedded SQL is more likely to be used by “glass house” applications that are considerably less dynamic in their function.
SQL-92/PSM
In 1996 an addendum for Stored Procedures, SQL- 92/PSM (Persistent Stored Modules, sometimes called PSM-96), was
adopted. SQL-92/PSM added the following types of features:



Multi-statement Procedures: groups of SQL statements can be executed together; flow-of control statements,
local variables, and condition handlers are provided
Stored routines and modules: procedures, functions, and modules can be stored in an SQLServer
External routines: functions and procedures written in host programming languages can be invoked from SQL
statement SQL-92/PSM was discussed in this column in December 1996.
SQL3
The two committees are currently working on SQL3, which will replace SQL-92 when it is adopted. SQL3 began its final
CD ballot in October 1997. An editing meeting took place in March 1998. Additional editing meetings are scheduled for
June 1998 and November 1998. If these meetings are successful, then SQL3 could be adopted in early 1999. SQL3 extends
the data types of SQL-92 significantly. It adds some predefined data types, like BOOLEAN, CHARACTER LARGE
OBJECT, and ROW. It adds the collection type of ARRAY. The single largest addition to SQL3 is user defined data types
(UDT’s). Users will be able to define their own data types, each with a concrete representation, methods, and ordering
properties. These UDT’s can be used anywhere that a predefined data type can be used (as the data type of a column, for
example).
A UDT can also be used in a new way. It can be associated with a base table, so that each attribute of the UDT maps to a
column of the base table. A new data type, REF, can be used to refer to rows in such a table. Inheritance is supported for
both base tables and UDT’s. It remains to be seen whether it is single inheritance or multiple inheritance that is finally
adopted. Some of the other features that are provided by SQL3 are:





Recursive Query – creates a result table from the traversal of rows that form a directed graph
Similar Predicate – an extension of the LIKE predicate that allows regular expressions
Roles – authorization may be granted to a role, and users may then take on different roles at different times
Triggers – statements may be defined to execute each time insert, update, or delete statements are executed on a
particular table. The statements may execute once for the statement, or individually for each row that is affected.
Holdable Cursors – cursors may be defined to stay open after a transaction commit SQL3 will have a set of features
that are required for conformance to the standard; this is being termed Core SQL3. Additional packages of features
will also be defined that require features in addition to those in Core SQL3. It is likely that we will devote a column
to SQL3 when it nears the end of its adoption process and has become more stable.
Object Data Management Group (ODMG)
ODMG is a non-profit consortium that was formed in 1991 to “develop and promote standards for object storage”.
ODMG’s latest publication is “Object Database Standard: ODMG 2.0”. This specification contains an Object Model, an
Object Definition Language (ODL), an Object Query Language (OQL), and language bindings to Java, C++, and Smalltalk.
The Object Model is a superset of the OMG object model, adding relationships, extents, collection classes and concurrency
control.
The language bindings allow a programmer to do both application and database programming in the same environment.
ODMG 2.0 has extended previous versions of the standard with:
· a Java language binding
· a standard external form for both metadata and data
Transaction Performance Processing
Council (TPC)
TPC is a non-profit corporation formed to “define transaction processing and database benchmarks and to disseminate
objective, verifiable TPC performance data to the industry”. TPC has published a number of benchmarks over time. Vendors
run the benchmarks, certified auditors review the tests and results, and the results are submitted to TPC, after which they can
then be published. There is a Technical Advisory Board (TAB) that reviews benchmark compliance challenges.
TPC-C
TPC-C is the current OLTP benchmark. It contains a mixture of read-only and update transactions to simulate a complex
OLTP application environment. The metrics for TPC-C are transactions-per-minute-C (tpmC) and price-per-tpm-C
($/tpmC). The current version of TPC-C is 3.3.2. V4.0, currently under development, might have the following changes:
· Increased cost for some of the transaction types
· Enforcement of referential integrity constraints
TPC-D
TPC-D is a benchmark for a complex decision support environment. The queries in the benchmark involve multi-table joins,
sorting, and aggregation. TPC-D benchmarks may be run with one of a specified set of database sizes that range from 1GB
to
10,000GB. The current TPC-D benchmark is V1.3.1. A query stream contains 17 queries. The benchmark contains two
metrics. The Power metric (QppD@size) is based on a single-stream Power test . The Throughput metric (QthD@size) may
be based
on an actual multi-stream run, or it may be calculated from the single-stream results. V2.0 may be approved in the beginning
of
1999. Some of the changes being considered for V2.0 are:
· 6 new queries (including left outer join, use of the SUBSTRING function)
· A required multi-stream throughput test, with a minimum number of streams for each database size
TPC-W
TPC-W is a benchmark for a retail eCommerce environment on the web. It is currently under development, with possible
approval in early 1999. The benchmark models a storefront on the web. The benchmark may be run with one of a set of
database sizes that range from 1K items to 1M items. The benchmark measures interactions seen by a browser, allowing for
some user-interrupted transfers. The primary metrics will be Web Interactions Per Second (WIPS@size) and price-perWIPS ($/WIPS@size).
SQLJ
SQLJ is an informal group of companies that has been investigating the ways that SQL and Java can be used together. This
effort has spawned three documents that are about to be submitted to formal standards bodies for consideration.
SQLJ Part 0 - SQL Embedded in Java
SQL-92 defines the embedding of SQL statements in programming languages such as C or COBOL. This part of SQLJ
defines the embedding of static SQL statements in a Java program. The expression of a user’s queries in SQLJ will usually
be more compact
and readable than their expression in JDBC. The SQLJ statements are introduced in Java programs by “#sql”, which is not a
valid token in the Java language. Variables and expressions from the Java language can be used to provide values to SQL
and to retrieve values from SQL. The variables and expressions are prefixed with “:” to distinguish them from SQL
identifiers. The JDBC mapping of Java data types to SQL data types has been used by SQLJ. A SQLJ program may then be
passed to a
SQLJ translator which can generate a pure Java program that might contain JDBC calls, or it might contain other Java
statements that communicate with a SQL database. The SQLJ translator is able to perform some validation of the SQL
statements. The translator may be given the actual schema that will be used at runtime or an “exemplar” schema (one that is
the same as the schema that will be used at runtime), in which case additional validation of the SQL statements can be done.
SQLJ Part 0 has just been submitted to NCITS H2 for adoption as a new part of SQL, SQL/OLB (Object Language
Bindings).
Difference between HTML and XML
Key Difference: HTML is a markup language that is used to design web pages. It is written in predefined tag elements. Its
primary purpose is to display data with focus on how the data looks. XML is a markup language whose primary purpose is
to transport and store data. It is a language that can be used to develop new languages and define other languages. It does not
have a predefined set of tags, and allows the developer to customize tags.
HyperText Markup Language (HTML) is a well known mark up language used to develop web pages. It has been around for
a long time and is commonly used in webpage design. XML or Extensible Markup Language defines a set of rules for
encoding documents in a format that can be read by both, human and computer.
HTML is written using HTML elements, which consist of tags, primarily an opening tag and a closing tag. The data between
these tags is usually the content. The main objective of HTML is to allow web browsers to interpret and display the content
written between the tags. The tags are designed to describe the page content. HTML comes with predefined tags. These
days, web pages are rarely only designed using HTML.
XML on the other hand, is a mark up language that is fairly new and was launched in 1996 as an adaptation of SGML
(Standard Generalized Markup Language). The main purpose of XML is to be an independent hardware tool used to
transport and store data with focus on what the data is. XML removes the constraint of sticking to pre-designed tags and
gives developers the freedom to design new tags. It was developed to create standardized specifications for creating custom
mark-up languages. XML-based languages include RSS, Atom and XHTML. It is neither a programming language not as
presentation language. It is known as meta-language, or a language that can be used to define other languages.
The XML is well-formed and has strict set of rules. Well-formed generally means that it satisfies a list of syntax rules
provided in its specification. Containing only properly encoded legal Unicode characters, no use of special syntax
characters, element tags are correctly nested, etc are a few example of the syntax rules. It also includes a well-formed
declaration that states the type of document it is and what processing rules should be applied.
In HTML, when a page created, the processor tries to make sense of the page and generates the content of the page. It does
not require strict rules regarding the format of the page. It will generate the page best to its ability even with errors present.
In XML, however, if certain rules are incorrect or the processor cannot comprehend something, the processor will generate
an error code and terminate processing the file. It includes an error-handling mechanism referred to as ‘draconian’. Further
specifications of both languages, including limitations are listed in the table below.
HTML
Date when invented
Extended from
Type
Markup language for
displaying web pages in a web
browser. Designed to display
data with focus on how the
data looks
1990
SGML
Static
Usage
Display a web page
Definition
Processing/Rules
Language type
Tags
White Space
Limitations
XML
Markup language defines a set of rules for encoding
documents that can be read by both humans and
machines. Designed with focus on storing and
transporting data.
1996
SGML
Dynamic
Transport data between the application and the database.
To develop other mark up languages.
No strict rules. Browser will
Strict rules must be followed or processor will terminate
still generate data to the best of
processing the file
its ability
Presentation
Neither presentation, nor programming
Predefined
Custom tags can be created by the author
Cannot preserve white space Preserves white space
Data does not know itself very
well. Data cannot change in
Cannot be used as a subtype of a sql_variant instance.
response to environment. Data
cannot be easily maintained.
Does not support casting or converting to either text or
Cannot store or call on
variables. Lacks the capability non text. Does not support the following column and
table constraints. XML provides its own encoding.
to define new structures by
defining relationships between Collations apply to string types only. Cannot be
classes. Tags are not useful for compared or sorted. Cannot be used in Distributed
Partitioned Views. Not well supported by browsers.
exchanging the document
between applications.
XML
The common method of specifying the contents and formatting of Web pages is through the use of hypertext documents.
There are various languages for writing these documents, the most common being HTML (HyperText Markup Language).
Although HTML is widely used for formatting and structuring Web documents, it is not suitable for specifying structured
data that is extracted from databases. A new language—namely, XML (Extensible Markup Language)—has emerged as the
standard for structuring and exchanging data over the Web. XML can be used to provide information about the structure and
meaning of the data in the Web pages rather than just specifying how the Web pages are formatted for display on the screen.
The formatting aspects are specified separately—for example, by using a formatting language such as XSL (Extensible
Stylesheet Language) or a transformation language such as XSLT (Extensible Stylesheet Language for Transformations or
simply XSL Transformations). Recently, XML has also been proposed as a possible model for data storage and retrieval,
although only a few experimental database systems based on XML have been developed so far.
Structured, Semi-structured, and Unstructured Data
The information stored in databases is known as structured data because it is represented in a strict format. For structured
data, it is common to carefully design the database schema using techniques. The DBMS then checks to ensure that all data
follows the structures and constraints specified in the schema. However, not all data is collected and inserted into carefully
designed structured databases. In some applications, data is collected in an ad hoc manner before it is known how it will be
stored and managed. This data may have a certain structure, but not all the information collected will have the identical
structure. Some attributes may be shared among the various entities, but other attributes may exist only in a few entities.
Moreover, additional attributes can be introduced in some of the newer data items at any time, and there is no predefined
schema. This type of data is known as semistructured data. A number of data models have been introduced for
representing semistructured data, often based on using tree or graph data structures rather than the flat relational model
structures.
A key difference between structured and semistructured data concerns how the schema constructs (such as the names of
attributes, relationships, and entity types) are handled. In semistructured data, the schema information is mixed in with the
data values, since each data object can have different attributes that are not known in advance. Hence, this type of data is
sometimes referred to as self-describing data. Consider the following example. We want to collect a list of bibliographic
references related to a certain research project. Some of these may be books or technical reports, others may be research
articles in journals or conference proceedings, and still others may refer to complete journal issues or conference
proceedings. Clearly, each of these may have different attributes and different types of information. Even for the same type
of reference—say, conference articles—we may have different information. For example, one article citation may be quite
complete, with full information about author names, title, proceedings, page numbers, and so on, whereas another citation
may not have all the information available. New types of bibliographic sources may appear in the future—for instance,
references to Web pages or to conference tutorials and these may have new attributes that describe them. Semistructured
data may be displayed as a directed graph, as shown in Figure 12.1. As we can see, this model somewhat resembles the
object model in its ability to represent complex objects and nested structures. In
Figure 12.1, the labels or tags on the directed edges represent the schema names: the names of attributes, object types (or
entity types or classes), and relationships. The internal nodes represent individual objects or composite attributes.
The leaf nodes represent actual data values of simple (atomic) attributes.
1. The schema information—names of attributes, relationships, and classes (object types) in the semistructured model is
intermixed with the objects and their data values in the same data structure.
2. In the semistructured model, there is no requirement for a predefined schema to which the data objects must conform,
although it is possible to define a schema if necessary.
XML Hierarchical (Tree) Data Model
The basic object in XML is the XML document. Two main structuring concepts are used to construct an XML
document: elements and attributes. It is important to note that the term attribute in XML is not used in the same
manner as is customary in database terminology, but rather as it is used in document description languages
such as HTML and SGML. Attributes in XML provide additional information that describes elements, as we
will see. There are additional concepts in XML, such as entities, identifiers, and references, but first we
concentrate on describing elements and attributes to show the essence of the XML model.
Figure 12.3 shows an example of an XML element called <Projects>. As in HTML, elements are identified in a
document by their start tag and end tag. The tag names are enclosed between angled brackets < ... >, and end
tags are further identified by a slash, </ ... >.5
<?xml version= “1.0” standalone=“yes”?>
<Projects>
<Project>
<Name>ProductX</Name>
<Number>1</Number>
<Location>Bellaire</Location>
<Dept_no>5</Dept_no>
<Worker>
<Ssn>123456789</Ssn>
<Last_name>Smith</Last_name>
<Hours>32.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<First_name>Joyce</First_name>
<Hours>20.0</Hours>
</Worker>
</Project>
<Project>
<Name>ProductY</Name>
<Number>2</Number>
<Location>Sugarland</Location>
<Dept_no>5</Dept_no>
<Worker>
<Ssn>123456789</Ssn>
<Hours>7.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<Hours>20.0</Hours>
</Worker>
<Worker>
<Ssn>333445555</Ssn>
Figure 12.3
A complex XML element called <Projects>. <Hours>10.0</Hours> </Worker> </Project>
Complex elements are constructed from other elements hierarchically, whereas simple elements contain data
values. A major difference between XML and HTML is that XML tag names are defined to describe the meaning
of the data elements in the document, rather than to describe how the text is to be displayed. This makes it
possible to process the data elements in the XML document automatically by computer programs. Also, the XML
tag (element) names can be defined in another document, known as the schema document, to give a semantic
meaning to the tag names that can be exchanged among multiple users. In HTML, all tag names are predefined
and fixed; that is why they are not extendible. It is straightforward to see the correspondence between the XML
textual representation shown in Figure 12.3 and the tree structure shown in Figure 12.1. In the tree
representation, internal nodes represent complex elements, whereas leaf nodes represent simple elements. That
is why the XML model is called a tree model or a hierarchical model. In Figure 12.3, the simple elements are the
ones with the tag names <Name>, <Number>, <Location>, <Dept_no>, <Ssn>, <Last_name>, <First_name>,
and <Hours>. The complex elements are the ones with the tag names <Projects>, <Project>, and <Worker>. In
general, there is no limit on the levels of nesting of elements. It is possible to characterize three main types of
XML documents:
■ Data-centric XML documents. These documents have many small data items that follow a specific structure
and hence may be extracted from a structured database. They are formatted as XML documents in order to
exchange them over or display them on the Web. These usually follow a predefined schema that defines the tag
names.
■ Document-centric XML documents. These are documents with large amounts of text, such as news articles
or books. There are few or no structured data elements in these documents.
■ Hybrid XML documents. These documents may have parts that contain structured data and other parts that
are predominantly textual or unstructured. They may or may not have a predefined schema.
XML Documents, DTD, and XML Schema
An XML document is well formed if it follows a few conditions. In particular, it must start with an XML declaration
to indicate the version of XML being used as well as any other relevant attributes, as shown in the first line in
Figure 12.3. It must also follow the syntactic guidelines of the tree data model. This means that there should be a
single root element, and every element must include a matching pair of start and end tags within the start and
end tags of the parent element. This ensures that the nested elements specify a well-formed tree structure. A
well-formed XML document is syntactically correct. This allows it to be processed by generic processors that
traverse the document and create an internal tree representation. A standard model with an associated set of
API (application programming interface) functions called DOM (Document Object Model) allows programs to
manipulate the resulting tree representation corresponding to a well-formed XML document. However, the whole
document must be parsed beforehand when using DOM in order to convert the document to that standard DOM
internal data structure representation. Another API called SAX (Simple API for XML) allows processing of XML
documents on the fly by notifying the processing program through callbacks whenever a start or end tag is
encountered. This makes it easier to process large documents and allows for processing of so-called streaming
XML documents, where the processing program can process the tags as they are encountered. This is also
known as event-based processing.
A well-formed XML document can be schema less; that is, it can have any tag names for the elements within the
document. In this case, there is no predefined set of elements (tag names) that a program processing the
Document knows to expect. This gives the document creator the freedom to specify new elements, but limits the
possibilities for automatically interpreting the meaning or semantics of the elements within the document.
A stronger criterion is for an XML document to be valid. In this case, the document must be well formed, and it
must follow a particular schema. That is, the element names used in the start and end tag pairs must follow the
structure specified in a separate XML DTD (Document Type Definition) file or XML schema file. Figure 12.4
shows a simple XML DTD file, which specifies the elements (tag names) and their nested structures. Any valid
documents conforming to this DTD should follow the specified structure. A special syntax exists for specifying
DTD files, as illustrated in Figure 12.4. First, a name is given to the root tag of the document, which is called
Projects in the first line in Figure 12.4. Then the elements and their nested structure are specified.
When specifying elements, the following notation is used:
A * following the element name means that the element can be repeated zero or more times in the document.
This kind of element is known as an optional multivalued (repeating) element.
■
A + following the element name means that the element can be repeated one or more times in the document.
This kind of element is a required multivalued (repeating) element.
■
A ? following the element name means that the element can be repeated zero or one times. This kind is an
optional single-valued (nonrepeating) element.
■
An element appearing without any of the preceding three symbols must appear exactly once in the document.
This kind is a required single-valued (nonrepeating) element.
■
The type of the element is specified via parentheses following the element. If the parentheses include names
of other elements, these latter elements are the children of the element in the tree structure. If the parentheses
include the keyword #PCDATA or one of the other data types available in XML DTD, the element is a leaf node.
PCDATA stands for parsed character data, which is roughly similar to a string data type.
■
The list of attributes that can appear within an element can also be specified via the keyword !ATTLIST. In
Figure 12.3, the Project element has an attribute ProjId. If the type of an attribute is ID, then it can be referenced
from another attribute whose type is IDREF within another element. Notice that attributes can also be used to hold
the values of simple data elements of type #PCDATA.
■
■
Parentheses can be nested when specifying elements.
■
A bar symbol ( e1 | e2 ) specifies that either e1 or e2 can appear in the document.
<!DOCTYPE Projects [
<!ELEMENT Projects (Project+)>
<!ELEMENT Project (Name, Number, Location, Dept_no?, Workers)
<!ATTLIST Project
ProjId ID #REQUIRED>
>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Number (#PCDATA)
<!ELEMENT Location (#PCDATA)>
<!ELEMENT Dept_no (#PCDATA)>
<!ELEMENT Workers (Worker*)>
<!ELEMENT Worker (Ssn, Last_name?, First_name?, Hours)>
<!ELEMENT Ssn (#PCDATA)>
<!ELEMENT Last_name (#PCDATA)>
<!ELEMENT First_name (#PCDATA)>
<!ELEMENT Hours (#PCDATA)>
]>
We can see that the tree structure in Figure 12.1 and the XML document in Figure 12.3 conform to the XML DTD
in Figure 12.4. To require that an XML document be checked for conformance to a DTD, we must specify this in
the declaration of the document. For example, we could change the first line in Figure 12.3 to the following:
<?xml version=“1.0” standalone=“no”?>
<!DOCTYPE Projects SYSTEM “proj.dtd”>
When the value of the standalone attribute in an XML document is “no”, the document needs to be checked
against a separate DTD document or XML schema document (see below). The DTD file shown in Figure 12.4
should be stored in the same file system as the XML document, and should be given the file name proj.dtd.
Alternatively, we could include the DTD document text at the beginning of the XML document itself to allow the
checking. Although XML DTD is quite adequate for specifying tree structures with required, optional, and
repeating elements, and with various types of attributes, it has several limitations. First, the data types in DTD
are not very general. Second, DTD has its own special syntax and thus requires specialized processors. It would
be advantageous to specify XML schema documents using the syntax rules of XML itself so that the same
processors used for XML documents could process XML schema descriptions. Third, all DTD elements are
always forced to follow the specified ordering of the document, so unordered elements are not permitted. These
drawbacks led to the development of XML schema, a more general but also more complex language for
specifying the structure and elements of XML documents.
XML Schema
The XML schema language is a standard for specifying the structure of XML documents. It uses the same syntax
rules as regular XML documents, so that the same processors can be used on both. To distinguish the two types
of documents, we will use the term XML instance document or XML document for a regular XML document, and
XML schema document for a document that specifies an XML schema. Figure 12.5 shows an XML schema
document corresponding to the COMPANY database shown in Figures 3.5 and 7.2. Although it is unlikely that
we would want to display the whole database as a single document, there have been proposals to store data in
native XML format as an alternative to storing the data in relational databases. The schema in Figure 12.5 would
serve the purpose of specifying the structure of the COMPANY database if it were stored in a native XML
system. As with XML DTD, XML schema is based on the tree data model, with elements and attributes as the
main structuring concepts. However, it borrows additional concepts from database and object models, such as
keys, references, and identifiers. Here we describe the features of XML schema in a step-by-step manner,
referring to the sample XML schema document in Figure 12.5 for illustration. We introduce and describe some of
the schema concepts in the order in which they are used in Figure 12.5.
Figure 12.5
An XML schema file called company.
<?xml version=“1.0” encoding=“UTF-8” ?>
<xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema”>
<xsd:annotation>
<xsd:documentation xml:lang=“en”>Company Schema (Element Approach) - Prepared by Babak
Hojabri</xsd:documentation>
</xsd:annotation>
<xsd:element name=“company”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“department” type=“Department” minOccurs=“0” maxOccurs= “unbounded” />
<xsd:element name=“employee” type=“Employee” minOccurs=“0” maxOccurs= “unbounded”>
<xsd:unique name=“dependentNameUnique”>
<xsd:selector xpath=“employeeDependent” />
<xsd:field xpath=“dependentName” />
</xsd:unique>
</xsd:element>
<xsd:element name=“project” type=“Project” minOccurs=“0” maxOccurs=“unbounded” />
</xsd:sequence>
</xsd:complexType>
<xsd:unique name=“departmentNameUnique”>
<xsd:selector xpath=“department” />
<xsd:field xpath=“departmentName” />
</xsd:unique>
<xsd:unique name=“projectNameUnique”>
<xsd:selector xpath=“project” />
<xsd:field xpath=“projectName” />
</xsd:unique>
<xsd:key name=“projectNumberKey”>
<xsd:selector xpath=“project” />
<xsd:field xpath=“projectNumber” />
</xsd:key>
<xsd:key name=“departmentNumberKey”>
<xsd:selector xpath=“department” />
<xsd:field xpath=“departmentNumber” />
</xsd:key>
<xsd:key name=“employeeSSNKey”>
<xsd:selector xpath=“employee” />
<xsd:field xpath=“employeeSSN” />
</xsd:key>
<xsd:keyref name=“departmentManagerSSNKeyRef” refer=“employeeSSNKey”>
<xsd:selector xpath=“department” />
<xsd:field xpath=“departmentManagerSSN” />
</xsd:keyref>
<xsd:keyref name=“employeeDepartmentNumberKeyRef”
refer=“departmentNumberKey”>
<xsd:selector xpath=“employee” />
<xsd:field xpath=“employeeDepartmentNumber” />
</xsd:keyref>
<xsd:keyref name=“employeeSupervisorSSNKeyRef” refer=“employeeSSNKey”>
<xsd:selector xpath=“employee” />
<xsd:field xpath=“employeeSupervisorSSN” />
</xsd:keyref>
<xsd:keyref name=“projectDepartmentNumberKeyRef” refer=“departmentNumberKey”>
<xsd:selector xpath=“project” />
<xsd:field xpath=“projectDepartmentNumber” />
</xsd:keyref>
<xsd:keyref name=“projectWorkerSSNKeyRef” refer=“employeeSSNKey”>
<xsd:selector xpath=“project/projectWorker” />
<xsd:field xpath=“SSN” />
</xsd:keyref>
<xsd:keyref name=“employeeWorksOnProjectNumberKeyRef”
refer=“projectNumberKey”>
<xsd:selector xpath=“employee/employeeWorksOn” />
<xsd:field xpath=“projectNumber” />
</xsd:keyref>
</xsd:element>
<xsd:complexType name=“Department”>
<xsd:sequence>
<xsd:element name=“departmentName” type=“xsd:string” />
<xsd:element name=“departmentNumber” type=“xsd:string” />
<xsd:element name=“departmentManagerSSN” type=“xsd:string” />
<xsd:element name=“departmentManagerStartDate” type=“xsd:date” />
<xsd:element name=“departmentLocation” type=“xsd:string” minOccurs=“0” maxOccurs=“unbounded” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Employee”>
<xsd:sequence>
<xsd:element name=“employeeName” type=“Name” />
<xsd:element name=“employeeSSN” type=“xsd:string” />
<xsd:element name=“employeeSex” type=“xsd:string” />
<xsd:element name=“employeeSalary” type=“xsd:unsignedInt” />
<xsd:element name=“employeeBirthDate” type=“xsd:date” />
<xsd:element name=“employeeDepartmentNumber” type=“xsd:string” />
<xsd:element name=“employeeSupervisorSSN” type=“xsd:string” />
<xsd:element name=“employeeAddress” type=“Address” />
<xsd:element name=“employeeWorksOn” type=“WorksOn” minOccurs=“1” maxOccurs=“unbounded” />
<xsd:element name=“employeeDependent” type=“Dependent” minOccurs=“0” maxOccurs=“unbounded” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Project”>
<xsd:sequence>
<xsd:element name=“projectName” type=“xsd:string” />
<xsd:element name=“projectNumber” type=“xsd:string” />
<xsd:element name=“projectLocation” type=“xsd:string” />
<xsd:element name=“projectDepartmentNumber” type=“xsd:string” />
<xsd:element name=“projectWorker” type=“Worker” minOccurs=“1” maxOccurs=“unbounded” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Dependent”>
<xsd:sequence>
<xsd:element name=“dependentName” type=“xsd:string” />
<xsd:element name=“dependentSex” type=“xsd:string” />
<xsd:element name=“dependentBirthDate” type=“xsd:date” />
<xsd:element name=“dependentRelationship” type=“xsd:string” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Address”>
<xsd:sequence>
<xsd:element name=“number” type=“xsd:string” />
<xsd:element name=“street” type=“xsd:string” />
<xsd:element name=“city” type=“xsd:string” />
<xsd:element name=“state” type=“xsd:string” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Name”>
<xsd:sequence>
<xsd:element name=“firstName” type=“xsd:string” />
<xsd:element name=“middleName” type=“xsd:string” />
<xsd:element name=“lastName” type=“xsd:string” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Worker”>
<xsd:sequence>
<xsd:element name=“SSN” type=“xsd:string” />
<xsd:element name=“hours” type=“xsd:float” />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“WorksOn”>
<xsd:sequence>
<xsd:element name=“projectNumber” type=“xsd:string” />
<xsd:element name=“hours” type=“xsd:float” />
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
1. Schema descriptions and XML namespaces. It is necessary to identify the specific set of XML schema
language elements (tags) being used by specifying a file stored at a Web site location. The second line in Figure
12.5 specifies the file used in this example, which is http://www.w3.org/2001/XMLSchema. This is a commonly
used standard for XML schema commands. Each such definition is called an XML namespace, because it
defines the set of commands (names) that can be used. The file name is assigned to the variable xsd (XML
schema description) using the attribute xmlns (XML namespace), and this variable is used as a prefix to all XML
schema commands (tag names). For example, in Figure 12.5, when we write xsd:element or xsd:sequence, we
are referring to the definitions of the element and sequence tags as defined in the file
http://www.w3.org/2001/XMLSchema.
2. Annotations, documentation, and language used. The next couple of lines in Figure 12.5 illustrate the XML
schema elements (tags) xsd:annotation and xsd:documentation, which are used for providing comments and other
Descriptions in the XML document. The attribute xml:lang of the xsd:documentation element specifies the language
being used, where en stands for the English language.
3. Elements and types. Next, we specify the root element of our XML schema. In XML schema, the name
attribute of the xsd:element tag specifies the element name, which is called company for the root element in our
example (see Figure 12.5). The structure of the company root element can then be specified, which in our
example is xsd:complexType. This is further specified to be a sequence of departments, employees, and projects
using the xsd:sequence structure of XML schema. It is important to note here that this is not the only way to specify
an XML schema for the COMPANY database.
4. First-level elements in the COMPANY database. Next, we specify the three first-level elements under the
company root element in Figure 12.5. These elements are named employee, department, and project, and each is
specified in an xsd:element tag. Notice that if a tag has only attributes and no further subelements or data within it,
it can be ended with the backslash symbol (/>) directly instead of having a separate matching end tag. These are
called empty elements; examples are the xsd:element elements named department and project in Figure 12.5.
5. Specifying element type and minimum and maximum occurrences. In XML schema, the attributes type,
minOccurs, and maxOccurs in the xsd:element tag specify the type and multiplicity of each element in any document
that conforms to the schema specifications. If we specify a type attribute in an xsd:element, the structure of the
element must be described separately, typically using the xsd:complexType element of XML schema. This is
illustrated by the employee, department, and project elements in Figure 12.5. On the other hand, if no type attribute is
specified, the element structure can be defined directly following the tag, as illustrated by the company root
element in Figure 12.5. The minOccurs and maxOccurs tags are used for specifying lower and upper bounds on the
number of occurrences of an element in any XML document that conforms to the schema specifications. If they
are not specified, the default is exactly one occurrence. These serve a similar role to the *, +, and ? symbols of
XML DTD.
6. Specifying keys. In XML schema, it is possible to specify constraints that correspond to unique and primary
key constraints in a relational database, as well as foreign keys (or referential integrity) constraints. The sd:unique
tag specifies elements that correspond to unique attributes in a relational database. We can give each such
uniqueness constraint a name, and we must specify xsd:selector and xsd:field tags for it to identify the element type
that contains the unique element and the element name within it that is unique via the xpath attribute. This is
illustrated by the departmentNameUnique and projectNameUnique elements in Figure 12.5. For specifying primary
keys, the tag xsd:key is used instead of xsd:unique, as illustrated by the projectNumberKey, departmentNumberKey, and
employeeSSNKey elements in Figure 12.5. For specifying foreign keys, the tag xsd:keyref is used, as illustrated by
the six xsd:keyref elements in Figure 12.5. When specifying a foreign key, the attribute refer of the xsd:keyref tag
specifies the referenced primary key, whereas the tags xsd:selector and xsd:field specify the referencing element
type and foreign key (see Figure 12.5).
7. Specifying the structures of complex elements via complex types. The next part of our example specifies
the structures of the complex elements Department, Employee, Project, and Dependent, using the tag xsd:complexType.
We specify each of these as a sequence of subelements corresponding to the database attributes of each entity
type by using the xsd:sequence and xsd:element tags of XML schema. Each element is given a name and type via
the attributes name and type of xsd:element.We can also specify minOccurs and maxOccurs attributes if we need to
change the default of exactly one occurrence. For (optional) database attributes where null is allowed, we need
to specify minOccurs = 0, whereas for multivalued database attributes we need to specify maxOccurs = “unbounded”
on the corresponding element. Notice that if we were not going to specify any key constraints, we could have
embedded the subelements within the parent element definitions directly without having to specify complex
types. However, when unique, primary key and foreign key constraints need to be specified; we must define
complex types to specify the element structures.
8. Composite (compound) attributes. Composite attributes are also specified as complex types, as illustrated
by the Address, Name, Worker, and WorksOn complex types. These could have been directly embedded within their
parent elements. This example illustrates some of the main features of XML schema. There are other features,
but they are beyond the scope of our presentation.
XML Languages
There have been several proposals for XML query languages, and two query language standards have
emerged. The first is XPath, which provides language constructs for specifying path expressions to identify
certain nodes (elements) or attributes within an XML document that match specific patterns. The second is
XQuery, which is a more general query language. XQuery uses XPath expressions but has additional constructs.
XPath: Specifying Path Expressions in XML
An XPath expression generally returns a sequence of items that satisfy a certain pattern as specified by the
expression. These items are either values (from leaf nodes) or elements or attributes. The most common type of
XPath expression returns a collection of element or attribute nodes that satisfy certain patterns specified in the
expression. The names in the XPath expression are node names in the XML document tree that are either tag
(element) names or attribute names, possibly with additional qualifier conditions to further restrict the nodes
that satisfy the pattern. Two main separators are used when specifying a path: single slash (/) and double slash
(//). A single slash before a tag specifies that the tag must appear as a direct child of the previous (parent) tag,
whereas a double slash specifies that the tag can appear as a descendant of the previous tag at any level. Let
us look at some examples of XPath as shown in Figure 12.6. The first XPath expression in Figure 12.6 returns
the company root node and all its descendant nodes, which means that it returns the whole XML document. We
should note that it is customary to include the file name in the XPath query. This allows us to specify any local file
name or even any path name that specifies a file on the Web. For example, if the COMPANY XML document is
stored at the location www.company.com/info.XML then the first XPath expression in Figure 12.6 can be written as
doc(www.company.com/info.XML)/company This prefix would also be included in the other examples of XPath
expressions.
Figure 12.6
Some examples of XPath expressions on XML documents that follow the XML schema file company in Figure 12.5.
1. /company
2. /company/department
3. //employee [employeeSalary gt 70000]/employeeName
4. /company/employee [employeeSalary gt 70000]/employeeName
5. /company/project/projectWorker [hours ge 20.0]
The second example in Figure 12.6 returns all department nodes (elements) and their descendant subtrees.
Note that the nodes (elements) in an XML document are ordered, so the XPath result that returns multiple nodes
will do so in the same order in which the nodes are ordered in the document tree. The third XPath expression in
Figure 12.6 illustrates the use of //, which is convenient to use if we do not know the full path name we are
searching for, but do know the name of some tags of interest within the XML document. This is particularly
useful for schema less XML documents or for documents with many nested levels of nodes.
The expression returns all employeeName nodes that are direct children of an employee node, such that the
employee node has another child element employeeSalary whose value is greater than 70000. This illustrates the use
of qualifier conditions, which restrict the nodes selected by the XPath expression to those that satisfy the
condition. XPath has a number of comparison operations for use in qualifier conditions, including standard
arithmetic, string, and set comparison operations. The fourth XPath expression in Figure 12.6 should return the
same result as the previous one, except that we specified the full path name in this example. The fifth expression
in Figure 12.6 returns all projectWorker nodes and their descendant nodes that are children under a path
/company/project and have a child node hours with a value greater than 20.0 hours. When we need to include
attributes in an XPath expression, the attribute name is prefixed by the @ symbol to distinguish it from element
(tag) names. It is also possible to use the wildcard symbol *, which stands for any element, as in the following
example, which retrieves all elements that are child elements of the root, regardless of their element type.When
wildcards are used, the result can be a sequence of different types of items. /company/* The examples above
illustrate simple XPath expressions, where we can only move down in the tree structure from a given node. A
more general model for path expressions has been proposed. In this model, it is possible to move in multiple
directions from the current node in the path expression. These are known as the axes of an XPath expression.
Our examples above used only three of these axes: child of the current node (/), descendent or self at any level
of the current node (//), and attribute of the current node (@). Other axes include parent, ancestor (at any level),
previous sibling (any node at same level to the left in the tree), and next sibling (any node at the same level to
the right in the tree). These axes allow for more complex path expressions. The main restriction of XPath path
expressions is that the path that specifies the pattern also specifies the items to be retrieved. Hence, it is difficult
to specify certain conditions on the pattern while separately specifying which result items should be retrieved.
The XQuery language separates these two concerns, and provides more powerful constructs for specifying
queries.
XQuery: Specifying Queries in XML
XPath allows us to write expressions that select items from a tree-structured XML document. XQuery permits the
specification of more general queries on one or more XML documents. The typical form of a query in XQuery is
known as a FLWR expression, which stands for the four main clauses of XQuery and has the following
form:
FOR <variable bindings to individual nodes (elements)>
LET <variable bindings to collections of nodes (elements)>
WHERE <qualifier conditions>
RETURN <query result specification>
There can be zero or more instances of the FOR clause, as well as of the LET clause in a single XQuery. The
WHERE clause is optional, but can appear at most once, and the RETURN clause must appear exactly once. Let
us illustrate these clauses with the following simple example of an XQuery. LET $d := doc(www.company.com/info.xml)
FOR $x IN $d/company/project[projectNumber = 5]/projectWorker,
$y IN $d/company/employee
WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn
RETURN <res> $y/employeeName/firstName, $y/employeeName/lastName,
$x/hours </res>
1. Variables are prefixed with the $ sign. In the above example, $d, $x, and $y are variables.
2. The LET clause assigns a variable to a particular expression for the rest of the query. In this example, $d is
assigned to the document file name. It is possible to have a query that refers to multiple documents by assigning
multiple variables in this way.
3. The FOR clause assigns a variable to range over each of the individual items in a sequence. In our example,
the sequences are specified by path expressions. The $x variable ranges over elements that satisfy the path
expression $d/company/project[projectNumber = 5]/projectWorker. The $y variable ranges over elements that satisfy the
path expression $d/company/employee. Hence, $x ranges over projectWorker elements, whereas $y ranges over
employee elements.
4. The WHERE clause specifies additional conditions on the selection of items. In this example, the first condition
selects only those projectWorker elements that satisfy the condition (hours gt 20.0). The second condition specifies a
join condition that combines an employee with a projectWorker only if they have the same ssn value.
5. Finally, the RETURN clause specifies which elements or attributes should be retrieved from the items that
satisfy the query conditions. In this example, it will return a sequence of elements each containing <firstName,
lastName, hours> for employees who work more that 20 hours per week on project number 5.
Figure 12.7 includes some additional examples of queries in XQuery that can be specified on an XML instance
documents that follow the XML schema document in Figure 12.5. The first query retrieves the first and last
names of employees who earn more than $70,000. The variable $x is bound to each employeeName element that
is a child of an employee element, but only for employee elements that satisfy the qualifier that their employeeSalary
value is greater than $70,000. The result retrieves the firstName and lastName child elements of the selected
employeeName elements. The second query is an alternative way of retrieving the same elements retrieved by
the first query. The third query illustrates how a join operation can be performed by using more than one
variable. Here, the $x variable is bound to each projectWorker element that is a child of project number 5, whereas
the $y variable is bound to each employee element. The join condition matches ssn values in order to retrieve the
employee names. Notice that this is an alternative way of specifying the same query in our earlier example, but
without the LET clause. XQuery has very powerful constructs to specify complex queries. In particular, it can
specify universal and existential quantifiers in the conditions of a query, aggregate functions, ordering of query
results, selection based on position in a sequence, and even conditional branching. Hence, in some ways, it
qualifies as a full-fledged programming language. This concludes our brief introduction to XQuery. The interested
reader is referred to www.w3.org, which contains documents describing the latest standards related to
XML and XQuery. The next section briefly discusses some additional languages and protocols related to XML.
1. FOR $x IN
doc(www.company.com/info.xml)
//employee [employeeSalary gt 70000]/employeeName
RETURN <res> $x/firstName, $x/lastName </res>
2. FOR $x IN
doc(www.company.com/info.xml)/company/employee
WHERE $x/employeeSalary gt 70000
RETURN <res> $x/employeeName/firstName, $x/employeeName/lastName </res>
3. FOR $x IN
doc(www.company.com/info.xml)/company/project[projectNumber = 5]/projectWorker,
$y IN doc(www.company.com/info.xml)/company/employee
WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn
RETURN <res> $y/employeeName/firstName, $y/employeeName/lastName, $x/hours </res>
Figure 12.7
Some examples of XQuery queries on XML documents that follow the XML schema file company in Figure 12.5.
1. Differentiate between attributes and elements in XML? List some of the important attributes used in
specifying elements in XML schema.
Elements can be parents of other elements and/or attributes and can be repeated within the same level of an XML
document. They also usually have start and end tags. An element is an XML element - a opening tag, some
content, a closing tag - they are the building blocks of your XML document:
An element would look like:
<test>someValue</test>
Here, "test" would be an element.
Attributes is an additional info on a tag - it's an "add-on" or an extra info on an element, but can never exist
alone. Attributes consist of a named pair attached to an element start-tag. Attribute values must be enclosed in
single or double quotes. Attribute names must be unique within a single element occurrence.
<test id="5">somevalue</test>
"id" is an attribute.
The decision to use Elements versus Attributes is mostly an architectural one; however, there are some key
differences between Elements and Attributes:
1. Elements can occur more than once (repeating) within the same level, while attributes can only appear once
within the same level, example:
It is okay to have:
But it would be invalid to have:
2. Elements can be defined to be in a certain order, while attributes can appear in any order.
Some of the important attributes used in specifying elements in XML schema.
Attribute
Description
default
Optional. Specifies a default value for the attribute. Default and fixed attributes
cannot both be present
fixed
Optional. Specifies a fixed value for the attribute. Default and fixed attributes
cannot both be present
Optional. Specifies the form for the attribute. The default value is the value of the
attributeFormDefault attribute of the element containing the attribute. Can be set
to one of the following:
form


"qualified" - indicates that this attribute must be qualified with the
namespace prefix and the no-colon-name (NCName) of the attribute
unqualified - indicates that this attribute is not required to be qualified
with the namespace prefix and is matched against the (NCName) of the
attribute
id
Optional. Specifies a unique ID for the element
name
Optional. Specifies the name of the attribute. Name and ref attributes cannot both
be present
ref
Optional. Specifies a reference to a named attribute. Name and ref attributes
cannot both be present. If ref is present, simpleType element, form, and type
cannot be present
type
Optional. Specifies a built-in data type or a simple type. The type attribute can
only be present when the content does not contain a simpleType element
Optional. Specifies how the attribute is used. Can be one of the following values:
use
any attributes



optional - the attribute is optional (this is default)
prohibited - the attribute cannot be used
required - the attribute is required
Optional. Specifies any other attributes with non-schema namespace
Differentiate between XML schema and XML DTD with suitable example.
The critical difference between DTDs and XML Schema is that XML Schema utilize an XML-based syntax, whereas DTDs
have a unique syntax held over from SGML DTDs. Although DTDs are often criticized because of this need to learn a new
syntax, the syntax itself is quite terse. The opposite is true for XML Schema, which are verbose, but also make use of tags
and XML so that authors of XML should find the syntax of XML Schema less intimidating.
The goal of DTDs was to retain a level of compatibility with SGML for applications that might want to convert SGML
DTDs into XML DTDs. However, in keeping with one of the goals of XML, "terseness in XML markup is of minimal
importance," there is no real concern with keeping the syntax brief.
LIST1 is an example using DTD and providing a schema definition for the content above, while LIST2 is an example using
XML Schema to provide a schema definition (employee.xs).
LIST1: Employee Information DTD
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ATTLIST
Employee_Info (Employee)*>
Employee (Name, Department, Telephone, Email)>
Name (#PCDATA)>
Department (#PCDATA)>
Telephone (#PCDATA)>
Email (#PCDATA)>
Employee Employee_Number CDATA #REQUIRED>
LIST2:Employee Information XML Schema(employee.xs)
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" >
<xs:element name="Employee_Info" type="EmployeeInfoType" />
<xs:complexType name="EmployeeInfoType">
<xs:sequence>
<xs:element ref="Employee" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
<xs:element name="Employee" type="EmployeeType" />
<xs:complexType name="EmployeeType">
<xs:sequence >
<xs:element ref="Name" />
<xs:element ref="Department" />
<xs:element ref="Telephone" />
<xs:element ref="Email" />
</xs:sequence>
<xs:attribute name="Employee_Number" type="xs:int" use="required"/>
</xs:complexType>
<xs:element
<xs:element
<xs:element
<xs:element
name="Name" type="xs:string" />
name="Department" type="xs:string" />
name="Telephone" type="xs:string" />
name="Email" type="xs:string" />
</xs:schema>
(Line numbers have been added for reference, and are not necessary in the actual code.)
As we see, the syntax is completely different between the two. For the DTD, a unique syntax is written, whereas the
XML Schema is written in XML format conforming to XML 1.0 syntax. LIST3 is an example of a valid XML document
for the LIST2 XML Schema (employee.xml).
For DTD, a DOCTYPE declaration is used to associate with the XML document; but, in the case of XML Schema, the
specification does not particularly determine anything with respect to the association of the XML document.
Accordingly, the implementation method of the validation tool actually used is followed. However, under the XML
Schema specification, there is a defined method for writing a hint to associate with the XML document. The following
content is inserted into the root element of the XML document.