Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Questions 1. Differentiate between attributes and elements in XML? List some of the important attributes used in specifying elements in XML schema. 2. XML and HTML 3. Differentiate between attributes and elements in XML? List some of the important attributes used in specifying elements in XML schema. 4. X Path 5. X Query 6. Differentiate between XML schema and XML DTD with suitable example. Unit 5: Database Related Standards SQL Standards: SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a database, or retrieve data from a database. Standards are important because of the complexity of database systems and their need for interoperation. Formal standards exist for SQL. De facto standards, such as ODBC and JDBC, and standards adopted by industry groups, such as CORBA, have played an important role in the growth of client–server database systems. Since SQL is the most widely used query language, much work has been done on standardizing it. ANSI and ISO, with the various database vendors, have played a leading role in this work. The SQL-86 standard was the initial version. The IBM Systems Application Architecture (SAA) standard for SQL was released in 1987. As people identified the need for more features, updated versions of the formal SQL standard were developed, called SQL-89 and SQL-92. The SQL:1999 version of the SQL standard added a variety of features to SQL. The SQL:2003 version of the SQL standard is a minor extension of the SQL:1999 standard. Some features such as the SQL:1999 OLAP features were specified as an amendment to the earlier version of the SQL:1999 standard, instead of waiting for the release of SQL:2003. The SQL:2003 standard was broken into several parts: • Part 1: SQL/Framework provides an overview of the standard. • Part 2: SQL/Foundation defines the basics of the standard: types, schemas, tables, views, query and update statements, expressions, security model, predicates, assignment rules, transaction management, and so on. • Part 3: SQL/CLI (Call Level Interface) defines application program interfaces to SQL. • Part 4: SQL/PSM (Persistent Stored Modules) defines extensions to SQL to make it procedural. • Part 9: SQL/MED (Management of External Data) defines standards or interfacing an SQL system to external sources. By writing wrappers, system designers can treat external data sources, such as files or data in non relational databases, as if they were “foreign” tables. • Part 10: SQL/OLB (Object Language Bindings) defines standards for embedding SQL in Java. • Part 11: SQL/Schemata (Information and Definition Schema) defines a standard catalog interface. • Part 13: SQL/JRT (Java Routines and Types) defines standards for accessing routines and types in Java. • Part 14: SQL/XML defines XML-Related Specifications. The missing numbers cover features such as temporal data, distributed transaction processing, and multimedia data, for which there is as yet no agreement on the standards. The latest versions of the SQL standard are SQL:2006, which added several features related to XML, and SQL:2008, which introduces a number of extensions to the SQL language. Database Connectivity Standards The ODBC standard is a widely used standard for communication between client applications and database systems. ODBC is based on the SQL Call Level Interface (CLI) standards developed by the X/Open industry consortium and the SQL Access Group, but it has several extensions. The ODBC API defines a CLI, an SQL syntax definition, and rules about permissible sequences of CLI calls. The standard also defines conformance levels for the CLI and the SQL syntax. For example, the core level of the CLI has commands to connect to a database, to prepare and execute SQL statements, to get back results or status values, and to manage transactions. The next level of conformance (level 1) requires support for catalog information retrieval and some other features over and above the core-level CLI; level 2 requires further features, such as ability to send and retrieve arrays of parameter values and to retrieve more detailed catalog information. ODBC allows a client to connect simultaneously to multiple data sources and to switch among them, but transactions on each are independent; ODBC does not support two-phase commit. A distributed system provides a more general environment than a client– server system. The X/Open consortium has also developed the X/Open XA standards for interoperation of databases. These standards define transaction management primitives (such as transaction begin, commit, abort, and prepare to- commit) that compliant databases should provide; a transaction manager can invoke these primitives to implement distributed transactions by two-phase commit. The XA standards are independent of the data model and of the specific interfaces between clients and databases to exchange data. Thus, we can use the XA protocols to implement a distributed transaction system in which a single transaction can access relational as well as object-oriented databases, yet the transaction manager ensures global consistency via two-phase commit. There are many data sources that are not relational databases, and in fact may not be databases at all. Examples are flat files and email stores. Microsoft’s OLE-DB is a C++ API with goals similar to ODBC, but for nondatabase data sources that may provide only limited querying and update facilities. Just like ODBC, OLE-DB provides constructs for connecting to a data source, starting a session, executing commands, and getting back results in the form of a rowset, which is a set of result rows. However, OLE-DB differs from ODBC in several ways. To support data sources with limited feature support, features in OLE-DB are divided into a number of interfaces, and a data source may implement only a subset of the interfaces. An OLE-DB program can negotiate with a data source to find what interfaces are supported. In ODBC commands are always in SQL. In OLE-DB, commands may be in any language supported by the data source; while some sources may support SQL, or a limited subset of SQL, other sources may provide only simple capabilities such as accessing data in a flat file, without any query capability. Another major difference of OLE-DB from ODBC is that a rowset is an object that can be shared by multiple applications through shared memory. A rowset object can be updated by one application, and other applications sharing that object will get notified about the change. The Active Data Objects (ADO) API, also created by Microsoft, provides an easy-to-use interface to the OLE-DB functionality, which can be called from scripting languages, such as VBScript and JScript. The newer ADO.NET API is designed for applications written in the .NET languages such as C# and Visual Basic.NET. In addition to providing simplified interfaces, it provides an abstraction called the DataSet that permits disconnected data access. Object Database Standards Standards in the area of object-oriented databases have so far been driven primarily by OODB vendors. The Object Database Management Group (ODMG) was a group formed by OODB vendors to standardize the data model and language interfaces to OODBs. The C++ language interface specified by ODMG was briefly outlined in Chapter 22. ODMG is no longer active. JDO is a standard for adding persistence to Java. The Object Management Group (OMG) is a consortium of companies, formed with the objective of developing a standard architecture for distributed software applications based on the object-oriented model. OMG brought out the Object Management Architecture (OMA) reference model. The Object Request Broker (ORB) is a component of the OMA architecture that provides message dispatch to distributed objects transparently, so the physical location of the object is not important. The Common Object Request Broker Architecture (CORBA) provides a detailed specification of the ORB, and includes an Interface Description Language (IDL), which is used to define the data types used for data interchange. The IDL helps to sup port data conversion when data are shipped between systems with different data representations. Microsoft introduced the Entity data model, which incorporates ideas from the entity-relationship and object-oriented data models, and an approach to integrating querying with the programming language, called Language Integrated Querying or LINQ. These are likely to become de facto standards. XML-Based Standards A wide variety of standards based on XML (see Chapter 23) have been defined for a wide variety of applications. Many of these standards are related to e-commerce. They include standards promulgated by nonprofit consortia and corporate-backed efforts to create de facto standards. RosettaNet, which falls into the former category, is an industry consortium that uses XML-based standards to facilitate supply-chain management in the computer and information technology industries. Supply-chain management refers to the purchases of material and services that an organization needs to function. In contrast, customer-relationship management refers to the front end of a company’s interaction, dealing with customers. Supply-chain management requires standardization of a variety of things such as: • Global company identifier: RosettaNet specifies a systemfor uniquely identifying companies, using a 9-digit identifier called Data Universal Numbering System (DUNS). • Global product identifier: RosettaNet specifies a 14-digit Global Trade Item Number (GTIN) for identifying products and services. • Global class identifier: This is a 10-digit hierarchical code for classifying products and services called the United Nations/Standard Product and Services Code (UN/SPSC). • Interfaces between trading partners: RosettaNet Partner Interface Processes (PIPs) define business processes between partners. PIPs are system-to-system XML-based dialogs: They define the formats and semantics of business documents involved in a process and the steps involved in completing a transaction. Examples of steps could include getting product and service information, purchase orders, order invoicing, payment, order status requests, inventory management, post-sales support including service warranty, and so on. Exchange of design, configuration, process, and quality information is also possible to coordinate manufacturing activities across organizations. Participants in electronic marketplaces may store data in a variety of database systems. These systems may use different data models, data formats, and data types. Furthermore, there may be semantic differences (metric versus English measure, distinct monetary currencies, and so forth) in the data. Standards for electronic marketplaces include methods for wrapping each of these heterogeneous systems with an XML schema. These XML wrappers form the basis of a unified view of data across all of the participants in the marketplace. Simple Object Access Protocol (SOAP) is a remote procedure call standard that uses XML to encode data (both parameters and results), and uses HTTP as the transport protocol; that is, a procedure call becomes an HTTP request. SOAP is backed by the World Wide Web Consortium (W3C) and has gained wide acceptance in industry. SOAP can be used in a variety of applications. For instance, in business-to-business e-commerce, applications running at one site can access data from and execute actions at other sites through SOAP. ODMG 3.0 ODMG 3.0 was developed by the Object Data Management Group (ODMG). The ODMG is a consortium of vendors and interested parties that work on specifications for object database and object-relational mapping products. ODMG 3.0 is a portability specification. It is designed to allow for portable applications that could run on more than one product. ODMG 3.0 uses the Java, C++, and Smalltalk languages as much as possible, to allow for the transparent integration of object programming languages. The major components of ODMG 3.0 specification are: Object Model. The common data model supported by ODMG implementations is based on the OMG Object Model. The OMG core model was designed to be a common denominator for object request brokers, object database systems, object programming languages, and other applications. In keeping with the OMG Architecture, a profile has been designed for their model, adding components (e.g., relationships) to the OMG core object model to support the ODMG needs. Object Specification Languages. The two specification languages are the Object Definition Language (ODL) and Object Interchange Format (OIF) languages. ODL is a specification language used to define the object types that conform to the ODMG Object Model and is based on the OMG IDL. OIF is a specification language used to dump and load from a file or set of files. Object Query Language. This is a declarative (nonprocedural) language for querying and updating objects. SQL-92 was used as the basis for OQL. C++ Language Binding. This is the binding of ODMG implementations to C++. This is called the C++ OML, or object manipulation language. The C++ binding also includes a version of the ODL that uses C++ syntax, a mechanism to invoke OQL, and procedures for operations and transactions. Smalltalk Language Binding. This is the binding of ODMG implementations to Smalltalk. It defines the binding in terms of the mapping between ODL and Smalltalk, which is based on the OMG Smalltalk binding for IDL. The Smalltalk binding also includes a mechanism to invoke OQL and procedures for operations on databases and transactions. Java Language Binding. This is the binding between the ODMG Object Model (ODL and OML) and the Java programming language as defined by the Java 2 Platform. The Java language binding also includes a mechanism to invoke OQL and procedures for operations and transactions. It is possible to read and write the same database from C++, Smalltalk, and Java, as long as the programmer stays within the common subset of supported data types. Note that, unlike SQL in relational systems, the ODMG data manipulation languages are tailored to specific application programming languages, in order to provide a single, integrated environment for programming and data manipulation. This is called transparent persistence. Such transparent persistence is illustrated by the following diagram and contrasts with the database sublanguage of SQL and its variants. In this diagram, you only see the host programming language and no database sublanguage or call-level interface as in JDBC. SQL/92 CLI In 1995 an addendum for a Call-level Interface, SQL- 92/CLI (sometimes called CLI-95), was adopted. CLI-95 is a subset of the popular (de facto standard) ODBC interface from Microsoft and others; it functions as a callable interface to an SQL database system, providing a highly dynamic capability by contrast with the relatively static facility provided by embedded SQL. CLI (and ODBC, of course) is primarily used by ad hoc applications, like decision support applications, whereas embedded SQL is more likely to be used by “glass house” applications that are considerably less dynamic in their function. SQL-92/PSM In 1996 an addendum for Stored Procedures, SQL- 92/PSM (Persistent Stored Modules, sometimes called PSM-96), was adopted. SQL-92/PSM added the following types of features: Multi-statement Procedures: groups of SQL statements can be executed together; flow-of control statements, local variables, and condition handlers are provided Stored routines and modules: procedures, functions, and modules can be stored in an SQLServer External routines: functions and procedures written in host programming languages can be invoked from SQL statement SQL-92/PSM was discussed in this column in December 1996. SQL3 The two committees are currently working on SQL3, which will replace SQL-92 when it is adopted. SQL3 began its final CD ballot in October 1997. An editing meeting took place in March 1998. Additional editing meetings are scheduled for June 1998 and November 1998. If these meetings are successful, then SQL3 could be adopted in early 1999. SQL3 extends the data types of SQL-92 significantly. It adds some predefined data types, like BOOLEAN, CHARACTER LARGE OBJECT, and ROW. It adds the collection type of ARRAY. The single largest addition to SQL3 is user defined data types (UDT’s). Users will be able to define their own data types, each with a concrete representation, methods, and ordering properties. These UDT’s can be used anywhere that a predefined data type can be used (as the data type of a column, for example). A UDT can also be used in a new way. It can be associated with a base table, so that each attribute of the UDT maps to a column of the base table. A new data type, REF, can be used to refer to rows in such a table. Inheritance is supported for both base tables and UDT’s. It remains to be seen whether it is single inheritance or multiple inheritance that is finally adopted. Some of the other features that are provided by SQL3 are: Recursive Query – creates a result table from the traversal of rows that form a directed graph Similar Predicate – an extension of the LIKE predicate that allows regular expressions Roles – authorization may be granted to a role, and users may then take on different roles at different times Triggers – statements may be defined to execute each time insert, update, or delete statements are executed on a particular table. The statements may execute once for the statement, or individually for each row that is affected. Holdable Cursors – cursors may be defined to stay open after a transaction commit SQL3 will have a set of features that are required for conformance to the standard; this is being termed Core SQL3. Additional packages of features will also be defined that require features in addition to those in Core SQL3. It is likely that we will devote a column to SQL3 when it nears the end of its adoption process and has become more stable. Object Data Management Group (ODMG) ODMG is a non-profit consortium that was formed in 1991 to “develop and promote standards for object storage”. ODMG’s latest publication is “Object Database Standard: ODMG 2.0”. This specification contains an Object Model, an Object Definition Language (ODL), an Object Query Language (OQL), and language bindings to Java, C++, and Smalltalk. The Object Model is a superset of the OMG object model, adding relationships, extents, collection classes and concurrency control. The language bindings allow a programmer to do both application and database programming in the same environment. ODMG 2.0 has extended previous versions of the standard with: · a Java language binding · a standard external form for both metadata and data Transaction Performance Processing Council (TPC) TPC is a non-profit corporation formed to “define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry”. TPC has published a number of benchmarks over time. Vendors run the benchmarks, certified auditors review the tests and results, and the results are submitted to TPC, after which they can then be published. There is a Technical Advisory Board (TAB) that reviews benchmark compliance challenges. TPC-C TPC-C is the current OLTP benchmark. It contains a mixture of read-only and update transactions to simulate a complex OLTP application environment. The metrics for TPC-C are transactions-per-minute-C (tpmC) and price-per-tpm-C ($/tpmC). The current version of TPC-C is 3.3.2. V4.0, currently under development, might have the following changes: · Increased cost for some of the transaction types · Enforcement of referential integrity constraints TPC-D TPC-D is a benchmark for a complex decision support environment. The queries in the benchmark involve multi-table joins, sorting, and aggregation. TPC-D benchmarks may be run with one of a specified set of database sizes that range from 1GB to 10,000GB. The current TPC-D benchmark is V1.3.1. A query stream contains 17 queries. The benchmark contains two metrics. The Power metric (QppD@size) is based on a single-stream Power test . The Throughput metric (QthD@size) may be based on an actual multi-stream run, or it may be calculated from the single-stream results. V2.0 may be approved in the beginning of 1999. Some of the changes being considered for V2.0 are: · 6 new queries (including left outer join, use of the SUBSTRING function) · A required multi-stream throughput test, with a minimum number of streams for each database size TPC-W TPC-W is a benchmark for a retail eCommerce environment on the web. It is currently under development, with possible approval in early 1999. The benchmark models a storefront on the web. The benchmark may be run with one of a set of database sizes that range from 1K items to 1M items. The benchmark measures interactions seen by a browser, allowing for some user-interrupted transfers. The primary metrics will be Web Interactions Per Second (WIPS@size) and price-perWIPS ($/WIPS@size). SQLJ SQLJ is an informal group of companies that has been investigating the ways that SQL and Java can be used together. This effort has spawned three documents that are about to be submitted to formal standards bodies for consideration. SQLJ Part 0 - SQL Embedded in Java SQL-92 defines the embedding of SQL statements in programming languages such as C or COBOL. This part of SQLJ defines the embedding of static SQL statements in a Java program. The expression of a user’s queries in SQLJ will usually be more compact and readable than their expression in JDBC. The SQLJ statements are introduced in Java programs by “#sql”, which is not a valid token in the Java language. Variables and expressions from the Java language can be used to provide values to SQL and to retrieve values from SQL. The variables and expressions are prefixed with “:” to distinguish them from SQL identifiers. The JDBC mapping of Java data types to SQL data types has been used by SQLJ. A SQLJ program may then be passed to a SQLJ translator which can generate a pure Java program that might contain JDBC calls, or it might contain other Java statements that communicate with a SQL database. The SQLJ translator is able to perform some validation of the SQL statements. The translator may be given the actual schema that will be used at runtime or an “exemplar” schema (one that is the same as the schema that will be used at runtime), in which case additional validation of the SQL statements can be done. SQLJ Part 0 has just been submitted to NCITS H2 for adoption as a new part of SQL, SQL/OLB (Object Language Bindings). Difference between HTML and XML Key Difference: HTML is a markup language that is used to design web pages. It is written in predefined tag elements. Its primary purpose is to display data with focus on how the data looks. XML is a markup language whose primary purpose is to transport and store data. It is a language that can be used to develop new languages and define other languages. It does not have a predefined set of tags, and allows the developer to customize tags. HyperText Markup Language (HTML) is a well known mark up language used to develop web pages. It has been around for a long time and is commonly used in webpage design. XML or Extensible Markup Language defines a set of rules for encoding documents in a format that can be read by both, human and computer. HTML is written using HTML elements, which consist of tags, primarily an opening tag and a closing tag. The data between these tags is usually the content. The main objective of HTML is to allow web browsers to interpret and display the content written between the tags. The tags are designed to describe the page content. HTML comes with predefined tags. These days, web pages are rarely only designed using HTML. XML on the other hand, is a mark up language that is fairly new and was launched in 1996 as an adaptation of SGML (Standard Generalized Markup Language). The main purpose of XML is to be an independent hardware tool used to transport and store data with focus on what the data is. XML removes the constraint of sticking to pre-designed tags and gives developers the freedom to design new tags. It was developed to create standardized specifications for creating custom mark-up languages. XML-based languages include RSS, Atom and XHTML. It is neither a programming language not as presentation language. It is known as meta-language, or a language that can be used to define other languages. The XML is well-formed and has strict set of rules. Well-formed generally means that it satisfies a list of syntax rules provided in its specification. Containing only properly encoded legal Unicode characters, no use of special syntax characters, element tags are correctly nested, etc are a few example of the syntax rules. It also includes a well-formed declaration that states the type of document it is and what processing rules should be applied. In HTML, when a page created, the processor tries to make sense of the page and generates the content of the page. It does not require strict rules regarding the format of the page. It will generate the page best to its ability even with errors present. In XML, however, if certain rules are incorrect or the processor cannot comprehend something, the processor will generate an error code and terminate processing the file. It includes an error-handling mechanism referred to as ‘draconian’. Further specifications of both languages, including limitations are listed in the table below. HTML Date when invented Extended from Type Markup language for displaying web pages in a web browser. Designed to display data with focus on how the data looks 1990 SGML Static Usage Display a web page Definition Processing/Rules Language type Tags White Space Limitations XML Markup language defines a set of rules for encoding documents that can be read by both humans and machines. Designed with focus on storing and transporting data. 1996 SGML Dynamic Transport data between the application and the database. To develop other mark up languages. No strict rules. Browser will Strict rules must be followed or processor will terminate still generate data to the best of processing the file its ability Presentation Neither presentation, nor programming Predefined Custom tags can be created by the author Cannot preserve white space Preserves white space Data does not know itself very well. Data cannot change in Cannot be used as a subtype of a sql_variant instance. response to environment. Data cannot be easily maintained. Does not support casting or converting to either text or Cannot store or call on variables. Lacks the capability non text. Does not support the following column and table constraints. XML provides its own encoding. to define new structures by defining relationships between Collations apply to string types only. Cannot be classes. Tags are not useful for compared or sorted. Cannot be used in Distributed Partitioned Views. Not well supported by browsers. exchanging the document between applications. XML The common method of specifying the contents and formatting of Web pages is through the use of hypertext documents. There are various languages for writing these documents, the most common being HTML (HyperText Markup Language). Although HTML is widely used for formatting and structuring Web documents, it is not suitable for specifying structured data that is extracted from databases. A new language—namely, XML (Extensible Markup Language)—has emerged as the standard for structuring and exchanging data over the Web. XML can be used to provide information about the structure and meaning of the data in the Web pages rather than just specifying how the Web pages are formatted for display on the screen. The formatting aspects are specified separately—for example, by using a formatting language such as XSL (Extensible Stylesheet Language) or a transformation language such as XSLT (Extensible Stylesheet Language for Transformations or simply XSL Transformations). Recently, XML has also been proposed as a possible model for data storage and retrieval, although only a few experimental database systems based on XML have been developed so far. Structured, Semi-structured, and Unstructured Data The information stored in databases is known as structured data because it is represented in a strict format. For structured data, it is common to carefully design the database schema using techniques. The DBMS then checks to ensure that all data follows the structures and constraints specified in the schema. However, not all data is collected and inserted into carefully designed structured databases. In some applications, data is collected in an ad hoc manner before it is known how it will be stored and managed. This data may have a certain structure, but not all the information collected will have the identical structure. Some attributes may be shared among the various entities, but other attributes may exist only in a few entities. Moreover, additional attributes can be introduced in some of the newer data items at any time, and there is no predefined schema. This type of data is known as semistructured data. A number of data models have been introduced for representing semistructured data, often based on using tree or graph data structures rather than the flat relational model structures. A key difference between structured and semistructured data concerns how the schema constructs (such as the names of attributes, relationships, and entity types) are handled. In semistructured data, the schema information is mixed in with the data values, since each data object can have different attributes that are not known in advance. Hence, this type of data is sometimes referred to as self-describing data. Consider the following example. We want to collect a list of bibliographic references related to a certain research project. Some of these may be books or technical reports, others may be research articles in journals or conference proceedings, and still others may refer to complete journal issues or conference proceedings. Clearly, each of these may have different attributes and different types of information. Even for the same type of reference—say, conference articles—we may have different information. For example, one article citation may be quite complete, with full information about author names, title, proceedings, page numbers, and so on, whereas another citation may not have all the information available. New types of bibliographic sources may appear in the future—for instance, references to Web pages or to conference tutorials and these may have new attributes that describe them. Semistructured data may be displayed as a directed graph, as shown in Figure 12.1. As we can see, this model somewhat resembles the object model in its ability to represent complex objects and nested structures. In Figure 12.1, the labels or tags on the directed edges represent the schema names: the names of attributes, object types (or entity types or classes), and relationships. The internal nodes represent individual objects or composite attributes. The leaf nodes represent actual data values of simple (atomic) attributes. 1. The schema information—names of attributes, relationships, and classes (object types) in the semistructured model is intermixed with the objects and their data values in the same data structure. 2. In the semistructured model, there is no requirement for a predefined schema to which the data objects must conform, although it is possible to define a schema if necessary. XML Hierarchical (Tree) Data Model The basic object in XML is the XML document. Two main structuring concepts are used to construct an XML document: elements and attributes. It is important to note that the term attribute in XML is not used in the same manner as is customary in database terminology, but rather as it is used in document description languages such as HTML and SGML. Attributes in XML provide additional information that describes elements, as we will see. There are additional concepts in XML, such as entities, identifiers, and references, but first we concentrate on describing elements and attributes to show the essence of the XML model. Figure 12.3 shows an example of an XML element called <Projects>. As in HTML, elements are identified in a document by their start tag and end tag. The tag names are enclosed between angled brackets < ... >, and end tags are further identified by a slash, </ ... >.5 <?xml version= “1.0” standalone=“yes”?> <Projects> <Project> <Name>ProductX</Name> <Number>1</Number> <Location>Bellaire</Location> <Dept_no>5</Dept_no> <Worker> <Ssn>123456789</Ssn> <Last_name>Smith</Last_name> <Hours>32.5</Hours> </Worker> <Worker> <Ssn>453453453</Ssn> <First_name>Joyce</First_name> <Hours>20.0</Hours> </Worker> </Project> <Project> <Name>ProductY</Name> <Number>2</Number> <Location>Sugarland</Location> <Dept_no>5</Dept_no> <Worker> <Ssn>123456789</Ssn> <Hours>7.5</Hours> </Worker> <Worker> <Ssn>453453453</Ssn> <Hours>20.0</Hours> </Worker> <Worker> <Ssn>333445555</Ssn> Figure 12.3 A complex XML element called <Projects>. <Hours>10.0</Hours> </Worker> </Project> Complex elements are constructed from other elements hierarchically, whereas simple elements contain data values. A major difference between XML and HTML is that XML tag names are defined to describe the meaning of the data elements in the document, rather than to describe how the text is to be displayed. This makes it possible to process the data elements in the XML document automatically by computer programs. Also, the XML tag (element) names can be defined in another document, known as the schema document, to give a semantic meaning to the tag names that can be exchanged among multiple users. In HTML, all tag names are predefined and fixed; that is why they are not extendible. It is straightforward to see the correspondence between the XML textual representation shown in Figure 12.3 and the tree structure shown in Figure 12.1. In the tree representation, internal nodes represent complex elements, whereas leaf nodes represent simple elements. That is why the XML model is called a tree model or a hierarchical model. In Figure 12.3, the simple elements are the ones with the tag names <Name>, <Number>, <Location>, <Dept_no>, <Ssn>, <Last_name>, <First_name>, and <Hours>. The complex elements are the ones with the tag names <Projects>, <Project>, and <Worker>. In general, there is no limit on the levels of nesting of elements. It is possible to characterize three main types of XML documents: ■ Data-centric XML documents. These documents have many small data items that follow a specific structure and hence may be extracted from a structured database. They are formatted as XML documents in order to exchange them over or display them on the Web. These usually follow a predefined schema that defines the tag names. ■ Document-centric XML documents. These are documents with large amounts of text, such as news articles or books. There are few or no structured data elements in these documents. ■ Hybrid XML documents. These documents may have parts that contain structured data and other parts that are predominantly textual or unstructured. They may or may not have a predefined schema. XML Documents, DTD, and XML Schema An XML document is well formed if it follows a few conditions. In particular, it must start with an XML declaration to indicate the version of XML being used as well as any other relevant attributes, as shown in the first line in Figure 12.3. It must also follow the syntactic guidelines of the tree data model. This means that there should be a single root element, and every element must include a matching pair of start and end tags within the start and end tags of the parent element. This ensures that the nested elements specify a well-formed tree structure. A well-formed XML document is syntactically correct. This allows it to be processed by generic processors that traverse the document and create an internal tree representation. A standard model with an associated set of API (application programming interface) functions called DOM (Document Object Model) allows programs to manipulate the resulting tree representation corresponding to a well-formed XML document. However, the whole document must be parsed beforehand when using DOM in order to convert the document to that standard DOM internal data structure representation. Another API called SAX (Simple API for XML) allows processing of XML documents on the fly by notifying the processing program through callbacks whenever a start or end tag is encountered. This makes it easier to process large documents and allows for processing of so-called streaming XML documents, where the processing program can process the tags as they are encountered. This is also known as event-based processing. A well-formed XML document can be schema less; that is, it can have any tag names for the elements within the document. In this case, there is no predefined set of elements (tag names) that a program processing the Document knows to expect. This gives the document creator the freedom to specify new elements, but limits the possibilities for automatically interpreting the meaning or semantics of the elements within the document. A stronger criterion is for an XML document to be valid. In this case, the document must be well formed, and it must follow a particular schema. That is, the element names used in the start and end tag pairs must follow the structure specified in a separate XML DTD (Document Type Definition) file or XML schema file. Figure 12.4 shows a simple XML DTD file, which specifies the elements (tag names) and their nested structures. Any valid documents conforming to this DTD should follow the specified structure. A special syntax exists for specifying DTD files, as illustrated in Figure 12.4. First, a name is given to the root tag of the document, which is called Projects in the first line in Figure 12.4. Then the elements and their nested structure are specified. When specifying elements, the following notation is used: A * following the element name means that the element can be repeated zero or more times in the document. This kind of element is known as an optional multivalued (repeating) element. ■ A + following the element name means that the element can be repeated one or more times in the document. This kind of element is a required multivalued (repeating) element. ■ A ? following the element name means that the element can be repeated zero or one times. This kind is an optional single-valued (nonrepeating) element. ■ An element appearing without any of the preceding three symbols must appear exactly once in the document. This kind is a required single-valued (nonrepeating) element. ■ The type of the element is specified via parentheses following the element. If the parentheses include names of other elements, these latter elements are the children of the element in the tree structure. If the parentheses include the keyword #PCDATA or one of the other data types available in XML DTD, the element is a leaf node. PCDATA stands for parsed character data, which is roughly similar to a string data type. ■ The list of attributes that can appear within an element can also be specified via the keyword !ATTLIST. In Figure 12.3, the Project element has an attribute ProjId. If the type of an attribute is ID, then it can be referenced from another attribute whose type is IDREF within another element. Notice that attributes can also be used to hold the values of simple data elements of type #PCDATA. ■ ■ Parentheses can be nested when specifying elements. ■ A bar symbol ( e1 | e2 ) specifies that either e1 or e2 can appear in the document. <!DOCTYPE Projects [ <!ELEMENT Projects (Project+)> <!ELEMENT Project (Name, Number, Location, Dept_no?, Workers) <!ATTLIST Project ProjId ID #REQUIRED> > <!ELEMENT Name (#PCDATA)> <!ELEMENT Number (#PCDATA) <!ELEMENT Location (#PCDATA)> <!ELEMENT Dept_no (#PCDATA)> <!ELEMENT Workers (Worker*)> <!ELEMENT Worker (Ssn, Last_name?, First_name?, Hours)> <!ELEMENT Ssn (#PCDATA)> <!ELEMENT Last_name (#PCDATA)> <!ELEMENT First_name (#PCDATA)> <!ELEMENT Hours (#PCDATA)> ]> We can see that the tree structure in Figure 12.1 and the XML document in Figure 12.3 conform to the XML DTD in Figure 12.4. To require that an XML document be checked for conformance to a DTD, we must specify this in the declaration of the document. For example, we could change the first line in Figure 12.3 to the following: <?xml version=“1.0” standalone=“no”?> <!DOCTYPE Projects SYSTEM “proj.dtd”> When the value of the standalone attribute in an XML document is “no”, the document needs to be checked against a separate DTD document or XML schema document (see below). The DTD file shown in Figure 12.4 should be stored in the same file system as the XML document, and should be given the file name proj.dtd. Alternatively, we could include the DTD document text at the beginning of the XML document itself to allow the checking. Although XML DTD is quite adequate for specifying tree structures with required, optional, and repeating elements, and with various types of attributes, it has several limitations. First, the data types in DTD are not very general. Second, DTD has its own special syntax and thus requires specialized processors. It would be advantageous to specify XML schema documents using the syntax rules of XML itself so that the same processors used for XML documents could process XML schema descriptions. Third, all DTD elements are always forced to follow the specified ordering of the document, so unordered elements are not permitted. These drawbacks led to the development of XML schema, a more general but also more complex language for specifying the structure and elements of XML documents. XML Schema The XML schema language is a standard for specifying the structure of XML documents. It uses the same syntax rules as regular XML documents, so that the same processors can be used on both. To distinguish the two types of documents, we will use the term XML instance document or XML document for a regular XML document, and XML schema document for a document that specifies an XML schema. Figure 12.5 shows an XML schema document corresponding to the COMPANY database shown in Figures 3.5 and 7.2. Although it is unlikely that we would want to display the whole database as a single document, there have been proposals to store data in native XML format as an alternative to storing the data in relational databases. The schema in Figure 12.5 would serve the purpose of specifying the structure of the COMPANY database if it were stored in a native XML system. As with XML DTD, XML schema is based on the tree data model, with elements and attributes as the main structuring concepts. However, it borrows additional concepts from database and object models, such as keys, references, and identifiers. Here we describe the features of XML schema in a step-by-step manner, referring to the sample XML schema document in Figure 12.5 for illustration. We introduce and describe some of the schema concepts in the order in which they are used in Figure 12.5. Figure 12.5 An XML schema file called company. <?xml version=“1.0” encoding=“UTF-8” ?> <xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema”> <xsd:annotation> <xsd:documentation xml:lang=“en”>Company Schema (Element Approach) - Prepared by Babak Hojabri</xsd:documentation> </xsd:annotation> <xsd:element name=“company”> <xsd:complexType> <xsd:sequence> <xsd:element name=“department” type=“Department” minOccurs=“0” maxOccurs= “unbounded” /> <xsd:element name=“employee” type=“Employee” minOccurs=“0” maxOccurs= “unbounded”> <xsd:unique name=“dependentNameUnique”> <xsd:selector xpath=“employeeDependent” /> <xsd:field xpath=“dependentName” /> </xsd:unique> </xsd:element> <xsd:element name=“project” type=“Project” minOccurs=“0” maxOccurs=“unbounded” /> </xsd:sequence> </xsd:complexType> <xsd:unique name=“departmentNameUnique”> <xsd:selector xpath=“department” /> <xsd:field xpath=“departmentName” /> </xsd:unique> <xsd:unique name=“projectNameUnique”> <xsd:selector xpath=“project” /> <xsd:field xpath=“projectName” /> </xsd:unique> <xsd:key name=“projectNumberKey”> <xsd:selector xpath=“project” /> <xsd:field xpath=“projectNumber” /> </xsd:key> <xsd:key name=“departmentNumberKey”> <xsd:selector xpath=“department” /> <xsd:field xpath=“departmentNumber” /> </xsd:key> <xsd:key name=“employeeSSNKey”> <xsd:selector xpath=“employee” /> <xsd:field xpath=“employeeSSN” /> </xsd:key> <xsd:keyref name=“departmentManagerSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector xpath=“department” /> <xsd:field xpath=“departmentManagerSSN” /> </xsd:keyref> <xsd:keyref name=“employeeDepartmentNumberKeyRef” refer=“departmentNumberKey”> <xsd:selector xpath=“employee” /> <xsd:field xpath=“employeeDepartmentNumber” /> </xsd:keyref> <xsd:keyref name=“employeeSupervisorSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector xpath=“employee” /> <xsd:field xpath=“employeeSupervisorSSN” /> </xsd:keyref> <xsd:keyref name=“projectDepartmentNumberKeyRef” refer=“departmentNumberKey”> <xsd:selector xpath=“project” /> <xsd:field xpath=“projectDepartmentNumber” /> </xsd:keyref> <xsd:keyref name=“projectWorkerSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector xpath=“project/projectWorker” /> <xsd:field xpath=“SSN” /> </xsd:keyref> <xsd:keyref name=“employeeWorksOnProjectNumberKeyRef” refer=“projectNumberKey”> <xsd:selector xpath=“employee/employeeWorksOn” /> <xsd:field xpath=“projectNumber” /> </xsd:keyref> </xsd:element> <xsd:complexType name=“Department”> <xsd:sequence> <xsd:element name=“departmentName” type=“xsd:string” /> <xsd:element name=“departmentNumber” type=“xsd:string” /> <xsd:element name=“departmentManagerSSN” type=“xsd:string” /> <xsd:element name=“departmentManagerStartDate” type=“xsd:date” /> <xsd:element name=“departmentLocation” type=“xsd:string” minOccurs=“0” maxOccurs=“unbounded” /> </xsd:sequence> </xsd:complexType> <xsd:complexType name=“Employee”> <xsd:sequence> <xsd:element name=“employeeName” type=“Name” /> <xsd:element name=“employeeSSN” type=“xsd:string” /> <xsd:element name=“employeeSex” type=“xsd:string” /> <xsd:element name=“employeeSalary” type=“xsd:unsignedInt” /> <xsd:element name=“employeeBirthDate” type=“xsd:date” /> <xsd:element name=“employeeDepartmentNumber” type=“xsd:string” /> <xsd:element name=“employeeSupervisorSSN” type=“xsd:string” /> <xsd:element name=“employeeAddress” type=“Address” /> <xsd:element name=“employeeWorksOn” type=“WorksOn” minOccurs=“1” maxOccurs=“unbounded” /> <xsd:element name=“employeeDependent” type=“Dependent” minOccurs=“0” maxOccurs=“unbounded” /> </xsd:sequence> </xsd:complexType> <xsd:complexType name=“Project”> <xsd:sequence> <xsd:element name=“projectName” type=“xsd:string” /> <xsd:element name=“projectNumber” type=“xsd:string” /> <xsd:element name=“projectLocation” type=“xsd:string” /> <xsd:element name=“projectDepartmentNumber” type=“xsd:string” /> <xsd:element name=“projectWorker” type=“Worker” minOccurs=“1” maxOccurs=“unbounded” /> </xsd:sequence> </xsd:complexType> <xsd:complexType name=“Dependent”> <xsd:sequence> <xsd:element name=“dependentName” type=“xsd:string” /> <xsd:element name=“dependentSex” type=“xsd:string” /> <xsd:element name=“dependentBirthDate” type=“xsd:date” /> <xsd:element name=“dependentRelationship” type=“xsd:string” /> </xsd:sequence> </xsd:complexType> <xsd:complexType name=“Address”> <xsd:sequence> <xsd:element name=“number” type=“xsd:string” /> <xsd:element name=“street” type=“xsd:string” /> <xsd:element name=“city” type=“xsd:string” /> <xsd:element name=“state” type=“xsd:string” /> </xsd:sequence> </xsd:complexType> <xsd:complexType name=“Name”> <xsd:sequence> <xsd:element name=“firstName” type=“xsd:string” /> <xsd:element name=“middleName” type=“xsd:string” /> <xsd:element name=“lastName” type=“xsd:string” /> </xsd:sequence> </xsd:complexType> <xsd:complexType name=“Worker”> <xsd:sequence> <xsd:element name=“SSN” type=“xsd:string” /> <xsd:element name=“hours” type=“xsd:float” /> </xsd:sequence> </xsd:complexType> <xsd:complexType name=“WorksOn”> <xsd:sequence> <xsd:element name=“projectNumber” type=“xsd:string” /> <xsd:element name=“hours” type=“xsd:float” /> </xsd:sequence> </xsd:complexType> </xsd:schema> 1. Schema descriptions and XML namespaces. It is necessary to identify the specific set of XML schema language elements (tags) being used by specifying a file stored at a Web site location. The second line in Figure 12.5 specifies the file used in this example, which is http://www.w3.org/2001/XMLSchema. This is a commonly used standard for XML schema commands. Each such definition is called an XML namespace, because it defines the set of commands (names) that can be used. The file name is assigned to the variable xsd (XML schema description) using the attribute xmlns (XML namespace), and this variable is used as a prefix to all XML schema commands (tag names). For example, in Figure 12.5, when we write xsd:element or xsd:sequence, we are referring to the definitions of the element and sequence tags as defined in the file http://www.w3.org/2001/XMLSchema. 2. Annotations, documentation, and language used. The next couple of lines in Figure 12.5 illustrate the XML schema elements (tags) xsd:annotation and xsd:documentation, which are used for providing comments and other Descriptions in the XML document. The attribute xml:lang of the xsd:documentation element specifies the language being used, where en stands for the English language. 3. Elements and types. Next, we specify the root element of our XML schema. In XML schema, the name attribute of the xsd:element tag specifies the element name, which is called company for the root element in our example (see Figure 12.5). The structure of the company root element can then be specified, which in our example is xsd:complexType. This is further specified to be a sequence of departments, employees, and projects using the xsd:sequence structure of XML schema. It is important to note here that this is not the only way to specify an XML schema for the COMPANY database. 4. First-level elements in the COMPANY database. Next, we specify the three first-level elements under the company root element in Figure 12.5. These elements are named employee, department, and project, and each is specified in an xsd:element tag. Notice that if a tag has only attributes and no further subelements or data within it, it can be ended with the backslash symbol (/>) directly instead of having a separate matching end tag. These are called empty elements; examples are the xsd:element elements named department and project in Figure 12.5. 5. Specifying element type and minimum and maximum occurrences. In XML schema, the attributes type, minOccurs, and maxOccurs in the xsd:element tag specify the type and multiplicity of each element in any document that conforms to the schema specifications. If we specify a type attribute in an xsd:element, the structure of the element must be described separately, typically using the xsd:complexType element of XML schema. This is illustrated by the employee, department, and project elements in Figure 12.5. On the other hand, if no type attribute is specified, the element structure can be defined directly following the tag, as illustrated by the company root element in Figure 12.5. The minOccurs and maxOccurs tags are used for specifying lower and upper bounds on the number of occurrences of an element in any XML document that conforms to the schema specifications. If they are not specified, the default is exactly one occurrence. These serve a similar role to the *, +, and ? symbols of XML DTD. 6. Specifying keys. In XML schema, it is possible to specify constraints that correspond to unique and primary key constraints in a relational database, as well as foreign keys (or referential integrity) constraints. The sd:unique tag specifies elements that correspond to unique attributes in a relational database. We can give each such uniqueness constraint a name, and we must specify xsd:selector and xsd:field tags for it to identify the element type that contains the unique element and the element name within it that is unique via the xpath attribute. This is illustrated by the departmentNameUnique and projectNameUnique elements in Figure 12.5. For specifying primary keys, the tag xsd:key is used instead of xsd:unique, as illustrated by the projectNumberKey, departmentNumberKey, and employeeSSNKey elements in Figure 12.5. For specifying foreign keys, the tag xsd:keyref is used, as illustrated by the six xsd:keyref elements in Figure 12.5. When specifying a foreign key, the attribute refer of the xsd:keyref tag specifies the referenced primary key, whereas the tags xsd:selector and xsd:field specify the referencing element type and foreign key (see Figure 12.5). 7. Specifying the structures of complex elements via complex types. The next part of our example specifies the structures of the complex elements Department, Employee, Project, and Dependent, using the tag xsd:complexType. We specify each of these as a sequence of subelements corresponding to the database attributes of each entity type by using the xsd:sequence and xsd:element tags of XML schema. Each element is given a name and type via the attributes name and type of xsd:element.We can also specify minOccurs and maxOccurs attributes if we need to change the default of exactly one occurrence. For (optional) database attributes where null is allowed, we need to specify minOccurs = 0, whereas for multivalued database attributes we need to specify maxOccurs = “unbounded” on the corresponding element. Notice that if we were not going to specify any key constraints, we could have embedded the subelements within the parent element definitions directly without having to specify complex types. However, when unique, primary key and foreign key constraints need to be specified; we must define complex types to specify the element structures. 8. Composite (compound) attributes. Composite attributes are also specified as complex types, as illustrated by the Address, Name, Worker, and WorksOn complex types. These could have been directly embedded within their parent elements. This example illustrates some of the main features of XML schema. There are other features, but they are beyond the scope of our presentation. XML Languages There have been several proposals for XML query languages, and two query language standards have emerged. The first is XPath, which provides language constructs for specifying path expressions to identify certain nodes (elements) or attributes within an XML document that match specific patterns. The second is XQuery, which is a more general query language. XQuery uses XPath expressions but has additional constructs. XPath: Specifying Path Expressions in XML An XPath expression generally returns a sequence of items that satisfy a certain pattern as specified by the expression. These items are either values (from leaf nodes) or elements or attributes. The most common type of XPath expression returns a collection of element or attribute nodes that satisfy certain patterns specified in the expression. The names in the XPath expression are node names in the XML document tree that are either tag (element) names or attribute names, possibly with additional qualifier conditions to further restrict the nodes that satisfy the pattern. Two main separators are used when specifying a path: single slash (/) and double slash (//). A single slash before a tag specifies that the tag must appear as a direct child of the previous (parent) tag, whereas a double slash specifies that the tag can appear as a descendant of the previous tag at any level. Let us look at some examples of XPath as shown in Figure 12.6. The first XPath expression in Figure 12.6 returns the company root node and all its descendant nodes, which means that it returns the whole XML document. We should note that it is customary to include the file name in the XPath query. This allows us to specify any local file name or even any path name that specifies a file on the Web. For example, if the COMPANY XML document is stored at the location www.company.com/info.XML then the first XPath expression in Figure 12.6 can be written as doc(www.company.com/info.XML)/company This prefix would also be included in the other examples of XPath expressions. Figure 12.6 Some examples of XPath expressions on XML documents that follow the XML schema file company in Figure 12.5. 1. /company 2. /company/department 3. //employee [employeeSalary gt 70000]/employeeName 4. /company/employee [employeeSalary gt 70000]/employeeName 5. /company/project/projectWorker [hours ge 20.0] The second example in Figure 12.6 returns all department nodes (elements) and their descendant subtrees. Note that the nodes (elements) in an XML document are ordered, so the XPath result that returns multiple nodes will do so in the same order in which the nodes are ordered in the document tree. The third XPath expression in Figure 12.6 illustrates the use of //, which is convenient to use if we do not know the full path name we are searching for, but do know the name of some tags of interest within the XML document. This is particularly useful for schema less XML documents or for documents with many nested levels of nodes. The expression returns all employeeName nodes that are direct children of an employee node, such that the employee node has another child element employeeSalary whose value is greater than 70000. This illustrates the use of qualifier conditions, which restrict the nodes selected by the XPath expression to those that satisfy the condition. XPath has a number of comparison operations for use in qualifier conditions, including standard arithmetic, string, and set comparison operations. The fourth XPath expression in Figure 12.6 should return the same result as the previous one, except that we specified the full path name in this example. The fifth expression in Figure 12.6 returns all projectWorker nodes and their descendant nodes that are children under a path /company/project and have a child node hours with a value greater than 20.0 hours. When we need to include attributes in an XPath expression, the attribute name is prefixed by the @ symbol to distinguish it from element (tag) names. It is also possible to use the wildcard symbol *, which stands for any element, as in the following example, which retrieves all elements that are child elements of the root, regardless of their element type.When wildcards are used, the result can be a sequence of different types of items. /company/* The examples above illustrate simple XPath expressions, where we can only move down in the tree structure from a given node. A more general model for path expressions has been proposed. In this model, it is possible to move in multiple directions from the current node in the path expression. These are known as the axes of an XPath expression. Our examples above used only three of these axes: child of the current node (/), descendent or self at any level of the current node (//), and attribute of the current node (@). Other axes include parent, ancestor (at any level), previous sibling (any node at same level to the left in the tree), and next sibling (any node at the same level to the right in the tree). These axes allow for more complex path expressions. The main restriction of XPath path expressions is that the path that specifies the pattern also specifies the items to be retrieved. Hence, it is difficult to specify certain conditions on the pattern while separately specifying which result items should be retrieved. The XQuery language separates these two concerns, and provides more powerful constructs for specifying queries. XQuery: Specifying Queries in XML XPath allows us to write expressions that select items from a tree-structured XML document. XQuery permits the specification of more general queries on one or more XML documents. The typical form of a query in XQuery is known as a FLWR expression, which stands for the four main clauses of XQuery and has the following form: FOR <variable bindings to individual nodes (elements)> LET <variable bindings to collections of nodes (elements)> WHERE <qualifier conditions> RETURN <query result specification> There can be zero or more instances of the FOR clause, as well as of the LET clause in a single XQuery. The WHERE clause is optional, but can appear at most once, and the RETURN clause must appear exactly once. Let us illustrate these clauses with the following simple example of an XQuery. LET $d := doc(www.company.com/info.xml) FOR $x IN $d/company/project[projectNumber = 5]/projectWorker, $y IN $d/company/employee WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn RETURN <res> $y/employeeName/firstName, $y/employeeName/lastName, $x/hours </res> 1. Variables are prefixed with the $ sign. In the above example, $d, $x, and $y are variables. 2. The LET clause assigns a variable to a particular expression for the rest of the query. In this example, $d is assigned to the document file name. It is possible to have a query that refers to multiple documents by assigning multiple variables in this way. 3. The FOR clause assigns a variable to range over each of the individual items in a sequence. In our example, the sequences are specified by path expressions. The $x variable ranges over elements that satisfy the path expression $d/company/project[projectNumber = 5]/projectWorker. The $y variable ranges over elements that satisfy the path expression $d/company/employee. Hence, $x ranges over projectWorker elements, whereas $y ranges over employee elements. 4. The WHERE clause specifies additional conditions on the selection of items. In this example, the first condition selects only those projectWorker elements that satisfy the condition (hours gt 20.0). The second condition specifies a join condition that combines an employee with a projectWorker only if they have the same ssn value. 5. Finally, the RETURN clause specifies which elements or attributes should be retrieved from the items that satisfy the query conditions. In this example, it will return a sequence of elements each containing <firstName, lastName, hours> for employees who work more that 20 hours per week on project number 5. Figure 12.7 includes some additional examples of queries in XQuery that can be specified on an XML instance documents that follow the XML schema document in Figure 12.5. The first query retrieves the first and last names of employees who earn more than $70,000. The variable $x is bound to each employeeName element that is a child of an employee element, but only for employee elements that satisfy the qualifier that their employeeSalary value is greater than $70,000. The result retrieves the firstName and lastName child elements of the selected employeeName elements. The second query is an alternative way of retrieving the same elements retrieved by the first query. The third query illustrates how a join operation can be performed by using more than one variable. Here, the $x variable is bound to each projectWorker element that is a child of project number 5, whereas the $y variable is bound to each employee element. The join condition matches ssn values in order to retrieve the employee names. Notice that this is an alternative way of specifying the same query in our earlier example, but without the LET clause. XQuery has very powerful constructs to specify complex queries. In particular, it can specify universal and existential quantifiers in the conditions of a query, aggregate functions, ordering of query results, selection based on position in a sequence, and even conditional branching. Hence, in some ways, it qualifies as a full-fledged programming language. This concludes our brief introduction to XQuery. The interested reader is referred to www.w3.org, which contains documents describing the latest standards related to XML and XQuery. The next section briefly discusses some additional languages and protocols related to XML. 1. FOR $x IN doc(www.company.com/info.xml) //employee [employeeSalary gt 70000]/employeeName RETURN <res> $x/firstName, $x/lastName </res> 2. FOR $x IN doc(www.company.com/info.xml)/company/employee WHERE $x/employeeSalary gt 70000 RETURN <res> $x/employeeName/firstName, $x/employeeName/lastName </res> 3. FOR $x IN doc(www.company.com/info.xml)/company/project[projectNumber = 5]/projectWorker, $y IN doc(www.company.com/info.xml)/company/employee WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn RETURN <res> $y/employeeName/firstName, $y/employeeName/lastName, $x/hours </res> Figure 12.7 Some examples of XQuery queries on XML documents that follow the XML schema file company in Figure 12.5. 1. Differentiate between attributes and elements in XML? List some of the important attributes used in specifying elements in XML schema. Elements can be parents of other elements and/or attributes and can be repeated within the same level of an XML document. They also usually have start and end tags. An element is an XML element - a opening tag, some content, a closing tag - they are the building blocks of your XML document: An element would look like: <test>someValue</test> Here, "test" would be an element. Attributes is an additional info on a tag - it's an "add-on" or an extra info on an element, but can never exist alone. Attributes consist of a named pair attached to an element start-tag. Attribute values must be enclosed in single or double quotes. Attribute names must be unique within a single element occurrence. <test id="5">somevalue</test> "id" is an attribute. The decision to use Elements versus Attributes is mostly an architectural one; however, there are some key differences between Elements and Attributes: 1. Elements can occur more than once (repeating) within the same level, while attributes can only appear once within the same level, example: It is okay to have: But it would be invalid to have: 2. Elements can be defined to be in a certain order, while attributes can appear in any order. Some of the important attributes used in specifying elements in XML schema. Attribute Description default Optional. Specifies a default value for the attribute. Default and fixed attributes cannot both be present fixed Optional. Specifies a fixed value for the attribute. Default and fixed attributes cannot both be present Optional. Specifies the form for the attribute. The default value is the value of the attributeFormDefault attribute of the element containing the attribute. Can be set to one of the following: form "qualified" - indicates that this attribute must be qualified with the namespace prefix and the no-colon-name (NCName) of the attribute unqualified - indicates that this attribute is not required to be qualified with the namespace prefix and is matched against the (NCName) of the attribute id Optional. Specifies a unique ID for the element name Optional. Specifies the name of the attribute. Name and ref attributes cannot both be present ref Optional. Specifies a reference to a named attribute. Name and ref attributes cannot both be present. If ref is present, simpleType element, form, and type cannot be present type Optional. Specifies a built-in data type or a simple type. The type attribute can only be present when the content does not contain a simpleType element Optional. Specifies how the attribute is used. Can be one of the following values: use any attributes optional - the attribute is optional (this is default) prohibited - the attribute cannot be used required - the attribute is required Optional. Specifies any other attributes with non-schema namespace Differentiate between XML schema and XML DTD with suitable example. The critical difference between DTDs and XML Schema is that XML Schema utilize an XML-based syntax, whereas DTDs have a unique syntax held over from SGML DTDs. Although DTDs are often criticized because of this need to learn a new syntax, the syntax itself is quite terse. The opposite is true for XML Schema, which are verbose, but also make use of tags and XML so that authors of XML should find the syntax of XML Schema less intimidating. The goal of DTDs was to retain a level of compatibility with SGML for applications that might want to convert SGML DTDs into XML DTDs. However, in keeping with one of the goals of XML, "terseness in XML markup is of minimal importance," there is no real concern with keeping the syntax brief. LIST1 is an example using DTD and providing a schema definition for the content above, while LIST2 is an example using XML Schema to provide a schema definition (employee.xs). LIST1: Employee Information DTD <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ATTLIST Employee_Info (Employee)*> Employee (Name, Department, Telephone, Email)> Name (#PCDATA)> Department (#PCDATA)> Telephone (#PCDATA)> Email (#PCDATA)> Employee Employee_Number CDATA #REQUIRED> LIST2:Employee Information XML Schema(employee.xs) 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 <?xml version="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > <xs:element name="Employee_Info" type="EmployeeInfoType" /> <xs:complexType name="EmployeeInfoType"> <xs:sequence> <xs:element ref="Employee" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType> <xs:element name="Employee" type="EmployeeType" /> <xs:complexType name="EmployeeType"> <xs:sequence > <xs:element ref="Name" /> <xs:element ref="Department" /> <xs:element ref="Telephone" /> <xs:element ref="Email" /> </xs:sequence> <xs:attribute name="Employee_Number" type="xs:int" use="required"/> </xs:complexType> <xs:element <xs:element <xs:element <xs:element name="Name" type="xs:string" /> name="Department" type="xs:string" /> name="Telephone" type="xs:string" /> name="Email" type="xs:string" /> </xs:schema> (Line numbers have been added for reference, and are not necessary in the actual code.) As we see, the syntax is completely different between the two. For the DTD, a unique syntax is written, whereas the XML Schema is written in XML format conforming to XML 1.0 syntax. LIST3 is an example of a valid XML document for the LIST2 XML Schema (employee.xml). For DTD, a DOCTYPE declaration is used to associate with the XML document; but, in the case of XML Schema, the specification does not particularly determine anything with respect to the association of the XML document. Accordingly, the implementation method of the validation tool actually used is followed. However, under the XML Schema specification, there is a defined method for writing a hint to associate with the XML document. The following content is inserted into the root element of the XML document.