Download Storage Format of HL7 v3 MIF-based Artifacts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Registry of World Record Size Shells wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Discussion of Architectural Issues Related to the ‘primary’ Exchange and
Storage Format of HL7 v3 MIF-based Artifacts
Len Gallagher
Draft distributed to Tooling Committee
June 28, 2004
Background
During the past several Tooling Committee meetings, we have derived a list of “Requirements”
and a list of “Nice-to-Have” features related to the exchange and storage format of HL7 v3 MIFbased artifacts. Given agreement on format and storage mechanism, the Tooling Committee then
intends to choose a ‘primary’ format and a ‘primary’ exchange mechanism that best satisfies the
requirements and desired features. During the discussion we did not try to limit the desired
features strictly to just formats; instead, we considered the combination of the pure formats along
with basic tools that might create, modify, combine and extract artifact representations from the
‘primary’ exchange and storage format chosen. The most recent version of this Architectural
Decisions document was distributed by Lloyd McKenzie to the Tooling Committee email list on
June 19, 2004.
An Issue
I think we are having a number of communication problems because each of us is thinking of
requirements and features in terms of exchange and storage formats that we’ve already had some
experience with. My own experience is with a centralized repository with access to objects in the
repository as a web service; Geoffry is most familiar with the CVS tool and storage he has set up
for HL7 use and is probably viewing requirements and features in terms of what that tool, and its
client side Eclipse tool, can do; Woody is certainly familiar with the Access database storage and
exchange mechanism that we’ve been using for definition and exchange of static models up to
this point; Lloyd is most familiar with the XML structures of the MIF schema and the tools he
has available to him for managing those types of structures. Others are certainly viewing
requirements and features through their own experienced eyes. Unfortunately, I think that our
interpretations of requirements and features are quite different, because of our different
experiences, so that even when we agree on a requirement, or a desirable feature, we’re not
really agreeing on exactly what is meant. I know that I’m evaluating requirements and features
against what I know we can do, or not do, in the NIST sponsored HL7 Artifact Registry, and I’m
continually surprised when someone else is interpreting that same requirement or feature quite
differently than I am.
Abstract
This paper addresses generic properties of MIF-based artifact definitions such as updatability
constraints, merging and packaging rules, deconstruction into smaller MIF-based artifacts, etc.,
as well as transformations into representations other than the ‘primary’ storage format. It
attempts to address each of the requirements and features we’ve already accepted against some
of the options being considered as the ‘primary’ exchange and storage format. In many cases I
consider interpretations of the requirements and features against the storage and exchange
mechanism that I’m most familiar with, i.e. a centralized database repository of MIF-based
artifacts that maintains a clear distinction between the document that is registered (i.e. the MIF
1
representation of the artifact) and the various metadata that helps to identify, describe, categorize
and associate a single artifact or associations among artifacts. I consider both the advantages and
disadvantages of universal web services for all access to the MIF-based registry/repository.
Introduction
Begin by assuming the existence of a primary source MIF-based representation of each
registered HL7 artifact. The primary source MIF is the unique TC-approved current definition of
the artifact and must be kept distinguished from all other MIF-based representations of that
artifact. Other MIFs may be copies of the primary source MIF, with or without some added
restrictions, or may have one or more primary source MIFs embedded within it, but there is only
one primary source MIF for each artifact and both tool builders and tool users must know when
they are working on a primary source MIF and when they are working on MIFs that are copies or
other derivations of the primary source MIF.
The MIF schema is very flexible, so an artifact may be represented in an equivalent manner by
several different non-identical MIFs. For example, the MIF schema allows one to reference
another artifact or to import its definition into the current MIF definition. Importation of other
MIFs is very helpful for readability, but makes it difficult to update a MIF if it’s not clear which
parts are source material and which parts are imported from other primary sources. For this
reason a primary source MIF should follow a few very simple rules:
1. A primary source MIF may include other primary source MIFs only by reference. It may
not copy the content of another primary source MIF into itself and retain its status as the
primary source MIF representation of the underlying artifact. For example, a primary
source MIF may include MIF definitions of data types, classes, associations, triggers,
interactions and vocabulary only if there do not exist other HL7-identified primary source
MIFs for those artifacts.
2. A primary source MIF that is derived from another primary source MIF may not copy
annotations or history items from that artifact into itself unless they are clearly marked as
copied from that other MIF and not editable in this MIF. If definitions or constraints are
copied verbatim, e.g. from an HMD primary source to a message type primary source,
then in practice they must be treated as two separate definitions or constraints, one for the
HMD and one for the MessageType. An edit of either definition or constraint will not
automatically flow to the other. A constraint on the parent model could invalidate the
specification of a descendant model unless one rewrites the descendant to honor the
tighter restrictions of the parent.
3. If a primary source MIF references a primary source vocabulary value set, then it may not
re-define those vocabulary items within itself. Instead, it could copy the codes into an
annotation or description that is clearly marked as derived from another source and not
editable. If instead, the MIF re-defines the value set, it must be treated as a new and
distinct value set subject to all constraint consistency rules between a MIF and the parent
MIF it is derived from.
At some point HL7 should define an appropriate granularity for primary source MIF
representations. For example, does it make sense for all value sets to have separate primary
source MIF representations, or should there be just one primary source MIF-defined package of
values sets for all of the structural attributes in the RIM?
2
Existing HL7 Primary Source Artifacts
Given the notion of primary source defined above, it makes sense to apply that notion to the
current paradigm for model development. Suppose Woody is holding the Access DB Composite
Repository that was compiled from all artifact information leading up to Ballot 6. At this point
the Composite Repository is the primary source artifact for the entire HL7 v3 specification. It
holds the definitive RIM artifacts (RIM v2.02), the definitive Vocabulary (Voc v209), the
definitive Naming conventions (Naming v17), and all DMIMs, RMIMs, HMDs and
MessageTypes collected from HL7 technical committees as of December 6, 2003. It also
contains information merged in from the PubDB as to the structure of Ballot 6 documentation.
This database was then used to publish the official HL7 v3 specification for balloting in Ballot 6.
At this point one could consider the HTML ballot to be the primary source for the HL7 v3
specification or one could consider the Ballot 6 Composite Repository to be the primary source.
Let’s assume that the Composite 6 Repository is the primary source for the entire HL7 v3
specification, and that the HTML pages are just a transformation of it into human readable
format. Any errors found in the HTML ballot documents would have to be found and corrected
in the Composite 6 Repository.
In the time period after Ballot 6 and leading up to Ballot 7, the primary source Composite 6
Repository gets modified in several ways. First, the existing Composite 6 Repository is archived
for posterity – it remains as the primary source specification for Ballot 6. Then a copy of it is
stripped of domain specific artifacts and renamed to something like rim0202d-emptyRepository20031231.mdb. At this point we have two primary source Repositories. The archived Composite
6 Repository is the primary source for all domain specific artifacts, i.e. DMIMs, RMIMs, HMDs
and MessageTypes, while the new emptyRepository v2.02 becomes the primary source for all
ongoing RIM, Naming and Vocabulary modifications.
Apparently there were no changes to the RIM during this period because the Ballot 7 Composite
Repository (dated 21 March 2004) still references RIM v2.02. However, there were several
vocabulary and naming updates, since the Ballot 7 Composite Repository now includes Vocab
v211 and Naming v18. Each of these vocabulary and naming modifications was made on the
single primary source emptyRepository v2.02 artifact, so there was no possibility of ambiguity.
The actions are even documented in the database itself! The primary source copy was held very
closely by several people and was only updated by editors one person at a time as HL7 decisions
were made and approved by appropriate HL7 technical committees. At this point Woody (I’m
making some assumptions here!) archived the emptyRepository v2.02 as the persistent primary
source for RIM and made a copy that was published in the Tools section of the HL7 website.
Then the existing DMIMs, RMIMs, HMDs and MessageTypes specific to a given Domain were
copied from the primary source Composite 6 Repository into copies of the emptyRepository and
distributed to every Domain committee responsible for domain artifacts under ballot. At this
point, the Domain specific contents of each repository sent to a technical committee becomes the
primary source artifact for the next version (i.e. Ballot 7) of all DMIMs, RMIMs, HMDs and
MessageTypes in that repository. The previous versions remain fixed with their existing
definitions, labeled as Ballot 6 specifications.
We now have a large number of primary source artifacts, many with duplicate information that
carries a high risk of ambiguous modification. The primary source for RIM, Formal Naming and
Vocabulary is the emptyRepository that has been archived, but since every Domain committee
has a non-primary source copy of it, with the tools to make changes to it, there is some risk that
we’ll have different RIM constraints, different vocabularies, and different naming conventions
3
floating around in the various copies. Another issue is that there is no way to be certain during
this period whether changes have been made to a specific DMIM, RMIM, HMD or
MessageType. Thus we must assume that some changes may have been made to all such models
and all must be re-balloted as if they have been modified. (NOTE: I’m sure there are some
exceptions to this where it is known that no changes have been made to some models!) But in
general, voters must assume that they’re going to see revised versions of every model sent back
out to the technical committees.
At the end of HL7 working Group and Interim meetings, technical committees send their revised
repository databases back to HL7. The Publication committee, with assistance from HL7 staff,
merges all the information back into a new Repository called the Ballot 7 Composite Repository.
Any discrepancies that may have crept into the RIM, Vocabulary, and Naming artifacts are
removed at this point. Also any static models that do not satisfy appropriate naming and/or
vocabulary restrictions may get sent back to the domain committee for confirmation on what
should be balloted. This revised Composite 7 Repository now becomes the primary source for all
RIM v3 artifacts and is used to produce the Ballot 7 HTML specification. Finally this new
Composite 7 Repository is added to the existing Composite 6 Repository in the archive and a
new round of development and balloting begins.
The most recent emptyRepository was published by Woody on March 31, 2004, under the title
rim0203d-emptyRepository-20040331.mdb. It identifies its content as RIM v2.03, Naming v18,
and Vocabulary v221 (http://www.hl7.org/library/data-model/V3Tooling/RimRepos.zip ). It has
probably already been populated with copies of Domain-specific models from the source
artifacts in the Composite 7 Repository and sent out to the domain technical committees for the
next set of revisions.
Proposed Future HL7 Primary Source Artifacts
The Tooling Committee is proceeding under the assumption that HL7 desires a change from its
existing primary source Artifact management to a new system based on MIF-derived artifacts.
The current Model Interchange Format (MIF) is a collection of XML Schemas with embedded
Schematron rules to enforce HL7 policy. The most recent MIF schemas are available to HL7
Members in a CVS repository at the NIAT research labs of UNLV (see Geoffrey Roberts for
access codes). They can be reached at node cvs.hl7.nscee.edu:/home/hl7/cvs/HEAD/VSROOT/
v3Schemas/uml/schemas. I downloaded a collection of schemas on June 17, 2004, and the most
recent date on any of the individual schemas was January 13, 2004. We know that more recent
schema pieces are still under development, especially vocabulary and dynamic model, but my
comments below are based on this June 17extraction.
Every important HL7 artifact is or should be representable in one of the MIF schemas. The
schemas form a hierarchy with lower level schemas being imported into higher level ones as
follows:
mifModelInterfaces
\
\
\
mifStaticModelFlat mifStaticModelSerialized
\
/
\
/
4
\
/
mifStaticModelBase
mifDatatype
\
/
\
/
\
/
\
/
mifStaticBase
mifVocabulary
\
/
\
/
\
/
\
/
mifBase
|
|
mifExtendedMarkup
|
|
pubDisplayMarkup
|
|
mifReferencedCodes
|
|
mifPatternTypes
The Tooling Committee is proceeding under the assumption that every HL7 artifact can be
represented as some XML element defined by the above schema hierarchy and thus would
validate to one of these schemas. The architectural issues now facing the Tooling Committee are
how to store and exchange these MIF-based artifacts during various stages of their development.
At one extreme, one could think of the primary source HL7 v3 specification being a single very
large XML file that contains elements representing each and every HL7 artifact. One could
extract elements from that file for each artifact, process that element in some HL7 tool, and then
add a new element or replace the existing element to effect a new constrained artifact or a
revision of the existing artifact. Under this scenario there is only one primary source v3
specification. All extractions would be non-primary source copies. The single primary source
specification would get updated at some point in time and the new XML document would
become the new primary source. The frequency of such updates is an architectural issue.
At another extreme, one could think of each artifact as having its own primary source MIF
representation. The collection of all such MIF representations would be the overall definitive
HL7 v3 specification. Under this scenario, one could remove an artifact representation from the
collection and have that MIF document retain its status as the primary source MIF representation
of that artifact. Each new MIF-representation added to the collection would carry an effective
date. Under this scenario one could extract a sub-collection of artifacts each with a different
effective date. The effective date of the sub-collection would then be the maximum of the
individual effective dates.
5
In practice, there may not be that much difference between these two extremes. In the collection
example, the collection itself could be represented as a MIF package consisting of each of the
individual MIF elements. The MIF package could then be considered as the large XML file that
represents the entire HL7 v3 specification. For example, if each HL7 artifact is represented as a
single staticModel in the MIF, then the collection of staticModels would represent the entire HL7
v3 specification for static models. The only significant differences between these extremes might
be how the primary source representations are identified and maintained.
The management issues one faces in the proposed MIF-based representations of HL7 artifacts
are similar to those that currently exist with the Composite Repository approach. When
individual pieces, or large subsets, of the specification are extracted from the primary source,
then one has to know if the extractions are primary source objects or derived objects. Similarly,
when one extracts a primary source model, it may become a derived copy of the artifact it
models, but it may also become the primary source of a new constrained model that could
eventually become part of the whole HL7 v3 specification.
A Web-based Artifact Repository
Many of the comments I make below will be based on my experience in working with NIST
colleagues to define and populate the NIST sponsored HL7 Artifact Registry. The primary intent
of this registry is to allow HL7 member organizations to register conformance profiles and/or
templates of HL7 standards. However, in order to register profiles or templates it is necessary to
be able to reference and have easy access to the definitions of many HL7 artifacts, especially the
static models, data types, and vocabulary. For this reason we have populated the Registry (
http://www.nist.gov/hl7xreg ) with a number of artifacts, most not yet formally adopted as HL7
standards. We believe that the Registry is essentially complete with static DMIM, RMIM, HMD,
and MessageType models from the Composite 7 Repository and from Naming v18 and
Vocabulary v221. Woody has kindly provided MIF representations for all HMDs and
MessageTypes in Ballots 6 and 7, so those registered items in the Registry have an associated
MIF-based repository item. The other artifacts are described, with links to graphics copied from
the HL7 Ballot, but have no MIF-based repository item.
One of the weaknesses of the Registry is the lack of automated version control, so I am not
proposing that the existing Artifact Registry be used for the day-to-day development of HL7
models, where intense version control is necessary during the development process. Instead, I
simply use this registry as a basis of comparison for further understanding the requirements and
features desired in any tool or tools we choose for MIF-based artifact development.
For comparison purposes, I’m characterizing the NIST Artifact Registry as a centralized
repository that holds all primary source MIF artifacts. In addition it may hold non-primary
source MIF artifacts that are derived from primary source artifacts either as packages or in some
other well-defined manner. I’m assuming universal ebXML Registry Servcies or simpler web
services access to the Registry for all submissions, queries and retrievals. I’m also assuming that
metadata describing Registry content is separate from the content of the MIF documents that are
registered. Even though names, descriptions, classifications and associations may be derived
from MIF content as Registry metadata, one is not able to use registry services directly to update
MIF content. Instead, all MIF updating will be considered as retrieval of a MIF document,
followed by editing of the MIF document via a non-registry tool, followed by re-submission of a
revised MIF document for that artifact. Of course, the MIF editing tool could be so completely
6
integrated with the Registry that the user sees it as a single edit in place, but we’ll be very careful
to distinguish edits on metadata from edits on the MIF document.
The Architectural Decisions Document
As mentioned in the background section above, my comments are based on the June 19, 2004,
version of the Architectural Decisions document under discussion by the Tooling Committee.
Currently this document has 4 sections, with each section being a question posed by the author
that needs a response from the Tooling Committee. At this point in time, only the first question
has been discussed by committee members in teleconference sessions.
The question is: What is the ‘primary’ exchange and storage format of HL7 v3 MIF-based
artifacts?
During these teleconferences we have agreed on a list of Requirements for the answer to this
question, a list of Nice-to-Have features, and a list of Possibilities. I discuss each requirement
and feature below.
Req #1: Must be MIF-based (use the MIF-defined data structure, not necessarily XML)
This is a non-controversial requirement because everyone is agreeing that all artifacts will have a
MIF representation. To say that the storage structure and exchange mechanism must be “MIFbased” could mean different things to different people. In my mind it simply means that the MIFrepresentation of each artifact, and each MIF-implied package of artifacts, must be derivable
from the storage or exchange mechanism in a straight-forward manner. For example, if a user
knows the ArtifactId of a given artifact, then knowledge of that ID should be sufficient to enable
extraction of the artifact’s corresponding MIF representations from the storage mechanism or
from the exchange mechanism.
Under this interpretation a storage mechanism might be a file system, a database, or a master
XML document. Similarly, an exchange mechanism might be a single file, a collection of files, a
compressed (i.e. zipped) collection of files, a database, etc. The ArtifactId must always be easily
transformable either to a file name or to a query that allows extraction of the MIF
representations. Multiple representations are sometimes possible, especially if the storage
mechanism holds multiple versions of an artifact under development.
Under a more rigid interpretation of this requirement, the storage and exchange mechanisms
themselves must be MIF-based. Under the existing MIF schema, the only way this could be
accomplished literally would be for the storage and exchange mechanisms to be representable as
XML documents or transformable to XML documents that validate to one of the MIF schemas.
As seen in the discussion of the two extremes above, even this more rigid interpretation of the
requirement could be satisfied quite easily because a carefully constructed collection of MIF
documents can always be represented as a single MIF document that is a MIF “package” of the
individual items.
The NIST Artifact Registry does not yet satisfy this requirement. At present one is able to extract
a MIF-representation only one artifact at a time. Even now, we only have MIF representations of
HMDs and of MessageTypes so it would be necessary to populate the Registry with agreed MIF
representations for all of the other HL7 artifacts. Secondly, it is not yet possible to extract a
collection of artifacts; instead, one would write a query that returns a collection of artifact
7
identifiers, and then would write individual queries to extract each MIF-representation
separately. It is possible to extend Registry features to return a collection of MIFs as the result of
a query, probably as a MIF package, but that is not yet accomplished.
An important consideration under the Registry model for storage is that the Registry always
holds the primary source MIF representation. I believe this would also be the case under the CVS
model of storage. Any MIF document extracted from the Registry would be a non-primary
source representation. However, invocation of the Registry “Replace” service would allow
replacement of a given MIF-representation with a new MIF-representation. The replacement
would be marked with a new effective date and would become the new primary source of the
given artifact. The replaced version would no longer be available. Alternatively, invocation of
the Registry “Supersedes” service would allow a new version of a MIF representation to be
added to the Registry and the original version would point to the new version. Both versions
would remain available to users. Superseded versions of the representation would have separate
internal Registry identifiers, but all could hold the same ArtifactId as an external identifier. The
Registry could add a constraint to ensure that each ArtifactId has at most one most recent version
and could always return the most recent version unless earlier versions are specifically requested
as part of a query. The NIST Artifact Registry enforces this superseded notion by labeling the
items with a Ballot number and by maintaining a SupersededBy association from the older
version to the newer version.
Req #2: Must support the MIF package structure
This requirement was also non-controversial, likely because the MIF schemas allow
representation of collections of items as a single MIF-package. Thus a collection of artifacts can
always be represented as a single MIF package, and a MIF package can always (if it’s a package
of artifact representations!) be split apart into a collection of individual MIF-representations of
the artifacts. Some very simple transforms would accomplish the necessary representations.
Issues may arise as to when and how the transforms are invoked, but their existence is taken for
granted.
However, a more strict reading of this requirement may imply that the storage mechanism must
be able to support the same flexible construction of packages as does the MIF. For example, a
use case for a MIF package might be an identified artifact, plus all of the other artifacts that it
references in its definition. An RMIM may reference a collection of CMETs and a natural
package is the RMIM together with the RMIMs of all of the referenced CMETs. Another natural
package is an HMD together with all MessageTypes that are derived from it. Another natural
package is a static information model together with all of the parent models it is derived from.
Many different types of users may have well-established requirements for different packages of
artifacts.
It would be difficult in any hierarchical file system storage mechanism to anticipate all of these
different kinds of packages. At best one would have to choose a finite number of static package
definitions that could be supported naturally by the file system. More dynamic or more flexible
package specification would have to be based on queries or some other flexible way to identify
the components of the package.
If general purpose query support is available, then the file structure for storage or exchange is
almost irrelevant. If the primary source HL7 v3 specification is considered to be a very large
XML document, then an easily written W3C standard XPath or XQuery can usually be
8
constructed to return an appropriate subset of elements to represent the desired package of
artifacts. Similarly, if the primary source HL7 v3 specification is considered to be a database of
individual artifacts, then an ISO standard SQL query can usually be written to achieve the same
effect. The Tooling Committee does not want to choose a storage mechanism solely on the basis
of features in different query languages. The best storage mechanism would support both
XQuery and SQL types of queries. In the best possible use case, the SQL query could be used to
identify a small collection of artifacts that satisfy certain administrative metadata constraints and
then the XQuery coulf be used to do further filtering on the content of each MIF. It should be
straight-forward to specify a virtual SQL schema and an equivalent virtual XML structure that
would support both types of queries. It is important to note that the next version of the ebXML
Registry standard intends (never guaranteed!) to specify support for both SQL and XQuery
retrieval.
The NIST Artifact Registry has two different mechanisms for supporting packages. The first is
an explicit ebXML “RIM Package” (not identical to MIF package) that maintains the elements of
a package as HasMember associations. Each package is considered to be a collection of other
items in the Registry, including other packages. A registered item can be a member of an
arbitrary number of packages. Items can be added to or removed from packages simply by
created or deleting HasMember associations. The second mechanism is SQL query. The Registry
has an SQL schema that allows submission of robust SQL queries to identify collections of
artifacts. If the result of a query is a collection of artifacts, then that collection can be returned to
the user as a single XML document that is a MIF package consisting of that same collection of
artifacts. As stated above, the NIST Registry does not yet support the combination of individual
artifacts into a single MIF document to represent a collection, but this could be a future activity.
Req #3: Must allow the storage of non-complete, non-valid MIF artifacts
Again, this requirement is non-controversial, likely because HL7 artifact developers will insist
upon it. An HL7 technical committee may develop a MIF-based artifact specification that it
knows is incomplete. It may be waiting for a single issue to be resolved before the specification
can be completed. A requirement of this committee is that it must have a mechanism for storage
and exchange of this incompletely defined artifact while the issue is being resolved. The
unresolved issue may even be referred to a ballot, so even the final ballot storage structures must
allow for the storage of non-complete, non-valid MIF artifacts.
This requirement is almost a requirement that the metadata describing a MIF document be
separate from the MIF document itself. That way the document can be described in a manner that
is structurally valid and stable even if the document itself is not.
In the CVS storage model, the metadata describing a document in the repository consists
essentially of just three items:
 a file name for the document,
 an automatic method for version maintenance, and
 a path of names for the location of the file in the storage hierarchy.
So long as the file and path names are stable, this type of storage satisfies the requirement that
MIF documents can be in any state of completion.
In the single large XML MIF document model, the schematron structure rules can be a substitute
for structure rules embedded in the XML descriptions. Thus it may be possible for a MIF to
9
validate as an XML document while still failing to comply with all of the schematron rules to be
enforced for artifact specification. However, it may be very difficult to find tools that will
manage MIF documents that do not validate to some minimal XML structure. If we choose a
MIF document as the primary source HL7 v3 specifications, then we will also have to describe
the storage structure in some way as an XML document that validates to some minimal structural
definition, even if many of the elements in the structure that represent artifacts do not validate to
a MIF element for that artifact.
In a database storage model, the MIF may be part of the database structure or the MIF may be
treated as if it were a separate file. Storing the MIF in the database as a Blob is essentially
equivalent to storing it as a separate file. If the MIF itself is broken apart and stored in database
structures, with appropriate database integrity constraints to enforce required artifact structure,
then this model is essentially equivalent to the XML model with schematron rules. It will be
impossible to store incomplete or invalid MIFs in the database unless the MIF structure rules can
be turned on and off and kept separate from the other database structure rules. This may be
difficult to achieve in some relational database systems.
The advantage of combining artifact storage structures with the required artifact structure is that
query can be extended to cover both metadata about the registered item and content of the
registered item. In the large MIF XML model, an XPath or XQuery constraint or query
specification can integrate metadata with artifact content. The same advantage is available in the
database model where the MIF is separated and represented by database structures. However, in
both cases this advantage is lost if the MIF representations of the artifacts do not validate to the
underlying XML or database schema.
The NIST Artifact Registry treats a registered item as something completely separate from the
metadata that describes it. The MIF-representation may even be missing as we’ve seen for most
of the artifacts currently registered in this Registry. Like in the CVS model, a MIF representation
may be in any state of construction. The only requirement is that it have some sort of an
identifier to be the collection point for metadata. In this Registry, a uuid identifier may be
supplied when metadata describing the item is submitted, or by default, a uuid identifier is
automatically generated by the Registration process. In most cases, even an incomplete or invalid
artifact will have an HL7-specific ArtifactId that is part of the metadata. An artifact will likely
retain the same ArtifactId as it passes though the various stages of the ballot or development
process. If it is desirable to retain copies of the specification at each step of that process, then an
ArtifactId may necessarily identify multiple registered items, each with a different uuid internal
identifier and each with a user visible ballotStatus or developmentStatus property that allows a
human (or a human written query) to distinguish among registered items having the same
ArtifactId.
Req #4: Must allow content to be stored in source control, or easily translatable to a format
that can be stored in source control.
When discussing HL7 artifacts, we’ve already agreed that the “source” is an XML document that
validates, or is close to validating, to one of the MIF XML schemas. Thus to be stored in “source
control” must mean stored as an XML document or stored in a format that is easily translatable
to an XML document.
I think this requirement is intended to rule out very sophisticated database or other data structure
representations of an artifact that then require extensive processing to produce the MIF XML
10
representation. At first this seems like a non-controversial requirement, with the only issue being
what “easily translatable” means. However, consider the HL7 vocabulary model as defined in the
HL7 Common Terminology Services (CTS) standard. The model itself is defined in UML with
the primitive elements being Concept, CodeSystem, CodedConcept, ConceptRelationship,
ValueSet, ValueSetConstructor, and VocabularyDomain.
The RoseTree tool for building RMIMs and HMDs uses a relational database model of these
above vocabulary concepts and is able to produce hierarchical representations of each
VocabularyDomain. RoseTree will produce an XML representation of the hierarchy if requested
and it is this XML representation that is used to produce the HL7 ballot materials for structural
attributes in the RIM. Thus one could conclude that the hierarchical XML representation is the
primary source for VocabularyDomain artifacts. I have not yet studied the MIF-based
representation of vocabulary information, but for now lets assume that it is analogous to the
existing XML representations. My point here is that I think it would be a mistake to consider the
XML hierarchy as the primary source for vocabulary. This is because the XML hierarchical
representation, while of immense value to the end user, does not satisfy the requirement that a
primary source be editable. This is because the Concept codes and definitions used in the
hierarchy are not owned by the hierarchy. Instead, they are owned by the CodeSystem that
defines them. A Concept name, code, and definition may appear at many different points in
many different hierarchies, so it will be impossible to edit them in the source hierarchy
document. Instead, it is necessary to edit these items in the place where they are defined, namely
the relational representation of the UML model.
One might conclude that the MIF representation of ValueSet should be the primary source for
vocabulary information. Then the attributes in each class of a static model could reference the
appropriate ValueSet. The primary source of the static model would then be separable from the
primary source of the ValueSet, making it easier to edit the static model while only referencing a
ValueSet. But how does one edit a ValueSet. The value set may contain Concepts from multiple
CodeSystems and may have relationships defined by both ConceptRelationships and
ValueSetConstructors., each owned by different owners. I think we cannot consider value sets as
the primary source for vocabulary information because the Concepts and relationships among the
Concepts will not normally be editable by a single editor in any complete XML representation of
the entire hierarchy of the ValueSet.
My purpose in bringing up this example is to show that it may be necessary to conclude that the
primary sources for vocabulary information be reduced to the primitive UML concepts listed
above. We may have a MIF-based primary source for CodeSystems, a separate MIF-based
primary source for ConceptRelationships, a separate MIF-based primary source for
ValueSetConstructor, etc. None of these primary sources will be very useful, by themselves, to
the end user. Thus it may be necessary to consider the MIF-representations of ValueSets and
VocabularyDomains to always be derived representations and never the primary source, and thus
never editable! If a ValueSetContructor is a binary relationship between two ValueSets owned
by different organizations, how does one decide which ValueSet owns the association and is able
to update, delete, or replace it?
The NIST Registry handles vocabulary information in much the same way as does the existing
Composite Repository, i.e. each of the primitive UML concepts listed above is represented by
some structure in the Registry. The Concepts and CodeSystems as separately registered, with the
requirements that a Concept be derived from a CodeSystem. A CodeSystem will then be
represented as a set of Concepts. ConceptRelationships are represented as HasSubtype
11
associations from a source Concept to a target Concept from the same CodeSystem. ValueSets
are separately registered with ValueSetConstructors represented as DerivedFrom associations
from a source ValueSet to a target ValueSet. A value set may also have HasMember associations
with Concepts. A VocabularyDomain has a DerivedFrom association with a ValueSet and a
UsesVocabulary association with a structural attribute of a RIM class. A VocabularyDomain is
represented in the Registry as a ClassificationScheme, so it has the same hierarchical structure as
that presented by RoseTree. Like RoseTree, the hierarchical structure of a VocabularyDomain is
derived from other more primitive concepts and is never considered to be the primary source for
vocabulary information.
The NIST Registry does not yet produce a MIF representation for any of these vocabulary
concepts, but it could do that if desired. However, the MIF representations of
VocabularyDomains and ValueSets would probably never be considered as “source control” for
these artifacts. Instead, the “source control” is embedded in the DerivedFrom, HasMember,
HasSubtype, UsesVocabulary, etc. associations with other vocabulary artifacts. An update of a
ValueSet or a VocabularyDomain may have to be considered as a collection of updates on other
vocabulary primitives that would then produce a new representation of the ValueSet or
VocabularyDomain.
Taking ValueSet as an example, it’s possible that a primary source MIF representation of the
ValueSet could be defined, but it would have to be clear that the ConceptCodes, Concept names,
Concept descriptions, and other ValueSets involved in the construction of the givenValueSet are
derived from those other sources and cannot be updated for “source control”. The only updatable
parts might be the direct HasMember and DerivedFrom associations with externally defined
Concepts and other ValueSets. The indirect relationships derived from these other artifacts
would not be updatable. In some sense a “useful” and “complete” MIF representation of a
ValueSet will likely not be editable as a primary source because it would be inheriting
HasSubtype relationships from externally defined Concepts and HasMember and DerivedFrom
relationships from other ValueSets.
If a “source control” document is not editable, then what good is this requirement? When an HL7
user desires to see a VocabularyDomain or a ValueSet, it certainly makes sense to require that
they be given a MIF-based representation of that artifact with fully expanded definitions of the
underlying Concepts and the underlying relationships. But to call that document “source control”
implies that it can be edited, and this I think is a mistake.
In summary, I think this is a relatively easy requirement to satisfy if we can agree that “source
control” representations of HL7 artifacts may not necessarily be “useful” or “complete”. Instead,
they may only represent the parts of the artifact that are updatable. For example, a primary
source MIF representation of the ActClass VocabularyDomain may consist simply of a name, a
text definition, a link to an attribute of the Act class, and a link to a single ValueSet. Even the
link to Act.classCode may be excluded because that link could be considered to be owned by the
attribute rather than by the VocabularyDomain.
It seems to me that this “source control” requirement is really a requirement on the tooling
architecture rather than a requirement on the storage mechanism for HL7 artifacts. If the tooling
architecture defines two MIF representations for each HL7 artifact: 1) the required content of a
primary source MIF, and 2) the required content of a complete derived MIF representation, then
it makes sense to require that storage mechanisms be able to accept primary source MIFs as new
artifacts, or as updates to existing artifacts, and then be able to produce both primary source
12
MIFs and/or complete derived MIFs as requested by a user. A later “nice-to-have” feature may
be a tool that can process edits to a complete derived MIF and produce the necessary
replacements for the underlying primary source MIFs.
Req #5: Must be easy for users to pass around
The term “easy” is further explained as having the following capabilities:
 encapsulatable,
 not overly large (emailable < 1MB zipped for commonly exchanged content),
 not corruptible when being exchanged,
 should have ‘similar’ number of units to pass around regardless of the number of artifacts
being communicated, i.e. always 1 or 2 files, not 10 sometimes, 20 sometimes, etc.
This requirement seems to be making the assumption that copies of the storage mechanism will
be passed around, rather than just a single artifact or collections of artifacts. For example, the
storage mechanism for the current HL7 v3 specification is a single Access database file that
encapsulates all of the artifact definitions. During development the database gets partitioned with
various pieces worked in parallel for the various domain committees. It can get fairly large,
especially when some larger CodeSystems are included in the file, but in general the pieces
communicated to the domain committees meet the desired size constraints. The database may
have integrity constraints to keep it self-consistent, but there’s always the possibility of
corruption when being handled by many users. If only part of the database is passed around, then
integrity constraints that span the parts will be ineffective and corruption can occur, or copies of
the common parts can get modified inadventently by the parallel actions.
In the CVS model of storage this requirement seems to be satisfied because there is only one
primary source for the specification. Nodes or branches of the CVS structure can be extracted
and passed around with confidence that they shouldn’t get corrupted. However, whole nodes or
whole branches are unlikely to satisfy the desired support for “MIF packages” as discussed in
Req #2. If a package is defined as a collection of files from different branches of the CVS model,
then corruption may occur when trying to re-construct the path hierarchy.
The large XML file model of storage also assumes a single persistent primary source, with
subsets broken off for different users to work on. As seen above the pieces that get broken of are
likely collections of artifacts, so they can always be represented as a MIF package and
transported as a single MIF file. But as with the existing Access database, it is likely that these
various pieces that get shipped around will have many common parts, and in the absence of
protection against modification these common parts can be modified differently by the domain
committees, thereby making it difficult to re-combine all of the parts in a consistent manner.
The centralized repository model of storage assumes that the primary source MIF-defined
artifacts are all stored at a central location, thus there is no danger of corruption as pieces get
passed around. The downside is that applications must communicate with the central repository
and they are not able to control the primary source MIF for anything other than completely new
artifacts. Whenever a subset of the repository is extracted it is no longer the primary source for
anything still in the repository. Whatever tool is used to make modifications to the extracted
pieces must then be able to update the repository to effect those modifications on the primary
source MIF artifacts.
13
As with CVS and XML file, the NIST Registry finesses this requirement by assuming that there
is only one persistent primary source for artifact specification. It has the disadvantages of a
centralized repository in that external applications can update existing artifacts only by extracting
their primary source MIF representation, updating it locally, and then replacing or superseding it
with the modified version. In this manner the primary source storage format is easy to pass
around because it doesn’t get passed around. To be most effective this model assumes that all
applications will have realtime access to the repository. Primary source MIFs would get
downloaded one at a time and all references to other artifacts from within that MIF would be
downloaded only as needed during realtime processing. Thus the effectiveness of this model
depends critically on the ability to access the repository and retrieve artifacts as needed in real
time. Our real world communication links may not yet be up to this task – but I think we’re
getting close.
Req #6: Must permit multiple versions to exist on the same machine.
During artifact development different versions of an artifact will have the same name and the
same ArtifactId. The only way to distinguish among versions is to have a separate unique
identifier for each version or to have a special attribute of the artifact that distinguishes among
versions. It will often be the case that a user will want to access the existing version and the
potential new version of the artifact simultaneously, either to copy certain portions or to compare
differences.
A pure file system would fail this requirement because the only way to distinguish versions is by
file name. Thus the file names would be constantly changing and one could never identify all
versions of the file strictly by name; instead, one would have to invent some sort of name
extension to identify versions or carry along additional metadata with each name. A storage
model like CVS hides this versioning from the user. Database storage mechanisms can avoid this
problem by defining unique identifiers for each version and providing metadata attributes to help
distinguish among versions. It is interesting to note that the existing Access database Composite
Repository model uses uuid’s to distinguish among artifacts even while relying upon the
ArtifactId as a unique identifier when versioning is not an issue. At present, domain committees
only work on one version of the model at a time, but that could be relaxed even in the existing
process by relying more fully on the existing uuid’s.
If a CVS model of storage is used, then there are client-side applications (e.g. Eclipse) that can
communicate directly with the repository, hide all version identifiers from the user, yet ensure
that versioning is not corrupted when files are copied back into CVS storage after modification.
Strict enforcement of this requirement would likely mean hiding version management from the
user. This is the strong point of CVS repositories and probably the weakest point of all of the
other approaches we are considering. The large XML file storage model would probably have to
add element identifiers in order to distinguish among artifacts that are identical except for a
minor change of the wording in a description. All of the database solutions, XML or relational,
would have to add version identifiers for each artifact or provide a special attribute for version
management. In all cases this is straight-forward to do, but it is not always easy to shelter version
management from the end user. The user may have to manage uuid’s that are not very human
readable and would have to properly set any version attributes if they exist. This may be very
difficult to do correctly when multiple users are working on the same artifact in parallel.
14
The NIST Registry is based on the ebXML Registry model, which has automatic version
management as one of its goals. However, version 2.1 of the ebXML specification that NIST is
using does not enforce automatic version management. Instead, the ebXML model provides a
special attribute for userVersion and NIST is using this metadata attribute to help distinguish
among versions, e.g. a different value for each ballot cycle. NIST has also added Slots to the
model (Slot is the ebXML term for user-defined attributes) to help distinguish the ballotStatus of
HL7 artifacts. New metadata attributes could be added to help manage versions during artifact
development. As with all of the other database storage solutions, management of these version
attributes is the responsibility of the artifact submitter or modifier and thereby subject to human
error and confusion.
Nice-to-Have Features
The Architecture Decisions document lists a number of nice-to-have features for the storage and
exchange format of HL7 v3 MIF-based artifacts. Each feature carries a tag of 1, 2, or 3 to
indicate its relative priority. We re-state each feature and discuss it below.
Feature #1 – Priority Tag 1
Should be operating-system independent. I.e. It should be possible for applications running on
any operating system used by HL7 members to create and read the format. Expected operating
systems include Windows, Macintosh and Unix.
The existing Composite Repository is an Access database file, portions of which are passed
around among users. As such it runs only on Windows operating systems. However, even though
proprietary, the file format is well-known and some other tools are able to open and manipulate
the contents. Individual tables can be extracted in a variety of formats and reloaded into any
relational database. Full blown query and update capabilities are probably only available on
Windows operating systems.
A CVS repository resides in only one place with access through Internet protocols. As such
access is operating system independent. The existing HL7 CVS repository at UNLV is running
on a Unix server. However, client side tools associated with CVS (e.g. Eclipse) can execute on
multiple operating systems thereby removing the operating system of the server as an issue.
File systems can usually be compressed and exchanged as zipped files. Zipped formats are
somewhat interchangeable so it is reasonable to assume that file systems of artifacts could be
exchanged, and that most applications would be able to create and read the format.
The NIST Registry storage format is a relational database with access through HTML browsers
or ebXML Registry Services. The query language is ISO standard SQL embedded in an ebXML
envelope. As such access is operating system independent. Currently the implementation runs
only on the Linux operating system, but that is not important because all access is via
standardized Internet protocols. Client-side tools that access the database would have to be able
to send and receive Internet protocols and parse the XML-based ebXML Registry Servcies.
XML databases may operate only on specific operating systems and access may be application
dependent. Although W3C standards, the XPath and XQuery access languages are read-only, so
updates would still be application specific. In addition, protocols for external access across the
internet would still need to be established.
15
Feature #2 – Priority Tag 1
Should not allow corruption of the data stored by simple tasks such as copy, paste or move. E.g.
If information is associated with the position of a file in a hierarchy, then moving the file could
change the information, thereby corrupting it.
Corruption of data can be ameliorated by enforcement of integrity constraints. So long as the
entire HL7 v3 MIF-based specification is kept together it is possible to enforce all integrity
constraints. However, as soon as portions of the specification are split off and distributed to
various committees or individuals for processing, constraints across pieces cannot be enforced so
corruption can creep in.
Any storage or exchange format that allows users to pass around primary source MIF-based
artifacts is subject to data corruption. Corruption can be reduced if primary source MIFs are
stored in a protected place and if modifications are checked for consistency as they come in.
The NIST Registry is centralized so it is possible to enforce constraints among all the pieces and
thereby reduce possibilities for data corruption. When a MIF-based artifact definition is extracted
from the Registry it is no longer considered to be a primary source. If it is later re-submitted to
the Registry, either as a new artifact or as a replacement for an existing artifact, integrity
constraints can be re-checked and re-verified as appropriate.
Feature #3 – Priority Tag 1
Should be natively amenable to source control (text is better than binary)
We’ve already discussed some of the implications of “source control” in Req #4 above. If
“source” is always considered to be a primary source XML MIF-based document, then “source
control” via simple text editors is possible. However, if the primary source is an XML or
relational database, then source control may be dependent upon the update and manipulation
facilities of the database management system.
Reconsider the Vocabulary examples discussed in Req #4 above. User friendly and complete
MIF-based representations of vocabulary artifacts may not be updatable, so the primary source
MIF-based artifact definitions may involve a large number of references to other primary source
MIF-based documents. Although each of these documents is subject to “source control” via a
simple text editor, keeping track of all of the primary source MIF documents may require
features of a database management system.
The best use of the NIST Registry assumes that all primary source MIF-based artifact definitions
will be kept in the Registry and that updates of the HL7 v3 specification will be through tools
that access the primary source MIF-based definitions in realtime interactions with the database
using ebXML standard Registry Services.
Feature #4 – Priority Tag 1
The format should be usable and exchangeable by all HL7 members. This means that neither the
format itself nor any libraries or software to the use of the format or the requirement to use
software or libraries to interpret that format which includes such fees or barriers.
16
The storage format for MIF-based artifacts is important only if that storage format is passed
around as primary source pieces of the HL7 v3 specification. This is not an issue in centralized
storage formats. In centralized storage formats, the main issues are the access and manipulation
protocols used to submit and modify the MIF-based artifacts.
The existing format is an Access database file, subsets of which are passed around. Thus the
storage format is an issue and it should be usable and exchangeable by all HL7 members. This
does not rule out the use of proprietary storage mechanisms provided that the MIF-based artifacts
can be extracted from that mechanism using tools that are obtainable and usable by all HL7
members.
Feature #5 – Priority Tag 2
Should permit multiple versions to be open within tool instances simultaneously. Examples
include the need to display an old version while working on changes to a new version;
performing maintenance on a published standard, while also working on developing a future
release.
This feature is very similar to Req #6 above regarding the existence of multiple versions on the
same machine. It differs only in that it requires that multiple versions be accessable within tool
instances simultaneously. As discussed above in Req #6, it is fairly straight-forward to adopt
identifiers or metadata attributes to distinguish among versions even if the artifact name and
ArtifactId remain fixed. The real problem is providing software support to occlude the
complexities of version management from the end user.
Feature #6 – Priority Tag ??
Should allow splitting, distribution and recombination of component artifacts. Artifacts are any
element that has traditionally been assigned a publication identifier. Essentially this
requirement means that the format should allow distribution of ‘subsets’ of artifacts, and allow
support tooling to move artifacts from one storage location to another for easy exchange
between HL7 members.
This feature is closely related to Req #2 above saying that exchange and storage formats must
support the “MIF package” structure. This feature extends that notion to features that allow the
storage mechanism itself to provide flexible splitting, distribution and recombination of
packages. We concluded above that this kind of flexibility probably requires support for a
general purpose query and update language over the storage structures themselves. Since the
only standardized query languages are XQuery, XPath, SQL, and OCL, full support for this
feature probably requires full implementation of at least one of those languages. And since three
of these languages are read-only, the storage mechanism may still require that applications use a
proprietary mechanism for submission of new artifacts and maintenance of existing ones.
The NIST Registry allows the use of a steamlined version of SQL within an ebXML Request to
return collections of artifacts that satisfy the request. However, it requires use of other
standardized ebXML Registry Services for submitting new items to be registered, for updating
any metadata that describes a registered item, and for maintaining classification of artifacts and
associations among artifacts.
17
Returning to our notion of a primary source MIF representation of an artifact versus a complete
derived MIF representation, this feature requires support for the ability to break apart a complete
derived MIF into a self-contained and consistent collection of primary source MIF primitives.
It’s not clear if this should be a property of the storage and exchange format mechanisms or a
desired feature of other HL7 tools, e.g. a later version of RoseTree, that access the MIF-based
artifacts.
Feature #7 – Priority Tag 2
Should be amenable to search & replace and manual editing. (Primarily needed for power
users)
This feature applies more to the storage mechanism than it does to the exchange mechanism. It is
independent of whether the storage mechanism is centralized or local to the user’s system. It
requires that it be straight-forward to find the artifacts of interest, extract them, modify them with
some tool, and then re-submit them to the storage mechanism.
The search mechanism could be a browsing mechanism that makes it easy to navigate through
the storage structure to find the artifacts of interest or artifacts that are linked to the artifacts of
interest. The search mechanism could also involve support for query, but that is addressed by the
next feature below.
The desired manual editing feature is similar to Feature #3 above, but discussion of that feature
reveals that editing of a MIF representation requires that the representation be updatable. We’ve
seen that updatable MIFs may not always be complete. However, if our MIF architecture
requires that every artifact have a primary source MIF representation, then that artifact will
always be manually editable by text edits on the XML MIF representation. The downside is that
many complete MIF representations, like the full hierarchy implied by the recursive definition of
a ValueSet, will not be directly editable, thus requiring that modification of a ValueSet may
result in the modification of a larger number of vocabulary primitives.
The NIST Registry seems to satisfy this requirement even in its present form. It allows
identification of a collection of artifacts by SQL query, and then allows browsing between
artifacts that are linked to one another by different types of associations. It assumes that a
primary source MIF representation is directly available in the Registry and that such artifacts can
be copied from the Registry, modified by an external tool, or manually edited by a simple text
editor, and then submitted back to the registry either as a replacement for the copied artifact or as
a new artifact. It also assumes that compound MIF representation could be extracted from the
Registry, worked on locally for maintenance, then split apart into primary source MIF
representations and returned to the Registry either as replacements or new versions.
Feature #8 – Priority Tag 3
Should allow query and search capabilities intrinsic to the file format. (Note: query and search
capability is critical, it’s just a question of how amenable the raw format is)
This feature implies support for robust query in a language that is available to and understood by
a large number of tool developers. It probably requires that the query language be standardized
and supported by multiple vendors or multiple open source implementations. In my mind this
limits query support to languages such as XQuery, Xpath, SQL and OCL.
18
Each of these languages is amenable to the raw format of the storage structure; in fact, they
essentially require that the raw format be viewable, at least virtually, as satisfying a static schema
definition. XQuery requires existence of an implied XML schema, SQL requires existence of an
implied SQL schema, and OCL requires existence of an implied object model.
The ebXML Registry standard allows embedding of a query language within a Registry Request.
The open source implementation that NIST uses only supports embedding of a reduced (but still
robust) subset of SQL. If SQL is supported, then the ebXML Registry standard requires the
existence of an implied SQL schema consisting of about 20 defined tables. Since SQL is used
only for query, not for update, it doesn’t matter if these tables are updatable base tables or nonupdatable view representations, thereby making it much easier for various kinds of database
products to claim conformance. The Registry Services for submission or update are XML
elements that correspond to new insertions into these virtual tables; the difference is that an
entire Registry object is considered as a whole rather than as individual insertions into SQL
tables. For the current version of the standard, update is viewed as the replacement of a registry
object (i.e. rows in one or more tables) by a new registry object. Keep in mind that we are
talking only about metadata structures here; the item being registered, i.e. the MIF representation
of an artifact, is self contained and stored as a whole.
Conclusions
This paper was written to help the author understand the real meaning behind some of the stated
requirements and nice-to-have features of the ‘primary’ exchange and storage format of HL7 v3
MIF-based artifacts. It helps the author to understand the requirements by comparing them
against the NIST Registry, which does claim to hold relevant HL7 artifacts. The primary purpose
of the NIST Registry is to support accessibility to HL7 conformance profiles and thus it must be
able to reference all ‘final HL7 standard’ artifacts. The NIST Registry was not designed to
support artifact development and intensive, pre-final version management, but it does make
sense to compare its capabilities against the stated requirements and features to see how it
measures up.
The first conclusion is that the Tooling Committee needs to make a distinction in its discussions
between primary source and non-primary source MIF-based representations. Primary source
MIFs will be updatable, but they may not be complete and exhaustive representations of an HL7
artifact, because the artifact may be recursively defined from many other primary source pieces. I
suggest that we define the contents of two types of MIF-based artifact representations 1) a
primary source MIF representation, which will only contain the parts of an artifact that are
updatable, and 2) a complete derived MIF representation, which will import the relevant parts
from other primary source MIF representation to present a complete and user friendly description
of the entire artifact structure, including those imported parts that cannot be updated. A primary
source MIF will only reference other artifacts, while a complete derived MIF may import
relevant parts of other artifacts that it uses or is dependent upon. For example, a complete
derived MIF of an HMD may import relevant parts of the RMIM or other information models it
depends upon, the CMETS it uses, and the datatype descriptions and value sets that are necessary
to construct and understand a valid MessageType.
A second conclusion is that the Tooling Committee should make an up-front decision about the
pros and cons of a centralized artifact storage structure versus a storage structure that will be
19
passed among multiple users and tool developers. Many of the requirements and features seem to
assume that the artifact storage structure itself will be passed around rather than just MIF
representations of the artifacts or collections of artifacts. Once we know if the primary source
storage structures are centralized or passed around, discussion of the requirements and features
becomes much more focused. Depending on the answer to these alternatives, tool developers can
concentrate on MIF management versus MIF storage management.
A third conclusion is that the answer to the second conclusion may depend upon the complexity
of the storage model. If the ‘primary’ storage model is simply a collection of MIF-based artifact
definitions, then it is easy to think of MIF packages being passed around as a single zipped file
that when opened gives the whole collection of MIF artifact definitions. However, if the storage
structures assume the complexities of generalized query and other features of a database
management system, then it is much more difficult to think of passing around such systems and a
centralized storage repository makes much more sense. Note that a virtual centralized storage
mechanism doesn’t prohibit distribution; the centralized repository could be a distributed
database with replications in as many locations as are necessary to support efficient access. But
management of the replication would become a repository problem and not an HL7 user or tool
developer problem.
A fourth conclusion is that the Tooling Committee has not yet spent enough time talking about
‘ownership’ of MIF-based artifact definitions. It makes sense to assume that every HL7 artifact
is ‘owned’ by some HL7 technical committee and that only a designated representative of that
committee, i.e. en editor, can make updates to it. If this is the case, then the artifact storage
structure may necessarily become a bit more complicated in order to group those items that are
owned by the same technical committee. We may need a rule to say that an updatable package of
MIF artifacts may only contain primary source MIFs that are owned by the same user, thereby
making that user responsible for the entire contents while holding possession. The CVS storage
model makes this assumption as each node of the storage structure can be owned by a different
user and only the owner of a node may make modifications to the artifacts in that node. In all of
the database solutions discussed above, there is an assumption of different users or roles, each
with potentially different access and update privileges, so no changes to the models are necessary
to accommodate ‘ownership’ control of the artifact or metadata describing the artifact.
Possibilities
The first section of the Tooling Committee’s Architectural Decisions document lists a number of
possibilities for the ‘primary’ storage and exchange format of HL7 v3 MIF-based artifacts,
including:
 Directory structure of MIF files,
 Zip-file containing directory structure of MIF files,
 One big XML file,
 Relational database,
 XML database,
 Other ???
The Tooling Committee has not yet discussed these possibilities, and may end up changing them
significantly. However, I think we can eliminate a lot of debate if we agree to focus on the word
‘primary’ and regard multiple possibilities as derivations of ‘primary’.
20
Suppose we think of the entire HL7 v3 specification as being a collection of MIF-based artifacts.
To avoid confusion, let’s also assume that each of these artifacts is a primary source MIF, or
easily decomposable into a collection of primary source MIFS, all owned by the same technical
committee. Thus each MIF is updateable or replaceable by its owner. Suppose the MIFs are
groupable by owner into a directory structure of two levels, so that the first level is the entire
specification and each node at the next level is owned by a technical committee and consists only
of MIFs that can be updated by an editor of that committee. If necessary, we could allow
additional levels for sub-committees, or groupings into different kinds of models, but that seems
to be an fairly straight-forward extension.
I’m suggesting that we adopt the first bullet above, i.e. directory structure of MIF files, as the
‘primary’ storage format for MIF representations of HL7 artifacts. This would have to be a static
directory structure, defined by the Tooling Committee, and well-understood by all users and all
tool developers. Users or tools would be able to ask for copies (i.e. non-primary sources) of all
MIFs stored at any node in the directory structure. Only the owners of that node would be able to
re-submit new versions of MIFs under that node. Adoption of such a ‘primary’ storage format
does not preclude other storage formats so long as they can virtually support the ‘primary’
format.
Suppose we also adopt the second bullet above, i.e. zip-file containing directory structure of MIF
files, as a required feature of any HL7-condoned primary source MIF storage format. A user
should be able to request a zip file of any node in the directory structure and receive a single file
containing all of the necessary information to reconstruct the directory structure locally. The
Tooling Committee may have to choose one or more zip formats that satisfy this requirement. I
don’t know if gnu zip, windows zip, tar files, etc. are mutually translatable. Owners of a node
should be able to resubmit a package of MIFs in the same zipped format and have them
expanded and properly acted upon by the storage mechanism.
The third bullet above, i.e. one big XML file, could also be adopted as a required feature of any
HL7-condoned primary source MIF storage format. In this case it would be necessary for the
Tooling Committee to define a new XML format that would validate to the MIF schemas, but
that would also contain sufficient information to allow re-construction of the directory structure
specified as ‘primary’. I think the flexibility of the MIF schemas would allow this to be done
with each node of the ‘primary’ directory structure being a specific package of MIF artifacts. As
with the zip requirement, a user should be able to request this specialized MIF package for any
node in the ‘primary’ directory structure and the owner of that node should be able to re-submit
new versions using the same MIF package structure.
The other bullets above, e.g. relational database, XML database, etc. could be format options
offered by any HL7-condoned primary source MIF storage mechanism. They would not be
required or precluded. It’s possible that different repositories may cooperate with one another to
hold identical collections of MIF-based artifacts (using repository replication services) while
offering different database or other format options to different sets of users.
The Other 3 Sections of the Architecture Decisions document
I have still not thought about the requirements and features discussed in these three sections.
However, if others find my deliberations useful, I could expand this paper to cover those topics.
21