Download A Framework for Information Interoperability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
A Framework for Information Interoperability
Len Seligman and Arnon Rosenthal MITRE
To rapidly respond to new opportunities and threats, both government and industry are looking for
faster, cheaper ways of sharing information via computer systems. In response, vendors offer a new
“solution” every few years, such as data warehouses, Web services, “enterprise information integration”
tools, and ontologies. While these are all potentially useful when applied appropriately, none is a silver
bullet. Each organization has to assess its own needs and the best approach to meet them.
In all cases, the goal is to make available information that sources have and are willing to export. The
framework presented in this article can be used to evaluate interoperability approaches. We'll describe
the most common architectures available for achieving interoperability and the challenges that still stand
in the way.
There are two main types of information interoperability:


Exchange, in which a producer (such as the Department of Defense) provides information to a
consumer (such as NATO), and the information is transformed to suit the consumer’s needs.
Integration, in which in addition to being transformed, information from multiple sources is also
correlated and fused. In general, the consumer sees a single, coherent view rather than all the
systems’ opinions.
Exchange requires addressing the first three problem levels in figure 1.
Integration requires that all four levels be addressed. You can use these problem levels to help
you analyze proposed interoperability solutions.
Level 1: Overcome geographic distribution and infrastructure heterogeneity.
Data can be widely distributed geographically. In addition, to access the data you must overcome several
types of infrastructure heterogeneity including:



Different data-structuring primitives, such as relational database tables versus XML versus
objects
Different data manipulation languages (such as SQL or XQuery), proprietary data languages, and
sources with no query language that require use of a general purpose pro- gramming language
(e.g., Java)
Different platforms, operating systems, networks, etc.
Level 1 challenges are not as resource-consuming as the others because off-the-shelf middleware
products handle most of these challenges. In certain environments (e.g., tactical military applications),
however, significant engineering is still required at this level.
Figure 1:
The four levels of
information integration
Level 2: Match semantically compatible attributes.
Some independently developed information systems use the same terms for the same concepts, but
many don?t. Sometimes, these differences in meaning are quite subtle. For example, in one system,
?number-of-employees? may include full-time and part-time employees but not contractors, whereas in
another system, it includes all full-time workers, regardless of whether they are regular employees or
contractors. If users combine results across systems without understanding these details, the resulting
data is unlikely to satisfy the needs of the application.
Level 3: Mediate between diverse representations.
Integrators often must reconcile different representations of the same concept. For example, one system
might measure altitude in meters from the earth’s surface while another measures it in miles from the
earth’s center. In the future, application developers may define interfaces in terms of
abstract attributes using “self-description”—for example, Altitude (datatype=integer, units=miles).
Mediators can use these descriptions to shield users from the representational details.
Levels 2 and 3 can be addressed by developing mappings across systems.
Level 4: Merge instances from multiple sources.
You can do this through data correlation and data-value reconciliation (sometimes called fusion). Data
correlation determines if two objects, usually from different data sources, refer to the same real-world
object. For example, if the Criminal Records database has “John Public, armed robber, born 1 Jan. 1970”
and the Motor Vehicle Registry database has “John Public Sr., license plate JP-1, born 9 Sept. 1939,”
might a police query consider these to refer to the same person and return “John Public, armed robber”?
Data correlation can identify different sources that disagree about particular facts. Suppose three sources
report John Public’s height to be 180, 187, and 0 centimeters, respectively. Data-value reconciliation can
be used to determine what values the search should return to the application. This capability requires
detailed application knowledge. Vendors and researchers are increasing their efforts in the “datacleaning” area to help administrators specify the desired policy, semi-automatically identify candidate
objects to be merged, and—if cost-justified—resolve individual instances. Reconciliation rules should be
flexible, modular, and displayable to domain experts who lack programming skills.
Typically, you must address these challenges in order, from lowest to highest. For example, unless the
reconciliation meets the challenges of geographic distribution and diverse infrastructures, addressing
higher levels will yield little benefit. For information exchange, levels 1–3 are sufficient, while integration
efforts also require information merging.
This issue of The Edge discusses how we have addressed these levels through several approaches.
General architecture approaches for information interoperability include:





Integration within the application. An application or Web portal communicates directly with
each source using that source’s native interface and reconciles the data it receives. While
common, this approach has serious drawbacks: it places great demands on the application
developer, who must stay knowledgeable about each of several data interfaces. In addition,
information combination becomes part of the code base that must be maintained, making it
difficult to leverage commercial database management or middleware products.
Data warehouses. Administrators define a global schema (i.e., a template) for the shared
data. They provide the derivation logic to reconcile data and pump it into one system, typically
with the help of extract-transform-load tools. Typically, the warehouse is read-only, with
updates made directly on the source systems. As a variation, data marts give individual
communities their own subsets of the global data.
Federated databases. These virtual data warehouses do not populate the global schema.
Instead, the source systems retain the physical data and a middleware layer translates all
requests to run against the source systems. Commercial companies call this “enterprise
information integration.”
Messaging. One application or database uses structured messages to pass data to others.
Often, however, the sender and receiver use different terms for the same concepts, so that the
data must be transformed to meet the needs of the receiver. Enterprise application integration
(EAI) products support message-based interoperability.
Parameter passing. One application invokes another and passes
data as parameters. Web services are an example of this architecture, in which services are
invoked and described using standard Web languages and protocols. EAI products also support
this architecture.
Challenges
The technical issues of these approaches revolve around heterogeneity, distribution, and multiple
versions. In general, the greatest challenges lie in semantics.
The framework presented in this article can be used to evaluate interoperability approaches.
For example, if a vendor describes their product as being “the answer” to information
interoperability, you can ask them which of the four levels their product addresses. If they say
“all of them,” be suspicious!