Download The Confederation Web: A Net-Centric Data Service For Large

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Draft
The Confederation Web: A Net-Centric Data Service
For Large-Scale, Rapid Integration
Douglas E. Dyer, PhD
Active Computing
http://www.activecomputing.org
12 March 2005
Introduction
DARPA’s Integrated Battle Command program (IBC) aims to integrate an array of
intelligent tools to help military commanders in battlefield engagements as well as in
peace-keeping and stability operations where political, economic, and social factors can
dominate. To support rapid integration of new components, IBC has developed the idea
of a confederation, a loosely-coupled set of decision aids, visualization tools, and
simulation models that speed understanding and assessment even for domains unfamiliar
to warfighters. This document describes the simple architecture, motivated by the
World-Wide Web, that makes a confederation possible: the Confederation Web. The
Confederation Web is meant to be a structured version of the World-Wide Web, and after
carefully analyzing requirements, most design choices follow directly from that great
example.
Requirements
For command and control (and most other domains), practical integration of software
components requires:
 A simple method of publishing
 Knowledge that software components and data exist
 The ability to physically access information from available software components
 A way to interpret/understand information provided by software components
 A way to prevent unauthorized users from seeing sensitive information
 Fast, reliable performance
Requirements Analysis and Design Choices for the Confederation Web
A simple method of publishing. Sharing information implies publication because
information that is not public may not be accessed or shared. But what information
should be published? In what format? And by what mechanism? And using which
semantic definitions and relationships? In my experience, when several developers unite
to build an integrated application, these questions can be very difficult to answer, and
debate extends well into the development cycle. Defining needed functions is not very
difficult and neither is allocating them to components. In contrast, creating a data model
and an ontology usually takes months. Mechanisms and formats can be problematic
Draft
1
Draft
because everyone has a vested interest in using the methods they know and prefer. The
more tightly-coupled the system, the more difficulties arise. Also, the available
development environment is dynamic. New data formats and service technologies
always loom on the horizon, promising useful features and greater facility. Developers
strive to stay in the mainstream while software development vendors have many
economic incentives to keep changing the “stream” regardless of technical utility. All
this uncertainty leads to delay, and the project’s growth may be stunted as a result.
To avoid uncertainty and delay, the Confederation Web specifies and provides the
minimal infrastructure necessary to enable application developers to immediately publish
whatever content they choose in whatever format they wish. The Confederation Web
defines the information sharing infrastructure, the uniform indexing method used to find
information, and certain metadata.
In the Confederation Web, the method of publishing is pre-defined. Using SQL and a
database interface library, applications write information elements (variable values in
context, plus metadata) to a pre-defined relational database tables whose attributes match
those of an information element. Let’s discuss this informally for a moment, first by
describing an information element.
The essence of integration is sharing information, and basically, an information element
is a self-contained unit of information. Its components are:
 A variable
 Its value in some context
 Associated metadata in the same context
A key concept is context. For example, x is a variable and its value may be “4,” but in
what context? Perhaps it’s in the context of the equation, x + y = 13 or perhaps x refers
to the distance, in centimeters, that a robot hand is from a door knob at some time t.
Because there is a huge number of relevant variables in the world and an even greater
number of contexts, it’s helpful to partition both variables and contexts. For variables, a
convenient and practical method is to use the application associated with the variable. If
so, then applications create namespaces. For example, the origin of an aircraft movement
in SOFTools is different from the origin of a flight in Expedia. So the application name
is an important part of a variable’s context in the Confederation Web. The rest of the
context is associated with a problem-solving instance. For example, SOFTools may be
used to author many different plans, and the variables for each plan are completely
independent from those in other plans. Plans created by different authors are different as
are plans for different purposes created by the same author. This suggests a method of
defining a problem-solving instance. For information elements, we specify the problemsolving instance with the user identity (such as an email address) and a problem-solving
instance number (integer) for that user. For example, the first time I book a flight on
Expedia, the problem-solving instance is identified by a unique user identifier,
“[email protected]” and “1.” The second time I book a flight, the
Draft
2
Draft
instance number is “2” and so on1. The application name, user identity, and user’s
problem-solving instance number constitute a complete context for the variable in an
information element. The context makes it possible to retrieve all information elements
for a particular context while ignoring all other information elements. The other parts of
an information element include the variable’s value and associated metadata: the time the
variable value was set, the source of the value (which could be the user, an agent, or an
algorithm acting on the user’s behalf, an important distinction), and “more information,”
which can be anything the developer wishes to insert but was originally intended for userprovided information.
To help make this more concrete, let’s look at an example. Here is an application used
for travel planning:
The application name is “Travel” and let’s suppose the user is Doug Dyer, identified by
email address, [email protected]. The lower right part of the application
indicates this is the first trip of three (1/3) that Doug has planned with this application.
So, in the Confederation Web, the information for this trip could be found from
something like:
There are often many ways to identify a user’s problem-solving instance uniquely using application
variables. To avoid selecting some set of these and ensuring uniqueness, we just use a number. Then,
when we want to recall all the information about the first flight ever booked on Expedia, we just use
enough variable values to get the instance number, e.g., “the flight whose destination was Amman, Jordan”
and whose departure date was sometime in March, 2005.”
1
Draft
3
Draft
application: Travel
userIdentity: [email protected]
instance: 1
Suppose Doug began planning this trip in October when told there would be travel
required to perform Experiment B at JFCOM in November. The complete information
element for the variable “Visiting” might be:
application: Travel
userIdentity: [email protected]
instance: 1
variable: Visiting
value: JFCOM
source: user
more:
time: 10975251862
In the Confederation Web’s relational database this looks similar:
+-------------+-------------------------------+----------+----------+-------+--------+------+------------+
| application | userIdentity
| instance | variable | value | source | more | time
|
+-------------+-------------------------------+----------+----------+-------+--------+------+------------+
| Travel
| [email protected] | 1
| Visiting | JFCOM | user
|
| 1097525186 |
+-------------+-------------------------------+----------+----------+-------+--------+------+------------+
In the Confederation Web, this simple method is used to publish current values.
Applications use the element table to store current variable values. SQL’s INSERT is
used for new information elements and UPDATE is used for existing information
elements whenever the value changes. Once an information element appears, any
authorized, connected user may read it, but no one besides the publishing application may
write it. To support a complete digital history, applications also INSERT information
elements into another table, history, with the same schema as element. This enables
anyone to query history table, sorting results on time, to provide the change history for a
particular information element. This is how applications share information3.
Because the only requirement for integration is writing to and reading from a relational
database, developers are free to use any programming language, library, or database
interface (e.g., ODBC). This reduces the need for developers to agree on infrastructure
and allows them to innovate freely or stick with tested technologies. Because developers
are vested in the programming languages, libraries, and technology they have chosen,
freedom to use what they wish is a significant advantage. In addition, this freedom
promotes innovation while provide system architects and managers complete control and
visibility needed to ensure a good integrated system.
Application developers are free to choose the content and formats they think are most
useful to others. This is a key design decision justified in the Appendix, but it does not
2
In the Confederation Web, Unix-style time is used, seconds since 1970.
Developers may also store descriptive metadata about variables and applications in a third table,
elementDescription. But that’s about as complex as it gets. The Confederation Web has only these three,
pre-defined tables.
3
Draft
4
Draft
necessarily preclude decision by consensus. Developers may choose to create a
community of interest and build a consensus on desired content and format; or they might
consult only with their primary customer; or they may unilaterally decide these issues.
Some may worry that authorizing developers to dictate the information published will not
satisfy system needs. This might be true if system architects and managers decide not to
allow competition between developers---the old way of doing business4. With
competition, no one has a monopoly on technology. Any component can be reimplemented, improved, and offered as an alternative. Knowing this, developers are
likely to produce good products because if they don’t, someone else will come along and
“build a better mouse trap.” If so, then there are multiple components from which to
choose. Market forces can quickly eliminate bad choices, either by helping developers
see the need for re-design or by “naturally selecting” applications that don’t evolve to
satisfy customers. Ultimately, market forces are stronger, more enduring, and more
efficient than even the best centrally planned program. Computers can measure use
patterns and market demand easily.
Semantics is typically a difficult issue. Trying to define a relatively complete domain
ontology can take a long time and can delay progress if required before writing
applications. Regardless of integration issues, every application naturally defines its own
ontology (this is true in any event, regardless of whether the Confederation Web is used
and whether the ontology is formally documented or not). When one application uses
information from another, its ontology is extended. By integrating a set of applications,
it’s possible to create a family of associated ontologies that more or less cover a domain
of interest (again, whether or not the ontology is documented). It’s often useful to
document the ontology to gain greater insight, make applications more closely represent
the domain, and facilitate the development of new functions and new applications.
However, developers are not generally domain experts and thus have trouble defining the
ontology initially. Luckily, the rapid prototyping approach implies frequent feedback
from users familiar with the domain. Rather than trying to define a complete ontology a
priori, I advocate using technology to assist users and developers in discovering and
documenting the ontology iteratively, changing the application to reflect new insight. As
is the case for most other constructive works, ontologies improve with iterations.
Following the example of the World-Wide Web, the Confederation Web makes
publishing simple. Simple ideas include reliance on mature relation database technology
and the notions of an information element, a pre-defined database schema, and authority
for application developers to unilaterally select the content and format published. The
intent is to support bottom-up development. Rather than waiting for agreement on a topdown design (which tends to arrive pretty late in development schedule), developers can
publish immediately. A simple, pre-defined method of publication helped the WorldWide Web attain exponential growth, and the same method will help the Confederation
Web grow too.
4
Draft
5
Draft
Knowledge that software components and data exist. A common problem identified
by the DOD’s Net Centric Data Strategy is that users and developers generally do not
know all the software components, services, and data elements that are available. Given
the rate of technical progress, number of new commercial products, and number of
Government programs aimed at component development and database upgrades, this is
not at all surprising. We need technology to help find software components and data
applicable to our current problem. Luckily, it’s easy to translate search engine
technology used to help us manage unstructured information5 to help manage structured
information as well. The single-server version of the Confederation Web already has a
“search engine for structured information.” This tool can help in the following ways:




Identify relevant applications and data
Find power-users based on use frequency and recency and access them via email
Provide semantic clues on the meaning of variables and functions of applications
Monitor the software development process
The search engine for structured information is documented in a paper on the Active
Computing web site. By exploiting the special structure available in an information
element, this search engine can help anyone find and understand the tools and data they
need to solve a problem. From a developer’s perspective, the search engine helps find
reusable components and useful data. From a project manager’s view, the search engine
makes it easy to track development progress. All of these benefits accrue from applying
search to the Confederation Web6.
The ability to physically access information from available software components.
Because the Confederation Web is implemented with relational databases, physical
access is possible from anywhere on the net. To query, the only requirement is access
information (e.g., database server, user, password) and an appropriate index for finding
information elements. To find a particular information element, the Confederation Web
uses a “URL for structured information.” For the single-server version of the
Confederation Web, the URL for structured information is the variable name, application,
user, and user’s problem-solving instance. For a multi-server version of the
Confederation Web, it’s necessary to define the database server as well by IP address or
name. The URL for structured information provides a uniform index that, with the
relational database server, enables authorized users to get information from the
Confederation Web from anywhere on the net using any appropriate client software.
5
Unstructured information is content which is currently interpretable only by a human. Examples include
natural language text, graphics, video, and sound. In contrast, structured information refers to variable
values in context (application, user, user’s problem-solving instance). Structured information is generally
machine-interpretable and can be used to automate business rules, trigger actions, support machine
reasoning, etc.
6
Such a search engine is not possible without a uniform database schema like the one used in the
Confederation Web.
Draft
6
Draft
A way to interpret/understand information provided by software components.
Typically, software developers provide some form of descriptive meta-data to help other
people understand the domain of interest, data definitions, and software functions.
Descriptive meta-data include source code comments, user and system documentation,
data dictionaries, schema, models, and ontologies. All of these require effort on the part
of the developer to produce, and I refer to them as “semantics by declaration.” The
Confederation Web has a table for storing and serving descriptive meta-data. However,
the Confederation Web also enables a second method which requires no a priori effort on
the part of developers but arises naturally as software applications are used. “Semantics
by example” exploits example values to gain insight into the meaning of a variable. To
understand this concept, consider the Travel application introduced earlier. One variable
in that application is visiting and its meaning might reasonably be any of at least three
things:
 A person
 A place (physical location)
 An organization
However, suppose you find these example values for visiting:
 University of Pittsburgh
 IBM Almaden Institute
 SOCOM
 DARPA
Clearly, these examples do not refer to a person, and they don’t seem to refer to a
physical location (for example, DARPA could choose to move its facility a different
office building). For these reasons, most people would be able to use these example
values to infer that visiting refers to an organization, not a person or a place. This
understanding, arising from inference based on example values, is “semantics by
example.”
As the Confederation Web is used, a large number of example values will be stored.
After a short time, we can expect the Confederation Web to include all possible values
for many variables that have enumerated values. For these variables and perhaps others,
the range of possible values available in the Confederation Web should result in valuable
insight into the meaning of each variable. Moreover, the distribution of values in the
context of other relevant variables should provide a sense of normal behavior, thus
facilitating recognition or abnormal, and possibly erroneous, information.
For many applications, just knowing the associated variables can facilitate understanding
the function of the application. For example, if variables include origin and destination,
then movement is implied. Knowing variable values in the context of a particular
problem is similar to having a case (as in case-based reasoning) and enables hypotheses
about the ontological relationships that apparently exist between variables.
“Semantics by example” is a powerful method for understanding applications and their
variables. It requires no effort from developers, arising instead through normal use of
Draft
7
Draft
applications. It can be used separately or in conjunction with the more traditional
“semantics by declaration.”
A way to prevent unauthorized users from seeing sensitive information. Security is
important in many domains including military command and control. One form of
security is preventing unauthorized users from gaining access to sensitive information.
Applications operating on client workstations and the Confederation Web servers are
assumed to be under physical control, but what about the network? Unprotected, hackers
may be able to get sensitive information by sniffing packets or by simply accessing the
database. Both of these problems are solved using off-the-shelf technologies. To prevent
packet sniffing, network traffic between applications and the relational database hosting
the Confederation Web should be encrypted using secure socket layer (SSL). To prevent
direct access to the database, we can just use the database’s access control scheme which
generally involves user accounts and passwords protecting databases, tables, and even
columns. The database normally has logging capability which enables manual analysis
as well as a variety of AI-based agents for checking out-of-norm or sensitive queries,
providing another layer of protection for at least detecting hackers. These standard
methods are difficult to improve upon. Adding additional components, services, and
interfaces may add complexity without any real security benefit. The more complex the
interfaces and services provided, the more difficult it is to ensure the system is secure.
Using a small, well-defined interface is makes it easier to analyze and address possible
methods of attack.
Fast, reliable performance. Performance is always desirable, but designing for
performance involves many considerations. For serving structured data, the relational
database provides a good foundation for reliable performance. Viable, commercialquality databases perform well by definition (otherwise they would have been culled by
market forces). Query optimizers improve each year based on market demand.
If, for whatever reason, database performance is not adequate, one possible solution is
replicated servers. Typical replication provides a master database that serves writes and
multiple slave databases that serve reads. Synchronization may be an issue in some
domains (and if so, the master may be used for immediate reads).
Another performance-enhancing possibility enabled by the design of the Confederation
Web is to split the information served, hosting some information on one server and the
rest on another. Information may also be replicated in a splitting scheme. If additional
performance is needed, more splitting can provided. The URL for structured information
provides a uniform indexing method that makes splitting possible. The architecture of
the multi-server version of the Confederation Web naturally mirrors that of the WorldWide Web and internet hosts in general:
Draft
8
Draft
In the architecture, both clients and servers are connected to the internet and routing is
handled in the normal, robust way. For large loads (e.g., replicated servers), it makes
sense to pay some attention to routing and distance, but otherwise, the architecture is
unconstrained. Servers each have the infrastructure of the Confederation Web, namely a
relational database with the three tables, element, history, elementDescription, the first
two of which hold information elements. Information elements are indexed using the
“URL for structured information” and may be found using the “search engine for
structured information.” This architecture has proven to be scalable for the World-Wide
Web servers and seems well-suited for rapidly integrating components and operating
them for command and control in today’s global environment.
In the Confederation Web, speed is further enhanced by a design that requires no joins.
Joining tables is a costly operation in a relational database.
Transaction processing provides a good foundation for reliability because SQL
statements either run to completion or fail, ensuring consistency in the database.
Viruses, Trojans, worms, spy ware, and other mal-ware is currently not as large a
problem for relational databases as for web browsers, email, chat, and client machines in
general. One reason is that the command and control applications that operate in the
Confederation Web arrive from well-known, trusted sources and do not automatically
download or otherwise receive and run arbitrary programs. Another reason is the
relational database server has a simple interface that supports writing to and reading data
Draft
9
Draft
from only to the database. Clients read the data but do not attempt to execute it. Current
mal-ware is extremely difficult to eradicate from an infected client, so these features have
the potential to prevent a very painful and difficult problem that can affect both reliability
and performance.
Denial of service attacks dramatically affect reliability of any server and are difficult to
prevent. Good databases can be set to deny access to all but specific hosts. Although a
denial of service attack may still occur, a combination of replication, network
partitioning, and access control can help reduce its effect.
Performance is difficult to deliver if it is unaffordable, an especially relevant
consideration for global deployment. Experience suggests technology with the least cost
and fewest restrictions is easiest to transition. Government contracting is a slow process,
and many decision-makers are concerned with justifying expensive systems, two factors
that may combine to slow the momentum in any program. Fortunately, there are low-cost
options for relation databases that run well on commodity hardware. For many purposes,
the open source MySQL database rivals Oracle, the world’s most expensive database. It
is certainly possible to spend a lot for software, but careful elimination of unnecessary
components, performance testing, and an analysis of alternatives is certainly worthwhile.
Summary
“The most important thing we do in an ACTD is Transition New
Capability to the warfighter!”
-- Joint Coordinated Real-time Engagement ACTD Brief, 2 Mar 05
Warfighters gain new capability from applications, not infrastructure. The information
service infrastructure is there to support applications. The Confederation Web was
designed to facilitate application development and support on-going operations given the
dynamic availability of new decision aids, situation awareness tools, and simulation
models. Rapid development and the growth of structured content are promoted using a
simple, publishing method based on information elements, conventional relational
database technology, a straightforward, predefined schema and authorization for
publishers to unilaterally choose content and format. An innovative “search engine for
structured information” allows anyone to know the applications and data currently
available, along with other useful information such as who the power-users are. A “URL
for structured information,” together with database servers, allow any authorized user to
connect with any appropriate client to easily access and share information. “Semantics
by example” is a compelling new tool for understanding applications and their variables.
It augments the more traditional “semantics by declaration” that may also be included in
the Confederation Web. The Confederation Web uses conventional methods to prevent
unauthorized users from accessing sensitive information: SSL to protect the network,
access control, logging, and AI agents to protect the server and detect malicious acts or
attempts. Fast, reliable performance is fundamental to the relational database market and
databases provide a good foundation for performance. Architectural designs that use
Draft
10
Draft
replication or splitting (enabled by the Confederation Web design) can increase
performance with additional hardware. Furthermore, the Confederation Web schema
requires no joins. Mal-ware plaguing other kinds of information systems is currently not
a problem for relational databases and, by virtue of restricted interfaces by clients and
servers, problems are not anticipated. Denial of service attacks can be a problem for any
server, and both conventional and innovative new techniques can help. Finally, high
performance and new military capabilities are not possible if the system does not
transition. Every good design should be affordable and future costs should be predictable
to maximize the number of users and minimize the factors that make decision-makers
hesitate.
Comments are solicited.
Draft
11
Draft
Appendix: Justifying choice of infrastructure and publishing authority in the
Confederation Web
The Confederation Web specifies and provides the minimal infrastructure necessary to
enable application developers to immediately publish whatever content they choose in
whatever format they wish. The Confederation Web defines the information sharing
infrastructure, the uniform indexing method used to find information, and certain
metadata. The information below justifies the choice of infrastructure and the key
decision to allow developers to choose their content and formats.
Often, when a team of developers come together to produce an integrated system, there
is a lot of uncertainty about how the components will interface. To speed things up,
certain decisions can be specified a priori to minimize required software interfaces,
infrastructure, complexity, and cost and support innovation, portability, and flexibility.
Also, the method of making design decisions may be chosen specifically to speed
development. In the Confederation Web, we exploit both of these concepts, choosing the
relational database as our information server and allowing developers to unilaterally
choose published content and format, i.e., which variables and with values of particular
units, types, file formats (e.g. XML), etc. Relational database technology is mature,
standard and widely available with interface libraries available for a wide variety of
programming languages and operating systems. Relational databases are optimized for
serving structured information and perform well for this purpose. Commercial-quality
databases feature access control, transaction processing, logging, replication, and many
other important features. Using a relational database to transfer information from one
software component to another provides complete visibility into information flows,
greatly reducing “finger pointing” between developers when trying to determine which
component is responding poorly. Although the Confederation Web could have been
implemented using different technology, all of these features make the relational database
an attractive choice.
Allowing developers to choose the information they publish and its format raises the real
possibility that other developers will need to translate formats (and that the information
sought is not available---let’s defer that problem for a moment). The need to translate is
often perceived as a major problem by software architects who often depict the problem
as the need for n2 interfaces as connoted in the following diagram, an example for seven
components:
Draft
12
Draft
In the figure, each double-ended arrow represents a bi-directional software interfaces.
Without a common ontology and common data formats, integration potentially requires
n-1 interfaces for each of n components, in other words (approximately) n2 interfaces. If
a single new component is added, then n more bi-directional interfaces are required, and
because n could be a big number, the impact on scalability seems obvious. Concern
about this problem has resulted in design emphasis on common ontologies and common
data formats as a precursor to other development activities, but this approach is normally
a lengthy one… and, as it turns out, unnecessary. For two practical reasons, the “n2
interfaces” problem never arises. First, typically each component is implemented by a
different developer. If there are n developers, then each developer must only implement
n-1 interfaces even though the overall system requires, at worst, about n2 interfaces. In
other words, the system is not built by one developer but by n developers and so the
difficulty of the creating interfaces is linear in the number of components, not
polynomial. Second, in practice, most components require information from only a few
of the others---the components are sparsely connected, not fully connected. Therefore,
the feared problem of having to maintain “many interfaces” never achieves it potential.
For these reasons, maintaining interfaces for translation is not as difficult as many
software architects may think.
Giving developers authority to make unilateral choices on published content and formats
speeds their design and implementations because it reduces the need for consensus. In
our experience, a translator can be written in a day while achieving design consensus can
take weeks. For these reasons, developers can choose the information content and
format they publish to the Confederation Web.
Draft
13
Draft
Appendix: The power of a free market in software development
In traditional Government acquisition programs, the Government opens a solicitation to
competition from a variety of vendors. Once all the proposals are in, the best is selected
and a contract is awarded. If the vendor delivers as promised, then the Government gets
a good deal. However, the competition is effectively over once a proposal is selected.
For complex development programs, the Government sometimes spends a lot of money
to help ensure and verify that the product will be as promised.
For software development, the barrier of entry into the market or to produce a different
software component is relatively low. Software development programs are always
complex. For these reasons, it is not always a good idea to end competition with proposal
selection.
When management controls restricting competition, developers have a monopoly and are
free to produce whatever they wish. Program managers and system architects have
limited options for ensuring quality (and these options are all more difficult than simply
choosing the best alternative). Luckily, smart program managers can opt out of this old
method. While Government contracting and traditional program management tend to
favor centralized planning and developer monopoly, agile program managers can work
within existing contracting limitations to promote competition even after proposal
selection, thus providing continuing incentives for quality and innovation. Capitalism
beats communism every time, and incentives are effective at raising the quality of the
products the Government acquires.
When spending the taxpayer’s money, it’s a great idea to avoid duplication of effort.
However, for research and development or any complex system, the right approach is
now always clear. If two developers are initially funded to create the same component,
then, after some development, the program manager can choose between them to select
whichever is best. This selection process, like any customer’s choice, drives quality and
innovation. Avoiding duplication of effort should be balanced with the need for quality
enable by having multiple choices among competing products in a free market. After all,
an alternative approach may cost more anyway. If only one product is funded, it may
fall below minimum quality requirements and have to be re-implemented wasting both
time and money.
Draft
14