Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Draft The Confederation Web: A Net-Centric Data Service For Large-Scale, Rapid Integration Douglas E. Dyer, PhD Active Computing http://www.activecomputing.org 12 March 2005 Introduction DARPA’s Integrated Battle Command program (IBC) aims to integrate an array of intelligent tools to help military commanders in battlefield engagements as well as in peace-keeping and stability operations where political, economic, and social factors can dominate. To support rapid integration of new components, IBC has developed the idea of a confederation, a loosely-coupled set of decision aids, visualization tools, and simulation models that speed understanding and assessment even for domains unfamiliar to warfighters. This document describes the simple architecture, motivated by the World-Wide Web, that makes a confederation possible: the Confederation Web. The Confederation Web is meant to be a structured version of the World-Wide Web, and after carefully analyzing requirements, most design choices follow directly from that great example. Requirements For command and control (and most other domains), practical integration of software components requires: A simple method of publishing Knowledge that software components and data exist The ability to physically access information from available software components A way to interpret/understand information provided by software components A way to prevent unauthorized users from seeing sensitive information Fast, reliable performance Requirements Analysis and Design Choices for the Confederation Web A simple method of publishing. Sharing information implies publication because information that is not public may not be accessed or shared. But what information should be published? In what format? And by what mechanism? And using which semantic definitions and relationships? In my experience, when several developers unite to build an integrated application, these questions can be very difficult to answer, and debate extends well into the development cycle. Defining needed functions is not very difficult and neither is allocating them to components. In contrast, creating a data model and an ontology usually takes months. Mechanisms and formats can be problematic Draft 1 Draft because everyone has a vested interest in using the methods they know and prefer. The more tightly-coupled the system, the more difficulties arise. Also, the available development environment is dynamic. New data formats and service technologies always loom on the horizon, promising useful features and greater facility. Developers strive to stay in the mainstream while software development vendors have many economic incentives to keep changing the “stream” regardless of technical utility. All this uncertainty leads to delay, and the project’s growth may be stunted as a result. To avoid uncertainty and delay, the Confederation Web specifies and provides the minimal infrastructure necessary to enable application developers to immediately publish whatever content they choose in whatever format they wish. The Confederation Web defines the information sharing infrastructure, the uniform indexing method used to find information, and certain metadata. In the Confederation Web, the method of publishing is pre-defined. Using SQL and a database interface library, applications write information elements (variable values in context, plus metadata) to a pre-defined relational database tables whose attributes match those of an information element. Let’s discuss this informally for a moment, first by describing an information element. The essence of integration is sharing information, and basically, an information element is a self-contained unit of information. Its components are: A variable Its value in some context Associated metadata in the same context A key concept is context. For example, x is a variable and its value may be “4,” but in what context? Perhaps it’s in the context of the equation, x + y = 13 or perhaps x refers to the distance, in centimeters, that a robot hand is from a door knob at some time t. Because there is a huge number of relevant variables in the world and an even greater number of contexts, it’s helpful to partition both variables and contexts. For variables, a convenient and practical method is to use the application associated with the variable. If so, then applications create namespaces. For example, the origin of an aircraft movement in SOFTools is different from the origin of a flight in Expedia. So the application name is an important part of a variable’s context in the Confederation Web. The rest of the context is associated with a problem-solving instance. For example, SOFTools may be used to author many different plans, and the variables for each plan are completely independent from those in other plans. Plans created by different authors are different as are plans for different purposes created by the same author. This suggests a method of defining a problem-solving instance. For information elements, we specify the problemsolving instance with the user identity (such as an email address) and a problem-solving instance number (integer) for that user. For example, the first time I book a flight on Expedia, the problem-solving instance is identified by a unique user identifier, “[email protected]” and “1.” The second time I book a flight, the Draft 2 Draft instance number is “2” and so on1. The application name, user identity, and user’s problem-solving instance number constitute a complete context for the variable in an information element. The context makes it possible to retrieve all information elements for a particular context while ignoring all other information elements. The other parts of an information element include the variable’s value and associated metadata: the time the variable value was set, the source of the value (which could be the user, an agent, or an algorithm acting on the user’s behalf, an important distinction), and “more information,” which can be anything the developer wishes to insert but was originally intended for userprovided information. To help make this more concrete, let’s look at an example. Here is an application used for travel planning: The application name is “Travel” and let’s suppose the user is Doug Dyer, identified by email address, [email protected]. The lower right part of the application indicates this is the first trip of three (1/3) that Doug has planned with this application. So, in the Confederation Web, the information for this trip could be found from something like: There are often many ways to identify a user’s problem-solving instance uniquely using application variables. To avoid selecting some set of these and ensuring uniqueness, we just use a number. Then, when we want to recall all the information about the first flight ever booked on Expedia, we just use enough variable values to get the instance number, e.g., “the flight whose destination was Amman, Jordan” and whose departure date was sometime in March, 2005.” 1 Draft 3 Draft application: Travel userIdentity: [email protected] instance: 1 Suppose Doug began planning this trip in October when told there would be travel required to perform Experiment B at JFCOM in November. The complete information element for the variable “Visiting” might be: application: Travel userIdentity: [email protected] instance: 1 variable: Visiting value: JFCOM source: user more: time: 10975251862 In the Confederation Web’s relational database this looks similar: +-------------+-------------------------------+----------+----------+-------+--------+------+------------+ | application | userIdentity | instance | variable | value | source | more | time | +-------------+-------------------------------+----------+----------+-------+--------+------+------------+ | Travel | [email protected] | 1 | Visiting | JFCOM | user | | 1097525186 | +-------------+-------------------------------+----------+----------+-------+--------+------+------------+ In the Confederation Web, this simple method is used to publish current values. Applications use the element table to store current variable values. SQL’s INSERT is used for new information elements and UPDATE is used for existing information elements whenever the value changes. Once an information element appears, any authorized, connected user may read it, but no one besides the publishing application may write it. To support a complete digital history, applications also INSERT information elements into another table, history, with the same schema as element. This enables anyone to query history table, sorting results on time, to provide the change history for a particular information element. This is how applications share information3. Because the only requirement for integration is writing to and reading from a relational database, developers are free to use any programming language, library, or database interface (e.g., ODBC). This reduces the need for developers to agree on infrastructure and allows them to innovate freely or stick with tested technologies. Because developers are vested in the programming languages, libraries, and technology they have chosen, freedom to use what they wish is a significant advantage. In addition, this freedom promotes innovation while provide system architects and managers complete control and visibility needed to ensure a good integrated system. Application developers are free to choose the content and formats they think are most useful to others. This is a key design decision justified in the Appendix, but it does not 2 In the Confederation Web, Unix-style time is used, seconds since 1970. Developers may also store descriptive metadata about variables and applications in a third table, elementDescription. But that’s about as complex as it gets. The Confederation Web has only these three, pre-defined tables. 3 Draft 4 Draft necessarily preclude decision by consensus. Developers may choose to create a community of interest and build a consensus on desired content and format; or they might consult only with their primary customer; or they may unilaterally decide these issues. Some may worry that authorizing developers to dictate the information published will not satisfy system needs. This might be true if system architects and managers decide not to allow competition between developers---the old way of doing business4. With competition, no one has a monopoly on technology. Any component can be reimplemented, improved, and offered as an alternative. Knowing this, developers are likely to produce good products because if they don’t, someone else will come along and “build a better mouse trap.” If so, then there are multiple components from which to choose. Market forces can quickly eliminate bad choices, either by helping developers see the need for re-design or by “naturally selecting” applications that don’t evolve to satisfy customers. Ultimately, market forces are stronger, more enduring, and more efficient than even the best centrally planned program. Computers can measure use patterns and market demand easily. Semantics is typically a difficult issue. Trying to define a relatively complete domain ontology can take a long time and can delay progress if required before writing applications. Regardless of integration issues, every application naturally defines its own ontology (this is true in any event, regardless of whether the Confederation Web is used and whether the ontology is formally documented or not). When one application uses information from another, its ontology is extended. By integrating a set of applications, it’s possible to create a family of associated ontologies that more or less cover a domain of interest (again, whether or not the ontology is documented). It’s often useful to document the ontology to gain greater insight, make applications more closely represent the domain, and facilitate the development of new functions and new applications. However, developers are not generally domain experts and thus have trouble defining the ontology initially. Luckily, the rapid prototyping approach implies frequent feedback from users familiar with the domain. Rather than trying to define a complete ontology a priori, I advocate using technology to assist users and developers in discovering and documenting the ontology iteratively, changing the application to reflect new insight. As is the case for most other constructive works, ontologies improve with iterations. Following the example of the World-Wide Web, the Confederation Web makes publishing simple. Simple ideas include reliance on mature relation database technology and the notions of an information element, a pre-defined database schema, and authority for application developers to unilaterally select the content and format published. The intent is to support bottom-up development. Rather than waiting for agreement on a topdown design (which tends to arrive pretty late in development schedule), developers can publish immediately. A simple, pre-defined method of publication helped the WorldWide Web attain exponential growth, and the same method will help the Confederation Web grow too. 4 Draft 5 Draft Knowledge that software components and data exist. A common problem identified by the DOD’s Net Centric Data Strategy is that users and developers generally do not know all the software components, services, and data elements that are available. Given the rate of technical progress, number of new commercial products, and number of Government programs aimed at component development and database upgrades, this is not at all surprising. We need technology to help find software components and data applicable to our current problem. Luckily, it’s easy to translate search engine technology used to help us manage unstructured information5 to help manage structured information as well. The single-server version of the Confederation Web already has a “search engine for structured information.” This tool can help in the following ways: Identify relevant applications and data Find power-users based on use frequency and recency and access them via email Provide semantic clues on the meaning of variables and functions of applications Monitor the software development process The search engine for structured information is documented in a paper on the Active Computing web site. By exploiting the special structure available in an information element, this search engine can help anyone find and understand the tools and data they need to solve a problem. From a developer’s perspective, the search engine helps find reusable components and useful data. From a project manager’s view, the search engine makes it easy to track development progress. All of these benefits accrue from applying search to the Confederation Web6. The ability to physically access information from available software components. Because the Confederation Web is implemented with relational databases, physical access is possible from anywhere on the net. To query, the only requirement is access information (e.g., database server, user, password) and an appropriate index for finding information elements. To find a particular information element, the Confederation Web uses a “URL for structured information.” For the single-server version of the Confederation Web, the URL for structured information is the variable name, application, user, and user’s problem-solving instance. For a multi-server version of the Confederation Web, it’s necessary to define the database server as well by IP address or name. The URL for structured information provides a uniform index that, with the relational database server, enables authorized users to get information from the Confederation Web from anywhere on the net using any appropriate client software. 5 Unstructured information is content which is currently interpretable only by a human. Examples include natural language text, graphics, video, and sound. In contrast, structured information refers to variable values in context (application, user, user’s problem-solving instance). Structured information is generally machine-interpretable and can be used to automate business rules, trigger actions, support machine reasoning, etc. 6 Such a search engine is not possible without a uniform database schema like the one used in the Confederation Web. Draft 6 Draft A way to interpret/understand information provided by software components. Typically, software developers provide some form of descriptive meta-data to help other people understand the domain of interest, data definitions, and software functions. Descriptive meta-data include source code comments, user and system documentation, data dictionaries, schema, models, and ontologies. All of these require effort on the part of the developer to produce, and I refer to them as “semantics by declaration.” The Confederation Web has a table for storing and serving descriptive meta-data. However, the Confederation Web also enables a second method which requires no a priori effort on the part of developers but arises naturally as software applications are used. “Semantics by example” exploits example values to gain insight into the meaning of a variable. To understand this concept, consider the Travel application introduced earlier. One variable in that application is visiting and its meaning might reasonably be any of at least three things: A person A place (physical location) An organization However, suppose you find these example values for visiting: University of Pittsburgh IBM Almaden Institute SOCOM DARPA Clearly, these examples do not refer to a person, and they don’t seem to refer to a physical location (for example, DARPA could choose to move its facility a different office building). For these reasons, most people would be able to use these example values to infer that visiting refers to an organization, not a person or a place. This understanding, arising from inference based on example values, is “semantics by example.” As the Confederation Web is used, a large number of example values will be stored. After a short time, we can expect the Confederation Web to include all possible values for many variables that have enumerated values. For these variables and perhaps others, the range of possible values available in the Confederation Web should result in valuable insight into the meaning of each variable. Moreover, the distribution of values in the context of other relevant variables should provide a sense of normal behavior, thus facilitating recognition or abnormal, and possibly erroneous, information. For many applications, just knowing the associated variables can facilitate understanding the function of the application. For example, if variables include origin and destination, then movement is implied. Knowing variable values in the context of a particular problem is similar to having a case (as in case-based reasoning) and enables hypotheses about the ontological relationships that apparently exist between variables. “Semantics by example” is a powerful method for understanding applications and their variables. It requires no effort from developers, arising instead through normal use of Draft 7 Draft applications. It can be used separately or in conjunction with the more traditional “semantics by declaration.” A way to prevent unauthorized users from seeing sensitive information. Security is important in many domains including military command and control. One form of security is preventing unauthorized users from gaining access to sensitive information. Applications operating on client workstations and the Confederation Web servers are assumed to be under physical control, but what about the network? Unprotected, hackers may be able to get sensitive information by sniffing packets or by simply accessing the database. Both of these problems are solved using off-the-shelf technologies. To prevent packet sniffing, network traffic between applications and the relational database hosting the Confederation Web should be encrypted using secure socket layer (SSL). To prevent direct access to the database, we can just use the database’s access control scheme which generally involves user accounts and passwords protecting databases, tables, and even columns. The database normally has logging capability which enables manual analysis as well as a variety of AI-based agents for checking out-of-norm or sensitive queries, providing another layer of protection for at least detecting hackers. These standard methods are difficult to improve upon. Adding additional components, services, and interfaces may add complexity without any real security benefit. The more complex the interfaces and services provided, the more difficult it is to ensure the system is secure. Using a small, well-defined interface is makes it easier to analyze and address possible methods of attack. Fast, reliable performance. Performance is always desirable, but designing for performance involves many considerations. For serving structured data, the relational database provides a good foundation for reliable performance. Viable, commercialquality databases perform well by definition (otherwise they would have been culled by market forces). Query optimizers improve each year based on market demand. If, for whatever reason, database performance is not adequate, one possible solution is replicated servers. Typical replication provides a master database that serves writes and multiple slave databases that serve reads. Synchronization may be an issue in some domains (and if so, the master may be used for immediate reads). Another performance-enhancing possibility enabled by the design of the Confederation Web is to split the information served, hosting some information on one server and the rest on another. Information may also be replicated in a splitting scheme. If additional performance is needed, more splitting can provided. The URL for structured information provides a uniform indexing method that makes splitting possible. The architecture of the multi-server version of the Confederation Web naturally mirrors that of the WorldWide Web and internet hosts in general: Draft 8 Draft In the architecture, both clients and servers are connected to the internet and routing is handled in the normal, robust way. For large loads (e.g., replicated servers), it makes sense to pay some attention to routing and distance, but otherwise, the architecture is unconstrained. Servers each have the infrastructure of the Confederation Web, namely a relational database with the three tables, element, history, elementDescription, the first two of which hold information elements. Information elements are indexed using the “URL for structured information” and may be found using the “search engine for structured information.” This architecture has proven to be scalable for the World-Wide Web servers and seems well-suited for rapidly integrating components and operating them for command and control in today’s global environment. In the Confederation Web, speed is further enhanced by a design that requires no joins. Joining tables is a costly operation in a relational database. Transaction processing provides a good foundation for reliability because SQL statements either run to completion or fail, ensuring consistency in the database. Viruses, Trojans, worms, spy ware, and other mal-ware is currently not as large a problem for relational databases as for web browsers, email, chat, and client machines in general. One reason is that the command and control applications that operate in the Confederation Web arrive from well-known, trusted sources and do not automatically download or otherwise receive and run arbitrary programs. Another reason is the relational database server has a simple interface that supports writing to and reading data Draft 9 Draft from only to the database. Clients read the data but do not attempt to execute it. Current mal-ware is extremely difficult to eradicate from an infected client, so these features have the potential to prevent a very painful and difficult problem that can affect both reliability and performance. Denial of service attacks dramatically affect reliability of any server and are difficult to prevent. Good databases can be set to deny access to all but specific hosts. Although a denial of service attack may still occur, a combination of replication, network partitioning, and access control can help reduce its effect. Performance is difficult to deliver if it is unaffordable, an especially relevant consideration for global deployment. Experience suggests technology with the least cost and fewest restrictions is easiest to transition. Government contracting is a slow process, and many decision-makers are concerned with justifying expensive systems, two factors that may combine to slow the momentum in any program. Fortunately, there are low-cost options for relation databases that run well on commodity hardware. For many purposes, the open source MySQL database rivals Oracle, the world’s most expensive database. It is certainly possible to spend a lot for software, but careful elimination of unnecessary components, performance testing, and an analysis of alternatives is certainly worthwhile. Summary “The most important thing we do in an ACTD is Transition New Capability to the warfighter!” -- Joint Coordinated Real-time Engagement ACTD Brief, 2 Mar 05 Warfighters gain new capability from applications, not infrastructure. The information service infrastructure is there to support applications. The Confederation Web was designed to facilitate application development and support on-going operations given the dynamic availability of new decision aids, situation awareness tools, and simulation models. Rapid development and the growth of structured content are promoted using a simple, publishing method based on information elements, conventional relational database technology, a straightforward, predefined schema and authorization for publishers to unilaterally choose content and format. An innovative “search engine for structured information” allows anyone to know the applications and data currently available, along with other useful information such as who the power-users are. A “URL for structured information,” together with database servers, allow any authorized user to connect with any appropriate client to easily access and share information. “Semantics by example” is a compelling new tool for understanding applications and their variables. It augments the more traditional “semantics by declaration” that may also be included in the Confederation Web. The Confederation Web uses conventional methods to prevent unauthorized users from accessing sensitive information: SSL to protect the network, access control, logging, and AI agents to protect the server and detect malicious acts or attempts. Fast, reliable performance is fundamental to the relational database market and databases provide a good foundation for performance. Architectural designs that use Draft 10 Draft replication or splitting (enabled by the Confederation Web design) can increase performance with additional hardware. Furthermore, the Confederation Web schema requires no joins. Mal-ware plaguing other kinds of information systems is currently not a problem for relational databases and, by virtue of restricted interfaces by clients and servers, problems are not anticipated. Denial of service attacks can be a problem for any server, and both conventional and innovative new techniques can help. Finally, high performance and new military capabilities are not possible if the system does not transition. Every good design should be affordable and future costs should be predictable to maximize the number of users and minimize the factors that make decision-makers hesitate. Comments are solicited. Draft 11 Draft Appendix: Justifying choice of infrastructure and publishing authority in the Confederation Web The Confederation Web specifies and provides the minimal infrastructure necessary to enable application developers to immediately publish whatever content they choose in whatever format they wish. The Confederation Web defines the information sharing infrastructure, the uniform indexing method used to find information, and certain metadata. The information below justifies the choice of infrastructure and the key decision to allow developers to choose their content and formats. Often, when a team of developers come together to produce an integrated system, there is a lot of uncertainty about how the components will interface. To speed things up, certain decisions can be specified a priori to minimize required software interfaces, infrastructure, complexity, and cost and support innovation, portability, and flexibility. Also, the method of making design decisions may be chosen specifically to speed development. In the Confederation Web, we exploit both of these concepts, choosing the relational database as our information server and allowing developers to unilaterally choose published content and format, i.e., which variables and with values of particular units, types, file formats (e.g. XML), etc. Relational database technology is mature, standard and widely available with interface libraries available for a wide variety of programming languages and operating systems. Relational databases are optimized for serving structured information and perform well for this purpose. Commercial-quality databases feature access control, transaction processing, logging, replication, and many other important features. Using a relational database to transfer information from one software component to another provides complete visibility into information flows, greatly reducing “finger pointing” between developers when trying to determine which component is responding poorly. Although the Confederation Web could have been implemented using different technology, all of these features make the relational database an attractive choice. Allowing developers to choose the information they publish and its format raises the real possibility that other developers will need to translate formats (and that the information sought is not available---let’s defer that problem for a moment). The need to translate is often perceived as a major problem by software architects who often depict the problem as the need for n2 interfaces as connoted in the following diagram, an example for seven components: Draft 12 Draft In the figure, each double-ended arrow represents a bi-directional software interfaces. Without a common ontology and common data formats, integration potentially requires n-1 interfaces for each of n components, in other words (approximately) n2 interfaces. If a single new component is added, then n more bi-directional interfaces are required, and because n could be a big number, the impact on scalability seems obvious. Concern about this problem has resulted in design emphasis on common ontologies and common data formats as a precursor to other development activities, but this approach is normally a lengthy one… and, as it turns out, unnecessary. For two practical reasons, the “n2 interfaces” problem never arises. First, typically each component is implemented by a different developer. If there are n developers, then each developer must only implement n-1 interfaces even though the overall system requires, at worst, about n2 interfaces. In other words, the system is not built by one developer but by n developers and so the difficulty of the creating interfaces is linear in the number of components, not polynomial. Second, in practice, most components require information from only a few of the others---the components are sparsely connected, not fully connected. Therefore, the feared problem of having to maintain “many interfaces” never achieves it potential. For these reasons, maintaining interfaces for translation is not as difficult as many software architects may think. Giving developers authority to make unilateral choices on published content and formats speeds their design and implementations because it reduces the need for consensus. In our experience, a translator can be written in a day while achieving design consensus can take weeks. For these reasons, developers can choose the information content and format they publish to the Confederation Web. Draft 13 Draft Appendix: The power of a free market in software development In traditional Government acquisition programs, the Government opens a solicitation to competition from a variety of vendors. Once all the proposals are in, the best is selected and a contract is awarded. If the vendor delivers as promised, then the Government gets a good deal. However, the competition is effectively over once a proposal is selected. For complex development programs, the Government sometimes spends a lot of money to help ensure and verify that the product will be as promised. For software development, the barrier of entry into the market or to produce a different software component is relatively low. Software development programs are always complex. For these reasons, it is not always a good idea to end competition with proposal selection. When management controls restricting competition, developers have a monopoly and are free to produce whatever they wish. Program managers and system architects have limited options for ensuring quality (and these options are all more difficult than simply choosing the best alternative). Luckily, smart program managers can opt out of this old method. While Government contracting and traditional program management tend to favor centralized planning and developer monopoly, agile program managers can work within existing contracting limitations to promote competition even after proposal selection, thus providing continuing incentives for quality and innovation. Capitalism beats communism every time, and incentives are effective at raising the quality of the products the Government acquires. When spending the taxpayer’s money, it’s a great idea to avoid duplication of effort. However, for research and development or any complex system, the right approach is now always clear. If two developers are initially funded to create the same component, then, after some development, the program manager can choose between them to select whichever is best. This selection process, like any customer’s choice, drives quality and innovation. Avoiding duplication of effort should be balanced with the need for quality enable by having multiple choices among competing products in a free market. After all, an alternative approach may cost more anyway. If only one product is funded, it may fall below minimum quality requirements and have to be re-implemented wasting both time and money. Draft 14