Download Integrating Diverse Information Repositories: A Distributed Hypertext

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Concurrency control wikipedia , lookup

Database model wikipedia , lookup

Versant Object Database wikipedia , lookup

Transcript
Integrating Diverse Information Repositories:
A Distributed Hypertext Approach 1
John Noll and Walt Scacchi
University of Southern California
Los Angeles, CA 90089-1421 USA
(213) 740-4782, (213) 740-8494 Fax
(jnoll, scacchi)@pollux.usc.edu
Abstract
Today's networked software engineering environment is characterized by a multitude
of autonomous, heterogeneous information repositories, a variety of incompatible user
interfaces, diverse, unconventional data types, including text, graphics, and possibly
video and sound, rapid change, both in structure and content, and multiple ways of
viewing relationships among the same information items. Existing information storage
mechanisms fail to combine diverse data types/models, complex objects and storage
structures, personal views and organizations of shared objects, access to distributed,
heterogeneous repositories, and ease of evolution. This paper examines these issues
and describes a Distributed Hypertext architecture that provides transparent access to
autonomous, heterogeneous software object repositories, resulting in both a powerful
organizational tool and a simple yet eective integration mechanism.
Keywords: Hypertext, Distributed Systems, Heterogeneous Distributed Object
Management, Distributed Software Engineering Environments, Distributed Hypertext
1
Revised version appears in IEEE COMPUTER, Vol.24(12), December 1991, pp.38-45.
1 Introduction
This paper addresses the problem of integration: how to provide transparent access to heterogeneous information repositories, while maintaining their autonomy.
In today's networked computing environment, powerful workstations enable a variety
of sophisticated, graphics oriented applications; local area networks (LANs) connect these
workstations to le servers; these LANs are themselves linked by wide area networks, enabling access to information around the globe. This abundance leads to a diversity of access
protocols, storage managers, data formats, and user interfaces, rendering much data inaccessible to any single user simply because there are too many things to know before one
can even begin. Furthermore, nding an item once is not necessarily the same as nding it
again. The original discovery process may be lengthy and involved, or even accidental; once
discovered, interesting objects should be organized so they can be easily retrieved again.
The key problems we address are how to support multiple, heterogeneous repositories,
each under separate and autonomous administration; a variety of incompatible interfaces;
diverse, unconventional data types; and dierent ways of viewing relationships among the
same information items. In this paper, we present a radically dierent solution to these
problems. Based on our Distributed Hypertext (DHT) Architecture, our solution combines
transparent access to autonomous, heterogeneous information repositories with a powerful,
exible organization technique. This is accomplished without change to the structure or
content of participating repositories.
2 Related work
To what extent do existing systems address this problem? They can be broadly classied
into two approaches: distributed lesystems and heterogeneous databases. Distributed le
systems provide access to les distributed among a network of le servers, which may be of
dierent architectures. The resulting le space can be extremely large.
A problem with le systems is that a le's name must serve several functions: it indicates
the location of a le in the le system hierarchy; it expresses the le's purpose; it often
contains an indication of the le's type in the form of le name extensions; and it may
include some notion of the relationship of a le to other les in the same directory. This
overloading can lead to complex, cumbersome lenames that are dicult to manage and
evolve.
Attributed le systems [10] attempt to solve this problem by attaching attributes to les;
these enable additional information, such as type, creation and access methods, relationships,
etc. to be expressed. In addition, they provide a way of locating relevant les by associative
search against attribute values.
File systems share the common drawback that relationships among objects can only be
expressed by grouping them into directories, or, in the case of attribute based systems, by
attaching attributes that point to other les. Directories can only express membership in a
set; attributes force relationships that are really separate entities to be expressed as part of
one or more other objects.
Database management systems address this relationship problem by providing primitive
to model relationships explicitly. Heterogeneous database systems add the ability to integrate
1
many separate databases into logically unied worlds. This can result in a single global
schema, or many application specic schemas, as in federated architectures [5].
The chief problem with heterogeneous database systems is how to provide transactions
while preserving autonomy. A transaction is typically controlled by a transaction manager. If
the transaction involves objects from more than one database, then some or all participating
databases become subordinate to the transaction manager, violating their autonomy.
Summarizing, we notice that le systems oer exibility and ease of use, but organization
is limited to hierarchies of directories. Databases oer many high level features, including
powerful organization primitives and multiple views, but must sacrice autonomy for transactions. Yet, there appears to be a gap between the two; this gap can be lled by DHT, as
we describe next.
3 Distributed Hypertext: A solution
Hypertext is a simple concept for organizing and viewing information. Chunks of text
are stored in objects called nodes, which may be individual les, entries in a database, or
possibly generated on demand at each access. These nodes may be logically connected with
named relationships called links. Links may represent such concepts as semantic relationships
between nodes, a logical progression from one node to another, citations in an article, or cross
references. They may be anchored, in which case the endpoints of the link are represented
by an icon or other indicator in the contents of the linked nodes (see Figure 2). Usually,
the links are one-way; the resulting structure forms a directed graph called a corpus. Users
navigate through the corpus by following links from node to node.
Our integration solution is based on the hybrid nature of hypertext. Conklin describes
the \essence" of Hypertext as a combination of three components: a \database method", a
particular way of accessing information by following links to information nodes; a \representation scheme", a technique for structuring information, and an \interface modality", a style
of user interaction based on direct manipulation of link \buttons" [2][p.33]. We exploit these
three features of hypertext to achieve integration while preserving autonomy. Specically,
we address the following issues.
Organization and Flexibility. Hypertext oers a simple, natural method for organizing textual data; it can be extended to handle multiple media types (i.e., Hypermedia2). User
interaction is simple and straightforward, based on a common set of commands applying to
all object types. Yet, the data elements can be organized into complex structures by linking
them together. Furthermore, data elements typically have attributes, so the hypertext can
be searched by querying against node attributes, links, or contents.
Transparency. Schatz describes three types of transparency that must be provided by
an information environment: type transparency, the ability to manipulate an object using the
same interaction technique independent of its type; location transparency, whereby an object
should be accessible in the same fashion whether local or remote; and scale transparency,
which requires objects to behave the same whether there are 100 or 100,000 objects in the
system [8]. In a heterogeneous environment, source transparency must also be addressed;
objects should be accessed uniformly regardless of the type of repository that manages them.
2
Henceforth, we will use the terms hypertext and hypermedia interchangeably.
2
Hypermedia systems have demonstrated type transparency by presenting seamless access
to diverse media types, including text, graphics, sound, and images. Distributed hypertext
systems have demonstrated location transparency (see Section 4). It remains to be seen
whether hypertext can achieve scale transparency in a very large system containing hundreds
of thousands of objects. Our approach is discussed in Section 5.
Autonomy. Hypertext transactions are simpler than those of databases. A hypertext
transaction typically involves reading or editing a node, or traversing a link. These operations
are performed on a single object: a node or a link. Thus, managing any single transaction
is the responsibly of a single repository. This makes hypertext ideally suited to integrating
autonomous information sources, since transactions do not cross administrative boundaries
and thus can be supported without global control.
Hypertext nodes and links can be combined to form complex structures. This capability
can be exploited to accommodate design autonomy, whereby local databases preserve the
ability to maintain their structure and content to support existing applications. By providing dynamic gateway processes, local objects can be transformed into hypertext structures,
and hypertext operations can be transformed into local queries. The underlying database
structure remains unchanged.
4 Sidebar: Survey of Distributed Hypertext Systems
Most hypertext systems are single-user, centralized designs [2]. There are several examples,
however, of distributed hypertext systems. Most of these employ a single server to store
the links and nodes for a hypertext; client programs running on workstations connected by
a local-area network access this server to manipulate portions of the hypertext. Examples
include Neptune [3], which stores nodes and links in a single database, and Intermedia [11],
DIF [4], and KMS [1], which store links in a database and nodes in les.
Multiple server designs allow nodes to reside on several dierent server machines, and
links to be made between such nodes. This category includes Sun's Link Service [7], Distributed DIF, and VNS [9], which store links and node locations in a database, and PlaneText [2], which stores both links and nodes in Unix les.
The ultimate category includes systems like Telesophy [8], and Xanadu [6]. These are
intended to link the knowledge base of a large organization, community, or nation.
5 An Architecture for Distributed Hypertext
To implement the solution outlined above, we begin with a DHT architecture to serve as
the infrastructure for distributed information management. This architecture is based on a
client-server model, and includes the following four components:
A Common Hypertext Data Model. From the client point of view, all objects in the
information space are described in terms of a common model, that denes both the structure
of information and the operations allowed on objects. The data model has three basic object
components: nodes, the container objects; links, representing relationships between nodes
or sections within nodes; and attributes of links or nodes, which serve to identify various
properties of objects.
3
Client applications
LAN
Gateway
Software
Archive
LAN
Gateway
OODB
Bibliography
Database
Gateway
Gateway
Ingres RDB
Wide Area
Network
in-memory
database
file system
Name
Server
LAN
Gateway
Gateway
Client applications
Link
Server
B-tree
B-tree
Annotation
Server
Figure 1: Components of the Distributed Hypertext Architecture
A set of operations on these components denes how clients access nodes and links.
These operations are: Create an object; Update the contents or attributes of a node, or the
endpoints or attributes of links; Delete nodes and links; and Traverse a link from one node
to another.
A Communication Protocol. All components communicate via a common application protocol, which implements the operations dened by the data model and provides a
mechanism for moving objects between clients and servers.
Servers. The content and structure of the hypertext are managed by servers. Servers
have two components: a gateway process that transforms hypertext operations into local access operations, and local information objects into hypertext nodes or links; and an information repository that contains the information to be accessed. These information repositories
are the entities to be integrated; they may be le systems, databases, or special purpose storage managers. From the repository point of view, the gateway process appears to be another
local application that accesses the repository's data. In this manner, existing databases can
be incorporated into the DHT architecture without having to copy their data or modify their
schemas. Also, existing applications continue to function as before.
Client Applications. Clients implement applications and specic styles of user interaction; they serve as a user interface to the primitive operations provided by the hypertext
data model.
These components are shown in Figure 1.
5.1 Summary
The architecture described above solves the integration problem by implementing hypertext
in a distributed, heterogeneous environment. By doing so, we take advantage of the transparency, organization, and exibility inherent to hypertext, while exploiting the hypertext
data access modality to preserve repository autonomy.
Heterogeneity and Integration. Integration is accomplished by the common data
model, which allows diverse information from diverse sources to be interpreted in a uniform
manner by client applications. One of the key advantages to this integration strategy is that
4
a source of information can participate without having to coordinate with other sources.
This is because of the nature of hypertext nodes as separate and distinct entities, and
the representation of relations via links that are separate from nodes. The result is that
integrating a new repository is only a task of translating objects into the common data
model, and implementing a server to accept and process protocol messages. Adding a new
server does not aect existing nodes, links, or servers.
Transparency. The hypertext data model describes the structure of objects and the
operations on objects; the communication protocol implements these operations. Therefore,
any client can manipulate any object using the same set of operations, regardless of source
or type.
Transactions and Autonomy. Repositories maintain complete control over the type
and number of objects exported, who may access them, and what operations may be performed. Only those objects exported are known outside of the repository; there is no requirement to export a repository's entire schema. Also, there is no need for a particular server
to provide write access to its database, either to users or system functions. Furthermore,
existing applications access the database in the same way, because the server handles all
translations between local and common object descriptions.
Transactions are restricted to read or update operations on a single node or link, and
create operations that invoke a server's constructor procedure. Therefore, there is no need
for a global transaction manager that might infringe upon local control and authority, since
any transaction is managed solely by the aected server.
The result of this approach is to render a complex, evolving object space comprehensible
by applying a simple set of structuring operations.
6 Implementation
This section describes how the Distributed Hypertext System (DHT) works in our prototype implementation. We start with an overview of the implementation, then proceed to
a discussion of how hypertext functions are carried out from the client and server point of
view.
6.1 System Overview
The goal of our prototype system is to demonstrate how existing heterogeneous repositories can be integrated using the DHT architecture. Therefore, we built several servers
to provide access to a variety of repositories ranging from simple hash tables to objectoriented databases. Specically, servers were constructed to access an existing software
project database implemented using the Ingres relational database management system; an
object-oriented database containing bibliographic entries; a name server to managing addresses of the other servers in the system; and the Unix le system.
Additionally, we implemented link and annotation servers using a B-tree storage manager.
The link and annotation servers store annotations linked to other objects, as well as user's
personal links. These servers enable user's to attach comments to other nodes in the system,
and build personal link networks. We have several clients that provide users with access to
the servers listed above, including a hypertext browser providing a means for navigating links,
5
viewing and editing nodes, and creating annotations to nodes, and several shell commands
for creating and retrieving nodes from specic servers.
See Figure 1 for an overview of these components, and the accompanying sidebar for
details on implementing a DHT server.
6.2 Using the System
User's engage in three basic activities: browsing links, viewing and editing nodes, and creating new objects. These tasks are carried out using a window-oriented browser/editor.
Servers support these activities by exporting data objects, providing type descriptions of
those objects, and creating new objects in their storage managers in response to user's (or
client's) requests. We will examine each of these activities below.
6.2.1 Browsing
Browsing involves repeated execution of the following steps by a client process:
1. Locate appropriate link servers. As discussed previously, links are not necessarily
stored as part of the nodes that they link. Therefore, a client managing the browsing
process must rst determine what server to contact to retrieve the list of links originating at
a node. This server may be specied with each command, or in a default conguration le.
By convention, each user has a default conguration that species, among other things,
where to look for links.
2. Retrieve the list of links associated with a node. Once the link server has been specied,
the client simply sends a message to the server requesting the list of links emanating from
the desired node.
3. Filter the list according to attributes specied by the user. Applications and users may
be interested in only a subset of links associated with a node. Therefore, a user may specify
a lter to apply to the list of links.
4. Present the resulting links to the user. This step is application dependent as well as
link dependent. Some links may simply represent a relationship between two nodes; others
are \anchored", representing a relationship between a subcomponent or section of one node
to one in another node. The latter style of link is best presented by emphasizing the link's
\anchor" in the displayed node (see Figure reflink-example, where anchors are shown in
boldface type), while the former can be selected from a menu-style list.
5. Retrieve and display the endpoint of a chosen link. Once the links have been presented,
either as highlighted anchors or menu items, the user can select one for traversal. This step
can take place only after a client determines that it has sucient local resources to display a
node. This is because certain node types require special hardware or software; a video object,
for example, requires special playback and display hardware whereas text can be shown on
any terminal. To facilitate this step, object identiers contain a type eld, so clients know
an object's type before retrieving the whole object. Once it determines that it can present
the contents of a node, the client process sends a message to the server that stores the node,
requesting that the it be returned.
6
Figure 2: Displaying Links
6.2.2 Creating New Objects
Object (node or link) creation is complicated by the need for servers to have absolute control
over the creation process in order to ensure consistency of the local repository. This means
that users cannot arbitrarily create nodes or links at any server. Rather, to allow servers to
control access and meet repository consistency constraints, we provide object \constructors",
local procedures that create complex objects from client-supplied arguments.
Constructor arguments are specied in object type descriptions. When a user requests
to create an object, he must specify the type of object to create. The client then retrieves
the appropriate type description, and interprets the constructor argument specication. The
user is then asked to supply values for each of the constructor arguments. Figure 3 shows a
type descriptions with constructor arguments.
7 Sidebar: Implementing a server
Servers must perform three tasks to support browsing: export data objects, provide type
descriptions, and execute object constructors. Here we describe how these tasks were implemented for our Software Archive server.
The server provides access to an existing Ingres database containing components of software projects including specications, manuals, and source code. This database has been
used to archive student projects collected over the past three years. The archive is partitioned
into more than 80 Projects, each of which is then further partioned into seven subsections
called Forms. Each Form corresponds to a phase in the software lifecycle and contains
numerous Modules that are the actual contents of a project.
The database schema comprises two relations: a Projects relation which stores the
project's name, le system location, and login names of users having access to the project;
and a Modules relation which stores information about each module in a project. These
7
Relation:
column name
project
form
name
instance
type
heading
author
status
Modules
type length
c
20
c
32
c
12
integer
4
c
12
c
64
c
16
integer
4
defclass Module {DHT Node}
{attributes
{{name
type:string
{instance type:int
{type
type:string
{heading
type:string
{status
type:int
{author
type:string
{constructor
{{name
type:string
{instance type:int
{heading
type:string
{contents type:string
{parent
type:oid
Relation:
column name
name
home
Projects
type length
c
20
text
1024
Relation:
column name
project
name
Engineers
type length
c
20
c
16
defclass Project {DHT Composite}
{attributes
{{name
type:string
{home
type:string
{engineers type:string
{constructor
{{name
type:string
{home
type:string
{engineers type:string
defclass Form {DHT Composite}
{attributes
{{name
type:string
{constructor {{}}}
card:
card:
card:
card:
card:
card:
1
1
1
1
1
1
desc:
desc:
desc:
desc:
desc:
desc:
"module name"}
"instance number"}
"module type"}
"module description"}
"module status"}
"module author"}} }
card:
card:
card:
card:
card:
1
1
1
1
1
desc:
desc:
desc:
desc:
desc:
"module name"}
"instance number"}
"module description"}
"module contents"}
"module parent"}} }
card: 1 desc: "project name"}
card: 1 desc: "root directory"}
card: * desc: "project developers"}} }
card: 1 desc: "project name"}
card: 1 desc: "root directory"}
card: * desc: "project developers"}} }
card: 1 desc: "Form name"}}}
Figure 3: Software Archive Relations and their associated DHT type denitions.
8
relations are depicted in Figure 3.
This database, though simple, presents a number of interesting problems to an integrator.
All projects have the same Forms, and each Form has a set of \core" modules that must
always be present. This core set may be augmented by additional modules at the discretion
of project team members. Any module added to the database must be associated with a
Project and Form within that project. These constraints can cause update anomalies if users
are allowed unrestricted authority to create views of the database.
DHT solves this problem through object constructors that create components of complex
objects in a locally consistent manner. The Module constructor, for example, requires that
the client specify the \parent" Form that is to contain the Module, as well as values for some
of the Modules attributes. The constructor procedure supplies values for the remaining
attributes (status, type, and author), and determines appropriate values for the project and
form els of the Modules relation from the parent argument supplied by the client. The
resulting values are appended to the Modules relation. If the parent object does not exist,
however, or the user is not assigned to the project, the creation fails and an error message
is returned.
The process of creating a server for this database involved four steps:
1. Deciding the types of nodes and links to be exported, and writing the type descriptions.
2. Determining which operations to support.
3. Writing SQL queries to implement these operations, which will be called from the server
process. These include the constructor queries to create new projects and modules.
4. Linking these components with the server library, and registering the new server with
the DHT nameserver.
There are three basic types to export from the archive database: Projects, Forms, and
Modules. The rst two are directory objects that serve as the source for links to their
children. Modules are container objects that have editable contents. The type descriptions
for each of these are shown in Figure 3.
The archive server supports the full range of DHT read operations, and creation of
Project and Module objects. Queries implementing these operations must therefore map
DHT concepts and objects into SQL queries against record attributes. For example, a \get
links" request for a form object is translated into an SQL query that retrieves all modules
that are part of the the specied object:
SELECT DISTINCT project, form, name, instance
FROM modules
WHERE project =:_project
AND form =:_form;
Of particular interest is the implementation of the Module constructor. From the DHT
perspective, this operation creates a new node, and links it to the specied form. From the
database perspective, it adds a new record to the Modules table, with the parent eld set to
the specied form.
These queries are embedded into a set of C language functions that are linked into a
server template provided as part of the DHT library. Once this process is complete, the
9
new server is ready to run; the last step is to add its name to the nameserver so that it can
register its address upon startup.
The entire development process was completed in two days, and comprises about 300
lines of C and SQL code.
8 Conclusion
We began this paper with a statement of four requirements for information integration:
transparency, autonomy, exibility, and organization. Our Distributed Hypertext architecture meets these requirements by extending hypertext to a distributed, heterogeneous
environment.
We chose hypertext as a solution strategy because of the its multifaceted nature: it
combines a user interaction technique, a data representation method, and a data storage
mechanism [2].
The user interaction facet provides transparency; users interact with information by
creating nodes and links and by browsing the resulting linked information space. Browsing
is conducted using a common set of tools regardless of the types of nodes contained in the
information space.
The data representation facet provides the organization and exibility to construct multiple views and complex structures by linking nodes. This simple organization technique is
both powerful and extremely exible; views can be changed or removed entirely simply by
adding or removing links. Yet the linked nodes remain unaected by such changes. Contrast
this with database management systems, in which dierent views can produce critical update
problems, and changes in schema structure can aect numerous data instances.
Finally, the data storage facet provides the key to implementing an integrated hypertext
in an autonomous and heterogeneous environment. This is because transactions under the
hypertext data storage model can be supported without compromising the autonomy of
participating information repositories. Furthermore, a hypertext storage mechanism can
manage diverse data types and can be implemented on a variety of storage managers.
Thus, the hypertext based approach results in two gains: a simple integration strategy
that preserves repository autonomy, and a powerful organizational tool that allows many
users and applications to construct personal views of shared objects.
The chief weakness of our solution is the necessity to implement a new gateway for each
new repository; this is a weakness we share with other integration strategies. However, our
experience has been that new gateways can be implemented in as little time as one day, so
this is less of a drawback than it might appear.
Our future work focuses reusable mechanisms for integrating new servers as well as continuing to gather experience with the prototype system. We also plan to implement more
sophisticated clients to perform software process modeling, and servers to integrate repositories for software engineering environments.
Acknowlegements. Christina Chen, Hongyan Luo, and Vien Chan participated in the
implementation of the prototype system. We would like to thank Peter Danzig, Deborah
Estrin, Doug Fang, Shahram Ghandeharizadeh, Steve Hotz, and Kim Korner for their many
helpful comments. This work has been supported in part by contracts and grants from
10
AT&T, Northrop Inc., Pacic Bell, and the Oce of Naval Technology through the Naval
Ocean Systems Center. No endorsement is implied.
References
[1] Robert M. Akscyn, Donald L. McCracken, and Elise A. Yoder. KMS: A distributed
hypermedia system for managing knowledge in organizations. Communications of the
ACM, 31(7):820{835, July 1988.
[2] Je Conklin. Hypertext: An introduction and survey. Computer, 20(9):17{41, September 1987.
[3] Norman M. Delisle and Meyer D. Schwartz. Neptune: a hypertext system for CAD
applications. In SIGMOD record, v15, n2 (June): pp. 132-142, pages 168{186, New
York, 1986. ACM.
[4] Pankaj K. Garg and Walt Scacchi. A hypertext system for software life cycle documents.
IEEE Software, 7(3):90{99, May 1990.
[5] Dennis Heimbigner and Dennis McLeod. A federated architecture for information management. ACM Trans. on Oce Information Systems, 3(3):253{278, July 1985.
[6] Ted Nelson. Managing immense storage. Byte, 13(1):225{233, 1988.
[7] Amy Pearl. Sun's link service: a protocol for open linking. In Hypertext '89 Proceedings,
pages 137{146, New York, 1989. ACM.
[8] Bruce R Schatz. Telesophy: A system for manipulating the knowledge of a community.
In GLOBECOM '87 Proceedings, Tokyo, November, 1987, pages 1181{1186, New York,
1989. ACM.
[9] Frank M. Shipman III. Distributed hypertext for collaborative research: the Virtual
Notebook system. In Hypertext '89 Proceedings, pages 29{135, New York, 1989. ACM.
[10] Marvin Theimer, Luis-Felipe Cabrera, and Jim Wyllie. QuickSilver support for access
to data in large, geographically dispersed systems. In The 9th International Conference
on Distributed Computing Systems, pages 28{35. IEEE Computer Society, June 1989.
[11] Nicole Yankelovich, Bernard J. Haan, Norman K. Meyrowitz, and Steven M. Drucker.
Intermedia: the concept and the construction of a seamless information environment.
Computer, 21(1):81{96, January 1988.
11