Project P817-PF
Database Technologies for Large Scale
Databases in Telecommunication
Deliverable 1
Overview of Very Large Database Technologies and Telecommunication
Applications using such Databases
Volume 1 of 5: Main Report
Suggested readers:
- Users of very large database information systems
- IT managers responsible for database technology within the PNOs
- Database designers, developers, testers, and application designers
- Technology trend watchers
- People employed in innovation units and R&D departments.
For full publication
March 1999
Deutsche Telekom AG
Koninklijke KPN N.V.
Tele Danmark A/S
Telia AB
Telefonica S.A.
Portugal Telecom S.A.
This document contains material which is the copyright of certain EURESCOM
PARTICIPANTS, and may not be reproduced or copied without permission.
All PARTICIPANTS have agreed to full publication of this document
The commercial use of any information contained in this document may require a
license from the proprietor of that information.
Neither the PARTICIPANTS nor EURESCOM warrant that the information
contained in the report is capable of use, or that use of the information is free from
risk, and accept no liability for loss or damage suffered by any person using this
This document has been approved by EURESCOM Board of Governors for
distribution to all EURESCOM Shareholders.
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
(Edited by EURESCOM Permanent Staff)
The Project will investigate different database technologies to support high
performance and very large databases. It will focus on state-of-the-art, commercially
available database technology, such as data warehouses, parallel databases, multidimensional databases, real-time databases and replication servers. Another important
area of concern will be on the overall architecture of the database and the application
tools and the different interaction patterns between them. Special attention will be
given to service management and service provisioning, covering issues such as data
warehouses to support customer care and market intelligence and database technology
for web based application (e.g. Electronic Commerce).
The Project started in January 1998 and will end in December 1999. It is a partially
funded Project with an overall budget of 162 MM and additional costs of around
20.000 ECU. The Participants of the Project are BT, DK, DT, NL, PT, ST and TE.
The Project is led by Professor Willem Jonker from NL.
This is the first of four Deliverables of the Project and is titled: “Overview of very
large scale Database Technologies and Telecommunication Applications using such
Databases”. The Deliverable consists of five Volumes, of which this Main Report is
the first. The other Volumes contain the Annexes. Other Deliverables are: D2
“Architecture and Interaction Report”, D3 “Experiments: Definition” and D4
“Experiments: Results and Conclusions”.
This Deliverable contains an extensive state-of-the-art technological overview of very
large database technologies. It addresses low-cost hardware to support very large
databases, multimedia databases, web-related database technology and data
warehouses. Contained is a first mapping of technologies onto applications in the
service management and service provisioning domain.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
Executive Summary
Developments in information and telecommunication technology have lead to a
situation where telecommunications services management and service provisioning
has become much more data intensive. For example modern switches allow the
detailed recording of individual calls leading to huge amounts of data which forms the
input to core processes like billing. In addition, new network architectures like
Intelligent Networks and TINA, and also mobile networks are more data intensive
than traditional telephony. Large datastores are an integral part of these new
architectures and offer support for services like number portability, tracking and
tracing, and roaming. Also, the shift from traditional telephony to IP based services
and to multimedia broadband services will make service provisioning more data
intensive especially when operators enter areas such as Web-hosting and Ecommerce.
At the same time there is the liberalisation of the telecommunication market. As a
result the operators will face competition, which makes service management
(including customer care) and fast introduction of new services strategic assets to
compete. A central element in customer care is information on the customer. This
information is derived from all kinds of data available on the customers. The need for
this information is currently driving many initiatives within operators to build large
data warehouses that contain an integrated view of the customer base.
While we see an increasing need for data management in telecommunication service
management and service provisioning, at the same time we see a large number of
emerging database technologies. This makes the selection of the right database
technologies to support the telecommunication specific needs very difficult. This
report will help in better understanding recent developments in database technology
as well as to position them with respect to each other. It will also help in identifying
the database technologies that are relevant to specific applications in the service
management and services provisioning areas.
In order to focus the work, it was decided to concentrate on the areas of services
provisioning and service management. The reason being that especially these areas
require a lot of investments in database technology in the near future due to the
developments mentioned above. In addition, there is a focus on technology supporting
very large databases, motivated by the fact that service management and service
provisioning involves usually large customer bases and more and more complex
services resulting in very large databases. A rough picture of the Project focus is
given in this figure.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
The report contains an extensive technological overview, here we will only
summarise the, in our opinion, most crucial technologies. We mention hardware to
support very large databases, multimedia databases and Web related database
technology, and data warehouses.
As far as hardware platform support for Very Large Data Bases is concerned, we see
the following situation. For very large operational databases, i.e. databases that
require heavy updating, mainframe technology (mostly MPP architectures, Massively
Parallel Processors) is by far the most dominant technology. For datawarehouses on
the other hand, that mostly support retrieval, we see a strong position for the high-end
UNIX SMP (Symmetric Multi Processor) architectures. The big question with respect
to the future is about the role of Windows NT on Intel. Currently there is no role in
very large databases for these technologies, however this may change in the coming
years. There are two mainstreams with respect to NT and Intel. On the one hand
NUMA (Non Uniform Memory Architecture, a kind of extended SMP architecture)
with Intel processors, and on the other hand clustered Intel machines. NUMA is more
mature and supports major databases like Oracle and Informix. However, NUMA is
still based on Unix, but suppliers work on NT implementations. Database technology
supporting NT clusters is not really available yet, with the exception of IBM DB2.
This area will be closely followed by the Project and actual experiments may be
planned to assess this technology.
Multimedia databases and Web related database technology is developing very fast.
All major database vendors support Web connectivity nowadays. There is a strong
focus on database-driven Web-sites and E-commerce servers for the Web. The
support for multimedia data support is rather rudimentary. Although vendors like
Oracle, Informix and IBM have made a lot of noise on Universal Servers that support
multimedia data. The proposed extendible architectures turned out be relatively
closed and unstable. Current practice is still mainly handling of multimedia data
outside the database.
Data warehouse technology is one of the most dynamic areas nowadays. All database
vendors and mainframe vendors are in this area. One has to be very careful here, a
data warehouse is not simply a large database. There is a lot of additional technology
for data extraction, metadata management, and architectures. Of course all major
vendors have their own methodology and care has to be taken not to be locked in. A
rather new development is that of operational datastores, these are data warehouses
with limited update capabilities. Especially for the propagation of these updates back
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
to the originating databases no stable solutions exist. Therefore great care has to be
taken when embarking on operational datastores.
Telecommunication services are becoming more and more data intensive, as a result
the role of database technology will only increase. Therefore, decisions with respect
to database technology become crucial elements to maintain control over the data
management around those services, and also to maintain a strong, flexible and
competitive position.
This report is a state-of-the-art overview of database technology with a first mapping
of technologies onto applications in the service management and service provisioning
domain. The Project will deliver a further report on guidelines for the construction of
very large databases for the most dominant applications in the above domains. This
report will not only focus on the database but also of the embedding of the database in
the overall application architecture. Finally, the Project will report on a number of
hands-on experiments that will be carried out during 1999 to validate the guidelines
and to assess the database technology involved.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
List of Authors
Jeroen Wijnands (overall editor)
KPN Research, The Netherlands
Wijnand Derks
KPN Research, The Netherlands
Willem Jonker
KPN Research, The Netherlands
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
1 Introduction
1.1 Technical introduction to Deliverable 1 ........................................................... 1
1.2 Guidelines for reading Deliverable 1 ............................................................... 1
1.3 Division of labour among partners ................................................................... 2
1.4 Introduction to the main report ......................................................................... 2
2 Very large database definition .................................................................................... 5
2.1 The definition ................................................................................................... 5
2.2 Examples of VLDB systems ............................................................................ 6
3 Database technologies for telecommunication applications ....................................... 9
3.1 Database Server Architectures ......................................................................... 9
3.1.1 Hardware architectures ....................................................................... 9
3.1.2 Data placement .................................................................................. 10
3.1.3 Commercial database servers ............................................................ 11
3.1.4 Analysis ............................................................................................. 12
3.2 Retrieval and Manipulation ............................................................................ 12
3.2.1 Query Processing in a distributed environment ................................ 12
3.2.2 Query processing in Federated Databases ........................................ 13
3.2.3 Commercial database products ......................................................... 13
3.2.4 Analysis ............................................................................................. 15
3.3 Backup and Recovery ..................................................................................... 15
3.3.1 Security ............................................................................................. 15
3.3.2 Backup and recovery strategies ......................................................... 16
3.3.3 Commercial products ........................................................................ 17
3.3.4 Analysis ............................................................................................. 18
3.4 Benchmarking................................................................................................. 18
3.4.1 Available benchmarks ....................................................................... 18
3.4.2 TPC benchmarks ............................................................................... 18
3.4.3 Analysis ............................................................................................. 20
3.5 Performability modelling ............................................................................... 20
3.5.1 Performability ................................................................................... 21
3.5.2 Tools for performability modelling................................................... 21
3.5.3 Guidelines for measures for the experiments .................................... 21
3.5.4 Analysis ............................................................................................. 22
3.6 Data warehousing ........................................................................................... 22
3.6.1 Data warehouse architectures ........................................................... 22
3.6.2 Design Strategies............................................................................... 25
3.6.3 Data Cleansing, Extraction, Transformation, Load Tools ................ 26
3.6.4 Target Databases ............................................................................... 26
3.6.5 On-Line Analytical Processing (OLAP) Technology and Tools ...... 27
3.6.6 Data Mining ...................................................................................... 28
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
3.6.7 Data Warehousing on the Web ......................................................... 28
3.6.8 Analysis ............................................................................................ 29
3.7 Transaction processing .................................................................................. 29
3.7.1 Commercial TP Monitors ................................................................. 31
3.7.2 Analysis ............................................................................................ 33
3.8 Multimedia databases..................................................................................... 34
3.8.1 Querying and content retrieval in MMDBs ...................................... 35
3.8.2 Transactions, concurrence and versioning in MMDBs .................... 35
3.8.3 Multi-media objects in relational databases ..................................... 36
3.8.4 Multimedia Objects in Object-Oriented Databases .......................... 37
3.8.5 Analysis ............................................................................................ 37
3.9 Databases and the World Wide Web ............................................................. 37
3.9.1 The Internet and the World Wide Web ............................................ 38
3.9.2 Database gateway architectures ........................................................ 38
3.9.3 Web databases and security .............................................................. 40
3.9.4 Web database products .................................................................... 41
3.9.5 Analysis ............................................................................................ 41
4 Mapping of telecommunication applications on database technologies .................. 43
4.1 Service management applications .................................................................. 43
4.2 Service provisioning applications .................................................................. 43
4.3 The telecommunication applications/ database technologies matrix. ............ 44
5 General analysis and recommendations.................................................................... 47
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Call Detail Record
Computer Supported Co-operative Work
Deutsche Telekom
DataBase Management System
Enterprise Resource Planning
Koninklijke PTT Nederland (Royal PTT Netherlands)
Massive Parallel Processing (or Processors)
Non-Uniform Memory Architecture
Project Internal Result
Shared-Disk (architecture)
Shared-Memory (architecture)
Symmetric Multi Processing (or Processors)
Shared-Something (architecture)
Tele Denmark
Transaction Processing Council
Uninterruptable Power Supply
Very Large DataBase
Windows Intel (platform)
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
This Deliverable contains the results of activities carried out in task 2, entitled
“Overview of Very Large Database Technologies and Telecommunications
Applications using such databases”, of the P817 Project. This introduction chapter
provides some technical information, guidelines for reading the Deliverable as a
whole, the assignment of activities to partners and finally a guideline for reading this
main report.
Technical introduction to Deliverable 1
Telecommunication services are becoming more and more data intensive in several
areas such as network control, network management, billing services, traffic analysis,
service management, service marketing, customer care and fraud detection. For
several of these applications, fast access and secure data storage over a longer period
of time is crucial. Given the increasing amount of network traffic, registering and
retrieving of these data requires very large databases in the order of TeraBytes.
Successful exploitation of high performance database technology will enable PNOs to
optimise the quality of the telecommunication services at lower costs. It will also
enable them to expand the use of their services. To facilitate this, knowledge on the
key database technologies involved is crucial. Proper assessment of the database
technology involved in the telecommunication domain is a joint PNOs interest that
has a broad scope, a technical focus, and requires serious financial investments.
This Deliverable provides the first step of this process. The objective is to define the
term “Very Large Database”, give an overview of the telecommunication applications
using such databases and give an overview of the database technologies used in such
databases. Concerning the telecommunication applications the focus will be on
service management and service provisioning.
To achieve this goal the following four activities have been defined:
Define what a “Very Large Database” is (PIR 2.1)
Determine telecommunication applications using such databases (PIR 2.2)
Describe the state-of-art on database technology used with such databases (PIR
Make a telecommunication applications/database technologies matrix (PIR 2.4)
The results of these activities are recorded in this Deliverable.
Guidelines for reading Deliverable 1
Because of the large amount of information produced for this Deliverable, it is
decided to cluster information into several volumes. This clustering also benefits the
disclosure of information to readers with different roles. Depending on the role, a
more or less detailed document can be read.
The Deliverable is composed of the following volumes:
Volume 1, the main report (this document), positioning and summarising the
knowledge gained in task 2. This volume is intended for readers who want a high
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
level overall overview of the results of task 2. This volume is also a guideline for
determining relevant annexes.
Volume 2, Annex 1 entitled “Architectural and performance issues” contains
detailed descriptions on the subjects “Database Server Architectures”,
“Performability modelling and analysis/simulation of very large databases” and
“Benchmarks”. This volume is intended for specialists in the mentioned areas.
Volume 3, Annex 2 entitled “Data manipulation and management issues”
contains detailed descriptions on the subjects “Transaction Processing
Monitors”, “Retrieval and Manipulation” and “Backup and Recovery”. This
volume is intended for specialists in the mentioned areas.
Volume 4, Annex 3 entitled “Advanced database technologies” contains detailed
descriptions on the subjects “Web related database technology”, “Multi media
databases” and “Data warehousing”. This volume is intended for specialists in
the mentioned areas.
Volume 5, Annex 4 entitled “Database technologies in telecommunication
applications” contains detailed descriptions on “Telecommunication applications
using very large databases” and “Application requirements versus available
database capabilities. This volume provides a bridge between telecommunication
applications and very large database technologies.
Division of labour among partners
The writing of this Deliverable has been a joined effort of KPN Research (task
leadership), DT, TD, TE and Telia with the following division of labour:
Volume 1
KPN Research
Volume 2
KPN Research, Telefonica
Volume 3
Deutsche Telekom, Tele Denmark, Telefonica
Volume 4
Telia, Tele Denmark
Volume 5
Telia, Deutsche Telekom
Introduction to the main report
This volume is the main report of Deliverable 1. It provides a summary and overview
of very large database technologies relevant for telecommunication applications. To
be more specific, the application areas “Service Management” and “Service
Provisioning” have been defined as the areas of main interest.
Chapter 2 starts with the definition of a very large database in the context of this
Chapter 3 provides detailed descriptions a number of Very Large Database
Chapter 4 provides a bridge between the database technologies described in chapter 3,
and the telecommunication applications where they can be applied. First, the Service
management and Service Provisioning applications are described. Service
management is a key issue for PNOs. For handling different services, management
systems such as billing system, customer ordering system, customer/user management
page 2 (48)
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
system, etc. are available. These management systems heavily rely on very large
database systems. Service Provisioning involves the direct offering of services to
customers. Some examples are E-commerce, Video-on-demand, Tele-education and
hosting services. The increasing amount of data involved in these services also
requires very large database technology.
Finally, this volume ends with chapter 5 describing the general analysis and
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
Very large database definition
The purpose of this chapter is to define a Very Large DataBase (VLDB) as agreed
upon by the participants of the P817 Project. The definition of a VLDB is intended to
suite the needs of P817, and may or may not ignore databases usually considered very
large by others. Note that is not our intention to give a mathematical precise definition
of a VLDB, because we think that is impossible. The definition is used to get a
common understanding among the P817 participants and Deliverable readers on the
term VLDB.
The definition
Before continuing, we have to clarify what we mean with the terms Database (DB),
DataBase Management System (DBMS) and Database System. A DB is a collection
of data with relations between the data as defined in a data schema. A DBMS is the
software that:
supports the storage of very large amounts of data over a long period of time,
keeping it secure from accident or unauthorised use and allows efficient access to
the data for queries and database modifications.
controls access to the data of many users at once, without allowing actions of one
user to affect the actions of other users and without allowing simultaneous
accesses that could corrupt the data.
allows users to create new databases and specify their schema (logical structure
of the data), using a specialised language called a Data Definition Language.
gives users the ability to query the data and modify the data, using an appropriate
language, called a query language or Data Manipulation Language.
A DB system is the combination of a DB with a DBMS1.
A Very Large DataBase system is mainly characterised by two issues viz.:
 Size: the number of bytes needed to store the data, index, etc. It should be noted
that the concept of a large size is a time dependent and technology dependent
issue. First storage systems are getting cheaper and cheaper so a large size today is
already a smaller size tomorrow. Second, 1 Tb on a WinTel platform is called very
large nowadays while the same amount of data is regular for a mainframe.
 Workload: the number of concurrent users and the size of their transactions. Note
that, according to this definition, a heavy workload is not per definition the same
as “a large number of users” e.g. a small number of concurrent users with large
transaction (typical in a data warehouse environment) can also be a heavy
We consider a database system to be a very large database system when it has a large
size and heavy workload. The area of interest in the context of P817 is depicted in the
following figure.
In practise, the terms Database, DBMS and Database System are often used in an
interchangeable fashion.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Above, size and workload are used to characterise a VLDB system. When assessing
VLDB technology in the context op P817, the following issues will be addressed:
a) Scaleability:
expressed in two definitions: speed-up and scale-up. Speed-up is faster execution
with constant problem size. Scale-up is the ability of an N-times larger system to
perform an N-times larger job in the same elapsed time as the original system.
b) Performance:
defined as the absolute execution characteristics of the system. This includes
execution times, latency and throughput of the interconnect and disk I/O speed.
c) Manageability:
defined as the ease of which the total system is configured and changed.
Manageability addresses issues such as configuration, loading, backup and change
d) Robustness:
defined as how the systems can handle, both software and hardware failures.
e) Costs:
defined as the costs required for setting up and maintaining the system.
Examples of VLDB systems
Up to now it is not defined what we mean with “very large size” and “heavy
workload”. The main reason for this is the time-dependants of these terms viz.
computers become faster and storage systems become larger. Giving absolute
numbers will make the definition obsolete within a short period of time. For this
reason we have decided to give some numbers applying to state-of-the-art (beginning
1998) running commercial database systems.
The systems we have chosen to use as examples are systems that can handle a certain
workload when the load is measured using Transaction Processing Performance
Council (TPC) standards. We realise that the systems used for these benchmarks are
tuned for optimal performance for this specific test and real life business systems will
never reach the resulting performances, but the figures present a practical
performance and size upper limit for current systems. Of course there are many other
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
examples of VLDB systems, but TPC standards are well known and give a clear
picture of which classes of systems we consider to be VLDB systems. In section 3.4.2
a top four is given of current TPC figures.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
The previous chapter provided a generally applicable definition of a very large
database. This chapter continues with detailed descriptions of relevant database
technologies. Readers for whom the provided material in this main report already is
too detailed, can quickly scan the chapter by only reading the “Analysis” sections at
the end of each database technology. Readers for whom the provided material is not
detailed enough, are referred to the annexes of this Deliverable ([1], [2], [3]).
Database Server Architectures
An important characteristic of the PNO business in general and the PNO applications
in specific is the large number of customers involved. Several millions of customers is
not an exception. This large number of customers results in very large amounts of
data that have to be stored and processed. The very large database systems supporting
these applications need parallel architectures to meet the requirements.
First some theoretical architectures are described. This information is used to position
the commercial available architectures described in the subsequent section. For more
detailed information on this subject, the reader is referred to [1].
Hardware architectures
Parallelism should provide Performance, Availability and Scaleability. To meet the
needs, many hardware architectures are described in literature but the following four
represent the main stream:
Shared-memory (SM) systems are complex hardware systems with multiple
processors, connected by a high bandwidth interconnect through which shared
memory and disks can be accessed. Each CPU has full access to both memory and
The shared-disk (SD) configuration consists of several nodes. Each node has one CPU
with private memory. Unlike the SM architecture, the shared-disk architecture has no
memory sharing across nodes. Each node communicates via a high speed interconnect
to the shared disks.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
The shared-nothing (SN) architecture does not have any shared resource (except for
the interconnect). Each node consists of a CPU with private memory and private
storage device. These nodes are interconnected by a network. This network is
typically standard technology. Shared-nothing systems are called loosely-coupled
The architectures described above are three pure architectures. All architectures have
advantages and disadvantages when looking at performance, availability and
scaleability. Therefore it makes sense to make a combination of these three
architectures called the Shared-Something (SS) architecture.
To summarise, Table 1 gives an overview of the hardware architectures in terms of
cost, DBMS complexity, performance, availability and scaleability.
0 = moderate
- = low
+ = high
Table 1 comparison of hardware
Data placement
When buying database hardware and software, the underlying architecture and
functionality are a given fact. This is not the case for the applications, data and
datamodel. Solid knowledge of how parallel processing influences these issues will
increase the performance, availability and scaleability. First, the type of transactions
performed on the database is important. Read-transactions, for example, will have
page 10 (48)
Deliverable 1
Volume 1: Main Report
another impact than write-transactions or read/write transactions. Next, the placement
of data over the available resources (nodes, CPUs, disks) in the parallel database is
important. On the one hand, splitting up data will increase the degree of parallelism
and thus increase the performance. On the other hand, the same splitting will have
negative consequences for data that has to be joined.
Commercial database servers
Parts of the theoretical hardware architectures described in section 3.1.1, are visible in
commercially available systems. Again, many architectures are available, but we limit
ourselves to four major architectures.
Symmetric Multi Processing (SMP)
SMP systems use a shared memory model and all resources are equally accessible.
The operating system and the hardware are organised so that each processor is
theoretically identical. The main performance constraint of SMP systems is the
performance of the interconnect. Applications running on a single processor system,
can easily be migrated to SMP systems without adaptations. However, the workload
must be suitable to take advantage of the SMP power. All major hardware vendors
have SMP systems, with Unix or NT operating systems, in their portfolio. They
distinguish in maximum number and capacity of processors; maximum amount of
memory; maximum storage capacity. SMP machines are already very common
The scaleability limits of SMP systems combined with the need for increases
resilience led to the development of clustered systems. A cluster combines two or
more separate computers, usually called nodes, into a single system. Each node can be
an uni-processor system or an SMP system, has its own main memory and local
peripherals and runs its own copy of the operating system For this reason the cluster
architecture is also called a Shared Nothing architecture. The nodes are connected by
a relatively high speed interconnect. Commercially available cluster solutions
distinguish in the maximum number of nodes and the type of interconnect. In the
world of Unix, several cluster solutions are already available (e.g. the SUN Enterprise
Cluster). In the World of Windows NT, clusters are yet in their infancy. Microsoft,
together with parties as Compaq (especially the business units Tandem and Digital), is
working on Clustering software (MS Cluster Server (codename “Wolfpack”)). Up to
now, only two node failover is available (e.g. NCR’s LifeKeepr) but the strong
position of Microsoft and the dominant presence of Windows NT will increase the
importance of NT clustering.
Massive Parallel Processing (MPP)
A Massively Parallel Processing system consists of a large number of processing
nodes connected to a very high-speed interconnect. MPP systems are considered as
Shared-Nothing, that is, each node has its own private memory, local disk storage and
a copy of the operating system and of the database software. Data are spread across
the disks connected to the nodes. MPP systems are very well suited for supporting
VLDBs but they are very expensive because of the need of special versions of the
operating system, database software and compilers, as well as a fundamentally
different approach to software design. For this reason, only a small top-end of the
market uses these systems. Only few vendors have MPP systems in their portfolio.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Among them IBM, with its RS/6000 Scaleable POWER parallel system, and Compaq
(Tandem) with its Himalaya systems.
Non-Uniform Memory Architecture (NUMA)
A NUMA system consists of multiple SMP processing nodes that share a common
global memory, in contrast to the MPP model where each node only has direct access
to the private memory attached to the node. The non-uniformity in NUMA describes
the way that memory is accessed. Somehow, the NUMA architecture is a hybrid
resulting from SMP and MPP technologies. As it uses a single shared memory and a
single instance of the operating system, the same applications as in an SMP machine
run without modifications. The latter advantage makes NUMA a significant
competitor to pure MPP machines. Several vendors have computers based on NUMA
in their portfolio among which Data General, IBM, ICL, NCR, Pyramid, Sequent and
Silicon Graphics.
For all types of very large database applications, several parallel platforms are already
commercially available. Where first only dedicated and expensive hardware and
software were available, nowadays new architectures appear on the scene like NUMA
and Windows NT clusters. These new architectures try to achieve the desired
scaleability, performance and manageability by using commodity components in a
parallel way what should result in lower costs of ownership. At this moment, these
new architectures still have to prove themselves.
Retrieval and Manipulation
Using efficient data manipulation and retrieval algorithms becomes very important
when the database size and workload take on Very Large proportions. This section
describes some issues that are characteristic to data retrieval in distributed database
environments. Also, an overview is provided of the major commercial database
products in the VLDB segment.
For more detailed information on retrieval and manipulation or commercial database
products, the reader is referred to [2].
Query Processing in a distributed environment
The steps to be executed for query processing are in general: parsing a request in an
internal form, validating the query against meta-data information (schemes or
catalogues), expanding the query using different internal views and finally building an
optimised execution plan to retrieve the requested data objects.
In a distributed system the query execution plans have to be optimised in a way that
query operations may be executed in parallel, avoiding costly shipping of data.
Several forms of parallelism may be implemented: Inter-query-parallelism allows the
execution of multiple queries concurrently on a database management system.
Another form of parallelism is based of the fragmentation of queries (sets of database
queries, e.g. selection, join, intersection, collecting) and on parallel execution of these
fragment pipelining the results between the processes.
Inter-query-parallelism may be used in two forms, either to execute producer and
consumers of intermediate results in pipelines (vertical inter-operator parallelism) or
page 12 (48)
Deliverable 1
Volume 1: Main Report
to execute independent subtrees in a complex query execution plan concurrently
(horizontal inter-operator parallelism).
Query processing in Federated Databases
A federated database is conceptually just a mapping of a set of (possible
heterogeneous) databases. When the word federated is used, it indicates that the
federated database is a mapping of a set of databases not originally designed for a
mutual purpose. This gives rise to special problems.
A solution can be found in using a distributed query processor consisting of a query
mediator and a number of query agents, one for each local database. The query
mediator is responsible for decomposing global queries given by multi database
applications into multiple subqueries to be evaluated by the query agents. It also
assembles the subquery results returned by the query agents and further processes the
assembled results in order to compute the final query result. Query agents transform
subqueries into local queries that can be directly processed by the local database
systems. The local query results are properly formatted before they are forwarded to
the query mediator. By dividing the query processing tasks between query mediator
and query agents, concurrent processing of subqueries on local databases is possible,
reducing the query response time. This architectural design further enables the query
mediator to focus on global query processing and optimisation, while the query agents
handle the transformation of subqueries decomposed by query mediator into local
queries. It is a job for the query agents to convert the subqueries into local queries on
heterogeneous local schemes. The heterogeneous query interfaces of local database
systems are also hidden from the query mediator by the query agents.
Commercial database products
Non-Stop Clusters software combines a standards-based, version of SCO UnixWare
2.1.2 with Tandem’s single system image (SSI) clustering software technology. SSI
simplifies system administration tasks with a consistent, intuitive view of the cluster.
This helps migrating current UNIX system applications to a clustered environment. It
also allows transparent on-line maintenance such as hot plug-in of disks, facilitates
the addition of more nodes to the cluster, and provides automatic failover and
Oracle8 is a data server from Oracle. Oracle8 is based on a object-relational model. In
this model it is possible to create a table with a column whose datatype is another
Oracle products run on Microsoft Windows 3.x/95/NT, Novell NetWare, Solaris, HPUX and Digital UNIX platforms.
Many operational and management issues must be considered in designing a very
large database under Oracle8 or migrating from an Oracle7 (the major predecessor of
Oracle8) database. If the database is not designed properly, the customer will not be
able to take full advantage of Oracle8’s new features. In annex 2, issues are discussed
related to designing a VLDB under Oracle8 or migrating from an Oracle7 database.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Informix Dynamic Server
Informix Dynamic Server is a multithreaded relational database server that employs a
single processor or symmetric multiprocessor (SMP) systems and dynamic scaleable
architecture (DSA) to deliver database scaleability, manageability and performance.
Informix Dynamic Server works on different hardware equipment, among which
UNIX and Microsoft Windows NT based.
IBM delivered its first phase of object-relational capabilities with Version 2 of DB2
Common Server in July, 1995. In addition, IBM released several packaged Relational
Extenders for text, images, audio, and video. The DB2 Universal Database combines
Version 2 of DB2 Common Server, including object-relational features, with the
parallel processing capabilities and scaleability of DB2 Parallel Edition on SMP,
MPP, and cluster platforms. DB2 Universal Database, for example, will execute
queries and UDFs in parallel.
The DB2 product family spans AS/400 systems, RISC System/6000 hardware, IBM
mainframes, non-IBM machines from Hewlett-Packard and Sun Microsystems, and
operating systems such as OS/2, Windows (95 & NT), AIX, HP-UX, SINIX, SCO
OpenServer, and Sun Solaris.
The Sybase Computing Platform includes a broad array of products and features for
Internet/middleware architecture support, decision support, mass-deployment, and
legacy-leveraging IS needs, bundled into an integrated architecture. The Adaptive
Server DBMS product family, the core engine of the Sybase Computing Platform,
supports new application data needs like mass-deployment, enterprise-scale OLTP,
and terabyte-scale data warehousing.
The Sybase Computing Platform allows developers to create applications that run
without change on all major platforms and architectures, scaling up from the laptop to
the enterprise server or Web server. These applications can take advantage of the
scaleability of Adaptive Server and PowerDynamo, the flexibility and programmer
productivity of Powersoft's Java tools, and the legacy interoperability of Sybase's
Enterprise CONNECT middleware.
Microsoft SQL Server Enterprise Edition 6.5 is a high-performance database
management system designed specifically for the largest, highly available Microsoft
Windows NT operating system applications. It extends the capabilities of SQL Server
by providing higher levels of scaleability, performance, built-in high-availability, and
a comprehensive platform for deploying distributed, mission-critical database
applications. When this report was released, Microsoft introduced its new Microsoft
SQL Server 7.0. This DBMS has been redesigned from scratch, resulting in a
scaleable parallel architecture to conquer the very large database market. The use of
the so called “zero administration concept” (that is minimise the human intervention
for maintaining the database), should increase the acceptance of the product.
page 14 (48)
Deliverable 1
Volume 1: Main Report
NCR Teradata
Beginning with the first shipment of the Teradata RDBMS, NCR has over 16 years of
experience in building and supporting data warehouses worldwide. Today, NCR’s
Scaleable Data Warehousing (SDW) delivers solutions in the data warehouse
marketplace, from entry-level data marts to very large production warehouses with
hundreds of Terabytes. Data warehousing from NCR is a complete solution that
combines Teradata parallel database technology, scaleable hardware, experienced
data warehousing consultants, and industry tools and applications available on the
market today.
Retrieval and manipulation of data in different database architectures has various
options for finding optimal solutions for database applications. In recent years many
architectural options have been discussed in the field of distributed and federated
databases and various algorithms have been implemented to optimise the handling of
data and to optimise methodologies to implement database applications. Nevertheless,
retrieval and manipulation in different architectures apply similar theoretical
principals for optimising the interaction between applications and database systems.
Efficient query and request execution is an important criterion when retrieving large
amounts of data.
There are a number of commercial database products competing in the VLDB
segment, most of which run on various hardware platforms. The DBMSs are generally
supported by a range of tools for e.g. data replication and data retrieval.
Backup and Recovery
More and more, databases have to be available 24 hours a day, seven days a week.
But as no system is free of failure, special actions have to be taken to guarantee this
availability. Regularly making backups of the data and being able to recover from
these backups are examples of these special actions. This section describes the issues
related to backup and recovery of very large databases. For more detailed information
on this subject, the reader is referred to [2].
Backup facilities are needed to be able to recover from lost data. Losing data is not
only the result of hard- and software failures but also of wrong user actions.
Unauthorised access to the system could also result in (wilful) destruction or
modification of data. For this reason, some security issues are treated in this section.
Data security involves two main aspects viz.:
Data protection: required to prevent unauthorised users from understanding the
physical content of data. This function is typically provided by data encryption.
Authorisation control: to guarantee that only authorised users perform operations
they are allowed to perform on the database.
The sequel of this section concentrates on data loss caused by wrong user actions or
hard- and software failures.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Backup and recovery strategies
In general, one can classify failures resulting in possible loss of data in the following
six categories:
User failures: caused by an user deleting or changing data improper or
Operation failures: caused by an illegal operation resulting in the DBMS
reacting with an error message.
Process failures: caused by an abnormal ending process.
Network failures: caused by an interruption of the network between the
database client and the database server.
Instance failures: caused by a failing database instance with accompanying
background processes.
Media failures: caused by read or write errors due to defect physical hardware .
As the first four types of failures can be handled by the DBMS, backup and recovery
strategies are mostly used to deal with the consequences of the last two types of
failures. One way to be resistant against these type of failures is by using
Uninterruptable Power Supplies (UPS) and redundant hardware (e.g. mirroring of
disks and power supplies). Another way is the so called “hot stand-by” solution where
a copy of the complete system exists (preferable situated at another location) and all
operation on the system are also performed at the hot stand-by site. When one of the
systems crashes, the other system can take over. Despite these facilities, backups of
data are still important.
Backup and recovery can be done at two levels viz. operating system level and
database level. At operating system level, operating system (and third-party) tools are
used to backup files and raw disks to backup media (e.g. tape). These tools have less
notion of the data they are handling but copying is rather fast. At the database level,
knowledge of the structure of the database and transactions performed on it are
available, enabling more flexibility in copying parts of the database and recovering
The following backup & recovery strategies can be used to recover from an erroneous
database state:
Dump and restart: where the entire database is regularly copied to a backup
device and completely restored from this device in the event of failure.
Undo-redo processing (also called roll-back and re-execute): where an audit
trail of all performed transactions is used to undo all (partially) performed
transactions to a known correct point in time. From that point on, the transactions
are re-executed to yield a correct database state. This strategy can be used when
partially completed processes are aborted.
Roll-forward processing (also called reload and re-execute): where all or part of
a previous correct state is reloaded after which the recently recorded transactions
from the transaction audit trail are re-execute to obtain a correct state. It is
typically used when (part of) the physical media has been damaged.
Restore and repeat: a variation on the previous strategy where the net result of
all transactions in the audit trail is given to the database.
Deliverable 1
Deliverable 1
Volume 1: Main Report
When the database is off-line during the backup process one calls it a Cold Backup.
When the database is on-line (and users can use it during the backup process) one
calls it a Warm Backup.
The selection of backup and recovery strategies is driven by quality and business
guidelines. The qualities of backup and recovery strategies are Consistency,
Reliability, Extensibility/scaleability, Support of heterogeneous environments and
Usability. The business guidelines are Speed, Application load, Backup testing
resources, Restoration time and Type of system.
Commercial products
As mentioned above backup and recovery can be performed at operating system level
and at database level. In the following section some products for both levels are listed.
Some criteria for selecting a product are for example:
capabilities for long-term data archives and Hierarchical Storage Management
(HSM) operations to automatically move infrequently used data to other (less
expensive) devices.
Supports of the range of hardware platforms.
Support for extensive storage device support.
capabilities for data compression to reduce network traffic and transmission time.
Multitasking capability.
On-line (Warm) and off-line (Cold) database backup and archive support.
Security capabilities.
Operating System level backup and recovery
The backup and recovery products at the operating system level depend on the type of
operating system and the architecture and software used to make the storage system.
The products are system dependent but DBMS independent. The user of the product
should provide the knowledge of which files are datafiles, index files, transaction logs
For the PC-oriented database servers the following lists represent the most commonly
used vendors/products:
Arcada Software/Storage Exec., Cheyenne Software/ArcServe, Conner Storage
Systems/Backup Exec, Emerald Systems/Xpress Librarian, Fortunet/NSure
NLM/AllNet, Hewlett Packard/ Omniback II, IBM/ADSM (Adstar Distributed
Storage Manager), Legato/ Networker, Mountain Network Solutions/FileSafe,
Palindrome/Backup Director, Performance Technology/PowerSave, Systems
Enhancement/Total Network Recall.
For the Unix-oriented database servers the following lists represent the most
commonly used vendors/products:
APUnix/FarTool, Cheyenne/ArcServe, Dallastone/D-Tools, Delta MycroSystems
(PDC)/BudTool, Epoch Systems/Enterprise Backup, IBM/ADSM, Hewlett
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Packard/Omniback II, Legato/Networker, Network Imaging Systems, Open
Vision/AXXion Netbackup, Software Moguls/SM-arch, Spectra Logic/Alexandria.
Database level backup and recovery
All major DBMSs provide backup and recovery tools. These tools support the backup
and recovery strategies described in section 3.3.2. Both cold and warm backups are
supported. In [2], detailed descriptions of the capabilities of DB2, Oracle 7, Oracle 8,
Informix, Sybase and SQLServer are given. Furthermore, [2] contains an appendix
with figures of a terabyte database backup and recovery benchmark. Hot backups at a
rate of between 500 Gb and 1 Tb per hour are reached with a total system overhead of
only a few percent (thus leaving over 90% of the system resources for “normal”
database users).
As no system (not even a fault-tolerant one) is free of failures, making backups of
data is essential for an organisation. Moreover, errors are not only caused by hardand software failures but also by (un)wilful wrong user actions. Some types of
failures can be corrected by the DBMS immediately (e.g. wrong user operations) but
others need a recovery action from a backup device (e.g. disk crashes). Depending on
issues like the type of system, the availability requirements, the size of the database
etc., one can choose from two levels of backup and recovery. The first is on the
operating system level and the second on the database level. Products of the former
are often operating system dependent and DBMS independent and products of the
latter the other way around. Which product to choose depends on the mentioned
A database benchmark is a method of doing a quantitative comparison of different
database management systems (DBMS) and hardware platforms in terms of
performance and price/performance metrics. These metrics are obtained by means of
the execution of a performance test on applications. Customers use benchmarks to
choose among vendors. Vendors often use benchmarks for marketing purposes. For
more detailed information, the reader is referred to [1].
Available benchmarks
Several benchmarks are available nowadays, but only few are general accepted and
used. Examples are the OO7 OODBMS Benchmark, from the University of
Wisconsin, for object oriented databases; the HyperModel Benchmark, a DBMS
performance test suite based upon a hypertext application model and the TPC
benchmarks, a family of benchmarks to model “real” business applications. As the
TPC benchmarks are the most commonly used, we will give a more detailed
description in the next sections.
TPC benchmarks
The Transaction Processing Council (TPC) is a non-profit corporation founded to
define transaction processing and database benchmarks and to disseminate objective,
page 18 (48)
Deliverable 1
Volume 1: Main Report
verifiable performance data to the industry. It was founded in 1988 by a consortium of
hardware and software vendors in response to a confusion caused by benchmark
problems. While the majority of TPC members are computer system vendors, the TPC
also has several DBMS vendors. In addition, the TPC membership also includes
market research firms, system integrators, and end-user organisations. There are over
40 members world-wide. Some of those members are:
Database system vendors like Oracle, Informix, Sybase, etc.
Hardware platforms manufacturers like HP, Sun, IBM, Bull, Intel, Silicon
Graphics, Acer, etc.
Software vendors like BEA Systems, Computer Associates, etc.
The TPC has developed a series of benchmarks. Currently, the so called TPC-C
benchmark is used for OLTP systems and the TPC-D benchmark is used for DSS
TPC-C simulates a complete computing environment where a population of terminal
operators executes transactions against a database. The benchmark is centred around
the principal activities (transactions) of an order-entry environment. The metrics
obtained from the TPC-C benchmark are the Transactions-per-minute (tpmC) and the
Costs-per-tpmC ($/tpmC). As an example, the following table shows some results of
TPC-C metrics (representing the situation at the beginning of 1998).
IBM RS/6000 SP Model
309 (12 node x 8 way)
Enterprise Edit’n 8.0.4
HP 9000 V2250 (16-way)
Sybase ASE 11.5 EBF 7532
Sun Ultra Enterprise 6000
c/s (2 node x 22 way)
Oracle8 Enterprise Edit’n
HP 9000 V2200 (16 way)
Sybase ASE 11.5 EBF 7532
We can see here that the IBM/Oracle combination has the highest overall performance
but the HP V2250/Sybase combination has the lowest costs per transaction.
TPC-D is the Transaction Processing Council’s benchmark for Decision Support
Systems. It consists of a suite of business oriented ad-hod queries and concurrent
updates. TPC-D has more metrics than TPC-C and the size of the database is an extra
parameter. The metrics of TPC-D are:
the Power metric (QppD@Size): indicating the query processing power at the
selected database size.
the Throughput metric (QthD@Size): indicating the throughput at the selected
database size.
the composite Query-per-hour metric (QphD@Size): combining the two previous
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
the Costs/Performance metric ($/QphD@Size): giving the costs per QphD for the
selected database size.
As an example, the following table shows some results of TPC-D metrics for a one
terabyte database (representing the situation at the beginning of 1998).
Sun Ultra Enterprise
6000 (4 x 24-way
Svr AD/XP 8.21
NCR WorldMark 5150
(32 x 4-way nodes)
Teradata V2R2.1
Sun Ultra Enterprise
10000 (64 way)
Oracle8 v8.0.4.2
IBM RS/6000 SP
Model 309
(32 x 8-way nodes)
DB2 UDB for AIX, V5
We can see here that the Sun Ultra 6000/Informix combination is superior in all
At the moment, the TPC benchmarks are the most general accepted database
benchmarks. The TPC-C is used for OLTP application whereas TPC-D is used for
Decision Support/data warehouse applications. But as benchmark systems are
optimised for performing the (predefined) benchmark queries, the resulting figures are
not representative for real life applications. For this reason, the benchmark figures are
indicative and should only be used to get an idea of the key players in the database
area, their performance and price potential. Finally, by comparing the TPC results
over time, one can analyse trends like e.g. the relation between Windows NT
platforms and Unix platforms and the increasing size of the databases for TPC-D
In the end phase of the construction of this Deliverable the TPC-W benchmark has
been announced. This benchmark represents web environments with transaction
components, Web page consistency and dynamic Web page generation. The
benchmark will be based on a browsing/shopping/buying application. The primary
metric will be user Web interactions per second (WIPS) and a price per WIPS.
Performability modelling
Although performability modelling is not directly related to state-of-the-art of
telecommunication applications and database technologies, we think it is a good idea
to put this section in the document. As mentioned earlier, scaleability and
performance are important characteristics of a very large database. To examine these
characteristics, experiments will be carried out. But most of the systems used for these
experiments will not be at the top-end of the very-large-database-scale. To be able to
say something sensible about systems that are at the top-end of that scale,
performability models will be build to extrapolate information obtained from the
page 20 (48)
Deliverable 1
Volume 1: Main Report
experiments. This section provides some information on the concept of performability
For more information on this subject, the reader is referred to [1].
When looking at the performance of (database) systems, possible failures of that
system are often not taken into account. This may be easier to model, but it is not
correct. From a users point of view, performance of a system is the performance of a
system subject to failure and distortion. Modelling the performance of a system,
taking into account possible failures of that system, is called “performability
Performability modelling can be done in two ways:
an integrated approach: where one overall model of the system, containing all
possible events (e.g. arriving of jobs, server breakdown etc.), is created and
an approach based on behavioural decomposition: where only likely states and
associated probabilities of the system are modelled resulting in a smaller state
space size.
As the second approach gives satisfying results within a smaller amount of time, this
approach will be used in the P817 Project.
Tools for performability modelling
Doing performability modelling is not just a manual job. Several tools are available to
support the modeller in making models and performing analysis and simulations. We
have distinguished the following five categories:
Model specification and construction tools: for (graphically) creating models of a
Analytical model solving tools: for solving a created model in an analytical way.
Simulation tools: for solving a created model by means of simulation (often used
when the model is to complex to solve in an analytical way).
Performance monitoring tools: for obtaining performance measurements from the
system that is modelled. These measures are used to tune the model.
Load emulation tools: for generating workload for the system that is modelled.
A selection of actual software products to be used for the experiments, has not been
made yet.
Guidelines for measures for the experiments
When performing experiments, it is important to realise that there are two types of
experiments; the experiments with the database itself (the database experiments) and
experiments with the model of that database (the modelling experiments). Results
from both types of experiments should be comparable, although they are obtained
differently. With the database experiments the results are obtained by measuring,
using performance monitoring tools, at the real system. For the modelling
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
experiments, results are obtained by evaluating the model using analytical or
simulation techniques. In [1], a framework is provided that enables comparison of
results from both types of experiments. This framework provides both a layered
approach for measuring relevant parameters at different layers (Application, DBMS,
Operating System, Hardware) and a definition of the types and units of parameters to
Performability modelling is not a goal on its own. It is used to give a better
understanding of the performance of a, not always failure free, very large database,
before it actually is build. After a first version of a performability model has been
created, experiments with a small size database are necessary to improve the model in
an iterative way. It is important to define the experiments in such a way that the
results can serve as input for the performability model of the database. Both tools for
monitoring the database as tools for making and analysing the model are necessary.
Data warehousing
This section describes data warehousing concepts, products and market developments.
For more detailed information, the reader is referred to [3].
Data warehouse architectures
The term data warehouse is defined as ”a subject oriented, integrated, time varying,
non-volatile collection of data that is primarily used in organisational decision
making”. As data is collected from several sources and added to the already existing
data in the data warehouse (viz. historical data is stored), the potential size of a data
warehouse is enormous.
The architecture chosen for a data warehouse has great impact on how the data
warehouse is built. The main components of a data warehousing architecture are:
source databases, data extraction, transformation and loading tools, data modelling
tools (including import and export facilities), target databases and end-user data
access and analysis tools. The main architectures for building a data warehouse
Virtual Datawarehouse; the end-user tools operates directly on the source
databases, that is there are no target database holding the physical warehouse
data. The main advantages of this architecture is that it is easy to build, does not
require large investments. The main drawbacks include that no history can be
stored, the queries interfere with the operational processes and the source
systems must have on-line access (e.g. RDBMS);
Retail Data
No Target DB
Financial Data
Legacy Systems
End User Tools
External source DB
page 22 (48)
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
Data Mart in a Box; all end-user tools operate on a single physical warehouse
target database. The data from the source databases are extracted, cleaned,
scrubbed and integrated into a physical warehouse target database. Local meta
data are stored in the target database. The main advantages of the architecture is
that it is a typical architecture directly supported by packaged products and hence
provides an easy entry to data warehouse technology usage. This approach can be
dangerous because it does not include meta-data integration which may result in
a stove pipe data mart, i.e. a standalone data mart not possible to integrate with
an enterprise data warehouse (see the Multiple, Non-Integrated Data Marts
Retail Data
Target DataBase
Financial Data
Legacy Systems
End User Tools
External source DB
Multiple, Non-Integrated Data Marts; the data from the source databases are
extracted, cleaned, scrubbed and integrated into physical data mart target
databases (one for each department) by multiple extraction processes. The enduser tools operate on the separate physical warehouse data marts, but the data
marts are still not integrated together. The advantage of this approach is its rapid
deployment, but a severe drawback is the increasing complexity of the
architecture as the data warehouse evolves. This will result in huge maintenance
cost. Another disadvantage is that data marts may not be consistent with each
other (i.e. multiple, incompatible views of the truth);
Sales Data Mart
Retail Data
Financial Data
Financial Data Mart
Legacy Systems
External source DB
Human Resources
Data Mart
Human Resources DB
Source Databases
Data Extraction,
Data Marts
Data Access
and Analysis
Multiple Architected Data Marts; the integration problem is solved as the
multiple architected data marts share common meta-data, but have no central data
warehouse in common. Sharing the same meta-data implies that the data marts
are built in the same way but serve different business areas. The main advantage
of this approach is that the central meta-data repository ensures a consistent view
on the data for all data marts. The main disadvantage is that different products
must be capable to integrate with the central meta-data repository.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Data Cleansing
Retail Data
Transform and
Financial Data
Meta Data
Legacy Systems
Data Modelling
External source DB
Source DataBases
Local Meta Data
Data Extraction,
Transformation, Load
Architected Data
Data Access
and Analysis
Enterprise DWH architecture; includes a large data warehouse driving
multiple data marts. There are multiple source databases. The central data
warehouse stores detail data and supports organisation wide, consolidated
analysis /reporting. The architecture also includes multiple architected data marts
based on RDBMS and/or MDB. There is central co-ordination and management
based on access to central meta data repository. This architecture involves a
complex environment which means high development cost and risk. This kind of
architecture is though required when end-users need access to detailed data. The
main advantages of this approach include the availability of detail data and
support for organisation wide consolidated analysis/reporting. The main
disadvantage of this approach is the complexity of the environment and its high
development cost and risk.
Data Cleansing
Admin. Tools
and Load
Data Subsetting
& Distribution
Meta Data
Meta Data
Local Meta Data
Data Modelling
Source DataBases Data Extraction
Architected Data Access
Data Warehouse Distribution Datamarts and Analysis
Operational Data Store Feeding Data Warehouse; an Operational Data Store
(ODS) consolidates data from multiple transaction systems and provides a near
real-time, integrated view of volatile (i.e. changeable, not permanent), current
data. The ODS is used for operational decisions and the data warehouse is used
by business analysts for tactical/strategic decisions. The ODS can be
implemented with package software. The ODS may also be used as a staging area
to drive one or more data warehouses (i.e. the ODS is put between the Source DB
and the DW as a staging area with a pull or push process that fetch the data from
the source DB, this staging area can then feed more than one DW). This is the
most complex data warehouse environment which results in very high cost and
risk. The other parts of the architecture is described in previous sections. The
ODS adds the value of presenting a current, near real-time, integrated view of
volatile, current data for the end-user, which the data warehouse not is able to.
The main advantage of this approach is that the ODS presents a current and
integrated view on enterprise data. Hence it can be used for complex operational
support. The main disadvantage is its high cost and high development risk.
page 24 (48)
Deliverable 1
Volume 1: Main Report
Data Cleansing
Admin. Tools
Data Store
Transform and
Data Subsetting
& Distribution
Meta Data
Central Data
Data Modelling
Source DataBases Data Extraction & ODS and Central Data
Data Warehouse Distribution Datamarts
Data Access
Design Strategies
There are two main philosophies of how to build a data warehouse, the first is
building it top-down, the second approach is to build the data warehouse bottom-up.
The top-down approach means that you from start of the data warehouse Project
include all source data and design a data warehouse with full size which is capable of
handling all your source data and end-users from start of operation. The bottom -up
approach is to design a data warehouse for a very small isolated part of your business
i.e. a data mart and from there when that is working extend it more and more to at last
be able to handle all your source data and end-users. The way to be build a data
warehouse successfully is some what in between these two previous strategies. Start
with a small isolated part of the business but define from start all datamodels, data
semantics and definitions across business areas as well as meta data handling for the
whole enterprise data warehouse.
Data Models
The data model provided to the user must meet his specific needs. Typically one sees
in data marts multi-dimensional modelling where the dimensions reflect the interest of
the user (i.e. products, sales, period). Star-schemes and snow-flake schemes are
widely used for this because it approaches the user’s way of thinking and because of
performance reasons. In the central data warehouse consistency is more important and
hence a normalised data model is often used here.
Meta Data
Meta data includes schema information of the source data, central data warehouse and
target data, calculation functions for derived data, transformation and conversion rules
and batch processing information. Meta data is stored in a central meta data repository
and may be distributed to local meta data repositories. Centralisation of meta data
maintenance ensures consistency.
Meta data standards are characterised by incompatible standards from different
organisations. The Meta Data Coalition with the members IBM, Informix, SyBase,
ETI, Business Objects, Arbor Software and Cognos has produced the Meta Data
Interchange Specification (MDIS), but this standard only specifies a flat file format
for exchanging information about data - this effort has not really been successful.
There exist a few other standards but the most promising one is the MicroSoft Open
Information Model (OIM). The fact is that over 65 vendors have agreed to back OIM
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
as a meta data standard. Meta Data Coalition has pledged to build translator between
the Meta Data Interchange Specification (MDIS) and OIM.
Data Cleansing, Extraction, Transformation, Load Tools
Moving data from the operational databases to the data warehouse needs to be done
via extraction tools. The following functions are essential:
Extraction and Transport; The data extraction component provides the ability
to transparently access all types of (legacy) databases regardless of their location.
Data extractors also provide a means of transport from the source data platform
to the target.
Data modelling and transformations; Data models, or metadata, define data
structures and may incorporate rules for business processes. Transformations of
data are needed because a data warehouse is organised differently from a
traditional transaction system. Data cleaning and conversion is also needed
because much of the source data comes from legacy systems which may contain
missing or overlapping data.
Loading; most relational database vendors offer bulk load utilities that provide a
high-speed way of loading large volumes of data by disabling index creation,
logging, and other operations that slow down the process.
Data Cleansing; denotes the process of filtering, merging, decoding, and
translating source data to create validated data for the data warehouse. In the
search for "best practices" many organisations try for a quick fix only to discover
that the data cleansing issue is enormously complex and requires industrial
strength solutions.
There is no one tool package that addresses that extremely large number of tasks
pertaining to data extraction, cleaning, and transport. Different tools specialise in
addressing different issues. W.H. Inmon, in his books on data warehousing, estimates
that, on average, 80% of the efforts of building a data warehouse go into these tasks.
The following tools facilitate the construction of data warehouses:
Enterprise/Integrator (Carleton), Data Propagator (IBM), PowerMart (Informatica),
DECISIVE (InfoSAGE), IRE Marketing (Mercantile Software Systems), Info
Transport (WarehousePlatinum Technology), OmniLoader (Fast Load Praxis
International), SAS/WarehouseAdministrator Software (SAS Institute), STRATEGY
Distributor (ShowCase), Smart DB Workbench (Smart Corporation), DataImport
(Spalding Software), ORCHESTRATE Development Environment (Torrent Systems),
Warehouse Directory (Prism Solutions).
There are only a few vendors that offers a total end-to-end data warehousing solution
with own products. Most vendors depend on one or more third party components for
solving the end-to-end data warehouse solution. Two examples of companies that
provides an end-to-end data warehouse solution are SAS Institute Inc. (SAS Data
warehouse - Orlando II) and Information Builders.
Target Databases
Target databases include the databases where data from the source databases is
transferred to and stored. Typically it holds historic, non-volatile data (read-only).
page 26 (48)
Deliverable 1
Volume 1: Main Report
The data in the target database are also subject oriented, integrated and includes both
detail and summary data. Three types of target databases are used:
relational databases (RDB); Conventional RDBs in combination with relational
OLAP (ROLAP) tools, support most data warehousing requirements, and can
handle very large target databases. The central data warehouse is almost always a
conventional RDB. The main advantages include excellent scaleability
(Terabytes), mature technology, openness and extensive support by the industry.
Multi-dimensional databases (MDB); contain pre-calculated results from the
base data in order to provide fast response times to the user. The main advantage
of MDBs are high performance together with sophisticated multi-dimensional
calculations. The most important limitations of an MDB is its scaleability in
terms of size (max. 30 GB) and number dimensions and its inflexibility.
Hybrid databases use the relational component to support large databases for
storage of detailed data and ad-hoc data access, and the multidimensional
component for fast, read/write OLAP analysis and calculations. The combination
of RDBMS and MDB is controlled by an OLAP server. The result set from the
RDBMS may be processed on-the-fly in the Server.
On-Line Analytical Processing (OLAP) Technology and Tools
On-Line Analytical Processing (OLAP) is defined as the process of slicing the data
using multi-/ cross dimensional calculation. OLAP provides capabilities for
consolidate/summarise along different complex hierarchies on dimensions, that
involves grouping/classification and summarisation/aggregation for both business and
statistical analyses. The following types of OLAP exist:
Relational OLAP (ROLAP) is based on relational technology and uses RDBMS
tables as data source for the analyses. The main advantage of ROLAP is its
RDBMS base with its scaleability characteristic. However, ROLAP requires online calculation which may have severe impact on response times. Products
include: DSS Agent/Server (MicroStrategy), DecisionSuite (Information
Advantage), InfoBeacon (Platinum technology), MetaCube (Informix Software).
Multi-dimensional OLAP (MOLAP) is based on that all answers (aggregates)
to the questions/queries made to the system are pre-calculated and stored in a
multidimensional database or data cube before the end-user starts to interact with
the system. The MOLAP approach requires a dedicated data structure and the
pre-calculation of all possible aggregates of the dimensions. The advantage of
MOLAP is its instant response, but MOLAP requires extensive precalculation.
Products include: Oracle Express for OLAP (Oracle Inc.), EssBase and IBM
DB2 OLAP Server (Arbor Software Corp.), TM1 (APPLIX Inc.), Holos (Seagate
Software), SAS Multi-dimensional Database Server (MDDB), GentiaDB (Gentia
Hybrid OLAP (HOLAP) is a combination of using relational OLAP and precalculated aggregates stored in multidimensional structures (MOLAP). Using this
hybrid approach for solving the analyses is maybe the most complete solution for
providing decision support to end-users with different levels of requirements on
the information. Products include: Media MR (Speedware Corp.), Plato OLAP
Server (Microsoft).
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Desktop OLAP (DOLAP) is ROLAP, MOLAP and/or HOLAP technology
implemented and operated on a desktop environment. Products include: Brio
Enterprise (Brio Technology Inc.), Business Objects (Business Objects),
Impromtur/PowerPlay (Cognos), IQ/Vision (IQ Software Corp.).
The Microsoft OLAP Server - Plato which is integrated with MS SQL Server 7.0 will
probably have great impact on the market for OLAP tools when this will be released
during second half of 1998. The MS OLAP server is based on HOLAP i.e. a hybrid
solution with both a MDB for MOLAP based analyses and a relational database for
ROLAP based analyses.
Data Mining
Data mining technology allows organisations to leverage their existing investments in
data storage and data acquisition. Through the effective use of data mining
technologies, organisations discover actionable information from raw data.
Production data mining systems automatically identify and act upon this information.
Production data mining has three basic requirements. First, it needs to operate on
large volumes of data - often hundreds of gigabytes. Second, the data mining system
needs to be able to handle very high throughput within fixed time constraints.
Tomorrow’s inventory forecast is useless if it is not generated in time to take action.
Finally, production data mining systems must be able to efficiently process thousands
of models, each using thousands of variables.
Data mining tools examine the historical detail transactions to identify trends, to
establish and reveal hidden relationships, and to predict future behaviour. The tools
are available in the following categories: case-based reasoning, data visualisation,
fuzzy query and analysis, knowledge discovery and neural networks.
Data Warehousing on the Web
The explosive growth of the World Wide Web and its powerful new computing
paradigm offer a compelling client/server platform for corporate developers and DSS
architects. Two key advantages motivate the integration of world wide web and data
warehouse technology: 1. expand role of data warehouses in the organisation; 2.
extend the use of data warehouse cost effectively.
To disclose data warehouses via the Web, a multi-tiered architecture is essential as it
greatly improves response times of accessing data. Java computing brings interactivity
to the web client and reduces cost because no fat clients are necessary. Server
solutions must meet stringent performance, scaleability, and availability requirements
while providing comprehensive services for security and simplified management.
Many companies are providing effective, so called Webhousing solutions for the
Internet, intranets and the WWW: Active OLAP Suite (ACG), GQL (Andyne
Computing, Ltd.), Essbase (Arbor Software), brio.web.warehouse (Brio Technology),
WebIntelligence (Business Objects), DecisionWeb (Comshare Commander), CorVu
Web Server (CorVu), NetMirror/400 (DataMirror), WebPlan (Enterprise Planning
Systems), Net.Data (IBM Corporation), Aperio OLAP, Aperio Report Gallery
(Influence Software), WebOLAP (Information Advantage), WebFocus (Information
Builders), Webcharts, WebSeQueL, Web Warehouse (InfoSpace), ALICE d'ISoft
(Isoft), IQ/LiveWeb (IQ Software), Business WEB (Management Science
Associates), DSS Web (Microstrategy), Shadow Direct, Shadow Web Server (Neon
Systems), DataDriller (OpenAir Software), Designer/2000, Discoverer v3.0, Express
page 28 (48)
Deliverable 1
Volume 1: Main Report
Web Agent (Oracle Corporation ), Internet Publisher (Pilot Software ), Beacon, Forest
and Trees, InfoReports, DataShopper (Platinum Technology ), Warehouse Directory
(Prism Solutions ), VISTA PLUS (Quest Software ), Data Mart Solutions (Sagent
Technology), Web Enabled Tools (SAS Institute ), DBConnect for the Web (Silvon ),
Media/Web (Speedware ), Environment Manager Toolset (White Light Technology,
Ltd. ), Zanza Web Reports (Zanza).
The market of data warehousing is growing - DW software, hardware and services
will continue to grow at 40 % compound rate through 1998 - from 2 billion US-dollar
in 1996 to 8 billion US-dollar in 1998. The number of organisations building data
warehouse of size 1 TeraByte or larger will increase from 7 % to 17 % during
1998/1999. There is a trend of more and more usage of data marts as a starting point
for data warehouse usage, which also enables to build the data warehouse
incrementally. Tools for all different parts of the data warehouse architecture as well
as for data warehouse administration are being developed. The including of
Operational Data Stores in the data warehouse architecture adds the value of
presenting a current view of data for operational decisions, which is not provided by
the data warehouse. The lack of Meta Data standard may be solved by the de-facto
standard introduced by Microsoft (Open Information Model) which may help to
automate the meta data handling which at this moment is performed manually in 90 %
of all cases. OLAP servers that provide multi-cube facilities, i.e. the hybrid approach
that supports both relational and multi-dimensional OLAP, tend to be the most allaround solution for meeting different levels of end-users requirements. Few
commercial products support the important integration of local meta data with a
central meta data repository for the whole data warehouse. Further only a few
companies provide an end-to-end data warehouse solution. Most vendors depend on a
third party vendor for providing a total data warehouse solution. A Web interface on
the end-user access products is almost a rule. Dominance of Microsoft SQL Server
7.0 to build data marts is foreseen.
The following recommendations can be given:
Build the data warehouse incrementally, one business area at the time, but define
the structure for the whole data warehouse and architected data marts from the
Buy only components that integrate with central meta data repository and ensure
that the data warehouse is not populated with dirty data.
Support a mix of RDBMS, MDB, and hybrid target database and ensure that the
tools can provide the same functions on a LAN as on the World Wide Web.
Support mobile users with off-line query, reporting and OLAP functions.
Ensure the system is scaleable to increases in users and database size as well as
provides powerful security and warehouse administration functions.
Transaction processing
Transactions are fundamental in all software applications, especially in distributed
database applications. They provide a basic model of success or failure by ensuring
that a unit of work must be completed in its entirety. For more detailed information on
this subject, the reader is referred to [2].
Deliverable 1
Volume 1: Main Report
Deliverable 1
From a technical point of view we define a transaction as "a collection of actions that
is governed by the ACID-properties". The ACID properties describe the key features
of transactions:
Atomicity. Either all changes to the state happen or none does. This includes
changes to databases, message queues or all other actions under transaction
Consistency. The transaction as a whole is a correct transformation of the state.
The actions undertaken do not violate any of the integrity constraints associated
with the state.
Isolation. Each transaction runs as though there are no concurrent transactions.
Durability. The effects of a committed transaction survive failures.
Database and TP systems both provide these ACID properties. They use locks, logs,
multiversions, two-phase-commit, on-line dumps, and other techniques to provide this
simple failure model. The two-phase commit protocol is currently the accepted
standard protocol to achieve the ACID properties in a distributed transaction
Transaction processing in a distributed environment is supported by TP Monitors. The
job of the TP Monitor is to ensure the ACID properties even in a distributed resource
environment while maintaining a high transaction throughput. A TP Monitor is good
at efficient transaction and process management.
TP Monitor
50 shared
50 Processes
25 MB of RAM
500 open files
Efficient transaction and process management is achieved by sharing server resources.
Keeping connections to the shared application (e.g. database) for each client is very
expensive. Instead, the TP Monitor maintains pools of pre-started application
processes or threads (called server classes) which are shared by multiple clients. In
addition the TP Monitor provides dynamic load balancing. These functionalities
provide better scaleability than the traditional two-tier architectures.
Without TP Monitor
With TP Monitor
Number of clients
In addition to efficiency, a TP Monitor also provides robustness. In case an
application crashes, the TP Monitor can re-establish connections or even restart the
failed application or process. Also, requests can be redirected to other servers. Hence
fail-over functionality is provided. TP Monitors also provide for deadlock detection
and resolution. TP Monitors provide facilities for security and centralised
management of the distributed applications.
page 30 (48)
Deliverable 1
Volume 1: Main Report
A TP monitor should not be used when only few users have access to a (potentially
very large) database, because it introduces architectural complexity. Once a TP
Monitor is implemented, it is not easy to switch to another TP Monitor. This makes
you much dependent on a specific vendor. The Standish Group recommends the use
of TP Monitors for any client/server application that has more than 100 clients,
processes more than five TPC-C type transactions per minute, uses three or more
physical servers and/or uses two or more databases.
Before we discuss the commercial TP Monitor products, we identify the components
of a TP Monitor. The Open Group's Distributed Transaction Processing Model
(1994), which has achieved wide acceptance in the industry, defines the following
The application program contains the business logic. It defines the transaction
boundaries through calls it makes to the transaction manager. It controls the
operations performed against the data through calls to the resource managers.
Application Program (AP)
Manager (TM)
Manager (RM)
Managers (CRM)
Resource managers are components that provide ACID access to shared
resources like databases, file systems, message queuing systems, application
components and remote TP Monitors.
The transaction manager creates transactions, assigns transaction identifiers to
them, monitors their progress and coordinates their outcome.
The Communication Resource Manager controls the communications between
distributed applications.
Commercial TP Monitors
In general all of the products follow the OpenGroup standard DTP architecture. The
following products are investigated:
BEA Systems Inc.'s Tuxedo,
IBM's TXSeries (Transarc's Encina),
Microsoft Transaction Server MTS,
Itautec's Grip.
In the following sections, we shortly discuss when to use these products. For a
detailed discussion we refer to [2].
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
BEA Systems Inc.'s Tuxedo
You are developing object-based applications. Tuxedo works with non-object
based applications, but it is especially suited to object based ones. In fact, you
cannot implement object-based applications involving transactions without
under-pinning them with a distributed TP Monitor like Tuxedo. (ORBs without
TP underpinning are not secure.)
You have a large number of proprietary and other platforms which you need to
integrate. You use Oracle, DB2/6000, Microsoft SQL Server, Gresham ISAMXA, Informix DBMSs or MQSeries, and you need to build transaction processing
systems that update all these DBMSs/resource managers concurrently.
You want to integrate Internet/Intranet based applications with in-house
applications for commercial transactions.
IBM's TXSeries (Transarc's Encina)
Your programmers are familiar with C, C++ or Corba OTS
You use Oracle, DB2/6000, MS SQL Server, Sybase, CA-Ingres, Informix,
ISAM-XA, MQSeries and/or any LU6.2-based mainframe transaction and you
need to build transaction processing systems that update or inter-operate with all
You need to build applications that enable users to perform transactions or access
files over the Internet.
You are already a big user of CICS and intend to be, in the future, a user of
predominantly IBM machines and operating systems. Note that if you intend to
use non-IBM operating systems you must be prepared to use DCE.
You are attracted by IBM's service and support network.
You are prepared to dedicate staff to performing all the administrative tasks
needed to ensure CICS is set up correctly and performs well across the company.
You do not need an enterprise-wide, self-administering solution supporting a
large range of different vendors' machines.
Microsoft Transaction Server MTS
You are building an multi-tier (especially Internet) application based on
Microsoft Backend Server Suites.
Your system architecture is built upon the DCOM architecture.
Your developers have good Visual Basic, VisualJ++ and NT knowledge.
You are building a complex business application which stays in the Microsoft
world only.
page 32 (48)
Deliverable 1
Volume 1: Main Report
You need a strategic, high performance, high availability middleware product
that combines support for synchronous and asynchronous processing with
message queuing, to enable you to support your entire enterprise.
You use TCP/IP, Unix (AIX, HP-UX, Solaris, Dynix, Sinix, IRIX, NCR SvR4,
U6000 SvR4, Olivetti SvR4, SCO UnixWare, Digital Unix, Pyramid DC/OSx),
Windows (NT, 95 or 3), OS/2, MVS, AS/400 or TPF.
You need distributed transaction processing support for Oracle, Informix,
Sybase, Teradata, CA-Ingres, Gresham's ISAM-XA, Microsoft SQL Server or
Your programmers use C, Cobol or C++, Oracle Developer/2000, NatStar,
Natural, PowerBuilder, Informix 4GL, SuperNova, Visual Basic, Visual C++ (or
any other ActiveX compliant tool), Java and Web browsers.
Itautec's Grip
You need a TPM which is capable of supporting a cost-effective, stand-alone or
locally distributed application, which may exchange data with a central
You want to develop these applications on Windows NT or NetWare servers.
Your hardware and network configurations are relatively stable.
The DBMSs you intend to use are Oracle, Sybase, SQL Server, Btrieve or
The most mature products are Tuxedo, Encina, TOP END and CICS. Grip and MTS
lack some features and standards support.
If you are looking for enterprise-wide capacity, consider TOP END and Tuxedo. If
your project is medium sized, consider Encina as well. If you look for a product to
support a vast number of different platforms then Tuxedo may be the product to
choose. If DCE is already used as underlying middleware then Encina should be
MTS and Grip are low-cost solutions. If cost is not an issue then consider Tuxedo,
TOP END and Encina. Internet integration is best for MTS, Encina, Tuxedo and TOP
Regarding support of objects or components MTS is clearly leading the field with a
tight integration of transaction concepts into the COM component model. Tuxedo and
Encina will support the competing CORBA object model from the OMG.
There seems to be a consolidation on the market for TP Monitors. On the one hand
Microsoft has discovered the TP Monitor market and will certainly gain a big portion
of the NT server market. On the other side the former TP Monitor competitors are
merging which leaves only IBM (CICS and Encina) and BEA Systems (Tuxedo) and
NCR (TOP END) as the old ones.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
The future will heavily depend on the market decision about object and component
models such as DCOM, CORBA and JavaBeans and the easy access to integrated
development tools.
Multimedia databases
In this report we adopt the following definition of a multimedia database [3]:
A multimedia database (MMDB) is a high-capacity/high-performance database
system (including its management system) that supports multimedia data types, as
well as other basic alphanumeric types, and handles very large volumes of
(multimedia) information.
This definition can be divided into five objectives. A multimedia database system:
supports multimedia data types. By supporting data types this means that a data
type consists of a structure to which specific operations can be invoked.
Examples of multimedia data types are text/documents, pictures, audio, video,
graphics and classical data. The multimedia data types, audio and video require
specific operations like fast-forward, rewind or pause.
have capacity to handle a very large number of multimedia objects. The
objects are mostly large (e.g. audio and video) and multimedia libraries may
contain a huge number of objects.
management. High-performance storage is required for real-time responses (e.g.
stream processing) and handling huge objects. High-capacity is required because
the objects can bePrinter
very large.
objects have stricter requirements
than other objects
read-only), cost-effective
Station Station
storage management may be achieved by a hierarchical storage manager. In a
hierarchical storage manager data is spread over on-line, near-line and off-line
storage devices, depending on the requirements (mostly access frequency).
Content Database Hierarchical
Retrieval Server Storage
Multimedia Database Management
has conventional database capabilities. These include persistence, versioning,
transactions, concurrence control, recovery, querying, integrity, security and
performance. Transactions, concurrence control and querying will be discussed
in more detail hereafter.
has information-retrieval capabilities of multimedia data. Multimedia data is
typically not queried on exact match, but searched for information (e.g. pattern
matching, query by example) which requires probabilistic techniques. Querying
is therefore often an iterative process.
page 34 (48)
Deliverable 1
Volume 1: Main Report
Querying and content retrieval in MMDBs
With the increased complexity of the data objects in a MMDB, exact matches on MM
objects are rare. Merely queries on MMDB concentrate on information (i.e. content)
of the object, rather than the data itself. For example, pictures can be queried for
round red shapes, instead of counting the number of red pixels. Therefore the
querying in MMDBs incorporate fuzzy values and answers with degrees of
probability. This makes querying an iterative process, rather than a single step.
Typical in querying MMDBs is ‘query by example’ (to find similar objects as the
presented object).
Content based retrieval is very complex when it is based only on the base data of the
MM object. Therefore MM objects can be annotated with data to enable quick and
accurate retrieval later on. Annotation can be done manually, automatically or a
combination of both.
Indexing MM objects is very important, because it provides high performance access
to the large objects. Three types of indexes are widely used in RDBMSs and
single-key index structures
multi-key index structures
content-based index structures.
A single-key index enables fast access to objects based on a single attribute of the
object. An example is the primary key. Single-key indexes structures can be defined
on large MM objects, which are stored as Binary Large Objects (BLOB). An
interesting single-key index structure is the positional B+ tree structure, where the
BLOB is partitioned into equally sized blocks which can be accessed through a tree
Multi-key index structures provide fast access to objects which involve multiple
attributes. These attributes are scanned at the same time. Especially for MM objects
multi-key indexes are important, because in a search typically multiple attributes must
be checked. There exists a number of multidimensional index structures, e.g. Kd-tree,
multi-dimensional trees, grid files, point-quad trees and R-trees.
Content-based index structures concentrate on the content rather than on attributes
describing the object. Two important types for content-based indexing are: inverted
indexes and signature indexes. An inverted index is a list of pairs (value, set) where
the set includes all relevant objects associated with the value. Signature indexes
associate each object with a signature. The signature is a complex string which
encodes information about the object. Hence to identify relevant objects, only their
signatures must be scanned. Although signature indexes are much more efficient in
storage, they require more complex algorithms.
Transactions, concurrence and versioning in MMDBs
Typically an MMDB consists of multiple components (e.g. RDBMS, hierarchical
storage manager, full text retrieval engine), so transactions, concurrence control and
versioning are complex tasks in MMDBs. Next these three topics will be discussed.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Traditional applications often involve short transactions (such as debit/credit
transactions) while multimedia database applications involve long (and short)
transactions. Long transactions are particular true in graphics and computer-aided
design applications. There have been a number of transaction management techniques
developed especially for object-oriented and multimedia database application
involving long duration of transactions. One of the earliest transaction models for
handling this was the nested transaction model.
Concurrence control is the activity of synchronising the actions of transactions
occurring at the same time, thereby prohibiting mutual interference. There exist a
number of concurrence control algorithms such as locking, time-stamp ordering,
commit-time certification etc. The most widely used algorithms in multimedia
database implementations (as well as in commercial database implementations) are
the locking-based algorithms. The concept with locking is that each persistent object
has a lock associated with it. Before a transaction can access an object, it must request
a lock. If at least one other transaction is currently holding that lock, the transaction
has to wait until the lock is released. The database system can provide locking at
different granules (e.g. at the instance, record, compound object, class, page, storage
area or database level). Multimedia databases have several categories for granules, i.e.
physical storage organisation, classes and instances for object orientation, class
hierarchy locking, complex or composite object locking. Fine grain locking
compromises performance, whereas coarse grain locking compromises concurrence
and has increased danger of deadlocks.
In many applications that involve complex objects, references between objects need to
be consistently maintained. Some applications use the concept of generations of data
based on historic versions. This concept is rather straightforward for multimedia
applications supported by multimedia databases. The multimedia storage manager
must make sure to propagate the older versions of objects to write-once-read-many
(WORM) media and to store the volumes of much older versions off-line.
Multi-media objects in relational databases
Relational databases support variable field lengths in records. These data types are
supported with the intention of providing direct and easy definition of variable-length
types such as text, audio/video (digitised), pictures (black&white, colour). Products
include among others Borlands’ Interbase, Sybase SQL server and Plexus XDP
InterBase is a relational database system from Borland which has built-in support for
BLOBs. BLOBs are stored in collections of segments and to access and manipulate
the database, InterBase uses GDML which is a proprietary high-level language.
The Sybase SQL server supports two variable-length data types, TEXT and IMAGE
where each such field can be large (2 GB). The database designer can via an API call
place the text (or multimedia data) on a device or volume separate from the database
page 36 (48)
Deliverable 1
Volume 1: Main Report
The Plexus XDP database is actually based on the INFORMIX-Turbo RDBMS.
Plexus has extended the INFORMIX engine with a number of multimedia database
features. It provides support for two variable-length data types, TEXT and BYTES.
Here, the BYTES and TEXT data can be stored in magnetic, erasable optic, or
WORM devices where the optical drives are managed directly by XDP. Unlike most
RDBMSs, the XDP system manages optical jukeboxes and volumes directly (vs.
storing an operating system path in a character length field). XDP provides consistent
transaction management for both records and long variable-length fields (IMAGE and
Multimedia Objects in Object-Oriented Databases
There are many ways and approaches for integrating object-oriented and database
capabilities, including a number of standardisation efforts (by ODMG and OMG).
There are three main approaches that can be outlined that are especially relevant to
multimedia database applications:
object-relational databases or databases that extend the relational model with
object-oriented capabilities (e.g. UniSQL and Illustra).
object-oriented databases that extend or integrate with an object-oriented
language supporting persistence and other database capabilities (often C++ is
used) (e.g. GemStone DBMS, Versant, Objectstore)
application-specific approaches that might use any underlying DBMS but which
concentrate on a specific application area. Application-specific examples are
face-retrieval systems, satellite imaging, earth and material science, medical
imaging, etc. These systems are often used to solve ”one solution per
application” and therefore isolated from each other.
Today, most multimedia databases are specific applications developed with
commercial DBMSs, HSMs or information-retrieval technologies. The state-of-the-art
is to rely on third-party vendors for each component and integrate these together (at
least some commercial DBMSs are starting to incorporate optical storage and
multimedia server support). Although a number of ”general-purpose” multimedia
development tools with various multimedia editing, querying, and retrieval
capabilities have started to appear, the successful implementations of multimedia
pertain to specific applications.
Databases and the World Wide Web
Nowadays, we can not image a world anymore without the Internet and the World
Wide Web (WWW). As the WWW is in fact one very large distributed multimedia
information system, very large databases play an important role. This section first
gives a small introduction into Internet and WWW concepts before describing the
possible ways of embedding databases in the architecture. As security is a major issue
with Internet, this is also treated. For more detailed information, the reader is referred
to [3]
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
The Internet and the World Wide Web
In fact, Internet is a huge set of autonomous systems connected in a network. In
principle, each system can communicate, directly or indirectly via routers, with all
other systems connected to the Internet. Each system can act as a client or as a server.
For communication to be possible, unique addressing of all systems is necessary. On
Internet, the so called Internet Protocol (IP) address is used for this. Each IP address is
made up of twelve digits and is issued by a central Internet authority. Internet is a
package based network, and the Internet Protocol itself is not sufficient to enable real
communication. A number of other protocols on top of it enable the actual end to end
communication (see figure below). For detailed descriptions of all components the
reader is referred to [3] but what one can see is that the WWW (or simply the Web) is
one of the many applications (indicated by ovals in the figure) on Internet.
File transfer
Hardware - Physical network
Important Internet protocols and applications
Information on the WWW is mostly structured in documents using the Hyper Text
Mark-up Language (HTML). HTML enables the structuring of multimedia
information in a uniform way where so called Web Browsers (e.g. from Netscape and
Microsoft) can render it to the user. An other important facility of HTML is the
possibility to point from one document to another document via so called Hyper
Links. By means of these hyper links users can navigate transparently over the whole
The rather static presentation of HTML document based information on Internet,
appeared to be insufficient to satisfy the needs of the users. A more dynamic type of
generating information on request was needed. This is where databases can play an
important role.
Database gateway architectures
At the server side, several architectures are possible for connecting databases to the
WWW. In the following the most important ones are described.
The figure below shows a “Database gateway as a CGI executable program”. The
Webserver receives requests from Internet users and for each user a Common
Gateway Interface (CGI) process is started. These processes set-up a session with the
database, passing eventual parameters. The database then retrieves the data and
returns it to the requesting CGI process which in turn returns it to the webserver. The
webserver returns the data to the requesting Internet user.
page 38 (48)
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
Web server
Although the concept is simple, the main drawbacks are the large amount of system
resources needed to host a separate CGI process for each users and the relative long
start-up time of a database session for each user. To overcome these drawbacks, the
“CGI application server architecture” has been invented.
In the CGI application server architecture, a Application Server on a middle tier, takes
care of efficient set-up of a pool of sessions with the database. For each Internet user
request, the Webserver starts a very small CGI process (the dispatcher) to pass
parameters to the application server, which in turn chooses an already existing
database session from a session pool to answer the request. Data from the database is
returned the other way around.
Web server
An even more efficient, but Webserver proprietary, solution is the “Server API
architecture”. In this architecture the Webserver comes with an API to extend its
functionality. By adding database connection functionality to the Webserver, a very
efficient route from Internet user via webserver to the database is made. A drawback
from this solution is that errors in the code of the database gateway may corrupt the
working of the whole web server.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Web server
server API
Web databases and security
In principle, each system on the Internet can connect to any other system connected to
the Internet. For this reason, security is a hot item when connecting databases to the
Internet. Only authorised users, both inside and outside the own organisation, should
be able to get data in or out a database. The two main approaches to provide this are
firewalls and encryption techniques.
The figure below, gives a typical architecture to connect a database to the Internet in a
secure way.
LAN network
The firewall, is a gatekeeper examining network data coming in and going out the
organisations internal LAN network. There are several types of firewalls that each
have their own way of examining network data, but the three major types are:
Screening Routers: low network level systems looking inside the IP packets to
determine whether a packet may pass depending on, among others, sender
address and receiver address.
Proxy Server Gateways: operating at a higher level in the protocol stack (e.g.
HTTP) to provide more means of monitoring and controlling access to the
internal network (e.g. hiding the IP addresses of the internal systems for the
outside world thus preventing intruders from directly connecting to an internal
Stateful Inspection Firewalls: comparing bit patterns of passing packets with the
bit patterns of packets that are already known to be trusted. In contrast with
page 40 (48)
Deliverable 1
Volume 1: Main Report
proxy server gateways, this type of firewalls does not need to evaluate a lot of
information in a lot of packets, resulting in much less overhead.
The first type is the most simple one but offers less security than the other two types.
The second type is the most common used, but generates a lot of overhead. The third
type is upcoming now. Remark that no matter how secure a firewall theoretically can
be, a good firewall policy has to be implemented in an organisation to keep intruders
out. Employees by-passing the firewall with, for example a private modem connection
to the Internet, are a treat to the safety of the internal network
Apart from firewalls, encryption techniques can be used to encrypt data send over the
Internet, thus preventing intruders from analysing this data and breaking into the
internal network. The Secure Socket Layer (SSL) is a open protocol specification for
providing general accepted encrypted data transport over the Internet. At the moment,
SSL is implemented as part of Netscape’s and Microsoft’s proprietary webserver
Web database products
In principle nothing has to be changed at the database site to connect it to the web, so
all existing databases can be connected to the web. From a database point of view, it
does not matter whether a users connects via the Internet or via a direct connection.
But, as described above, a webserver is often used in the case of an Internet
connection. All major database vendors like Oracle, Informix, IBM and Microsoft
offer software to connect their databases (and the databases of competitors) to the
web. Although possible, one should take into account the special demands posed on
the database by the Internet like large amounts of users (scaleability), contents in
HTML format (multimedia support), 7x24 use (availability) and a lot of read
transactions (data warehousing). Database vendors have tuned their databases to meet
these typical Internet needs.
Although it is still difficult to figure out the profits of the Internet, it is difficult to
image a world without it. Together with the increasing demands for (personalised)
information via the Web, the role of very large (multimedia) databases increases. All
major database vendors and other dedicated Internet vendors, already offer products
(e.g. webservers) to connect their databases to the web in a secure and efficient way.
Vendors tune their DBMSs to meet the special requirements of Internet use, like
handling large numbers of concurrent users, processing multimedia contents and
providing high availability.
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Mapping of telecommunication
database technologies
Volume 1: Main Report
As mentioned earlier, this Project focuses on Service Management applications and
Service Provisioning applications. In this chapter we list some applications we have in
mind when talking about Service Management and Service Provisioning. Second, the
mentioned telecommunication applications are mapped onto the database technologies
from chapter 3 by means of a matrix. Given a service management or a service
provisioning application, this matrix gives an indication of the relevant database
technologies. For more detailed information on the realisation of the matrix, the
reader is referred to [4].
Service management applications
The management of telecommunication services is a key issue for PNOs. For handling
different services, there are management systems such as billing system, customer
ordering system, customer/user management system, etc. These management systems
heavy rely on very large database systems. Examples of these applications are:
Billing: both for customers (customer billing) as for providers (provider billing).
Billing is a core process to get money for the services offered to customers.
Session management: for monitoring sessions of customers and storing related
information (e.g. call detail records (CDR))
Customer registration: for knowing who each customers is and what he wants.
So called “Customer Profiles” play an important role in customer
personalisation. For each customer all relevant information (e.g. address, call
behaviour, installed base etc.) is stored to satisfy the needs of individual
customers. So called “1-to-1 (database) marketing” may give a competitive edge
for PNOs in the near future.
Number portability: to enable customers to keep the same telephone number
when they change from one operator to another.
Home location registering of mobile phones: for keeping track of which
basestation is nearest to which mobile phone.
Service Order Entry: for supporting the order processing within the
Enterprise Resource Planning (ERP): for enterprise wide supporting process
flow and related data. ERP promises one database, one application and one user
interface for the entire enterprise where once disparate systems ruled
manufacturing, finance, distribution and sales.
Business Intelligence: for analysing customers behaviour and starting
appropriate marketing campaigns.
Service provisioning applications
Applications for supporting Service Provision heavily rely on very large database
technology and their number and size is increasing. Whereas Service Mangement
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
focuses on managing processes within the company, Service Provisioning focuses on
offering services to the customers. Examples of these applications are:
Search Engines: offer the customers facilities for searching in information bases
(e.g. the Internet).
Electronic Commerce: offers a variety of services for buying and selling goods
and services in an electronic way via Internet.
Hosting Multimedia: for hosting multimedia information (e.g. audio, video,
image, text, etc.)
Video On Demand Service: for streaming a video to a customer on his demand.
Audio on Demand Service: offers the customer the possibility to download
On-line Publishing: for on-line publishing of (multimedia) content on e.g. the
Digital Library: for disclosure of digitalised collections of books, articles and
other published media information.
Tele-Education: which incorporates all telecom based educational services
ranging from one way directed educational material by broadcast to customers, to
electronic classrooms displayed with VR technology.
Mobile Information Service: for providing the customer with information on
restaurants, shops, offers, etc. located closely to the customer that are based on
the customers interests and current position.
Fleet Management Service: for providing the customer with information of
peoples/vehicles position.
Computer Supported Co-operative Work (CSCW): supports people,
separated by long physical distances, to work together. The service provides the
customer with all necessary help for communication and effective work e.g.
video/audio conferencing, sharing of documents/white-boards, e-mail facilities,
common document database etc.
Monitoring related services: for giving key customers the possibility to monitor
and partially control their own use of telecom services.
The telecommunication applications/database technologies
After having defined the relevant telecommunication applications in the previous
section, we now continue with mapping these applications onto the database
technologies discussed in chapter 3.
In the matrix below the degree of support a database technology provides to an
application is indicated by the following grades:
Required (req.)
Applying Technology y to realise the Application x is required, otherwise it is
difficult to realise Application x.
page 44 (48)
 1999 EURESCOM Participants in Project P817-PF
Deliverable 1
Volume 1: Main Report
To realise the Application x the usage of Technology y is helpful. Other
technologies may offer the same capabilities.
This Technology is not directly needed to realise Application x, but the decision
to use this technology depends on the expected external influences like expected
traffic, number of users, workload, existing communication environment, legacy
systems to be integrated etc.
No influence, not applicable.
The matrix gives a first impression of which database technologies are relevant for
which applications. The matrix is the result of combining detailed information given
in [4].
 1999 EURESCOM Participants in Project P817-PF
Volume 1: Main Report
Deliverable 1
Parallel DB
Retrieval and
up &
Fleet mngm
Hosting MM
Online Publ
Mapping of telecommunication applications and database technologies
Underlined = Service Management
page 46 (48)
Normal = Service Provisioning
Deliverable 1
Volume 1: Main Report
General analysis and recommendations
In this main report we have summarised the developments on the database
technologies for the construction of very large databases. Further details can be found
in the following parts of the Deliverable.
For conclusions on individual technologies we refer to the analysis sections in chapter
3. Here we only recall a few conclusion on those technologies that we feel are
currently most relevant. Main drivers for these developments nowadays are low cost
database platforms, data warehouses, Web applications, and the issue of data
As far as hardware platform support for Very Large Data Bases is concerned, we see
the following situation. For very large operational databases, i.e. databases that
require heavy updating, mainframe technology (mostly MPP architectures, Massively
Parallel Processors) is by far the most dominant technology. For datawarehouses on
the other hand, that mostly support retrieval, we see a strong position for the high-end
UNIX SMP (Symmetric Multi Processor) architectures. The big question with respect
to the future is about the role of Windows NT on Intel. Currently there is no role in
very large databases for these technologies, however this may change in the coming
years. There are two mainstreams with respect to NT and Intel. On the one hand
NUMA (Non Uniform Memory Architecture) with Intel processors, and on the other
hand clustered Intel machines. NUMA is more mature and supports major databases
like Oracle and Informix. However, NUMA is still based on Unix, but suppliers work
on NT implementations. Database technology supporting NT clusters is not really
available yet, with the exception of IBM DB2. This area will be closely followed by
the Project and actual experiments may be planned to assess this technology.
Multimedia databases and Web related database technology is developing very fast.
All major database vendors support Web connectivity nowadays. There is a strong
focus on database-driven Web-sites and E-commerce servers for the Web. The
support for multimedia data support is rather rudimentary. Although vendors like
Oracle, Informix and IBM have made a lot of noise on Universal Servers that support
multimedia data. The proposed extendible architectures turned out be relatively
closed and unstable. Current practice is still mainly handling of multimedia data
outside the database.
Data warehouse technology is one of the most dynamic areas nowadays. All database
vendors and mainframe vendors are in this area. One has to be very careful here, a
data warehouse is not simply a large database. There is a lot of additional technology
for data extraction, metadata management, and architectures. Of course all major
vendors have there own methodology and care has to be taken not to be locked in. A
rather new development is that of operational data stores, these are data warehouses
with limited update capabilities. Especially for the propagation of these updates back
to the originating databases no stable solutions exist. Therefore great care has to be
taken when embarking on operational data stores.
Finally, as telecommunication services are becoming more and more data intensive,
the role of database technology will increase. Therefore, decisions with respect to
database technology become crucial elements to maintain control over the data
management around those services, and also to maintain a strong, flexible and
competitive position.
 1999 EURESCOM Participants in Project P817-PF
page 47 (48)
Volume 1: Main Report
Deliverable 1
EURESCOM P817, Deliverable 1, Volume 2, Annex 1 - Architectural and
Performance issues, September 1998
EURESCOM P817, Deliverable 1, Volume 3, Annex 2 - Data Manipulation
and Management Issues, September 1998
EURESCOM P817, Deliverable 1, Volume 4, Annex 3 - Advanced Database
technologies, September 1998
EURESCOM P817, Deliverable 1, Volume 5, Annex 4 - Database
Technologies in Telecommunication Applications, September 1998
page 48 (48)
 1999 EURESCOM Participants in Project P817-PF