Download Distributed Database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Serializability wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Concurrency control wikipedia , lookup

Transcript
Outline
• Introduction
– What is a distributed DBMS
– Distributed DBMS Architecture
•
•
•
•
•
•
•
•
•
•
•
•
•
Background
Distributed Database Design
Database Integration
Semantic Data Control
Distributed Query Processing
Multidatabase query processing
Distributed Transaction Management
Data Replication
Parallel Database Systems
Distributed Object DBMS
Peer-to-Peer Data Management
Web Data Management
Current Issues
Ch.1/1
File Systems
program 1
data description 1
File 1
program 2
data description 2
File 2
program 3
data description 3
File 3
Ch.1/2
Database Management
Application
program 1
(with data
semantics)
Application
program 2
(with data
semantics)
DBMS
description
manipulation
control
database
Application
program 3
(with data
semantics)
Ch.1/3
Motivation
Database
Technology
Computer
Networks
integration
distribution
Distributed
Database
Systems
integration
integration ≠ centralization
Ch.1/4
Distributed Computing
• A number of autonomous processing elements
(not necessarily homogeneous) that are
interconnected by a computer network and that
cooperate in performing their assigned tasks.
• What is being distributed?
–
–
–
–
Processing logic
Function
Data
Control
Ch.1/5
What is a Distributed Database System?
A distributed database (DDB) is a collection of multiple,
logically interrelated databases distributed over a
computer network.
A distributed database management system (D–DBMS)
is the software that manages the DDB and provides an
access mechanism that makes this distribution
transparent to the users.
Distributed database system (DDBS) = DDB + D–DBMS
Ch.1/6
What is not a DDBS?
• A timesharing computer system
• A loosely (separate primary memory and
shared secondary memory) or tightly coupled
(shared memory) multiprocessor system
• A database system which resides at one of the
nodes of a network of computers - this is a
centralized database on a network node
Ch.1/7
Centralized DBMS on a Network
Site 1
Site 2
Site 5
Communication
Network
Site 4
Site 3
Ch.1/8
Distributed DBMS Environment
Site 1
Site 2
Site 5
Communication
Network
Site 4
Site 3
Ch.1/9
Implicit Assumptions
• Data stored at a number of sites  each site logically
consists of a single processor.
• Processors at different sites are interconnected by a
computer network  not a multiprocessor system
– Parallel database systems
• Distributed database is a database, not a collection of
files  data logically related as exhibited in the users’
access patterns
– Relational data model
• D-DBMS is a full-fledged DBMS
– Not remote file system, not a TP system
Ch.1/10
Data Delivery Alternatives
• We characterize the data delivery alternatives
along three orthogonal dimensions:
• Delivery modes
• Frequency
• Communication Methods
• Note: not all combinations make sense
Ch.1/11
Data delivery
• Delivery modes
– Pull-only {the transfer of data from servers to clients is initiated by a client
pull}
• Push-only {the transfer of data from servers to clients is initiated by a
server push in the absence of any specific request from clients.
periodic, irregular, or conditional}
– Hybrid (mix of pull and push)
• Frequency
• Periodic (A client request for IBM’s stock price every week is an
example of a periodic pull.)
• Conditional (An application that sends out stock prices only when
they change is an example of conditional push.)
– Ad-hoc or irregular
• Communication Methods {Unicast, One-to-many}
Ch.1/12
Ch.1/12
Distributed DBMS Promises
Transparent management of distributed,
fragmented, and replicated data
Improved reliability/availability through
distributed transactions
Improved performance
Easier and more economical system
expansion
Ch.x/13
Ch.1/13
Transparency
• Transparency is the separation of the higher
level semantics of a system from the lower
level implementation issues.
• Fundamental issue is to provide
data independence
in the distributed environment
– Network (distribution) transparency
– Replication transparency
– Fragmentation transparency
• horizontal fragmentation: selection
• vertical fragmentation: projection
• hybrid
Ch.x/14
Ch.1/14
Example
SELECT ENAME,SAL
FROM
EMP,ASG,PAY
WHERE DUR > 12
AND
EMP.ENO = ASG.ENO
AND
PAY.TITLE =
EMP.TITLE
Ch.1/15
Transparent Access
Tokyo
SELECT
FROM
WHERE
AND
AND
Paris
ENAME,SAL
Boston
EMP,ASG,PAY
DUR > 12
EMP.ENO = ASG.ENO
PAY.TITLE = EMP.TITLE
Communication
Network
Paris projects
Paris employees
Paris assignments
Boston employees
Boston projects
Boston employees
Boston assignments
Montreal
New
York
Boston projects
New York employees
New York projects
New York assignments
Montreal projects
Paris projects
New York projects
with budget > 200000
Montreal employees
Montreal assignments
Ch.1/16
Distributed Database - User View
Distributed Database
Ch.1/17
Distributed DBMS - Reality
User
Query
DBMS
Software
DBMS
Software
DBMS
Software
User
Application
DBMS
Software
Communication
Subsystem
User
Query
User
Application
DBMS
Software
User
Query
Ch.1/18
Types of Transparency
• Data independence {It refers to the immunity of
user applications to changes in the definition
and organization of data, and vice versa.
• Logical data independence and physical data
independence}
• Network transparency (or distribution
transparency)
– Location transparency
– Fragmentation transparency
• Replication transparency
• Fragmentation transparency
Ch.1/19
Who Should Provide Transparency?
• Nevertheless, the level of transparency is inevitably a compromise
between ease of use and the difficulty and overhead cost of
providing high levels of transparency.
• Gray argues that full transparency makes the management of
distributed data very difficult and claims that “applications coded
with transparent access to geographically distributed databases
have: poor manageability, poor modularity, and poor message
performance”.
• He proposes a remote procedure call mechanism between the
requestor
• users and the server DBMSs whereby the users would direct their
queries to a specific DBMS.
• Application level {code of application, little transperancy}
• Operating system {device drivers within the operating system}
• DBMS
Ch.1/20
Ch.1/20
Reliability Through Transactions
• Replicated components and data should make
distributed DBMS more reliable. {eliminate
single points of failure}
• Distributed transactions provide
• Concurrency transparency {sequence of
database operations executed as an atomic
action. consistent db transformed to another
consistent db state}
– Failure atomicity {update salary by 10%}
Ch.1/21
Potentially Improved Performance
• Proximity of data to its points of use {data
localization}
– Requires some support for fragmentation and
replication
1. Since each site handles only a portion of
the database, contention for CPU and I/O
services is not as severe as for centralized
databases.
2. Localization reduces remote access delays
Ch.1/22
• Parallelism Requirements
• read only queries
Have as much of the data required by each
application at the site where the application
executes
– Full replication
• How about updates?
Ch.1/23
System Expansion
• Issue is database scaling
• Emergence of microprocessor and
workstation technologies
– Demise of Grosh's law
– Client-server model of computing
Ch.1/24
Complications Introduced by
Distribution
• data may be replicated, the distributed
database system is responsible for
(1) choosing one of the stored copies of the
requested data for access in case of
retrievals, and
(2) making sure that the effect of an update is
reflected on each and every copy of that data
item.
• if some sites fail or communication fail,
DBMS will ensure update for fail site as soon as
Ch.1/25
Ch.1/25
Distributed DBMS Issues
• Distributed Database Design {chapter 3}
– How to distribute the database {portioned and
replicated}
– Replicated (partial dupliacated or fully duplicated)
& non-replicated database distribution
– Fragmentation
– {research area to minimize cost of storing,
processing transactions and communication is NP
hard. Proposed solution are based on heuristics}
Ch.1/26
Distributed DBMS Issues
• Query Processing {chapter 6-8}
– Convert user transactions to data manipulation
instructions
– Optimization problem
• min{cost = data transmission + local processing}
– General formulation is NP-hard
• Concurrency Control {chapter 11}
– Synchronization of concurrent accesses
– The condition that requires all the values of
Ch.1/27
• Distributed deadlock management
• The deadlock problem in DDBSs is similar in nature to
that encountered in operating systems. The
competition among users for access to a set of
resources (data, in this case) can result in a deadlock if
the synchronization mechanism is based on locking.
• The well-known alternatives of prevention, avoidance,
and detection/recovery also apply to DDBSs.
• Reliability and availability
– How to make the system resilient to failures
– Atomicity and durability
Ch.1/28
Ch.1/28
Relationship Between Issues
Directory
The same
information (i.e., fragment structure
Management
and
placement) is used by the query
processor toamong
determine
the query evaluation
There
is a strong
the concurrency
control
The
replication
of relationship
Finally, the
needproblem,
for replication
protocols
strategy. arise if
the deadlock
fragments
when
data
distribution
involves
replicas.
management
problem,
and reliability
issues.
This is to be expected,
they
are
Query As indicated above,
Distribution
there is a strong Reliability
since together
distributed
affects
Processing
Design
relationship
between
replication management
protocols andproblem. The
they
usually called
the transaction
theare
concurrency
The design
of deal
concurrency
control
techniques,
since
both
concurrency
control
strategies
On the
other
hand,
the
accessaffects many areas. It affects
distributed
databases
with
the
consistency
of
data,
but
from
control
algorithm
that
is
employed
will
determine
whether or not a
that
might
be
and usage patterns thatdirectory
are determined
by
the
management, because the
different
perspectives.
separate
deadlock
employed.
query
processor
are
used
as
inputs
Concurrency
definition of fragments and to
their placement determine the
management
facility and
is required.
If a locking-based algorithm is
Control
the
data distribution
fragmentation
contents of the directory
used, deadlocks
will
algorithms. Similarly,
directory
placement
(or directories) as well as the strategies that may be
whereas
theythe
willprocessing
not if timestamping
is the chosen
andoccur,
contents
influence
of
employed
to manage them.
Deadlock
queries. alternative.
Management
Ch.1/29
Architecture
• Defines the structure of the system
– components identified
– functions of each component defined
– interrelationships and interactions between
components defined
Ch.1/30
ANSI/SPARC Architecture
Users
External
Schema
External
view
External
view
Conceptual
Schema
Conceptual
view
Internal
Schema
Internal view
External
view
Ch.1/31
Differences between Three Levels of
ANSI-SPARC Architecture
© Pearson Education Limited 1995, 2005
32
Ch.1/32
Data Independence
• Logical Data Independence
– Refers to immunity of external schemas to
changes in conceptual schema.
– Conceptual schema changes (e.g.
addition/removal of entities).
– Should not require changes to external
schema or rewrites of application programs.
© Pearson Education Limited 1995, 2005
33
Ch.1/33
Data Independence
• Physical Data Independence
– Refers to immunity of conceptual schema to
changes in the internal schema.
– Internal schema changes (e.g. using different
file organizations, storage structures/devices).
– Should not require change to conceptual or
external schemas.
© Pearson Education Limited 1995, 2005
34
Ch.1/34
Data Independence and the ANSISPARC Three-Level Architecture
© Pearson Education Limited 1995, 2005
35
Ch.1/35
Generic DBMS Architecture
The interface layer manages the
interface to the applications.
View management consists of
translating the user query from
decomposes the query into a tree of
external data to
algebra
operations
and controls
tries to find
the
The
control
layer
the
conceptual data.
“optimal”
query by adding
semantic integrity
ordering of the
operations.
predicates
andThe result
is storedThe
in an
access
plan. The
query
processing
(oroutput
authorization
predicates.
thismaps the query
compilation) of
layer
layer is a query
expressed
into an
optimizedin lowerlevel code
(algebra
operations).
sequence
of lower-level
Finally,
theoperations.
consistency
layer
The
execution
layer directs
the
manages
concurrency
control
execution
of thelayer
access
plans,
The
data access
manages
and loggingtransaction
for update
theincluding
data structures that
requests. This
layer restart)
allows and
management
(commit,
implement the
files, indices,
transaction, system,ofand
media
algebra
etc.synchronization
It also manages the
buffers
recovery after failure.
by caching operations
the most frequently
accessed data.
Ch.1/36
DBMS Implementation Alternatives
Ch.1/37
Autonomy
Degree to which member databases can operate independently
Autonomy is a function of a number of factors such as whether the
component systems (i.e., individual DBMSs) exchange information,
whether they can independently execute transactions, and whether
one is allowed to modify them.
1. The local operations of the individual DBMSs are not affected
by their participation in the distributed system.
2. The manner in which the individual DBMSs process queries and
optimize them should not be affected by the execution of global
queries that access multiple databases.
3. System consistency or operation should not be compromised
when individual DBMSs join or leave the distributed system.
38
Ch.1/38
Dimensions of the Problem
• Distribution
– Whether the components of the system are located on the same
machine or not
• Heterogeneity
– Various levels (hardware, communications, operating system)
– DBMS important one
• data model, query language, transaction management algorithms
• Autonomy
– Not well understood and most troublesome
– Various versions
• Design autonomy: Ability of a component DBMS to decide on issues related to
its own design.
• Communication autonomy: Ability of a component DBMS to decide whether
and how to communicate with other DBMSs.
• Execution autonomy: Ability of a component DBMS to execute local operations
in any manner it wants to.
Ch.1/39
In Figure 1.10, we have identified three alternative
architectures that are the focus of this book and that we
discus in more detail in the next three subsections: (A0,
D1, H0) that corresponds to client/server distributed
DBMSs, (A0, D2, H0) that is a peer-to-peer distributed
DBMS and (A2, D2, H1) which represents a (peer-topeer)
distributed, heterogeneous multidatabase system. Note
that we discuss the heterogeneity issues within the context
of one system architecture, although the issue
arises in other models as well.
Ch.1/40
Ch.1/40
Client/Server Architecture
Ch.1/41
Advantages of Client-Server
Architectures
• More efficient division of labor
• Horizontal and vertical scaling of resources
• Better price/performance on client machines
• Ability to use familiar tools on client machines
• Client access to remote data (via standards)
• Full DBMS functionality provided to client
workstations
• Overall better system price/performance
Ch.1/42
Database Server
Ch.1/43
Distributed Database Servers
Ch.1/44
Datalogical Distributed DBMS
Finally,
user
applications
and
user
Architecture
Data independence
is supported
access
to the database
is supported
ES1
ES2
...
ESn
GCS
LCS1
LCS2
...
LCSn
LIS1
LIS2
...
LISn
since the
model
is an
To handle
data
fragmentation
and
by
We first note that the physical data
extension
ofthe
ANSI/SPARC,
which asof
replication,
logical
organization
external
schemas
(ESs),
defined
organization on each machine may
provides
such
independence
data
being
above
the
be,global
and conceptual
naturally.
Location
at each
site needs
to be described.
schema.
probably is, different. This means
and
replication
transparencies
Therefore,
there
needs to be are
a third
that there needs to be an individual
supported by layer
the definition
in the of the
internal schema
local and
architecture,
theglobal
local conceptual
definition at each site, which we call
conceptual
schemas
the
schema
(LCS).
In the and
architectural
the local internal schema (LIS).
mapping in
between.
Network
model
we have
The enterprise
transparency,
on theconceptual
chosen,
then, the global
view of the data is described by the
other
hand,isisthe
supported
schema
union of by
thethe
local
global conceptual schema (GCS),
definition of the
global conceptual
conceptual
which is global
schema.
schemas.
because it describes the logical
structure of the data at all the sites.
Ch.1/45
Peer-to-Peer Component
Architecture
Runtime
Support
Processor
Local Recovery
Manager
Local Query
Processor
Global
Execution
Monitor
Global Query
Optimizer
Semantic Data
Controller
User Interface
Handler
The semantic data controller uses the
PROCESSOR
DATA
PROCESSOR
ItUSER
is important
to note, atintegrity
this point,
that
our use
the
constraints
and of
authorizations
The second majorterms “user
processor”
System
that
are defined
as part ofLocal
the global
Global
Local
External
Log
component
of aprocessor”
distributed
Conceptual does
Conceptual
Internal
andSchema
“data
not imply
a functional
division
conceptual
schema
to
check
if the user
GD/D
Schema
Schema
Schema
DBMS isThe
the local
data processor
similar
to
client/server
query
optimizer,
query can
be processed.
User
The userwhich
interface
handler
is
andThese
divisions
are merely
organizational and Database
actually
acts as
the
access
path
requests systems.
responsible for interpreting user
consists of three elements:
there is no suggestion that
selector,
commands as
they should
placed
on
different
machines.
is recovery
responsible
forbe
choosing
the
best
USER
The local
manager
is
they come in, and formatting the
access
to sure
access
any data
item
responsible
for path5
making
that
result
data
as it is
sent to the user.
The
run-time
support
processor
physically accesses
the an
the
localquery
The
global
optimizer
and
decomposer
determines
System
database
according
to
the
physical
commands
in
the
database
remains
consistent
The
detailed
components
of
a
distributed
DBMS
execution
strategy
to
minimize
a
cost
function,
and
translates
responsesThe distributed execution monitor coordinates the distributed the
schedule
generated
by
the
querythe
optimizer.
Thelocal
run-time
even
when
failures
occur
are
shown.
One
global
queries
into
local
ones
using
global
and
conceptual
execution of the user request. The execution monitor is
also
support
processor
is as
thewell
interface
toglobal
thewith
operating
system
component
handles
the
interaction
users,
and
schemas
as
the
directory.
called the distributed transaction manager. In executing
and contains
the database
buffer
(or cache)
manager,
another
deals
with
the
storage.
The
queries in a distributed fashion, the execution monitors
whichfirst
is responsible
for maintaining
the
main
memory
major
component,
which
we
call
the
user
at various sites may, and usually do, communicate
with one
buffers
and managing
the
dataelements:
accesses.
processor,
consists
of
four
another.
Ch.1/46
Multidatabase systems (MDBS)
• Multidatabase systems (MDBS) represent the
case where individual DBMSs (whether
distributed or not) are fully autonomous and
have no concept of cooperation; they may not
even “know” of each other’s existence or how
to talk to each other.
47
Ch.1/47
Datalogical Multi-DBMS
Architecture
LES11
…
GES1
GES2
LES1n
GCS
...
GESn
LESn1
…
LCS1
LCS2
…
LCSn
LIS1
LIS2
…
LISn
LESnm
Ch.1/48
MDBS Components & Execution
Global
User
Request
Local
User
Request
Local
User
Request
Multi-DBMS
Layer
Global
Subrequest
DBMS1
Global
Subrequest
DBMS2
Global
Subrequest
DBMS3
Ch.1/49
Mediator/Wrapper Architecture
Ch.1/50