Download Distributed database - GCG-42

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Commitment ordering wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Serializability wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Concurrency control wikipedia , lookup

Transcript
Distributed Databases
Objectives

key terms in the distributed database area








Distributed vs. Decentralized Database
Homogenous vs. Heterogeneous Decentralized Database
Location transparency vs. local autonomy
Asynchronous vs. Synchronous distributed databases
Horizontal vs. Vertical partitioning
Full refresh vs. differential refresh
Push replication vs. Pull replication
Local transaction vs. Global Transaction
Objectives





Describe salient characteristics of distributed database
environments
Explain advantages and risks of distributed databases
Explain strategies and options for distributed database
design
Discuss synchronous and asynchronous data replication
and partitioning
Discuss optimized query processing in distributed
databases
Distributed vs. Decentralized Database
Both are stored on computers in multiple locations

Distributed Database


Geographical distribution of a SINGLE
database
Decentralized Database


A collection of independent databases on nonnetworked computers
Users at various sites cannot share data
Distributed Database


Require multiple DBMS running at
remote sites
There are different types of distributed
database environments


The degree to which these DBMS cooperate
Having a master site to coordinate requests
involving data from multiple sites
Reasons for Distributed Database

Distribution and Autonomy of Business Units




Data sharing


Departments/Facilities are geographically distributed
Each has the authority to create and control own data
Business mergers create this environment
Consolidate data across local databases on demand.
Data communication costs and reliability



Economical and reliable to locate data where needed.
High cost for remote transactions / large data volumes
Dependence on data communications can be risky
Reasons for Distributed Database

Multiple application vendor environment



Each unit may have different vendor applications
A distributed DBMS can provide functionality that
cuts across separate applications
Database recovery

Replicating data on separate computers may ensure
that a damaged database can be quickly recovered
Homogeneous vs. Heterogeneous
Distributed Database

Homogeneous Distributed Database 


The same DBMS is used at each node
Difficult for most organizations to force a
homogeneous environment
Heterogeneous Distributed Database


Potentially different DBMS are used at each
node
Much more difficult to manage
Typical Homogeneous Environment



Data distributed across all the nodes.
Same DBMS at each node.
A central DBMS coordinates database access
and update across the notes


No exclusively local data
All access is through one, global schema.

The global schema is the union of all the local
schema.
Figure 13-2 – Homogeneous Database
Everyone is a
GLOBAL user
Identical DBMSs
Typical Heterogeneous Environment




Data distributed across all the nodes.
Different DBMSs may be used at each node.
Local access is done using the local DBMS
and schema.
Remote access is done using the global
schema.
Figure 13-3 –Typical Heterogeneous Environment
Local user
accesses his
own data
Non-identical DBMSs
Major Objectives of Distributed Database

Allow users to share data yet be able to operate
independently when network link fails.
Location Transparency



User does not have to know the location of the data
Data requests automatically forwarded to appropriate
sites
Local Autonomy


Local site can operate with its database when network
connections fail
Each site controls its own data, security, logging,
recovery
Trade-Offs in Distributed Database
When do you update data across the database?

Synchronous Distributed Database





All copies of the same data are always identical
Updates apply immediately to all copies throughout network
Good for data integrity
High overhead  slow response times
Asynchronous Distributed Database




Some data inconsistency is tolerated
Data update propagation is delayed
Lower data integrity
Less overhead  faster response time
Advantages of Distributed Database
1. Increased reliability and availability
 Even when a component fails the database may continue to
function albeit at a reduced level
2. Allow Local control over data.
 Local control promotes data integrity and administration
3. Modular growth
 Easy to add a connection to a new location
 Less chance of disrupting existing users during expansion
4. Lower communication costs.
5. Faster response for certain queries.
 Query local data
 Parallel queries
Disadvantages of Distributed Database




Software cost and complexity.
Processing overhead.
Data integrity exposure.
Slower response for certain queries.

If data are not distributed properly, according to
their usage, or if queries are not formulated
correctly, queries can be extremely slow
Options for Distributing a Database




Data Replication
Horizontal Partitioning
Vertical Partitioning
Combinations of the above
Data Replication

Advantages





Reliability – if one node fails, you can find data at
another node
Fast response at sites that have a full copy
May avoid complicated distributed transaction
integrity routines (if replicated data is refreshed at
scheduled intervals.)
De-couples nodes -transactions proceed even if
some nodes are down.
Reduced network traffic at prime time, if updates
can be delayed to non-primetime hours
Data Replication

Disadvantages 


Storage requirements
Complexity and cost of updating.
Integrity exposure of getting incorrect data if
replicated data is not updated simultaneously.
Data Replication

Best for non-volatile/static, non-collaborative
data




Catalogs
Telephone directories
Train Schedules
Not good for on-line applications


Airline reservations
ATM transactions
Types of Data Replication

Push Replication


Updating site sends changes to other sites
Pull Replication

Receiving sites control when update
messages will be processed
Types of Push Replication

Snapshot Replication



Changes periodically sent to master site
Master collects updates in log
Near Real-Time Replication


Broadcast update orders without requiring
confirmation
Update messages stored in message queue until
processed by receiving site
Issues in Data Replication Use





Data timeliness – high tolerance for out-of-date
data may be required
DBMS capabilities – if DBMS cannot support
multi-node queries, replication may be necessary
Performance implications – refreshing may cause
performance problems for busy nodes
Network heterogeneity – complicates replication
Network communication capabilities – complete
refreshes place heavy demand on
telecommunications
Horizontal Partitioning


Different rows of a table at different sites
Advantages 




Data stored close to where it is used  efficiency
Local access optimization  better performance
Only relevant data is available  security
Unions across partitions  ease of query
Disadvantages


Accessing data across partitions  inconsistent
access speed
If no data replication  backup vulnerability
Vertical Partitioning


Different columns of a table at different sites
Advantages and disadvantages are the same as
for horizontal partitioning except that
combining data across partitions is more
difficult because it requires joins (instead of
unions)
Factors in Choice of Distributed Strategy
No approach to data distribution is ALWAYS best
 Choice depends on






Funding, autonomy, security.
Site data referencing patterns.
Growth and expansion needs.
Technological capabilities.
Costs of managing complex technologies.
Need for reliable service.
Distributed DBMS


Distributed database requires distributed DBMS
Functions of a distributed DBMS:







Locate data with a distributed data dictionary
Determine location from which to retrieve data and process
query components
DBMS translation between nodes with different local DBMSs
(handle heterogeneous DBMS translation using middleware)
Data consistency (via multiphase commit protocols)
Global primary key control
Scalability
Security, concurrency, query optimization, failure recovery
Distributed DBMS Data Reference


Local Transaction - references local data.
Global Transaction - references non-local data.
Distributed DBMS Architecture
Distributed DBMS Transparency Objectives

Location Transparency


Replication Transparency


User/application does not need to know where data resides
User/application does not need to know about duplication
Failure Transparency


Either all of the actions of a transaction are committed or else
none of them is committed.
If a transaction fails at one site it don’t commit at other sites


Each site has a transaction manager


A system should detect a failure (broken communication link,
erroneous data, disk head crash), reconfigure the system and recover
Logs transactions and before and after images
Requires special commit protocol
Failure Transparency Two-Phase Commit

Commit Protocol: Ensures that a global
transaction is either successfully completed at
each site or else aborted.

Two-Phase Commit

Prepare Phase: Check if operation ok at all
participating sites

Commit Phase: Only if all participating sites
agree, do you issue the commite
Distributed DBMS Transparency Objectives

Concurrency Transparency


Allow multiple users to run transactions
concurrently, with each transaction appears as if
it is the only activity in the system
Timestamping


Ensure that even if two events occur simultaneously
at different sites, each will have a unique timestamp.
Alternative to locks in distributed databases
Distributed DBMS Vendors








Oracle
Microsoft
Informix
Sybase
IBM
Computer Associates
Ingress
Others……