Download distributedDatabases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Distributed Databases
CENG 553 Database Management Systems
1
What is a Distributed Database?
• Database whose relations reside on different
sites
• Database some of whose relations are
replicated at different sites
• Database whose relations are split between
different sites
CENG 553 Database Management Systems
2
Two Types of Applications that Access
Distributed Databases
• The application accesses data at the level of SQL
statements
– Example: company has nationwide network of
warehouses, each with its own database; a transaction
can access all databases using their schemas
• The application accesses data at a database using
only stored procedures provided by that database.
– Example: purchase transaction involving a merchant
and a credit card company, each providing stored
subroutines for its subtransactions
CENG 553 Database Management Systems
3
Database and Query Design Issues
• How should a distributed database be designed?
• At what site should each item be stored?
• Which items should be replicated and at which
sites?
• How should queries that access multiple databases
be processed?
• How do issues of query optimization affect query
design?
CENG 553 Database Management Systems
4
Application Designer’s View of a
Distributed Database
• Designer might see the individual schemas
of each local database -- called a
multidatabase -- in which case distribution
is visible
– Can be homogeneous (all databases from one
vendor) or heterogeneous (databases from
different vendors)
• Designer might see a single global schema
that integrates all local schemas (is a view)
in which case distribution is hidden
CENG 553 Database Management Systems
5
Views of Distributed Data
(a) Multidatabase with local schemas
(b) Integrated distributed database with global schema
CENG 553 Database Management Systems
6
Multidatabases
• Application must explicitly connect to each site
• Application accesses data at a site using SQL
statements based on that site’s schema
• Application may have to do reformatting in order
to integrate data from different sites
• Application must manage replication
– Know where replicas are stored and decide which
replica to access
CENG 553 Database Management Systems
7
Global Schemas
• Middleware provides integration of local schemas
into a global schema
– Application need not connect to each site
– Application accesses data using global schema
• Need not know where data is stored – location transparency
– Global joins are supported
– Middleware performs necessary data reformatting
– Middleware manages replication – replication
transparency
CENG 553 Database Management Systems
8
Objectives : Distributed Architecture
• Location Transparency
– User does not have to know the location of the data.
– Data requests automatically forwarded to appropriate sites
• Local Autonomy
– Local site can operate with its database when network
connections fail
– Each site controls its own data, security, logging, recovery
Distributed Data Storage
• Fragmentation:
– Relation is partitioned into several fragments stored in distinct
sites
• Replication:
– System maintains multiple copies of data, stored in different
sites, for faster retrieval and fault tolerance.
• Replication and fragmentation can be combined:
– Relation is partitioned into several fragments: System
maintains several identical replicas of each such fragment.
Data Fragmentation
• Division of relation r into fragments r1, r2, …, rn which
contain sufficient information to reconstruct relation r.
• Horizontal fragmentation: each tuple of r is assigned to
one or more fragments.
• Vertical fragmentation: the schema for relation r is split
into several smaller schemas.
TID
Storing Data

t1
t2
t3
t4
Fragmentation
– Horizontal: Usually disjoint.
– Vertical: Lossless-join; tids.

Replication
R1
– Gives increased availability.
– Faster query evaluation.
– Synchronous vs. Asynchronous.
 Vary in how current copies are.
SITE A
SITE B
R1
CENG 553 Database Management Systems
R3
R2
12
12
Horizontal Fragmentation
• Each fragment, Ti , of table T contains a
subset of the rows and each row is in exactly
one fragment:
Ti = Ci (T)
T =  Ti
– Horizontal fragmentation is lossless
T1
T2
T
T3
T4
CENG 553 Database Management Systems
13
Horizontal Fragmentation
• Example: An Internet grocer has a relation
describing inventory at each warehouse
Inventory(StockNum, Amount, Price, Location)
• It fragments the relation by location and stores each
fragment locally: rows with Location = ‘Chicago’
are stored in the Chicago warehouse in a fragment
Inventory_ch(StockNum, Amount, Price, Location)
• Alternatively, it can use the schema
Inventory_ch(StockNum, Amount, Price)
CENG 553 Database Management Systems
14
Vertical Fragmentation
• Each fragment, Ti, of T contains a subset of the
columns, each column is in at least one fragment,
and each fragment includes the key:
Ti = attr_listi (T)
T = T1
T2 …..
Tn
– Vertical fragmentation is lossless
• Example: The Internet grocer has a relation
Employee(SSnum, Name, Salary, Title, Location)
– It fragments the relation to put some information at
headquarters and some elsewhere:
Emp1(SSnum, Name, Salary) – at headquarters
Emp2(SSnum, Name, Title, Location) – elsewhere
CENG 553 Database Management Systems
15
Data Replication
• A relation or fragment of a relation is
replicated if it is stored redundantly in two
or more sites.
• Full replication of a relation is the case
where the relation is stored at all sites.
• Fully redundant databases are those in
which every site contains a copy of the
entire database.
Data Replication (Cont.)
• Advantages of Replication:
– Availability: failure of site containing relation r
does not result in unavailability of r if replicas
exist.
– Parallelism: queries on r may be processed by
several nodes in parallel.
– Reduced data transfer: relation r is available
locally at each site containing a replica of r.
Data Replication (Cont.)
• Disadvantages of Replication
– Increased cost of updates: each replica of relation r
must be updated.
– Increased complexity of concurrency control:
concurrent updates to distinct replicas may lead to
inconsistent data unless special concurrency control
mechanisms are implemented.
• One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy.
Replication Example
• Internet grocer might have relation
Customer(CustNum, Address, Location)
– Queries are executed
• At headquarters to produce monthly mailings
• At a warehouse to obtain information about deliveries
– Updates are executed
• At headquarters when new customer registers and
when information about a customer changes
CENG 553 Database Management Systems
19
Example (con’t)
• Intuitively it seems appropriate to either or both:
– Store complete relation at headquarters
– Horizontally fragment a replica of the relation and store
a fragment at the corresponding warehouse site
• Each row is replicated: one copy at headquarters,
one copy at a warehouse
• The relation can be both distributed and replicated
CENG 553 Database Management Systems
20
Example (con’t): Performance Analysis
• We consider three alternatives:
– Store the entire relation at the headquarters site
and nothing at the warehouses (no replication)
– Store the fragments at the warehouses and
nothing at the headquarters (no replication)
– Store entire relation at headquarters and a
fragment at each warehouse (replication)
CENG 553 Database Management Systems
21
Example (con’t):
Performance Analysis - Assumptions
• To evaluate the alternatives, we estimate the amount
of information that must be sent between sites.
• Assumptions:
– The Customer relation has 100,000 rows
– The headquarters mailing application sends each customer
1 mailing a month
– 500 deliveries are made each day; a single row is read for
each delivery
– 100 new customers/day
– Changes to customer information occur infrequently
CENG 553 Database Management Systems
22
Example: The Evaluation
• Entire relation at headquarters, nothing at warehouses
– 500 tuples per day from headquarters to warehouses for
deliveries
• Fragments at warehouses, nothing at headquarters
– 100,000 tuples per month from warehouses to headquarters
for mailings (3,300 tuples per day, amortized)
– 100 tuples per day from headquarters to warehouses for new
customer registration
• Entire relation at headquarters, fragments at warehouses
– 100 tuples per day from headquarters to warehouses for new
customer registration
CENG 553 Database Management Systems
23
Example: Conclusion
• Replication (case 3) seems best, if we count
the number of transmissions.
• Let us look at another measure:
– If we replicate, the time to register a new
customer might suffer because of the remote
update
• But this update can be done by a separate
transaction after the registration transaction commits
(asynchronous update)
CENG 553 Database Management Systems
24
Query Planning
• Systems that support a global schema contain a
global query optimizer, which analyzes each
global query and translates it into an appropriate
sequence of steps to be executed at each site
• In a multidatabase system, the query designer
must manually decompose each global query into
a sequence of SQL statements to be executed at
each site
– Thus a query designer must be her own query optimizer
CENG 553 Database Management Systems
25
Distributed Query Processing

Cost-based approach: consider all plans, pick cheapest;
similar to centralized optimization.
– Difference 1: Communication costs must be considered.
– Difference 2: Local site autonomy must be respected.
– Difference 3: New distributed join methods.

Query site constructs global plan, with suggested local
plans describing processing at each site.
– If a site can improve suggested local plan, free to do so.
CENG 553 Database Management Systems
26
26
Planning Global Joins
• Suppose an application at site A wants to join
tables at sites B and C. Two straightforward
approaches
– Transmit both tables to site A and do the join there
• The application explicitly tests the join condition
• This approach must be used in multidatabase systems
– Transmit the smaller of the tables, e.g. the table at site
B, to site C; execute the join there; transmit the result to
site A
• This approach might be used in a homogenous distributed
database system
CENG 553 Database Management Systems
27
Global Join Example
• Site B
Student(Id, Major)
• Site C
Transcript(StudId, CrsCode)
• Application at Site A wants to compute join
with join condition
Student.Id = Transcript.StudId
CENG 553 Database Management Systems
28
Assumptions
• Lengths of attributes
– Id and StudId: 9 bytes
– Major: 3 bytes
– CrsCode: 6 bytes
• Student: 15,000 tuples, each of length 12 bytes
• Transcript: 20,000 tuples, each of length 15 bytes
– 5000 students are registered for at least 1 course (10,000
students are not registered – summer session)
– Each student is registered for 4 courses on the average
CENG 553 Database Management Systems
29
Comparison of Alternatives
• Send both tables to site A, do join there:
– have to send 15,000*12 + 20,000*15 = 480,000 bytes
• Send the smaller table, Student, from site B to site C,
compute the join there. Then send result to Site A:
– have to send 15,000*12 + 20,000*18 = 540,000 bytes
• Alternative 1 is better
CENG 553 Database Management Systems
30
Another Alternative: Semijoin
• Step1:
At site C: Compute P = StudId(Transcript)
Send P to site B
– P contains Ids of students registered for at least 1 course
– Student tuples having Ids in P contribute to join
• Step 2:
At site B: Compute Q = Student
Send Q, to site A
Id = StudId P
– Q contains tuples of Student corresponding to students registered for at
least 1 course (i.e., 5,000 students out of 15,000)
– Q is a semijoin – the set of all Student tuples that will participate in the
join
• Step 3:
Send Transcript to site A
At site A: Compute Transcript
Id = StudId Q
CENG 553 Database Management Systems
31
Comparing Semijoin with Previous
Alternatives
•
•
•
•
In step 1: 45,000 = 5,000*9 bytes sent
In step 2: 60,000 = 5,000*12 bytes sent
In step 3: 300,000 = 20,000*15 bytes sent
In total: 405,000 = 45,000 + 60,000 + 300,000
bytes sent
• Semijoin is the best of the three alternatives
CENG 553 Database Management Systems
32
Definition of Semijoin
• The semijoin of two relations, T1 and T2, is
defined as:
T1
join_cond
T2 = attributes(T1)( T1
join_cond
T2)
– In other words, the semijoin consists of the
tuples in T1 that participate in the join with T2
CENG 553 Database Management Systems
33
Using the Semijoin
• To compute T1
join_cond T2 using a
semijoin, first compute T1 join_cond T2
then join it with T2:
attributes(T1)( T1
join_cond
T2)
CENG 553 Database Management Systems
join_cond
T2
34
Queries that Involve Joins and Selections
• Suppose the Internet grocer relation Employee is
vertically fragmented as
Emp1(SSnum, Name, Salary)
Emp2(SSnum, Title, Location)
at Site B
at Site C
• A query at site A wants the names of all employees
with Title = ‘manager’ and Salary > ‘20000’
• Solution 1: First do join then selection:
Name (Title=‘manager’ AND Salary>’20000’ (Emp1
Emp2))
– Semijoin not helpful here: all tuples of each table must be
brought together to form join (due to vertical fragmentation)
CENG 553 Database Management Systems
35
Queries that Involve Joins and Selections
• Solution 2: Do selections before the join:
Name((Salary>’20000’(Emp1))
(Title=‘manager’(Emp2)))
• At site B, select all tuples from Emp1 satisfying
Salary > ‘20000’; call the result R1
• At site C, select all tuples from Emp2 satisfying
Title = ‘manager’; call the result R2
• At some site to be determined by minimizing
communication costs, compute Name(R1
R2);
Send result to site A
– In a multidatabase, join must be performed at Site A, but
communication costs are reduced because only “selected”
data needs to be sent
CENG 553 Database Management Systems
36
Exercise
•
Consider the following global and fragmentation schemas in a
distributed database:
Global schema:
DOCTOR(DNO, DNAME, DEPT)
PATIENT(PNO, PNAME, DEPT, TREAT, DNO)
CARE(PNO, DRUG, QUAN)
Fragmentation schema:
D1 =  DEPT = ‘SURGERY’ (DOCTOR)
D2 =  DEPT = ‘PEDIATRICS’ (DOCTOR)
D3 =  DEPT  ‘SURGERY’  DEPT  ‘PEDIATRICS’ (DOCTOR)
P1 =  DEPT = ‘SURGERY’  TREAT = ‘INTENSIVE’ (PATIENT)
P2 =  DEPT = ‘SURGERY’  TREAT  ‘INTENSIVE’ (PATIENT)
P3 =  DEPT  ‘SURGERY’ (PATIENT)
C1 = CARE
P1
C2 = CARE
P2
C3 = CARE
P3
CENG 553 Database Management Systems
37
Exercise (con’t)
•
Translate the following global queries into
fragment queries and simplify them.
1. List names of patients who use aspirin in
their care.
2. List names of doctors who have prescribed
aspirin to patients undergoing intensive
treatment.
CENG 553 Database Management Systems
38
Choices to be Made by a Distributed
Database Application Designer
• Place tables at different sites
• Fragment tables in different ways and place
fragments at different sites
• Replicate tables or data within tables and place
replicas at different sites
• In multidatabase systems, do manual “query
optimization”: choose an optimal sequence of
SQL statements to be executed at each site
CENG 553 Database Management Systems
39