Download Parallel Databases - UCF Computer Science

Document related concepts

Entity–attribute–value model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

SQL wikipedia , lookup

Serializability wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Join (SQL) wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Transcript
PARALLEL DATABASE TECHNOLOGY
Kien A. Hua
School of Computer Science
University of Central Florida
Orlando, FL 32816-2362
Kien A. Hua
1
Topics
Results
Queries
PROCESS STRUCTURE
QUERY OPTIMIZATION
* Parallelizing Query
Optimization Techniques
(4)
EXECUTOR
(3)
STORAGE MANAGER
(2)
HARDWARE
(1)
COMMERCIAL SYSTEMS (5)
* Parallel Algorithms
* Data Placement
Strategies
* Parallel Architectures
Kien A. Hua
3
Relational Data Model
An attribute
(column)
EMP#
NAME
ADDR
0005
John Smith
Orlando
A relation
(table)
A tuple
(row)
A database structure is a collection of tables.
Each table is organized into rows and
columns.
The persistent objects of an application are
captured in these tables.
Kien A. Hua
4
Relational Operator:
SCAN
SELECT NAME
FROM EMPLOYEE
WHERE ADDR = \Orlando";
EMP#
NAME
ADDR
0002
Jane Doe
Orlando
0005
John Smith
Orlando
SELECT (ADDR=Orlando)
EMP#
NAME
ADDR
0002
0005
Jane Doe
John Smith
Orlando
Orlando
PROJECT(NAME)
NAME
Jane Doe
John Smith
SCAN
Kien A. Hua
5
Relational Operator:
SELECT
FROM
WHERE
JOIN
EMPLOYEE, PROJECT
EMP# = ENUM
EMPLOYEE:
PROJECT:
EMP#
NAME
ADDR
0002
Jane Doe
Orlando
ENUM PROJECT
DEPT
0002
Database
Research
0002
GUI
Research
Matching EMP# = ENUM
No
Yes
EMP_PROJ:
EMP#
NAME
ADDR
PROJECT
DEPT
0002
0002
Jane Doe
Jane Doe
Orlando
Orlando
Database
GUI
Research
Research
Kien A. Hua
6
Hash-Based Join
EMP_PROJ_0
EMP_PROJ_1
JOIN
JOIN
E0
0004
EMP_PROJ_2
P0
E1
EMP_PROJ_3
JOIN
P1
JOIN
E2
P2
E3
0004
0008
HASH:(EMP#) = EMP# mod 4
EMP# NAME
HASH(ENUM) = ENUM mod 4‘
ADDR
EMPLOYEE:
ENUM PROJ
DEPT
PROJECT:
0004
0004
0008
Examples:
0 mod 4 = 0
1 mod 4 = 1
2 mod 4 = 2
3 mod 4 = 3
4 mod 4 = 0
5 mod 4 = 1
6 mod 4 = 2
7 mod 4 = 3
P3
Kien A. Hua
7
Bucket Sizes and I/O Costs
Memory
One tuple at a time
Bucket B does not t
in the memory in its
entirety. It must be
loaded several time.
Bucket B does not fit in the memory
in its entirety. Bucket A must be loaded
several times.
Bucket A
Bucket A
Bucket B
Memory
One tuple at a time
Bucket B(1)
LOAD
Bucket A(1)
Bucket B ts in the
memory. It needs to be
loaded only once.
Bucket B(1) fits in the memory.
A(1) needs to be loaded only once.
Bucket A(2)
Bucket B(2)
Bucket A(3)
Bucket B(3)
Kien A. Hua
8
Speedup and Scaleup
The ideal parallel systems demonstrates two key
properties:
1. Linear Speedup:
system elapsed time
Speedup = small
big system elapsed time
Linear Speedup : Twice as much hardware can
perform the task in half the elapse time (i.e.,
speedup = number of processors.)
2. Linear Scaleup:
system elapsed time on small problem
Scaleup = small
big system elapsed time on big problem
Linear Scaleup : Twice as much hardware can
perform twice as large a task in the same elapsed
time (i.e., scaleup = 1.)
Kien A. Hua
Barriers to Parallelism
Startup:
The time needed to start a parallel operation
(thread creation/connection overhead) may
dominate the actual computation time.
Interference:
When accessing shared resources, each new
process slows down the others (hot spot
problem).
Skew:
The response time of a set of parallel
processes is the time of the slowest one.
9
Kien A. Hua
10
The Challenge
The ideal database machine has:
1. a single innitely fast processor,
2. an innitely large memory with innite bandwidth
;! Unfortunately, technology is not
delivering such machines !
The challenge is:
1. to build an innitely fast processor out of
innitely many processors of nite speed, and
2. to build an innitely large memory with
innitely many storage units of nite speed.
Kien A. Hua
Performance of Hardware Components
Processor:
{ Density increases by 25% per year.
{ Speed doubles in three years.
Memory:
{ Density increases by 60% per year.
{ Cycle time decreases by 13 in ten years.
Disk:
{ Density increases by 25% per year.
{ Cycle time decreases by 13 in ten years.
The Database Problem: The I/O bottleneck will worsen.
11
Kien A. Hua
12
Hardware Architectures
M
M
M
M
Communication
Network
Communication
Network
P
P
P
P
P
P
P
P
M
M
M
M
(c) Shared Everything (SE)
(b) Shared Disk (SD)
Communication
Network
M
: Memory Module
P
: Processor
: Disk Drive
P
P
P
P
M
M
M
M
(a) Shared Nothing (SN)
Shared Nothing is more scalable for very
large database systems
Kien A. Hua
14
Shared-Everything Systems
MEMORY
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
MEMORY
MEMORY
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
. . .
MEMORY
COMMUNICATION NETWORK
CACHE
CACHE
CACHE
CACHE
. . .
CPU
CPU
CPU
Cross Interrogation
CPU
Shared Disk Architecture
4K
page
Communication Network
Memory
Memory
Memory
Memory
Update
P
P
P
P
Cross Interrogation for a small change to a page.
Processing units interfere each other even they
work on different records of the same page.
Kien A. Hua
15
Hybrid Architecture
COMMUNICATION NETWORK
P
P
. . .
P
P
P
Bus
. . .
P
Bus
. . .
Memory
Cluster 1
Memory
Cluster N
SE clusters are interconnected through a communication network
to form an SN structure at the inter-cluster level.
This approach minimizes the communication overhead associated
with the SN structure, and yet each cluster size is kept small
within the limitation of the local memory and I/O bandwidth.
Examples of this architecture include Sequent computers, NCR
5100M and Bull PowerCluster.
Some of the DBMSs designed for this structure are the Teradata
Database System for the NCR WorldMark 5100 computer, Sybase
MPP, Informix Online Extended Parallel Server.
Kien A. Hua
16
Parallelism in Relational Data Model
Pipeline
Parallelism: If one operator sends its
output to another, the two operators can execute in
parallel.
C
INSERT INTO C
INSERT
SELECT
*
FROM
A, B
WHERE
A.x = B.y;
JOIN
Partitioned
SCAN
SCAN
A
B
Parallelism: By taking the large
relational operators and partitioning their inputs and
outputs, it is possible to turn one big job into many
concurrent independent little ones.
C0
C1
C2
INSERT
INSERT
INSERT
JOIN
SCAN
SCAN
A2
A1
A0
B1
B0
Merger
Merger
Process
executing
operator
INSERT
• Merge operator
combines
several parallel
data streams
into a simple
SCAN
SCAN
SCAN
sequential
stream
Output streams
Input streams
Merge & Split Operators
Split
INSERT
INSERT
Merge
JOIN
JOIN
JOIN
Split
SCAN
SCAN
• Split operator is used to partition or
replicate the stream of tuples
• With split and merge operators, a web of
simple sequential dataflow nodes can be
connected to form a parallel execution
plan
Kien A. Hua
18
Data Partitioning Strategies
Data partitioning is the key to partitioned
execution:
Round-Robin:
Hash
maps the ith tuple to disk i mod n.
Partitioning: maps each tuple to a disk
location based on a hash function.
Range
Partitioning: maps contiguous attribute
ranges of a relation to various disks.
NETWORK
NETWORK
P0
P1
P2
P3
A-F
G-L
M-R
S-Z
P0
P1
P2
Range Partitioning
HASH
. . .
NETWORK
Hashed Partitioning
P0
P1
P2
. . .
Round-Robin
P3
P3
Kien A. Hua
19
Comparing Data Partitioning Strategies
Round Robin Partitioning:
Advantage:
simple
Disadvantage: It does not support associative search.
Hash Partitioning:
Advantage:
Associative access to the tuples with a
specic attribute value can be directed
to a single disk.
Disadvantage: It tends to randomize data rather
than cluster it.
Range Partitioning:
Advantage:
It is good for associative search and
clustering data.
Disadvantage: It risks execution skew in which all
the execution occurs in one partition.
Kien A. Hua
20
Horizontal Data Partitioning
COMMUNICATION NETWORK
P0
P1
P2
P3
0 < GPA < .99
1 < GPA < 1.99
2 < GPA < 2.99
3 < GPA < 4
GPA ?
Partitioning attribute
STUDENT
SSN
NAME
GPA
MAJOR
012345678
Jane Doe
3.8
Computer Science
876543210
John Smith
2.9
English
..
.
..
.
..
.
..
.
Query 1: Retrieve the names of students who have
a GPA better than 2.0.
=)
Only P2 and P3 can participate.
Query 2: Retrieve the names of students who major in Anthropology.
=)
The whole le must be searched.
Multidimensional Data Partitioning
1-attribute
query
Age
The tuples in this
cell are assigned to
processing node #5
70
55
50
45
40
35
30
25
20
0
3
6
1
4
7
2
5
8
1
4
7
2
5
8
3
6
0
20K
2
5
8
3
6
0
4
7
1
30K
3
6
0
4
7
1
5
8
2
35K
4
7
1
5
8
2
6
0
3
45K
5
8
2
6
0
3
7
1
4
55K
6
0
3
7
1
4
8
2
5
70K
7
1
4
8
2
5
0
3
6
90K
8
2
5
0
3
6
1
4
7
Footprint of
a 1-attribute
query
Footprint of
a 2-attribute
query
Salary
Advantages:
• Degree of parallelism is maximized
(i.e., using as many processing nodes
as possible)
• Search space is minimized (i.e.,
searching only relevant data blocks)
Kien A. Hua
22
Query Types
Query Shape: The shape of the data subspace accessed by a range
query.
Square Query: The query shape is a
square.
Row Query:
The query shape is a rectangle containing a number
of rows.
Column Query: The query shape is a rectangle containing a number
of columns.
Kien A. Hua
24
Optimality
A data allocation strategy is usage optimal
with respect to a query type if the execution
of these queries can always use all the PNs
available in the system.
A data allocation strategy is balance
optimal with respect to a query type if the
execution of these queries always results in a
balance workload for all the PNs involved.
A data allocation strategy is optimal with
respect to a query type if it is usage optimal
and balance optimal with respect to this
query type.
Kien A. Hua
25
Coordinate Modulo Declustering (CMD)
0
1
2
3
4
5
6
7
Advantages:
1
2
3
4
5
6
7
0
2
3
4
5
6
7
0
1
3
4
5
6
7
0
1
2
4
5
6
7
0
1
2
3
5
6
7
0
1
2
3
4
6
7
0
1
2
3
4
5
7
0
1
2
3
4
5
6
Optimal for row and column queries.
Disadvantages: Poor for square queries.
Kien A. Hua
26
Hilbert Curve Allocation Method (HCAM)
0
1
6
7
0
3
4
5
Advantages:
3
2
5
4
1
2
7
6
4
7
0
3
6
5
0
1
5
6
1
2
7
4
3
2
2
1
6
5
0
3
4
5
3
0
7
4
1
2
7
6
4
5
2
3
6
5
0
1
7
6
1
0
7
4
3
2
Good for square range
queries.
Disadvantages: Poor for row and column
queries.
Kien A. Hua
27
General Multidimensional Data Allocation
(GMDA)
Row 0
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
0
3
6
1
4
7
2
5
8
1
4
7
2
5
8
3
6
0
2
5
8
3
6
0
4
7
1
3
6
0
4
7
1
5
8
2
4
7
1
5
8
2
6
0
3
5
8
2
6
0
3
7
1
4
6
0
3
7
1
4
8
2
5
7
1
4
8
2
5
0
3
6
8
2
5
0
3
6
1
4
7
Check row
Check row
Check row
N is the number of processing nodes
p
Regular Rows: Circular left shift b N c positions.
p
Check Rows: Circular left shift b N c + 1
positions.
Advantages: optimal for row, column,
p and2
small square range queries (jQj < b N c ).
Kien A. Hua
28
Handling 3-Dimensional Case
A cube with N 3 grid blocks can be seen as N
2-dimensional planes stacked up in the third dimension.
Y
X
Z
First, apply the
2D_GMDA to the
top plan. This also
initializes the first
row for each of the
vertical plans.
5 6 7 8 0 1 2 3 4
4
2 3 4 5 6 7 8 0
0
6
6 7 8 0 1 2 3 4 5
5 2
2 3 4 5 6 7 8 0 1
8
1 7
7 8 0 1 2 3 4 5 6
4
1
6 3
3 4 5 6 7 8 0 1 2
0 6
2
3
8 5
8 0 1 2 3 4 5 6 7
2 8
7 4
4 5 6 7 8 0 1 2 3
5
1 7
3 0
4 1
0 1 2 3 4 5 6 7 8
6
7
3 0
0 1 2 3 4 5 6 7 8 8 5 2 8
6 3
0
5 2
2 3 4 5 6 7 8 0 1 1 7 4 1
8 5
2
7 4
3 0
4 5 6 7 8 0 1 2 3
1 7
6 3
0 6
5 2
6 7 8 0 1 2 3 4 5
3
8 5
2 8
8 0 1 2 3 4 5 6 7 7 4 1 7
4
1 2 3 4 5 6 7 8 0 0 6 3 0
3 4 5 6 7 8 0 1 2 2 8 5
4 1
There are
1
5 6 7 8 0 1 2 3 4
7 8 0 1 2 3 4 5 6
9 vertical plans
6
Algorithm 3D GMDA:
1. We compute the mapping for the rst rows of all the 2dimensional planes
by considering these rows collectively as forming a plane along
the third dimension. We apply the 2D GMDA algorithm
p3 2 to this
plane, except that the shifting distance is set to b N c .
2. We apply the 2D GMDA algorithmpto each of the 2-dimensional
planes using the shifting distance b 3 N c and the rst row already
computed in Step 1.
Kien A. Hua
29
Handling Higher Dimensions: Mapping Function
A grid block (X1; X2; :::; Xd) is assigned to PN
GeMDA(X1; X2; :::; Xd), where
GeMDA(X1; :::; Xd) =
2
d
X
4
i=2
6
6
6
4
Xi GCDi 7775 + Xd (X Shf dist )35 mod N;
i
i
N
i=1
N = number of PNs,
p
Shf dist = b d N c, and
GCDi = gcd(Shf disti; N ).
Kien A. Hua
Optimality Comparison
Allocation
Optimal with respect to
scheme row queries column queries small square queries
HCAM
No
No
No
CMD
Yes
Yes
No
GeMDA
Yes
Yes
Yes
30
Conventional Parallel Hashbased Join: Grace Algorithm
JOIN
JOIN
.
JOIN
...
A bucket
after
merging
.
BUCK TUNING
JOIN
.
.
.
Join
matching
bucket
pairs
BUCK TUNING
Merge
buckets to
fit memory
space
BUCK TUNING
BUCK TUNING
.
.
.. .
.. .
.. .
Transmit
tuples to
their hash
bucket
...
A hash
bucket
DATA TRANSMISSION
HASH
R1
HASH
S1
PN1
.. .
.
R2
HASH
S2
PN2
Shared Nothing System
Hash
tables into
buckets
R3
HASH
S3
PN3
R4
S4
PN4
Kien A. Hua
32
The Eect of Imbalanced Workloads
100000
32768
Size of bucket (tuples)
8192
Zb=1
Zb=0.5
2048
Zb=0
512
128
32
8
0
2048
4096
Bucket ID
6144
8192
100
90
Number processors = 64
I/O = 64 X 4 MBytes
80
Communication = 64 X 4 MBytes
Total cost (Second)
70
60
50
40
GRACE_worst
30
GRACE_best
20
10
0
0
0.2
0.4
0.6
Bucket skew
0.8
1
Kien A. Hua
33
Partition Tuning: Largest Processing Time
(LPT) First Strategy
Combine
Combine
P1
Combine
Size
(Tuples)
B1
B2
B3
B4
B6
B7
B8
Hash Bucket
Combine
Combine
Combine
P2
Kien A. Hua
34
Naive Load Balancing Parallel Hash Join (NBJ)
JOIN
...
JOIN
...
...
JOIN
...
...
JOIN
...
..
.
..
.
BUCK TUN
BUCK TUN
BUCK TUN
BUCK TUN
...
...
...
...
...
...
...
...
DATA TRANSMISSION
PART TUN
PART TUN
...
...
PART TUN
PART TUN
...
...
...
...
PN4 has
less
hashing
work !
...
..
.
DATA TRANSMISSION
HASH
R1
HASH
S1
PN1
R2
HASH
S2
PN2
R3
HASH
S3
PN3
R4
S4
PN4
JOIN
.. .
JOIN
...
BUCK TUN
.
...
...
JOIN
.. .
BUCK TUN
.
...
.. .
JOIN
...
BUCK TUN
...
.
..
.
.. .
BUCK TUN
.
...
DATA TRANSMISSION
PART TUN
...
.
PART TUN
.
HASH
R1
.
HASH
S1
PN1
...
PART TUN
R2
PART TUN
...
HASH
S2
PN2
...
R3
PN3
.
HASH
S3
Local buckets
are collected
to their
destined PN
to form the
bucket based
on “bin
packing”
R4
Each PN
hashes its
local tuples
into local
buckets
S4
PN4
Workload is balanced among the PN’s throughout the computation
What if the partitioning is skew initially ?
JOIN
JOIN
.. .
...
JOIN
.
..
...
.. .
JOIN
...
...
.. .
DATA TRANSMISSION
PART TUN
PART TUN
PART TUN
..
...
..
...
BUCK TUN
.
...
...
...
BUCK TUN
BUCK TUN
.
.
...
...
PART TUN
..
Subbuckets
are collected
to their
destined PN
to form the
bucket based
on “bin
packing”
...
BUCK TUN
...
.
Each PN has
a subbucket
for each of
the hash
value
DATA TRANSMISSION
H/TI
H/TI
R1
S1
PN1
R2
S2
PN2
H/TI
H/TI
R3
S3
PN3
R4
S4
Each PN
distributes its
tuples with a
certain hash
value evenly
among the 4
subbuckets
PN4
• Workload is balanced among the PN’s throughout
the computation
• Better to do partition tuning before bucket tuning
Kien A. Hua
37
Simulation Results
140
NBJ
120
Cost (Second)
100
GRACE
80
60
TIJ
ABJ
40
20
0
0
0.2
0.4
0.6
Bucket Skew
0.8
1
60
ABJ
55
GRACE
50
NBJ
Cost (Second)
45
TIJ
40
35
30
25
20
15
10
0
0.2
0.4
0.6
0.8
Initial Partition Skew
1
NBJ
100
Cost (Second)
80
GRACE
60
TIJ
40
ABJ
20
0.5
1
1.5
2
2.5
3
3.5
Communication bandwidth per PN (MBytes/sec)
4
ABJ continues to suffer
the initial skew after the
hash phase resulting in
a performance worse
than even that of
GRACE under a severe
initial skew condition
Sampling-based Load Balancing
(SLB) Join Algorithm
• Sampling Phase: As the sampling tuples are
loaded into memory, each PN declusters its
tuples into a number of in-memory buckets
by hashing on the join attribute
• Partition Tuning Phase: The coordinator
determines the optimal bucket allocation
strategy BAS using the statistics on the
partially formed buckets
• Split Phase:
– The tuples already in the
memory are allocated to
their respective host PN,
according to BAS, to form the
join buckets
– Each PN loads the remaining
tuples, and redistributes
them to the join buckets in
accordance with BAS
• Join Phase: Each PN performs
the local joins of respectively
matching buckets
MEMORY
Partially formed
hash buckets
…
HASH
DATABASE
Kien A. Hua
39
nCUBE/2 Results:
SLB vs. ABJ vs. GRACE
400
350
GRACE
time (secs)
300
250
ABJ
200
SBJ
150
0
0.2
0.4
0.6
partiton skew (data skew = 0.8)
0.8
1
The
performance of SLB approaches that of GRACE
on very mild skew conditions, and
it
can avoid the disastrous performance that GRACE
suers on severe skew conditions.
Kien A. Hua
40
Pipelining Hash-Join Algorithms
Two-Phase Approach:
Output stream
Hash
Table
Matching
First operand
Second operand
{ Advantage: requires only one hash table.
{ Disadvantage: pipelining along the outer relation
must be suspended during the build phase (i.e.,
building the hash table).
One-Phase Approach: As a tuple comes in, it
is rst inserted into its hash table, and then
used to probe that part of the hash table of
the other operand that has already been
constructed.
Output stream
Hash
Table
First operand
Matching
Matching
Hash
Table
Second operand
{ Advantage: pipelining along both operands is
possible.
{ Disadvantage: requires larger memery space.
Kien A. Hua
41
Aggregate Functions
An SQL aggregate function is a function that
operates on groups of tuples.
Example:
SELECT
FROM
WHERE
GROUP BY
department, COUNT(*)
Employee
age > 50
department
The number of result tuples depends on the
selectivity of the GROUP BY attributes (i.e.,
department).
Kien A. Hua
44
Centralized Merging
PN1
Department
1
2
3
4
5
6
7
8
Coordinator
Department
Count
1
2
3
4
5
6
7
8
13
13
15
17
11
20
13
14
Count
PN2
Department
5
7
8
4
2
7
3
6
1
2
3
4
5
6
7
8
Partition 1
Count
PN3
Department
Count
13
4
13
3
15
5
17
8
11
3
20
6
13
7
14
4
1
2
3
4
5
6
7
8
13
4
13
3
15
2
17
5
11
6
20
7
13
3
14
4
Partition 2
Partition 3
Kien A. Hua
45
Distributed Merging
PN1
Department
Count
3
6
15
20
This scheme
incurs less data
transmission
compared to
next technique
PN2
Department
Count
PN3
Department
Count
1
2
4
3
7
13
13
17
15
13
2
1
2
5
3
8
13
13
11
15
14
MOD 3
MOD 3
A counter is
created for each
department. This
strategy uses
more memory
space
MOD 3
PN1
Department
Count
PN2
Department
Count
PN3
Department
Count
1
2
3
4
5
6
7
8
5
7
8
4
2
7
3
6
1
2
3
4
5
6
7
8
13
4
13
3
15
5
17
8
11
3
20
6
13
7
14
4
1
2
3
4
5
6
7
8
13
4
13
3
15
2
17
5
11
6
20
7
13
3
14
4
Partition 1
Partition 2
Partition 3
Kien A. Hua
46
Repartitioning
PN1
Department
Count
3
6
15
20
PN2
Department
Count
PN3
Department
Count
1
2
4
3
7
13
13
17
15
13
2
1
2
5
3
8
13
13
11
15
14
Partition 1'
Partition 2'
Partition 3'
(hash value = 0)
(hash value = 1)
(hash value = 2)
No need to write
the new partitions
to disks
This scheme
incurs more data
transmission
MOD 3
MOD 3
MOD 3
Partition 1
Partition 2
Partition 3
Kien A. Hua
43
Performance Characteristics
Centralized
Advantage:
Merging Algorithm:
It works well when the number of tuples is small.
Disadvantage: The merging phase is sequential.
Distributed
Advantage:
Merging Algorithm:
The merging step is not a bottleneck.
Disadvantage: Since a group value is being accumulted on potentially all the PNs the overall memory requirement can be large.
Repartitioning
Advantage:
Algorithm:
It reduces the memory requrement as each
group value is stored in one place only.
Disadvantage: It incurs more network trac.
Kien A. Hua
42
Coventional Aggregation Algorithms
Centralized
Merging (CM) Algorithm:
Phase 1: Each PN does aggregation on its local tuples.
Phase 2: The local aggregate values are merged at a
predetermined central coordinator.
Distributed
Merging (DM) Algorithm:
Phase 1: Each PN does aggregation on its local tuples.
Phase 2: The local aggregate values are then hash-partitioned
(based on the GROUP BY attribute) and the PNs merge
these local aggregate values in parallel.
Repartitioning
(Rep) Algorithm:
Phase 1: The relation is repartitioned using the GROUP BY
attributes.
Phase 2: The PNs do aggreation on their local partitions in
parallel.
Performance Comparison:
CM and DM work well when the number of result tuples is
small.
Rep works better when the number of groups is large.
Kien A. Hua
47
Adaptive Aggregation Algorithms
Sampling
Based (Samp) Approach:
{ CM algorithm is rst applied to a small Page-oriented random
sample of the relation.
{ If the number of groups obtained from the sample is small
then DM strategy is used; Rep algorithm is used otherwise.
Adaptive
DM (A-DM) Algorithm:
{ This algorithm starts with the DM strategy under the
common case assumption that the number of group is small.
{ However, if the algorithm detects that the number of groups is
large (i.e., memory full is detected) it switches to the Rep
strategy.
The
Adaptive Repartitioning (A-Rep) Algorithm:
{ This algorithm starts with the Rep strategy.
{ It switches to DM if the number of groups is not large enough
(i.e., number of groups is too few given the number of seen
tuples).
Performance Comparison:
In general, A-DM performs the best.
However, A-Rep should be used if the number of groups is
suspected to be very large.
Kien A. Hua
48
Implementation Techniques for A-DM
Global
Switch:
{ When the rst PN detects a memory full condition, it informs
all the PNs to switch to the Rep strategy.
{ Each PN rst partitions and sends the so far accumulated
local results to PNs they hash to. Then, it proceeds to read
and repartition the remaining tuples.
{ Once the repartitioning phase is complete, the PNs do
aggregate on the local partitions in parallel (as in the Rep
algorithm).
Local
Switch:
{ A PN upon detecting memory full stops processing its local
tuples. It rst partitions and sends the so far accumulated
local results to the PNs they hash to. Then it proceeds to
read and repartition the remaining tuples.
{ During Phase one, one set of PNs may be executing the DM
algorithm while other are executing the Rep algorithm. When
the latter receives an aggregate value from another PN, it
accumulates it to the corresponding local aggregate value.
{ Once all PNs have completed their Phase 1, The local
aggregate values are merged as in the DM algorithm.
Kien A. Hua
49
A-DM: Global Switch
Department Count
3
6
Department Count
15
20
1
4
7
13
17
13
Phase 2
2
5
8
13
11
14
Phase 2
Phase 2
Partition 1'
Partition 2'
Partition 3'
(hash value = 0)
(hash value = 1)
(hash value = 2)
Department Count
3
6
Department Count
Department Count
10
9
1
4
7
Department Count
5
2
6
2
5
8
No need to write
the new partitions
to disks
5
2
2
Step 1
MOD 3
MOD 3
Department Count
Department Count
1
2
3
4
5
6
7
3
4
5
2
1
4
1
MOD 3
1
2
3
2
5
1
7
5
MOD 3
Department Count
MOD 3
2
3
1
3
6
5
8
2
Step 2
DM
Rep
Partition 1
DM
DM
Partition 2
Partition 3
PN2 must also
switch to the Rep
scheme in Step 2
(i.e., global switch)
Kien A. Hua
50
A-DM: Local Switch
PN1
Department
Count
3
6
5
5
PN2
Department
Count
PN3
Department
Count
1
2
4
3
7
12
13
13
10
15
3
2
1
2
5
3
8
13
6
13
4
15
5
Phase 2
MOD 3
PN1
Department
Count
3
6
5
4
PN2
Department
1
2
3
4
5
7
8
Step 1
MOD 3
Phase 2
Phase 2
Count
PN3
Department
Count
12
13
13
15
17
7
11
20
13
3
14
3
1
2
3
4
5
6
7
8
13
13
6
15
17
3
11
4
20
1
13
14
2
MOD 3
PN1
Department
Count
1
2
3
4
5
6
7
3
4
5
2
1
4
1
DM
Partition 1
MOD 3
Only PN1
switches to the
Rep scheme
Step
2
DM
DM
Partition 2
Partition 3
Rep
Kien A. Hua
51
SQL (S tructured Query Language)
EMPLOYEE
ENAME
ENUM
BDATE
WORKSON
ENO
PNO
HOURS
PNAME
PNUM
DNUM
PROJECT
An SQL query:
ADDR
SALARY
PLOCATION
SELECT ENAME
FROM EMPLOYEE, WORKSON, PROJECT
WHERE PNAME= `database' AND
PNUM = PNO AND
ENO = ENUM AND
BDATE > `1965'
SQL
is nonprocedural.
The
Compiler must generate the execution plan.
1. Transforms the query from SQL into relational algebra.
2. Restructures (optimizes) the algebra to improve performance.
Kien A. Hua
52
Relational Algebra
Relation T1
ENAME
SALARY
ENUM
Andrew
Casey
James
Kathleen
$98,000
$150,000
$120,000
$115,00
005
003
007
001
Select: Selects rows.
Relation T2
ENUM ADDRESS
001
003
005
007
8
<
:
BDATE
Orlando
New York
Los Angeles
London
1964
1966
1968
1958
9
=
(
Casey;
150000
;
003)
SALARY 120;000(T 1) = (James; 120000; 007) ;
Project: Selects columns.
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
9
(Andrew; 98000); >>>>>
>
=
(
Casey;
150000)
;
ENAME;SALARY (T 1) = (James; 120000); >>
>
>
(Kathleen; 115000) >>;
Cartesian Product: Selects all possible combinations.
T1 8
>
>
>
>
>
>
>
>
>
<
T 2 = >>
>
>
>
>
>
>
>
:
9
>
(Andrew; 98000; 005; 001; Orlando; 1964);
>
>
>
(Andrew; 98000; 005; 003; New Y ork; 1966); >>>>>=
..
>
>
(Kathleen; 150000; 001; 005; Los Angeles; 1968) >>>>>>
>
;
(Kathleen; 115000; 001; 007; London; 1958)
Join: Selects some combinations.
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
9
(Andrew; 98000; 005; Los Angeles; 1968); >>>>>
>
=
(
Casey;
150000
;
003
;
New
Y
ork;
1966)
;
T 1 1 T 2 = (James; 120000; 007; London; 1958);
>
>
>
>
(Kathleen; 115000; 001; Orlando; 1964) >>;
Kien A. Hua
53
Transforming SQL into Algebra
An SQL query:
SELECT ENAME
FROM EMPLOYEE, WORKSON, PROJECT
WHERE PNAME = `database' AND
PNUM = PNO AND
ENO = ENUM AND
BDATE > `1965'
Canonical Query Tree
SELECT
Clause
WHERE
Clause
ENAME
PNAME = ‘database’ AND PNUM = PNO AND
ENO = ENUM AND BDATE > ‘1965’
FROM
Clause
PROJECT
EMPLOYEE
WORKSON
This query tree (procedure) will compute the correct result. However,
the performance will be very poor. =) needs optimization !
Kien A. Hua
54
Optimization Strategies
GOAL: Reducing the sizes of the intermediate
results as quickly as possible.
STRATEGY:
1. Move SELECTs and PROJECTs as far down the
query tree as possible.
2. Among SELECTs, reordering the tree to perform the
one with lowest selectivity factor rst.
3. Among JOINs, reordering the tree to perform the one
with lowest join selectivity rst.
Kien A. Hua
55
Example: Apply SELECTs First
Canonical Query Tree
SELECT
Clause
ENAME
PNAME = ‘database’ AND PNUM = PNO AND
ENO = ENUM AND BDATE > ‘1965’
WHERE
Clause
FROM
Clause
PROJECT
EMPLOYEE
WORKSON
After Optimization
ENAME
PNUM = PNO
ENUM = ENO
PNAME = ‘database’
PROJECT
BDATE > ‘1965’
EMPLOYEE
WORKSON
Kien A. Hua
56
Example: Replace \ ; " by \1"
Before Optimization
ENAME
PNUM = PNO
ENUM = ENO
PNAME = ‘database’
PROJECT
BDATE > ‘1965’
WORKSON
EMPLOYEE
After Optimization
ENAME
PNUM = PNO
ENUM = ENO
BDATE > ‘1965’
EMPLOYEE
WORKSON
PNAME = ‘database’
PROJECT
Kien A. Hua
57
Example: Move PROJECTs Down
Before Optimization
ENAME
PNUM = PNO
PNAME = ‘database’
ENUM = ENO
WORKSON
BDATE > ‘1965’
PROJECT
EMPLOYEE
After Optimization
ENAME
PNUM = PNO
ENAME, PNO
ENUM = ENO
ENAME, ENUM
ENO, PNO
BDATE > ‘1965’
WORKSON
EMPLOYEE
PNUM
PNAME = ‘database’
PROJECT
Kien A. Hua
Parallelizing Query Optimizer
Relations are fragmented and allocated to
multiple processing nodes:
=) Besides the choice of ordering relational operations, the parallelizing optimizer must select
the best sites to process data.
=) The role of a parallelizing optimizer is to map a
query on a global relations into a sequence of local operations acting on local relation fragments.
58
Kien A. Hua
59
Parallelizing Query Optimization
SQL query on
global relations
SEQUENTIAL OPTIMIZER
Global
Schema
Optimized sequential
access plan
PARALLELIZING OPTIMIZER
Fragment
Schema
Optimized parallel
access plan
Parallelizing Optimizer:
Parallelizes
the relational operators.
Selects the best processing nodes for each parallelized
relational operator.
Kien A. Hua
60
Example: Elimination of Useless JOINs
Fragments:
E1 = ENOE3(E )
E2 = E3<ENOE6(E )
E3 = ENO>E6(E )
Query :
G1 = ENOE3(G)
G2 = ENO>E3(G)
SELECT FROM E, G
WHERE E.ENO = G.ENO
(1) Sequential query tree
(2) Data Localization
Operator on
relations
ENO
ENO
E
G
E1
(3) Distributing 1 over [
E2
E3
G1
G2
(4) Eliminate useless JOINs
Find the "best"
ordering of these
fragment operators
Fragment
operator
. . .
E1
G1 E1
G2
E3
G2
E1
G1
E2
G2
E3
G2
Kien A. Hua
Parallelizing Query Optimization
1. Determines which fragments are involved and
transforms the global operators into fragment
operators.
2. Eliminates useless fragment operators.
3. Finds the \best" ordering of the fragment operators.
4. Selects the best processing node for each fragment
operator and species the communication operations.
61
Kien A. Hua
62
Prototype at UCF
A
prototype of a shared-nothing system is
implemented on a 64-processor nCUBE/2 computer.
Our
system was implemented to demonstrate:
{ GeMDA multidimensional data partitioning
technique,
{ dynamic optimization scheme with load balancing
capability, and
{ competition-based scheduling policy.
Kien A. Hua
63
System Architecture
Create/Destroy
Tables
SQL
Results Queries
PRESENTATION MANAGER
Global
Schema
QUERY
TRANSLATOR
Fragment
Schema
QUERY
EXECUTOR
Operator
Routine
LOAD
UTILITY
...
Operator
Routine
STORAGE MANAGER
Database
Interprets the
execution tree
and calls the
corresponding
operator
routines to
carry out the
underlying
operations
Database
Provides a
sequential scan
interface to the
query processing
facilities
Kien A. Hua
64
Software Componets
Storage Manager: This component manages physical disk
devices and schedules all I/O activities. It provides a sequential
scan interface to the query processing facilities.
Catalog Manager: It acts as a central repository of all global
and fragment schemas.
Load Utility: This program allows the users to populate a
relation using an external le. It distributes the fragments of a
relation across the processing nodes using GeMDA.
Query Translator: This component provides an interface for
queries. It translates an SQL query into a query graph. It also
caches the global schema information locally.
Query Executor: This component performs dynamic query
optimization. It schedules the execution of the operators in the
query graph.
Operator Routines: Each routine implements a primitive
database operator. To execute an operator in a query graph, the
Query Executor calls the appropriate operator routine to carry
out the underlying operation.
Presentation Manager: This module provides an interactive
interface for the user to create/destroy tables, and query the
database. The user can also use this interface to browse query
results.
Kien A. Hua
65
Competition-Based Scheduling
Each operator server is associated with a processing node
Server pool
(Operator Servers)
Operator
Server
Operator
Server
Coordinator
Coordinator
Active queries
...
Waiting queue
Dispatcher
Coordinator
Coordinator pool
Advantage: Fair.
Disadvantage: System utilization is not maximized.
Kien A. Hua
66
Planning-Based Scheduling
Operator
Server
Operator
Server
Coordinator
Coordinator
scheduling
window
...
Scheduler
Query queue
Server pool
(Operator Servers)
Operator
Server
JOIN
Operator
Server
SELECT
Operator
Server
PROJECT
Each operator server is associated with a processing node
Advantage: Better system utilization.
Disadvantages: Less fair. Scheduler can become a
bottleneck.
Kien A. Hua
67
Hardware Organization
Local Area Network
Host Computer
IFP
IFP
IFP
Interconnection Network
ACP
ACP
ACP
ACP
ACP
IFP: Interface Processor
ACP: Access Processor
Catalog
Manager, Query Manager and Scheduler
processes are run on IFP's.
Operator processes are run on ACP's.
Kien A. Hua
68
Structure of Operator Processes
Operator Process
Split Table
Stream of
tuples
(e.g., 8K byte
batches)
Operation
Hash
Value
Destination Process
0
(Processor #3, Port #5)
1
(Processor #4, Port #6)
2
(Processor #5, Port #8)
3
(Processor #6, Port #2)
The output is demultiplexed through a split
table.
When the process detects the end of its input
stream,
{ it rst closes the output streams and
{ then sends a control message to its coordinator
process indicating that it has completed execution.
Kien A. Hua
69
Example: Operator and Process Structure
Query
Tree:
C (PN1, PN2)
Executing Join on
PN3 and PN4
Join
Process
Select
Scan
A (PN1, PN2)
B (PN1, PN2)
Structure:
PN3
PN4
Probe
Scan
Hash
Table
Store
Select
Probe
Scan
Store
PN1
Use "B" tuples
to probe the
hash tables
B1
Hash
Table
Select
PN2
C1
A1
B2
C2
A2
Build hash
tables for
relation A
Contains code for
each operator in the
database access
language.
OPERATOR
METHODS
COMPILED
QUERY
ACCESS
METHODS
Record ID
Record
STORAGE
A storage
manager
provides the
primitives
for scanning
a file via a
sequential
or index
scan
MANAGER
STORAGE
STRUCTURES
BUFFER
MANAGEMENT
PHYSICAL
I/O
Database
ACCESS METHOD
Maintains an active
scan table that describes
all the scans in progress.
Maps file names to file ID’s,
manages active files, searches
for the page given a record.
Manages a buffer pool.
Manages physical disk devices,
performs page-level I/O
operations.
Kien A. Hua
71
Transaction Processing
The consistency and reliability aspects of
transactions are due to four properties:
1. Atomicity: A transaction is either performed in its
entirety or not performed at all.
2. Consistency: A correct execution of the transaction
must take the database from one consistent state to
another.
3. Isolation: A transaction should not make its updates
visible to other transaction until it is committed.
4. Durability: Once a transaction changes the database
and changes are committed, these changes must never
be lost because of subsequent failure.
Kien A. Hua
72
Transaction Manager
TRANSACTION MANAGER
LOCK
LOG
MANAGER
MANAGER
Lock Manager:
{ Each local lock manager is responsible for the lock
units local to that processing node.
{ They provide concurrency control.
Log Manager:
Isolation property
{ Each local log manager logs the local database
operations.
{ They provide recovery services.
Kien A. Hua
73
Two-Phase Locking Protocol
Number
of locks
Growing
Phase
Begin
Shrinking
Phase
Lock
point
End
Transaction
duration
schedule generated by a concurrency control
algorithm that obeys the 2PL protocol is serializable
(i.e., the isolation property is guaranteed).
Any
2PL is
know:
dicult to implement. The lock manager has to
1. the transaction has obtained all its locks, and
2. the transaction no longer needs to access the data item in
question. (so that the lock can be released).
Cascading
aborts can occur.
(because transactions reveal updates before they commit)
Kien A. Hua
74
Strict Two-Phase Locking Protocol
Number
of locks
Begin
End
Transaction
duration
The lock manager releases all the locks together when the
transaction terminates (commits or aborts).
Kien A. Hua
75
Handling Deadlocks
Detection and Resolution:
{ Abort and restart a transaction if it has waited for
a lock for a long time.
{ Detect cycles in the wait-for graph and select a
transaction (involved in a cycle) to abort.
Prevention:
If Ti requires a lock held by Tj ,
? If Ti is older =) Ti can wait.
? If Ti is younger =) Ti is aborted and restarted
with the same timestamp.
There can never be a loop
in the wait-for graph.
Deadlock is not possible !!
Kien A. Hua
76
Distributed Deadlock Detection [Chandy 83]
This message goes around
and comes back to
transaction 0
Transaction 0 is
waiting for
transaction 1 for
a data item -Transaction 0 is
blocked
(0,8,0)
0
(0,0,1)
4
1
2
5
8
(0,5,7)
7
(0,2,3)
Blocked
transaction
When
6
3
(0,1,2)
PN 0
(0,4,6)
Transaction
sending this
message
PN 2
PN 1
Transaction
receiving this
message
a transaction is blocked, it sends a special
probe message to the blocking transaction.
The message consists of three numbers: the transaction that
just blocked, the transaction sending the message, and the
transaction to whom it is being sent.
When the message arrives,
the recipient checks to see if
it itself is waiting for any transaction. If so, the
message is updated, replacing the second eld by its
own TID and the third one by the TID of the
transaction it is waiting for. The message is then sent
to the blocking transaction.
If a message goes all the way around and come back to
the original sender, a deadlock is detected.
Kien A. Hua
77
Two-Phase Commit Protocol
To ensure the atomicity property, a 2P commit
protocol can be used to coordinate the commit
process among subtransactions.
2
2
3
3
1
1
1
PREPARE
4
4
5
5
READY
or
ABORT
(Voting)
COMMIT
or
ABORT
ACK
: Coordinator, it originates the transaction.
: Agent, it executes a subtransaction on behalf
of its coordinator.
Kien A. Hua
78
Recovery
An
entry is made in the local log le at a processing
node each time one of the following commands is
issued by a transaction:
{ begin transaction
{ write (insert, delete, update)
{ commit transaction
{ abort transaction
Write-ahead
log protocol:
Note: If log entry wasn't saved
before the crash, corresponding
change was not applied to database!
{ It is essential that log records be written before the
corresponding write to the database.
{ If there is no commit transaction entry in the log for a
particular transaction, then that transaction was still active at
the time of failure and must therefore be undone.
Kien A. Hua
84
Commercial Product: Teradata DBC/1012
Local Area Network
Host Computer
IFP
IFP
COP
Ynet
AMP
AMP
AMP
AMP
AMP
IFP: Interface Processor
AMP: Access Module Processor
COP: Communication Processor
It
may have over 1,000 processors and many thousands
of disks.
Each relation
AMPs.
Near-linear
is hash partitioned over a subset of the
speedup and scaleup on queries have been
demonstrated for systems containing over 100
processors.
Kien A. Hua
85
Teradata DBC/1012: Distribution of Data
The
fallback copy ensures that the data remains
available on other AMPs if an AMP should fail.
In the following example, if AMPs 4 and 7 were to fail
simultaneously, however, there would be a loss of data
availability.
Primary Copy Area:
DSU/AMP 1
DSU/AMP 2
DSU/AMP 3
DSU/AMP 4
1, 9, 17
2, 10, 18
3, 11, 19
4, 12, 20
1, 23, 8
9, 2, 16
17, 10, 3
DSU/AMP 5
DSU/AMP 6
DSU/AMP 7
DSU/AMP 8
5, 13, 21
6, 14, 22
7, 15, 23
8, 16, 24
19, 12, 24
20, 5, 6
13, 14, 7
Fallback Copy Area: 21, 22, 15
Primary Copy Area:
Fallback Copy Area:
18, 11, 4
Additional
data protection can be achieved by
\clustering" the AMPs in groups.
In the following example, If both AMPs 4 and 7 were
to fail, all data would still be available.
CLUSTER A
DSU/AMP 1
DSU/AMP 2
DSU/AMP 3
DSU/AMP 4
Primary Copy Area:
1, 9, 17
2, 10, 18
3, 11, 19
4, 12, 20
Fallback Copy Area:
2 3, 4
1, 11, 12
9 10 20
17 18 19
CLUSTER B
DSU/AMP 5
DSU/AMP 6
DSU/AMP 7
DSU/AMP 8
Primary Copy Area:
5, 13, 21
6, 14, 22
7, 15, 23
8, 16, 24
Fallback Copy Area:
6, 7, 8
5, 15, 16
13, 14, 24
21, 22, 23
Kien A. Hua
86
Commercial Product: Tandem NonStop SQL
DYNABUS
DYNABUS CONTROL
DYNABUS CONTROL
DYNABUS CONTROL
DYNABUS CONTROL
MAIN PROCESSOR
MAIN PROCESSOR
MAIN PROCESSOR
MAIN PROCESSOR
MEMORY
MEMORY
MEMORY
MEMORY
I/O PROCESSOR
I/O PROCESSOR
I/O PROCESSOR
I/O PROCESSOR
DISK
CONTROLLER
TERMINAL
TAPE
CONTROLLER
CONTROLLER
DISK
CONTROLLER
DISK
CONTROLLER
Tandem systems run
the applications on the same
processors as the database servers.
Relations may be range partitioned across multiple
disks.
It is primarily designed for OLTP. It scales linearly
well beyond the largest reported mainframes on the
TPC-A benchmarks.
It is three times cheaper than a comparable mainframe
system.
Kien A. Hua
92
Challenges Facing Parallel Database Technology
Multimedia
applications require very large
network-I/O bandwdith.
A
parallel parallelizing query optimizer will be able to
consider many more execution plans.
Design
parallel programming languages to take
advantage of parallelism inherrent in the parallel
database system.
We
need new data partitioning techniques and
algorithms for non-relational data objects allowed in
SQL3, e.g., graph structures.