* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Parallel Databases - UCF Computer Science
Survey
Document related concepts
Entity–attribute–value model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Functional Database Model wikipedia , lookup
Serializability wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Database model wikipedia , lookup
Transcript
PARALLEL DATABASE TECHNOLOGY Kien A. Hua School of Computer Science University of Central Florida Orlando, FL 32816-2362 Kien A. Hua 1 Topics Results Queries PROCESS STRUCTURE QUERY OPTIMIZATION * Parallelizing Query Optimization Techniques (4) EXECUTOR (3) STORAGE MANAGER (2) HARDWARE (1) COMMERCIAL SYSTEMS (5) * Parallel Algorithms * Data Placement Strategies * Parallel Architectures Kien A. Hua 3 Relational Data Model An attribute (column) EMP# NAME ADDR 0005 John Smith Orlando A relation (table) A tuple (row) A database structure is a collection of tables. Each table is organized into rows and columns. The persistent objects of an application are captured in these tables. Kien A. Hua 4 Relational Operator: SCAN SELECT NAME FROM EMPLOYEE WHERE ADDR = \Orlando"; EMP# NAME ADDR 0002 Jane Doe Orlando 0005 John Smith Orlando SELECT (ADDR=Orlando) EMP# NAME ADDR 0002 0005 Jane Doe John Smith Orlando Orlando PROJECT(NAME) NAME Jane Doe John Smith SCAN Kien A. Hua 5 Relational Operator: SELECT FROM WHERE JOIN EMPLOYEE, PROJECT EMP# = ENUM EMPLOYEE: PROJECT: EMP# NAME ADDR 0002 Jane Doe Orlando ENUM PROJECT DEPT 0002 Database Research 0002 GUI Research Matching EMP# = ENUM No Yes EMP_PROJ: EMP# NAME ADDR PROJECT DEPT 0002 0002 Jane Doe Jane Doe Orlando Orlando Database GUI Research Research Kien A. Hua 6 Hash-Based Join EMP_PROJ_0 EMP_PROJ_1 JOIN JOIN E0 0004 EMP_PROJ_2 P0 E1 EMP_PROJ_3 JOIN P1 JOIN E2 P2 E3 0004 0008 HASH:(EMP#) = EMP# mod 4 EMP# NAME HASH(ENUM) = ENUM mod 4‘ ADDR EMPLOYEE: ENUM PROJ DEPT PROJECT: 0004 0004 0008 Examples: 0 mod 4 = 0 1 mod 4 = 1 2 mod 4 = 2 3 mod 4 = 3 4 mod 4 = 0 5 mod 4 = 1 6 mod 4 = 2 7 mod 4 = 3 P3 Kien A. Hua 7 Bucket Sizes and I/O Costs Memory One tuple at a time Bucket B does not t in the memory in its entirety. It must be loaded several time. Bucket B does not fit in the memory in its entirety. Bucket A must be loaded several times. Bucket A Bucket A Bucket B Memory One tuple at a time Bucket B(1) LOAD Bucket A(1) Bucket B ts in the memory. It needs to be loaded only once. Bucket B(1) fits in the memory. A(1) needs to be loaded only once. Bucket A(2) Bucket B(2) Bucket A(3) Bucket B(3) Kien A. Hua 8 Speedup and Scaleup The ideal parallel systems demonstrates two key properties: 1. Linear Speedup: system elapsed time Speedup = small big system elapsed time Linear Speedup : Twice as much hardware can perform the task in half the elapse time (i.e., speedup = number of processors.) 2. Linear Scaleup: system elapsed time on small problem Scaleup = small big system elapsed time on big problem Linear Scaleup : Twice as much hardware can perform twice as large a task in the same elapsed time (i.e., scaleup = 1.) Kien A. Hua Barriers to Parallelism Startup: The time needed to start a parallel operation (thread creation/connection overhead) may dominate the actual computation time. Interference: When accessing shared resources, each new process slows down the others (hot spot problem). Skew: The response time of a set of parallel processes is the time of the slowest one. 9 Kien A. Hua 10 The Challenge The ideal database machine has: 1. a single innitely fast processor, 2. an innitely large memory with innite bandwidth ;! Unfortunately, technology is not delivering such machines ! The challenge is: 1. to build an innitely fast processor out of innitely many processors of nite speed, and 2. to build an innitely large memory with innitely many storage units of nite speed. Kien A. Hua Performance of Hardware Components Processor: { Density increases by 25% per year. { Speed doubles in three years. Memory: { Density increases by 60% per year. { Cycle time decreases by 13 in ten years. Disk: { Density increases by 25% per year. { Cycle time decreases by 13 in ten years. The Database Problem: The I/O bottleneck will worsen. 11 Kien A. Hua 12 Hardware Architectures M M M M Communication Network Communication Network P P P P P P P P M M M M (c) Shared Everything (SE) (b) Shared Disk (SD) Communication Network M : Memory Module P : Processor : Disk Drive P P P P M M M M (a) Shared Nothing (SN) Shared Nothing is more scalable for very large database systems Kien A. Hua 14 Shared-Everything Systems MEMORY M M M M M M M M M M M M M M M M M M M M P P P P P P P P MEMORY MEMORY M M M M M M M M P P P P P P P P . . . MEMORY COMMUNICATION NETWORK CACHE CACHE CACHE CACHE . . . CPU CPU CPU Cross Interrogation CPU Shared Disk Architecture 4K page Communication Network Memory Memory Memory Memory Update P P P P Cross Interrogation for a small change to a page. Processing units interfere each other even they work on different records of the same page. Kien A. Hua 15 Hybrid Architecture COMMUNICATION NETWORK P P . . . P P P Bus . . . P Bus . . . Memory Cluster 1 Memory Cluster N SE clusters are interconnected through a communication network to form an SN structure at the inter-cluster level. This approach minimizes the communication overhead associated with the SN structure, and yet each cluster size is kept small within the limitation of the local memory and I/O bandwidth. Examples of this architecture include Sequent computers, NCR 5100M and Bull PowerCluster. Some of the DBMSs designed for this structure are the Teradata Database System for the NCR WorldMark 5100 computer, Sybase MPP, Informix Online Extended Parallel Server. Kien A. Hua 16 Parallelism in Relational Data Model Pipeline Parallelism: If one operator sends its output to another, the two operators can execute in parallel. C INSERT INTO C INSERT SELECT * FROM A, B WHERE A.x = B.y; JOIN Partitioned SCAN SCAN A B Parallelism: By taking the large relational operators and partitioning their inputs and outputs, it is possible to turn one big job into many concurrent independent little ones. C0 C1 C2 INSERT INSERT INSERT JOIN SCAN SCAN A2 A1 A0 B1 B0 Merger Merger Process executing operator INSERT • Merge operator combines several parallel data streams into a simple SCAN SCAN SCAN sequential stream Output streams Input streams Merge & Split Operators Split INSERT INSERT Merge JOIN JOIN JOIN Split SCAN SCAN • Split operator is used to partition or replicate the stream of tuples • With split and merge operators, a web of simple sequential dataflow nodes can be connected to form a parallel execution plan Kien A. Hua 18 Data Partitioning Strategies Data partitioning is the key to partitioned execution: Round-Robin: Hash maps the ith tuple to disk i mod n. Partitioning: maps each tuple to a disk location based on a hash function. Range Partitioning: maps contiguous attribute ranges of a relation to various disks. NETWORK NETWORK P0 P1 P2 P3 A-F G-L M-R S-Z P0 P1 P2 Range Partitioning HASH . . . NETWORK Hashed Partitioning P0 P1 P2 . . . Round-Robin P3 P3 Kien A. Hua 19 Comparing Data Partitioning Strategies Round Robin Partitioning: Advantage: simple Disadvantage: It does not support associative search. Hash Partitioning: Advantage: Associative access to the tuples with a specic attribute value can be directed to a single disk. Disadvantage: It tends to randomize data rather than cluster it. Range Partitioning: Advantage: It is good for associative search and clustering data. Disadvantage: It risks execution skew in which all the execution occurs in one partition. Kien A. Hua 20 Horizontal Data Partitioning COMMUNICATION NETWORK P0 P1 P2 P3 0 < GPA < .99 1 < GPA < 1.99 2 < GPA < 2.99 3 < GPA < 4 GPA ? Partitioning attribute STUDENT SSN NAME GPA MAJOR 012345678 Jane Doe 3.8 Computer Science 876543210 John Smith 2.9 English .. . .. . .. . .. . Query 1: Retrieve the names of students who have a GPA better than 2.0. =) Only P2 and P3 can participate. Query 2: Retrieve the names of students who major in Anthropology. =) The whole le must be searched. Multidimensional Data Partitioning 1-attribute query Age The tuples in this cell are assigned to processing node #5 70 55 50 45 40 35 30 25 20 0 3 6 1 4 7 2 5 8 1 4 7 2 5 8 3 6 0 20K 2 5 8 3 6 0 4 7 1 30K 3 6 0 4 7 1 5 8 2 35K 4 7 1 5 8 2 6 0 3 45K 5 8 2 6 0 3 7 1 4 55K 6 0 3 7 1 4 8 2 5 70K 7 1 4 8 2 5 0 3 6 90K 8 2 5 0 3 6 1 4 7 Footprint of a 1-attribute query Footprint of a 2-attribute query Salary Advantages: • Degree of parallelism is maximized (i.e., using as many processing nodes as possible) • Search space is minimized (i.e., searching only relevant data blocks) Kien A. Hua 22 Query Types Query Shape: The shape of the data subspace accessed by a range query. Square Query: The query shape is a square. Row Query: The query shape is a rectangle containing a number of rows. Column Query: The query shape is a rectangle containing a number of columns. Kien A. Hua 24 Optimality A data allocation strategy is usage optimal with respect to a query type if the execution of these queries can always use all the PNs available in the system. A data allocation strategy is balance optimal with respect to a query type if the execution of these queries always results in a balance workload for all the PNs involved. A data allocation strategy is optimal with respect to a query type if it is usage optimal and balance optimal with respect to this query type. Kien A. Hua 25 Coordinate Modulo Declustering (CMD) 0 1 2 3 4 5 6 7 Advantages: 1 2 3 4 5 6 7 0 2 3 4 5 6 7 0 1 3 4 5 6 7 0 1 2 4 5 6 7 0 1 2 3 5 6 7 0 1 2 3 4 6 7 0 1 2 3 4 5 7 0 1 2 3 4 5 6 Optimal for row and column queries. Disadvantages: Poor for square queries. Kien A. Hua 26 Hilbert Curve Allocation Method (HCAM) 0 1 6 7 0 3 4 5 Advantages: 3 2 5 4 1 2 7 6 4 7 0 3 6 5 0 1 5 6 1 2 7 4 3 2 2 1 6 5 0 3 4 5 3 0 7 4 1 2 7 6 4 5 2 3 6 5 0 1 7 6 1 0 7 4 3 2 Good for square range queries. Disadvantages: Poor for row and column queries. Kien A. Hua 27 General Multidimensional Data Allocation (GMDA) Row 0 Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 0 3 6 1 4 7 2 5 8 1 4 7 2 5 8 3 6 0 2 5 8 3 6 0 4 7 1 3 6 0 4 7 1 5 8 2 4 7 1 5 8 2 6 0 3 5 8 2 6 0 3 7 1 4 6 0 3 7 1 4 8 2 5 7 1 4 8 2 5 0 3 6 8 2 5 0 3 6 1 4 7 Check row Check row Check row N is the number of processing nodes p Regular Rows: Circular left shift b N c positions. p Check Rows: Circular left shift b N c + 1 positions. Advantages: optimal for row, column, p and2 small square range queries (jQj < b N c ). Kien A. Hua 28 Handling 3-Dimensional Case A cube with N 3 grid blocks can be seen as N 2-dimensional planes stacked up in the third dimension. Y X Z First, apply the 2D_GMDA to the top plan. This also initializes the first row for each of the vertical plans. 5 6 7 8 0 1 2 3 4 4 2 3 4 5 6 7 8 0 0 6 6 7 8 0 1 2 3 4 5 5 2 2 3 4 5 6 7 8 0 1 8 1 7 7 8 0 1 2 3 4 5 6 4 1 6 3 3 4 5 6 7 8 0 1 2 0 6 2 3 8 5 8 0 1 2 3 4 5 6 7 2 8 7 4 4 5 6 7 8 0 1 2 3 5 1 7 3 0 4 1 0 1 2 3 4 5 6 7 8 6 7 3 0 0 1 2 3 4 5 6 7 8 8 5 2 8 6 3 0 5 2 2 3 4 5 6 7 8 0 1 1 7 4 1 8 5 2 7 4 3 0 4 5 6 7 8 0 1 2 3 1 7 6 3 0 6 5 2 6 7 8 0 1 2 3 4 5 3 8 5 2 8 8 0 1 2 3 4 5 6 7 7 4 1 7 4 1 2 3 4 5 6 7 8 0 0 6 3 0 3 4 5 6 7 8 0 1 2 2 8 5 4 1 There are 1 5 6 7 8 0 1 2 3 4 7 8 0 1 2 3 4 5 6 9 vertical plans 6 Algorithm 3D GMDA: 1. We compute the mapping for the rst rows of all the 2dimensional planes by considering these rows collectively as forming a plane along the third dimension. We apply the 2D GMDA algorithm p3 2 to this plane, except that the shifting distance is set to b N c . 2. We apply the 2D GMDA algorithmpto each of the 2-dimensional planes using the shifting distance b 3 N c and the rst row already computed in Step 1. Kien A. Hua 29 Handling Higher Dimensions: Mapping Function A grid block (X1; X2; :::; Xd) is assigned to PN GeMDA(X1; X2; :::; Xd), where GeMDA(X1; :::; Xd) = 2 d X 4 i=2 6 6 6 4 Xi GCDi 7775 + Xd (X Shf dist )35 mod N; i i N i=1 N = number of PNs, p Shf dist = b d N c, and GCDi = gcd(Shf disti; N ). Kien A. Hua Optimality Comparison Allocation Optimal with respect to scheme row queries column queries small square queries HCAM No No No CMD Yes Yes No GeMDA Yes Yes Yes 30 Conventional Parallel Hashbased Join: Grace Algorithm JOIN JOIN . JOIN ... A bucket after merging . BUCK TUNING JOIN . . . Join matching bucket pairs BUCK TUNING Merge buckets to fit memory space BUCK TUNING BUCK TUNING . . .. . .. . .. . Transmit tuples to their hash bucket ... A hash bucket DATA TRANSMISSION HASH R1 HASH S1 PN1 .. . . R2 HASH S2 PN2 Shared Nothing System Hash tables into buckets R3 HASH S3 PN3 R4 S4 PN4 Kien A. Hua 32 The Eect of Imbalanced Workloads 100000 32768 Size of bucket (tuples) 8192 Zb=1 Zb=0.5 2048 Zb=0 512 128 32 8 0 2048 4096 Bucket ID 6144 8192 100 90 Number processors = 64 I/O = 64 X 4 MBytes 80 Communication = 64 X 4 MBytes Total cost (Second) 70 60 50 40 GRACE_worst 30 GRACE_best 20 10 0 0 0.2 0.4 0.6 Bucket skew 0.8 1 Kien A. Hua 33 Partition Tuning: Largest Processing Time (LPT) First Strategy Combine Combine P1 Combine Size (Tuples) B1 B2 B3 B4 B6 B7 B8 Hash Bucket Combine Combine Combine P2 Kien A. Hua 34 Naive Load Balancing Parallel Hash Join (NBJ) JOIN ... JOIN ... ... JOIN ... ... JOIN ... .. . .. . BUCK TUN BUCK TUN BUCK TUN BUCK TUN ... ... ... ... ... ... ... ... DATA TRANSMISSION PART TUN PART TUN ... ... PART TUN PART TUN ... ... ... ... PN4 has less hashing work ! ... .. . DATA TRANSMISSION HASH R1 HASH S1 PN1 R2 HASH S2 PN2 R3 HASH S3 PN3 R4 S4 PN4 JOIN .. . JOIN ... BUCK TUN . ... ... JOIN .. . BUCK TUN . ... .. . JOIN ... BUCK TUN ... . .. . .. . BUCK TUN . ... DATA TRANSMISSION PART TUN ... . PART TUN . HASH R1 . HASH S1 PN1 ... PART TUN R2 PART TUN ... HASH S2 PN2 ... R3 PN3 . HASH S3 Local buckets are collected to their destined PN to form the bucket based on “bin packing” R4 Each PN hashes its local tuples into local buckets S4 PN4 Workload is balanced among the PN’s throughout the computation What if the partitioning is skew initially ? JOIN JOIN .. . ... JOIN . .. ... .. . JOIN ... ... .. . DATA TRANSMISSION PART TUN PART TUN PART TUN .. ... .. ... BUCK TUN . ... ... ... BUCK TUN BUCK TUN . . ... ... PART TUN .. Subbuckets are collected to their destined PN to form the bucket based on “bin packing” ... BUCK TUN ... . Each PN has a subbucket for each of the hash value DATA TRANSMISSION H/TI H/TI R1 S1 PN1 R2 S2 PN2 H/TI H/TI R3 S3 PN3 R4 S4 Each PN distributes its tuples with a certain hash value evenly among the 4 subbuckets PN4 • Workload is balanced among the PN’s throughout the computation • Better to do partition tuning before bucket tuning Kien A. Hua 37 Simulation Results 140 NBJ 120 Cost (Second) 100 GRACE 80 60 TIJ ABJ 40 20 0 0 0.2 0.4 0.6 Bucket Skew 0.8 1 60 ABJ 55 GRACE 50 NBJ Cost (Second) 45 TIJ 40 35 30 25 20 15 10 0 0.2 0.4 0.6 0.8 Initial Partition Skew 1 NBJ 100 Cost (Second) 80 GRACE 60 TIJ 40 ABJ 20 0.5 1 1.5 2 2.5 3 3.5 Communication bandwidth per PN (MBytes/sec) 4 ABJ continues to suffer the initial skew after the hash phase resulting in a performance worse than even that of GRACE under a severe initial skew condition Sampling-based Load Balancing (SLB) Join Algorithm • Sampling Phase: As the sampling tuples are loaded into memory, each PN declusters its tuples into a number of in-memory buckets by hashing on the join attribute • Partition Tuning Phase: The coordinator determines the optimal bucket allocation strategy BAS using the statistics on the partially formed buckets • Split Phase: – The tuples already in the memory are allocated to their respective host PN, according to BAS, to form the join buckets – Each PN loads the remaining tuples, and redistributes them to the join buckets in accordance with BAS • Join Phase: Each PN performs the local joins of respectively matching buckets MEMORY Partially formed hash buckets … HASH DATABASE Kien A. Hua 39 nCUBE/2 Results: SLB vs. ABJ vs. GRACE 400 350 GRACE time (secs) 300 250 ABJ 200 SBJ 150 0 0.2 0.4 0.6 partiton skew (data skew = 0.8) 0.8 1 The performance of SLB approaches that of GRACE on very mild skew conditions, and it can avoid the disastrous performance that GRACE suers on severe skew conditions. Kien A. Hua 40 Pipelining Hash-Join Algorithms Two-Phase Approach: Output stream Hash Table Matching First operand Second operand { Advantage: requires only one hash table. { Disadvantage: pipelining along the outer relation must be suspended during the build phase (i.e., building the hash table). One-Phase Approach: As a tuple comes in, it is rst inserted into its hash table, and then used to probe that part of the hash table of the other operand that has already been constructed. Output stream Hash Table First operand Matching Matching Hash Table Second operand { Advantage: pipelining along both operands is possible. { Disadvantage: requires larger memery space. Kien A. Hua 41 Aggregate Functions An SQL aggregate function is a function that operates on groups of tuples. Example: SELECT FROM WHERE GROUP BY department, COUNT(*) Employee age > 50 department The number of result tuples depends on the selectivity of the GROUP BY attributes (i.e., department). Kien A. Hua 44 Centralized Merging PN1 Department 1 2 3 4 5 6 7 8 Coordinator Department Count 1 2 3 4 5 6 7 8 13 13 15 17 11 20 13 14 Count PN2 Department 5 7 8 4 2 7 3 6 1 2 3 4 5 6 7 8 Partition 1 Count PN3 Department Count 13 4 13 3 15 5 17 8 11 3 20 6 13 7 14 4 1 2 3 4 5 6 7 8 13 4 13 3 15 2 17 5 11 6 20 7 13 3 14 4 Partition 2 Partition 3 Kien A. Hua 45 Distributed Merging PN1 Department Count 3 6 15 20 This scheme incurs less data transmission compared to next technique PN2 Department Count PN3 Department Count 1 2 4 3 7 13 13 17 15 13 2 1 2 5 3 8 13 13 11 15 14 MOD 3 MOD 3 A counter is created for each department. This strategy uses more memory space MOD 3 PN1 Department Count PN2 Department Count PN3 Department Count 1 2 3 4 5 6 7 8 5 7 8 4 2 7 3 6 1 2 3 4 5 6 7 8 13 4 13 3 15 5 17 8 11 3 20 6 13 7 14 4 1 2 3 4 5 6 7 8 13 4 13 3 15 2 17 5 11 6 20 7 13 3 14 4 Partition 1 Partition 2 Partition 3 Kien A. Hua 46 Repartitioning PN1 Department Count 3 6 15 20 PN2 Department Count PN3 Department Count 1 2 4 3 7 13 13 17 15 13 2 1 2 5 3 8 13 13 11 15 14 Partition 1' Partition 2' Partition 3' (hash value = 0) (hash value = 1) (hash value = 2) No need to write the new partitions to disks This scheme incurs more data transmission MOD 3 MOD 3 MOD 3 Partition 1 Partition 2 Partition 3 Kien A. Hua 43 Performance Characteristics Centralized Advantage: Merging Algorithm: It works well when the number of tuples is small. Disadvantage: The merging phase is sequential. Distributed Advantage: Merging Algorithm: The merging step is not a bottleneck. Disadvantage: Since a group value is being accumulted on potentially all the PNs the overall memory requirement can be large. Repartitioning Advantage: Algorithm: It reduces the memory requrement as each group value is stored in one place only. Disadvantage: It incurs more network trac. Kien A. Hua 42 Coventional Aggregation Algorithms Centralized Merging (CM) Algorithm: Phase 1: Each PN does aggregation on its local tuples. Phase 2: The local aggregate values are merged at a predetermined central coordinator. Distributed Merging (DM) Algorithm: Phase 1: Each PN does aggregation on its local tuples. Phase 2: The local aggregate values are then hash-partitioned (based on the GROUP BY attribute) and the PNs merge these local aggregate values in parallel. Repartitioning (Rep) Algorithm: Phase 1: The relation is repartitioned using the GROUP BY attributes. Phase 2: The PNs do aggreation on their local partitions in parallel. Performance Comparison: CM and DM work well when the number of result tuples is small. Rep works better when the number of groups is large. Kien A. Hua 47 Adaptive Aggregation Algorithms Sampling Based (Samp) Approach: { CM algorithm is rst applied to a small Page-oriented random sample of the relation. { If the number of groups obtained from the sample is small then DM strategy is used; Rep algorithm is used otherwise. Adaptive DM (A-DM) Algorithm: { This algorithm starts with the DM strategy under the common case assumption that the number of group is small. { However, if the algorithm detects that the number of groups is large (i.e., memory full is detected) it switches to the Rep strategy. The Adaptive Repartitioning (A-Rep) Algorithm: { This algorithm starts with the Rep strategy. { It switches to DM if the number of groups is not large enough (i.e., number of groups is too few given the number of seen tuples). Performance Comparison: In general, A-DM performs the best. However, A-Rep should be used if the number of groups is suspected to be very large. Kien A. Hua 48 Implementation Techniques for A-DM Global Switch: { When the rst PN detects a memory full condition, it informs all the PNs to switch to the Rep strategy. { Each PN rst partitions and sends the so far accumulated local results to PNs they hash to. Then, it proceeds to read and repartition the remaining tuples. { Once the repartitioning phase is complete, the PNs do aggregate on the local partitions in parallel (as in the Rep algorithm). Local Switch: { A PN upon detecting memory full stops processing its local tuples. It rst partitions and sends the so far accumulated local results to the PNs they hash to. Then it proceeds to read and repartition the remaining tuples. { During Phase one, one set of PNs may be executing the DM algorithm while other are executing the Rep algorithm. When the latter receives an aggregate value from another PN, it accumulates it to the corresponding local aggregate value. { Once all PNs have completed their Phase 1, The local aggregate values are merged as in the DM algorithm. Kien A. Hua 49 A-DM: Global Switch Department Count 3 6 Department Count 15 20 1 4 7 13 17 13 Phase 2 2 5 8 13 11 14 Phase 2 Phase 2 Partition 1' Partition 2' Partition 3' (hash value = 0) (hash value = 1) (hash value = 2) Department Count 3 6 Department Count Department Count 10 9 1 4 7 Department Count 5 2 6 2 5 8 No need to write the new partitions to disks 5 2 2 Step 1 MOD 3 MOD 3 Department Count Department Count 1 2 3 4 5 6 7 3 4 5 2 1 4 1 MOD 3 1 2 3 2 5 1 7 5 MOD 3 Department Count MOD 3 2 3 1 3 6 5 8 2 Step 2 DM Rep Partition 1 DM DM Partition 2 Partition 3 PN2 must also switch to the Rep scheme in Step 2 (i.e., global switch) Kien A. Hua 50 A-DM: Local Switch PN1 Department Count 3 6 5 5 PN2 Department Count PN3 Department Count 1 2 4 3 7 12 13 13 10 15 3 2 1 2 5 3 8 13 6 13 4 15 5 Phase 2 MOD 3 PN1 Department Count 3 6 5 4 PN2 Department 1 2 3 4 5 7 8 Step 1 MOD 3 Phase 2 Phase 2 Count PN3 Department Count 12 13 13 15 17 7 11 20 13 3 14 3 1 2 3 4 5 6 7 8 13 13 6 15 17 3 11 4 20 1 13 14 2 MOD 3 PN1 Department Count 1 2 3 4 5 6 7 3 4 5 2 1 4 1 DM Partition 1 MOD 3 Only PN1 switches to the Rep scheme Step 2 DM DM Partition 2 Partition 3 Rep Kien A. Hua 51 SQL (S tructured Query Language) EMPLOYEE ENAME ENUM BDATE WORKSON ENO PNO HOURS PNAME PNUM DNUM PROJECT An SQL query: ADDR SALARY PLOCATION SELECT ENAME FROM EMPLOYEE, WORKSON, PROJECT WHERE PNAME= `database' AND PNUM = PNO AND ENO = ENUM AND BDATE > `1965' SQL is nonprocedural. The Compiler must generate the execution plan. 1. Transforms the query from SQL into relational algebra. 2. Restructures (optimizes) the algebra to improve performance. Kien A. Hua 52 Relational Algebra Relation T1 ENAME SALARY ENUM Andrew Casey James Kathleen $98,000 $150,000 $120,000 $115,00 005 003 007 001 Select: Selects rows. Relation T2 ENUM ADDRESS 001 003 005 007 8 < : BDATE Orlando New York Los Angeles London 1964 1966 1968 1958 9 = ( Casey; 150000 ; 003) SALARY 120;000(T 1) = (James; 120000; 007) ; Project: Selects columns. 8 > > > > > > < > > > > > > : 9 (Andrew; 98000); >>>>> > = ( Casey; 150000) ; ENAME;SALARY (T 1) = (James; 120000); >> > > (Kathleen; 115000) >>; Cartesian Product: Selects all possible combinations. T1 8 > > > > > > > > > < T 2 = >> > > > > > > > : 9 > (Andrew; 98000; 005; 001; Orlando; 1964); > > > (Andrew; 98000; 005; 003; New Y ork; 1966); >>>>>= .. > > (Kathleen; 150000; 001; 005; Los Angeles; 1968) >>>>>> > ; (Kathleen; 115000; 001; 007; London; 1958) Join: Selects some combinations. 8 > > > > > > < > > > > > > : 9 (Andrew; 98000; 005; Los Angeles; 1968); >>>>> > = ( Casey; 150000 ; 003 ; New Y ork; 1966) ; T 1 1 T 2 = (James; 120000; 007; London; 1958); > > > > (Kathleen; 115000; 001; Orlando; 1964) >>; Kien A. Hua 53 Transforming SQL into Algebra An SQL query: SELECT ENAME FROM EMPLOYEE, WORKSON, PROJECT WHERE PNAME = `database' AND PNUM = PNO AND ENO = ENUM AND BDATE > `1965' Canonical Query Tree SELECT Clause WHERE Clause ENAME PNAME = ‘database’ AND PNUM = PNO AND ENO = ENUM AND BDATE > ‘1965’ FROM Clause PROJECT EMPLOYEE WORKSON This query tree (procedure) will compute the correct result. However, the performance will be very poor. =) needs optimization ! Kien A. Hua 54 Optimization Strategies GOAL: Reducing the sizes of the intermediate results as quickly as possible. STRATEGY: 1. Move SELECTs and PROJECTs as far down the query tree as possible. 2. Among SELECTs, reordering the tree to perform the one with lowest selectivity factor rst. 3. Among JOINs, reordering the tree to perform the one with lowest join selectivity rst. Kien A. Hua 55 Example: Apply SELECTs First Canonical Query Tree SELECT Clause ENAME PNAME = ‘database’ AND PNUM = PNO AND ENO = ENUM AND BDATE > ‘1965’ WHERE Clause FROM Clause PROJECT EMPLOYEE WORKSON After Optimization ENAME PNUM = PNO ENUM = ENO PNAME = ‘database’ PROJECT BDATE > ‘1965’ EMPLOYEE WORKSON Kien A. Hua 56 Example: Replace \ ; " by \1" Before Optimization ENAME PNUM = PNO ENUM = ENO PNAME = ‘database’ PROJECT BDATE > ‘1965’ WORKSON EMPLOYEE After Optimization ENAME PNUM = PNO ENUM = ENO BDATE > ‘1965’ EMPLOYEE WORKSON PNAME = ‘database’ PROJECT Kien A. Hua 57 Example: Move PROJECTs Down Before Optimization ENAME PNUM = PNO PNAME = ‘database’ ENUM = ENO WORKSON BDATE > ‘1965’ PROJECT EMPLOYEE After Optimization ENAME PNUM = PNO ENAME, PNO ENUM = ENO ENAME, ENUM ENO, PNO BDATE > ‘1965’ WORKSON EMPLOYEE PNUM PNAME = ‘database’ PROJECT Kien A. Hua Parallelizing Query Optimizer Relations are fragmented and allocated to multiple processing nodes: =) Besides the choice of ordering relational operations, the parallelizing optimizer must select the best sites to process data. =) The role of a parallelizing optimizer is to map a query on a global relations into a sequence of local operations acting on local relation fragments. 58 Kien A. Hua 59 Parallelizing Query Optimization SQL query on global relations SEQUENTIAL OPTIMIZER Global Schema Optimized sequential access plan PARALLELIZING OPTIMIZER Fragment Schema Optimized parallel access plan Parallelizing Optimizer: Parallelizes the relational operators. Selects the best processing nodes for each parallelized relational operator. Kien A. Hua 60 Example: Elimination of Useless JOINs Fragments: E1 = ENOE3(E ) E2 = E3<ENOE6(E ) E3 = ENO>E6(E ) Query : G1 = ENOE3(G) G2 = ENO>E3(G) SELECT FROM E, G WHERE E.ENO = G.ENO (1) Sequential query tree (2) Data Localization Operator on relations ENO ENO E G E1 (3) Distributing 1 over [ E2 E3 G1 G2 (4) Eliminate useless JOINs Find the "best" ordering of these fragment operators Fragment operator . . . E1 G1 E1 G2 E3 G2 E1 G1 E2 G2 E3 G2 Kien A. Hua Parallelizing Query Optimization 1. Determines which fragments are involved and transforms the global operators into fragment operators. 2. Eliminates useless fragment operators. 3. Finds the \best" ordering of the fragment operators. 4. Selects the best processing node for each fragment operator and species the communication operations. 61 Kien A. Hua 62 Prototype at UCF A prototype of a shared-nothing system is implemented on a 64-processor nCUBE/2 computer. Our system was implemented to demonstrate: { GeMDA multidimensional data partitioning technique, { dynamic optimization scheme with load balancing capability, and { competition-based scheduling policy. Kien A. Hua 63 System Architecture Create/Destroy Tables SQL Results Queries PRESENTATION MANAGER Global Schema QUERY TRANSLATOR Fragment Schema QUERY EXECUTOR Operator Routine LOAD UTILITY ... Operator Routine STORAGE MANAGER Database Interprets the execution tree and calls the corresponding operator routines to carry out the underlying operations Database Provides a sequential scan interface to the query processing facilities Kien A. Hua 64 Software Componets Storage Manager: This component manages physical disk devices and schedules all I/O activities. It provides a sequential scan interface to the query processing facilities. Catalog Manager: It acts as a central repository of all global and fragment schemas. Load Utility: This program allows the users to populate a relation using an external le. It distributes the fragments of a relation across the processing nodes using GeMDA. Query Translator: This component provides an interface for queries. It translates an SQL query into a query graph. It also caches the global schema information locally. Query Executor: This component performs dynamic query optimization. It schedules the execution of the operators in the query graph. Operator Routines: Each routine implements a primitive database operator. To execute an operator in a query graph, the Query Executor calls the appropriate operator routine to carry out the underlying operation. Presentation Manager: This module provides an interactive interface for the user to create/destroy tables, and query the database. The user can also use this interface to browse query results. Kien A. Hua 65 Competition-Based Scheduling Each operator server is associated with a processing node Server pool (Operator Servers) Operator Server Operator Server Coordinator Coordinator Active queries ... Waiting queue Dispatcher Coordinator Coordinator pool Advantage: Fair. Disadvantage: System utilization is not maximized. Kien A. Hua 66 Planning-Based Scheduling Operator Server Operator Server Coordinator Coordinator scheduling window ... Scheduler Query queue Server pool (Operator Servers) Operator Server JOIN Operator Server SELECT Operator Server PROJECT Each operator server is associated with a processing node Advantage: Better system utilization. Disadvantages: Less fair. Scheduler can become a bottleneck. Kien A. Hua 67 Hardware Organization Local Area Network Host Computer IFP IFP IFP Interconnection Network ACP ACP ACP ACP ACP IFP: Interface Processor ACP: Access Processor Catalog Manager, Query Manager and Scheduler processes are run on IFP's. Operator processes are run on ACP's. Kien A. Hua 68 Structure of Operator Processes Operator Process Split Table Stream of tuples (e.g., 8K byte batches) Operation Hash Value Destination Process 0 (Processor #3, Port #5) 1 (Processor #4, Port #6) 2 (Processor #5, Port #8) 3 (Processor #6, Port #2) The output is demultiplexed through a split table. When the process detects the end of its input stream, { it rst closes the output streams and { then sends a control message to its coordinator process indicating that it has completed execution. Kien A. Hua 69 Example: Operator and Process Structure Query Tree: C (PN1, PN2) Executing Join on PN3 and PN4 Join Process Select Scan A (PN1, PN2) B (PN1, PN2) Structure: PN3 PN4 Probe Scan Hash Table Store Select Probe Scan Store PN1 Use "B" tuples to probe the hash tables B1 Hash Table Select PN2 C1 A1 B2 C2 A2 Build hash tables for relation A Contains code for each operator in the database access language. OPERATOR METHODS COMPILED QUERY ACCESS METHODS Record ID Record STORAGE A storage manager provides the primitives for scanning a file via a sequential or index scan MANAGER STORAGE STRUCTURES BUFFER MANAGEMENT PHYSICAL I/O Database ACCESS METHOD Maintains an active scan table that describes all the scans in progress. Maps file names to file ID’s, manages active files, searches for the page given a record. Manages a buffer pool. Manages physical disk devices, performs page-level I/O operations. Kien A. Hua 71 Transaction Processing The consistency and reliability aspects of transactions are due to four properties: 1. Atomicity: A transaction is either performed in its entirety or not performed at all. 2. Consistency: A correct execution of the transaction must take the database from one consistent state to another. 3. Isolation: A transaction should not make its updates visible to other transaction until it is committed. 4. Durability: Once a transaction changes the database and changes are committed, these changes must never be lost because of subsequent failure. Kien A. Hua 72 Transaction Manager TRANSACTION MANAGER LOCK LOG MANAGER MANAGER Lock Manager: { Each local lock manager is responsible for the lock units local to that processing node. { They provide concurrency control. Log Manager: Isolation property { Each local log manager logs the local database operations. { They provide recovery services. Kien A. Hua 73 Two-Phase Locking Protocol Number of locks Growing Phase Begin Shrinking Phase Lock point End Transaction duration schedule generated by a concurrency control algorithm that obeys the 2PL protocol is serializable (i.e., the isolation property is guaranteed). Any 2PL is know: dicult to implement. The lock manager has to 1. the transaction has obtained all its locks, and 2. the transaction no longer needs to access the data item in question. (so that the lock can be released). Cascading aborts can occur. (because transactions reveal updates before they commit) Kien A. Hua 74 Strict Two-Phase Locking Protocol Number of locks Begin End Transaction duration The lock manager releases all the locks together when the transaction terminates (commits or aborts). Kien A. Hua 75 Handling Deadlocks Detection and Resolution: { Abort and restart a transaction if it has waited for a lock for a long time. { Detect cycles in the wait-for graph and select a transaction (involved in a cycle) to abort. Prevention: If Ti requires a lock held by Tj , ? If Ti is older =) Ti can wait. ? If Ti is younger =) Ti is aborted and restarted with the same timestamp. There can never be a loop in the wait-for graph. Deadlock is not possible !! Kien A. Hua 76 Distributed Deadlock Detection [Chandy 83] This message goes around and comes back to transaction 0 Transaction 0 is waiting for transaction 1 for a data item -Transaction 0 is blocked (0,8,0) 0 (0,0,1) 4 1 2 5 8 (0,5,7) 7 (0,2,3) Blocked transaction When 6 3 (0,1,2) PN 0 (0,4,6) Transaction sending this message PN 2 PN 1 Transaction receiving this message a transaction is blocked, it sends a special probe message to the blocking transaction. The message consists of three numbers: the transaction that just blocked, the transaction sending the message, and the transaction to whom it is being sent. When the message arrives, the recipient checks to see if it itself is waiting for any transaction. If so, the message is updated, replacing the second eld by its own TID and the third one by the TID of the transaction it is waiting for. The message is then sent to the blocking transaction. If a message goes all the way around and come back to the original sender, a deadlock is detected. Kien A. Hua 77 Two-Phase Commit Protocol To ensure the atomicity property, a 2P commit protocol can be used to coordinate the commit process among subtransactions. 2 2 3 3 1 1 1 PREPARE 4 4 5 5 READY or ABORT (Voting) COMMIT or ABORT ACK : Coordinator, it originates the transaction. : Agent, it executes a subtransaction on behalf of its coordinator. Kien A. Hua 78 Recovery An entry is made in the local log le at a processing node each time one of the following commands is issued by a transaction: { begin transaction { write (insert, delete, update) { commit transaction { abort transaction Write-ahead log protocol: Note: If log entry wasn't saved before the crash, corresponding change was not applied to database! { It is essential that log records be written before the corresponding write to the database. { If there is no commit transaction entry in the log for a particular transaction, then that transaction was still active at the time of failure and must therefore be undone. Kien A. Hua 84 Commercial Product: Teradata DBC/1012 Local Area Network Host Computer IFP IFP COP Ynet AMP AMP AMP AMP AMP IFP: Interface Processor AMP: Access Module Processor COP: Communication Processor It may have over 1,000 processors and many thousands of disks. Each relation AMPs. Near-linear is hash partitioned over a subset of the speedup and scaleup on queries have been demonstrated for systems containing over 100 processors. Kien A. Hua 85 Teradata DBC/1012: Distribution of Data The fallback copy ensures that the data remains available on other AMPs if an AMP should fail. In the following example, if AMPs 4 and 7 were to fail simultaneously, however, there would be a loss of data availability. Primary Copy Area: DSU/AMP 1 DSU/AMP 2 DSU/AMP 3 DSU/AMP 4 1, 9, 17 2, 10, 18 3, 11, 19 4, 12, 20 1, 23, 8 9, 2, 16 17, 10, 3 DSU/AMP 5 DSU/AMP 6 DSU/AMP 7 DSU/AMP 8 5, 13, 21 6, 14, 22 7, 15, 23 8, 16, 24 19, 12, 24 20, 5, 6 13, 14, 7 Fallback Copy Area: 21, 22, 15 Primary Copy Area: Fallback Copy Area: 18, 11, 4 Additional data protection can be achieved by \clustering" the AMPs in groups. In the following example, If both AMPs 4 and 7 were to fail, all data would still be available. CLUSTER A DSU/AMP 1 DSU/AMP 2 DSU/AMP 3 DSU/AMP 4 Primary Copy Area: 1, 9, 17 2, 10, 18 3, 11, 19 4, 12, 20 Fallback Copy Area: 2 3, 4 1, 11, 12 9 10 20 17 18 19 CLUSTER B DSU/AMP 5 DSU/AMP 6 DSU/AMP 7 DSU/AMP 8 Primary Copy Area: 5, 13, 21 6, 14, 22 7, 15, 23 8, 16, 24 Fallback Copy Area: 6, 7, 8 5, 15, 16 13, 14, 24 21, 22, 23 Kien A. Hua 86 Commercial Product: Tandem NonStop SQL DYNABUS DYNABUS CONTROL DYNABUS CONTROL DYNABUS CONTROL DYNABUS CONTROL MAIN PROCESSOR MAIN PROCESSOR MAIN PROCESSOR MAIN PROCESSOR MEMORY MEMORY MEMORY MEMORY I/O PROCESSOR I/O PROCESSOR I/O PROCESSOR I/O PROCESSOR DISK CONTROLLER TERMINAL TAPE CONTROLLER CONTROLLER DISK CONTROLLER DISK CONTROLLER Tandem systems run the applications on the same processors as the database servers. Relations may be range partitioned across multiple disks. It is primarily designed for OLTP. It scales linearly well beyond the largest reported mainframes on the TPC-A benchmarks. It is three times cheaper than a comparable mainframe system. Kien A. Hua 92 Challenges Facing Parallel Database Technology Multimedia applications require very large network-I/O bandwdith. A parallel parallelizing query optimizer will be able to consider many more execution plans. Design parallel programming languages to take advantage of parallelism inherrent in the parallel database system. We need new data partitioning techniques and algorithms for non-relational data objects allowed in SQL3, e.g., graph structures.