Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Systems on Modern Processors Kenneth Ross Columbia University (also visiting IBM T. J. Watson Research Center) EDGE Workshop, May 2006 Talk Outline 2 Intro – what is a DBMS and what are its performance goals? Some history. How DBMS researchers are dealing with changes in architecture. The future. EDGE Workshop 2006 Database Management Systems Store and manage large data collections Features – – – – – – 3 – Concurrency control Persistence, Recovery Integrity (constraints) Declarative Queries Transactions (all or nothing) Distribution … EDGE Workshop 2006 Database Management Systems Focus on Relational DBMS – Multi-billion dollar market DBMS servers are a major use of modern computers – 4 IBM DB2, Oracle, Sybase, Microsoft SQL Server, MySQL, Postgres, … DBMS architects and HW architects care about performance of DB workloads EDGE Workshop 2006 Performance Scalability Transaction processing – – – – Query Processing – – – 5 Keep up with very high transaction rates Millions of transactions/minute Benchmark: TPC-C www.tpc.org Single row updates with locking – Data warehouses with terabytes of data Complex queries over many tables Benchmark: TPC-H Large data volumes: throughput EDGE Workshop 2006 Query Processing 101 Each query has many potential execution plans – Query optimizer explores a space of plans – Chooses the plan with lowest cost (i.e., time) Cost model: I/O, CPU, … Costs are estimated based on – 6 Plans are logically equivalent, but differ in speed – operators used, input sizes, expected selectivity of predicates, other DB statistics. Each operator has a cost function EDGE Workshop 2006 Talk Outline 7 Intro – what is a DBMS and what are its performance goals? Some history. How DBMS researchers are dealing with changes in architecture. The future. EDGE Workshop 2006 A Bit of History Database Machines (1980s) – – Major failure – – 8 Build specialized processors for databases Associative memory, SIMD/MIMD, etc. By the time hardware had been built, commodity hardware had caught up Advances in software (e.g., better algorithms) Lesson: Use commodity hardware! – Scope of “commodity” has changed between 1980 and 2006. EDGE Workshop 2006 Early-Mid 1990s Move to shared-nothing parallel systems – – Good scale-up and speed-up – – 9 Predictable performance with minimal interference between processors Database workloads typically partition well Just add more resources Embarrassing parallelism I/O is the bottleneck EDGE Workshop 2006 Late 1990s Disk speed/capacity, memory capacity, processor speed all increase – Recognition that these improvements were changing the performance profile of a DBMS – – – 10 But not memory latency I/O is no longer the bottleneck L2 data cache misses and L1 instruction cache misses are significant [Ailamaki et al 99] 2 to 5 times the CPI of SPECInt workloads EDGE Workshop 2006 What to do about it? Build a new DBMS from scratch? – – – Try to adapt existing systems. – – – 11 Major investments in existing infrastructure. Performance is only one of many goals. Research prototypes, e.g., Monet [CWI]. Constrains options, but improvements attainable without re-implementing everything. Be configurable to available hardware. Focus on “inner loops” of the critical paths. EDGE Workshop 2006 Talk Outline 12 Intro – what is a DBMS and what are its performance goals? Some history. How DBMS researchers are dealing with changes in architecture. The future. EDGE Workshop 2006 Dealing with Modern Architectures Data & Instruction cache performance – – – – Branches – – Take mispredictions into account when estimating query cost Avoid branches to get better CPI SIMD – 13 Better utilization of space in each cache line Schedule work to efficiently utilize data in each cache line Chunk work into units processed by L1 cache-sized code segments Prefetching Speedup due to parallelism and absence of branch misprediction TLB, GPU, SMT, CAM, etc. EDGE Workshop 2006 Better Cache-line Utilization I B+-Tree node CSB+-Tree node [Rao & Ross 2000] K1 K2 K3 K4 K5 K6 K7 P P0 K1 P1 K2 P2 K3 P3 8 contiguous children 4 children 4 million leaf entries 1.1 B+-Tree Full means allocate space for all possible children in advance 1 CSB+-Tree 0.9 time (s) Full CSB+-Tree 0.8 0.7 0.6 0.5 0.4 14 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 proportion of searches EDGE Workshop 2006 Better Cache-line Utilization II Columnwise storage, e.g., Monet [CWI] – PAX [Ailamaki et al 2001] – – – Best of both worlds with two data copies CSM [Shao et al 2004] – 15 Each column stored columnwise within disk page Good cache behavior for queries Good I/O behavior for row-access Fractured Mirrors [Ramamurthy et al 2002] – Only process the columns you need In-memory copy keeps just part of disk page EDGE Workshop 2006 Consecutive Lookups to an Index Probe Stream 2 11 1 20 9 19 3 5 A Key r1 r2 r3 r4 r5 r6 r7 r8 RID Cache Capacity (Up to 3 nodes) In Cache B C E D F [Zhou & Ross 2003] 16 EDGE Workshop 2006 Consecutive Lookups to an Index Probe Stream 2 11 1 20 9 19 3 5 A r1 r2 r3 r4 r5 r6 r7 r8 RID Cache Capacity (Up to 3 nodes) In Cache B C 17 Key E D F EDGE Workshop 2006 Consecutive Lookups to an Index Probe Stream 2 11 1 20 9 19 3 5 A r1 r2 r3 r4 r5 r6 r7 r8 RID Cache Capacity (Up to 3 nodes) In L2 Cache B C Key E D F Cache Thrashing! 18 EDGE Workshop 2006 Buffering Accesses to the Index Probe Stream 1000 451 998 999 A In L2 Cache B E Buffered Items 300 0 1 Results C Buffered Items 200 0 19 D Buffered Items 249 0 1 Buffered Items 100 0 F Buffered Items 0 EDGE Workshop 2006 Impact of Buffering Increases temporal and spatial locality Overhead: extra copy of data into buffers Key idea: Buffer units of L2 cache size or less several tree levels if fan-out is small Elapsed Time (seconds) 14 12 Pentium 4, 5M keys, 5M probes 10 8 6 4 2 0 20 B+-Tree CSB+-Tree Conventional Lookup R-Tree Buffered Lookup EDGE Workshop 2006 Conventional Demand-Driven Pipelined Execution Model Support open-next-close iterator interface An operator returns control immediately after generating one tuple [Zhou & Ross 2004] Results Instructions In L1 I-Cache Parent Parent Operator Cache Thrashing! GetNext() Child Operator 21 Solution: Buffering! ABABABA… becomes ABBBBBAAAAAB… Poor instruction locality Poor branch prediction accuracy EDGE Workshop 2006 Validation on Postgres Agg sum, avg, count 18 12% 16 Elapsed Time (seconds) SELECT SUM(l_extendedprice * (1 – l_discount) (1 + l_tax)) AS sum_charge, AVG(l_quantity) AS avg_qty, COUNT(*) AS count_order FROM lineitem WHERE l_shipdate <= date ‘1998-11-01’ (TPC-H) 14 12 10 8 6 4 Agg 2 sum, avg, count 0 Original Query Plan Buffer TableScan Buffered Query Plan Other Cost Branch Misprediction Penalty L2 Cache Miss Penalty L1 Instruction Cache Miss Penalty linitem TableScan 22 Combined footprint: 23K > 16K L1 linitem The buffered plan reduces L1 instruction cache misses by 80% EDGE Workshop 2006 Steps: Cache-resident Transaction Processing [Harizopoulos et al 2004] Group similar transactions – – – 23 Reuse L1-cache resident instructions for many transactions Fast context switch 40% speedup for TPC-C workloads EDGE Workshop 2006 Prefetching Rely on hardware prefetching – Explicit prefetching – – Child/sibling cache-lines in tree indexes [Chen et al 2001, 2002] Staged access to data [Harizopoulos & Ailamaki 2003] Work-ahead set [Zhou et al 2005] [EDGE poster] – – 24 Use predictable (e.g., sequential) access patterns Load rather than prefetch (which may be dropped) Use another SMT thread to manage memory access EDGE Workshop 2006 Branch Mispredictions Conjunctive Selections [Ross 2002] for (i=0; i<number_of_records; i++) { if (all conditions are true, i.e., equal to 1, for record i) { answer[j++] = i; } } 25 EDGE Workshop 2006 Code Alternatives (I) if (f1(R[i]) && f2(R[i])) {…} versus if (f1(R[i]) & f2(R[i])) {…} 26 && version does less work if f1 is selective. But… && version has two conditional branches while & version has one. EDGE Workshop 2006 Code Alternatives (II) Version with no conditional branches. for (i=0; i<number_of_records; i++) { answer[j] = i; j += (f1(R[i]) & f2(R[i])); } 27 EDGE Workshop 2006 4-way conjunction; simple f1, f2, … 28 EDGE Workshop 2006 Query Optimization Large Space of plans, e.g., for (i=0; i<number_of_records; i++) { if ( (f1(R[i]) & f2(R[i])) && f3(R[i]) ) { answer[j] = i; j += f4(R[i]); } } Optimizer’s cost function must include term for branch misprediction penalty. 29 EDGE Workshop 2006 Branches and CPI Branches create control dependencies that limit instruction-level parallelism on superscalar CPUs – Compression/decompression – – 30 Avoid dependencies to get better CPI [Zukowski et al 2006] Branch-free decompression algorithm Decompress at 2-6 GB/sec to get good data bandwidth without hundreds of disks EDGE Workshop 2006 SIMD on Modern Processors E.g., SSE on Pentium Comparisons generate masks – 31 Convert if-then-else control dependencies to data dependencies EDGE Workshop 2006 Benefits of SIMD for DBs [Zhou & Ross 2002] Select Agg(R.y) From R Where R.x Between a AND b Table with 1 million records. Predicate selectivity 0.2 Elapsed Time (milliseconds) 25 20 15 10 5 0 SUM 32 SIMD SUM COUNT SIMD COUNT Other Cost MAX SIMD MAX Branch Misprediction Penalty MIN SIMD MIN EDGE Workshop 2006 Additional work Efficient SMT parallelism for single operators – TLB optimization – – Fast sorting [Govindaraju et al 2006] Spatial query primitives [Bandi et al 2004] Use of other commodity hardware – 33 More passes over data to avoid TLB thrashing (e.g., data partitioning) [Manegold et al 2004] Use of co-processors (e.g., GPU) – [Zhou et al 2005, Garcia et al 2006] Content addressible memory (CAM) [Bandi et al 2005] EDGE Workshop 2006 Talk Outline 34 Intro – what is a DBMS and what are its performance goals? Some history. How DBMS researchers are dealing with changes in architecture. The future. EDGE Workshop 2006 Future Challenges I Multi-core, more threads per core DBs already parallel (e.g., Oracle will run on Niagara) but… – – Cache interference (or underutilization) on each chip Flipside: No cache coherence problem with shared L2 cache – Bandwidth in/out of the chip now a bottleneck 35 Move away from complete shared-nothing? Need to model on-chip and off-chip communication costs New trade-offs Speculative prefetching a bad idea EDGE Workshop 2006 Future Challenges II Making use of multiple HW components – Best use of whole system – – – – 36 CPU+GPU PPE+SPEs on Cell Various kinds of accelerators (commodity!) Configurability, tunability Programmability Cost models EDGE Workshop 2006 Conclusions DBMS software is changing. Evolution and revolution in response to new hardware. More changes ahead for multi-core systems – What’s the best HW mix for databases? – 37 Existing parallel design a head-start E.g., Is a GPU a server component? EDGE Workshop 2006