Download Title

Database Systems on Modern Processors Kenneth Ross Columbia University (also visiting IBM T. J. Watson Research Center) EDGE Workshop, May 2006 Talk Outline     2 Intro – what is a DBMS and what are its performance goals? Some history. How DBMS researchers are dealing with changes in architecture. The future. EDGE Workshop 2006 Database Management Systems   Store and manage large data collections Features – – – – – – 3 – Concurrency control Persistence, Recovery Integrity (constraints) Declarative Queries Transactions (all or nothing) Distribution … EDGE Workshop 2006 Database Management Systems  Focus on Relational DBMS –   Multi-billion dollar market DBMS servers are a major use of modern computers – 4 IBM DB2, Oracle, Sybase, Microsoft SQL Server, MySQL, Postgres, … DBMS architects and HW architects care about performance of DB workloads EDGE Workshop 2006 Performance Scalability  Transaction processing – – – –  Query Processing – – – 5 Keep up with very high transaction rates Millions of transactions/minute Benchmark: TPC-C www.tpc.org Single row updates with locking – Data warehouses with terabytes of data Complex queries over many tables Benchmark: TPC-H Large data volumes: throughput EDGE Workshop 2006 Query Processing 101  Each query has many potential execution plans –  Query optimizer explores a space of plans –   Chooses the plan with lowest cost (i.e., time) Cost model: I/O, CPU, … Costs are estimated based on – 6 Plans are logically equivalent, but differ in speed – operators used, input sizes, expected selectivity of predicates, other DB statistics. Each operator has a cost function EDGE Workshop 2006 Talk Outline     7 Intro – what is a DBMS and what are its performance goals? Some history. How DBMS researchers are dealing with changes in architecture. The future. EDGE Workshop 2006 A Bit of History  Database Machines (1980s) – –  Major failure – –  8 Build specialized processors for databases Associative memory, SIMD/MIMD, etc. By the time hardware had been built, commodity hardware had caught up Advances in software (e.g., better algorithms) Lesson: Use commodity hardware! – Scope of “commodity” has changed between 1980 and 2006. EDGE Workshop 2006 Early-Mid 1990s  Move to shared-nothing parallel systems – –  Good scale-up and speed-up – –  9 Predictable performance with minimal interference between processors Database workloads typically partition well Just add more resources Embarrassing parallelism I/O is the bottleneck EDGE Workshop 2006 Late 1990s  Disk speed/capacity, memory capacity, processor speed all increase –  Recognition that these improvements were changing the performance profile of a DBMS – – – 10 But not memory latency I/O is no longer the bottleneck L2 data cache misses and L1 instruction cache misses are significant [Ailamaki et al 99] 2 to 5 times the CPI of SPECInt workloads EDGE Workshop 2006 What to do about it?  Build a new DBMS from scratch? – – –  Try to adapt existing systems. – – – 11 Major investments in existing infrastructure. Performance is only one of many goals. Research prototypes, e.g., Monet [CWI]. Constrains options, but improvements attainable without re-implementing everything. Be configurable to available hardware. Focus on “inner loops” of the critical paths. EDGE Workshop 2006 Talk Outline     12 Intro – what is a DBMS and what are its performance goals? Some history. How DBMS researchers are dealing with changes in architecture. The future. EDGE Workshop 2006 Dealing with Modern Architectures  Data & Instruction cache performance – – – –  Branches – –   Take mispredictions into account when estimating query cost Avoid branches to get better CPI SIMD – 13 Better utilization of space in each cache line Schedule work to efficiently utilize data in each cache line Chunk work into units processed by L1 cache-sized code segments Prefetching Speedup due to parallelism and absence of branch misprediction TLB, GPU, SMT, CAM, etc. EDGE Workshop 2006 Better Cache-line Utilization I B+-Tree node CSB+-Tree node [Rao & Ross 2000] K1 K2 K3 K4 K5 K6 K7 P P0 K1 P1 K2 P2 K3 P3 8 contiguous children 4 children 4 million leaf entries 1.1 B+-Tree Full means allocate space for all possible children in advance 1 CSB+-Tree 0.9 time (s) Full CSB+-Tree 0.8 0.7 0.6 0.5 0.4 14 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 proportion of searches EDGE Workshop 2006 Better Cache-line Utilization II  Columnwise storage, e.g., Monet [CWI] –  PAX [Ailamaki et al 2001] – – –  Best of both worlds with two data copies CSM [Shao et al 2004] – 15 Each column stored columnwise within disk page Good cache behavior for queries Good I/O behavior for row-access Fractured Mirrors [Ramamurthy et al 2002] –  Only process the columns you need In-memory copy keeps just part of disk page EDGE Workshop 2006 Consecutive Lookups to an Index Probe Stream 2 11 1 20 9 19 3 5 A Key r1 r2 r3 r4 r5 r6 r7 r8 RID Cache Capacity (Up to 3 nodes) In Cache B C E D F [Zhou & Ross 2003] 16 EDGE Workshop 2006 Consecutive Lookups to an Index Probe Stream 2 11 1 20 9 19 3 5 A r1 r2 r3 r4 r5 r6 r7 r8 RID Cache Capacity (Up to 3 nodes) In Cache B C 17 Key E D F EDGE Workshop 2006 Consecutive Lookups to an Index Probe Stream 2 11 1 20 9 19 3 5 A r1 r2 r3 r4 r5 r6 r7 r8 RID Cache Capacity (Up to 3 nodes) In L2 Cache B C Key E D F Cache Thrashing! 18 EDGE Workshop 2006 Buffering Accesses to the Index Probe Stream 1000 451 998 999 A In L2 Cache B E Buffered Items 300 0 1 Results C Buffered Items 200 0 19 D Buffered Items 249 0 1 Buffered Items 100 0 F Buffered Items 0 EDGE Workshop 2006 Impact of Buffering  Increases temporal and spatial locality  Overhead: extra copy of data into buffers  Key idea: Buffer units of L2 cache size or less  several tree levels if fan-out is small Elapsed Time (seconds) 14 12 Pentium 4, 5M keys, 5M probes 10 8 6 4 2 0 20 B+-Tree CSB+-Tree Conventional Lookup R-Tree Buffered Lookup EDGE Workshop 2006 Conventional Demand-Driven Pipelined Execution Model   Support open-next-close iterator interface An operator returns control immediately after generating one tuple [Zhou & Ross 2004] Results Instructions In L1 I-Cache Parent Parent Operator Cache Thrashing! GetNext() Child Operator   21 Solution: Buffering! ABABABA… becomes ABBBBBAAAAAB… Poor instruction locality Poor branch prediction accuracy EDGE Workshop 2006 Validation on Postgres Agg sum, avg, count 18 12% 16 Elapsed Time (seconds) SELECT SUM(l_extendedprice * (1 – l_discount) (1 + l_tax)) AS sum_charge, AVG(l_quantity) AS avg_qty, COUNT(*) AS count_order FROM lineitem WHERE l_shipdate <= date ‘1998-11-01’ (TPC-H) 14 12 10 8 6 4 Agg 2 sum, avg, count 0 Original Query Plan Buffer TableScan Buffered Query Plan Other Cost Branch Misprediction Penalty L2 Cache Miss Penalty L1 Instruction Cache Miss Penalty linitem TableScan 22 Combined footprint: 23K > 16K L1 linitem The buffered plan reduces  L1 instruction cache misses by 80% EDGE Workshop 2006 Steps: Cache-resident Transaction Processing [Harizopoulos et al 2004]  Group similar transactions – – – 23 Reuse L1-cache resident instructions for many transactions Fast context switch 40% speedup for TPC-C workloads EDGE Workshop 2006 Prefetching  Rely on hardware prefetching –  Explicit prefetching – –  Child/sibling cache-lines in tree indexes [Chen et al 2001, 2002] Staged access to data [Harizopoulos & Ailamaki 2003] Work-ahead set [Zhou et al 2005] [EDGE poster] – – 24 Use predictable (e.g., sequential) access patterns Load rather than prefetch (which may be dropped) Use another SMT thread to manage memory access EDGE Workshop 2006 Branch Mispredictions Conjunctive Selections [Ross 2002] for (i=0; i<number_of_records; i++) { if (all conditions are true, i.e., equal to 1, for record i) { answer[j++] = i; } } 25 EDGE Workshop 2006 Code Alternatives (I) if (f1(R[i]) && f2(R[i])) {…} versus if (f1(R[i]) & f2(R[i])) {…}   26 && version does less work if f1 is selective. But… && version has two conditional branches while & version has one. EDGE Workshop 2006 Code Alternatives (II)  Version with no conditional branches. for (i=0; i<number_of_records; i++) { answer[j] = i; j += (f1(R[i]) & f2(R[i])); } 27 EDGE Workshop 2006 4-way conjunction; simple f1, f2, … 28 EDGE Workshop 2006 Query Optimization Large Space of plans, e.g., for (i=0; i<number_of_records; i++) { if ( (f1(R[i]) & f2(R[i])) && f3(R[i]) ) { answer[j] = i; j += f4(R[i]); } } Optimizer’s cost function must include term for branch misprediction penalty. 29 EDGE Workshop 2006 Branches and CPI  Branches create control dependencies that limit instruction-level parallelism on superscalar CPUs –  Compression/decompression – – 30 Avoid dependencies to get better CPI [Zukowski et al 2006] Branch-free decompression algorithm Decompress at 2-6 GB/sec to get good data bandwidth without hundreds of disks EDGE Workshop 2006 SIMD on Modern Processors E.g., SSE on Pentium  Comparisons generate masks – 31 Convert if-then-else control dependencies to data dependencies EDGE Workshop 2006 Benefits of SIMD for DBs [Zhou & Ross 2002] Select Agg(R.y) From R Where R.x Between a AND b Table with 1 million records. Predicate selectivity 0.2 Elapsed Time (milliseconds) 25 20 15 10 5 0 SUM 32 SIMD SUM COUNT SIMD COUNT Other Cost MAX SIMD MAX Branch Misprediction Penalty MIN SIMD MIN EDGE Workshop 2006 Additional work  Efficient SMT parallelism for single operators –  TLB optimization –  – Fast sorting [Govindaraju et al 2006] Spatial query primitives [Bandi et al 2004] Use of other commodity hardware – 33 More passes over data to avoid TLB thrashing (e.g., data partitioning) [Manegold et al 2004] Use of co-processors (e.g., GPU) –  [Zhou et al 2005, Garcia et al 2006] Content addressible memory (CAM) [Bandi et al 2005] EDGE Workshop 2006 Talk Outline     34 Intro – what is a DBMS and what are its performance goals? Some history. How DBMS researchers are dealing with changes in architecture. The future. EDGE Workshop 2006 Future Challenges I   Multi-core, more threads per core DBs already parallel (e.g., Oracle will run on Niagara) but… – – Cache interference (or underutilization) on each chip Flipside: No cache coherence problem with shared L2 cache   – Bandwidth in/out of the chip now a bottleneck  35 Move away from complete shared-nothing? Need to model on-chip and off-chip communication costs  New trade-offs Speculative prefetching a bad idea EDGE Workshop 2006 Future Challenges II  Making use of multiple HW components – Best use of whole system   – – – – 36 CPU+GPU PPE+SPEs on Cell Various kinds of accelerators (commodity!) Configurability, tunability Programmability Cost models EDGE Workshop 2006 Conclusions    DBMS software is changing. Evolution and revolution in response to new hardware. More changes ahead for multi-core systems –  What’s the best HW mix for databases? – 37 Existing parallel design a head-start E.g., Is a GPU a server component? EDGE Workshop 2006

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Title