Download Design and Evaluation of Architectures for Commercial Applications

Design and Evaluation of Architectures for Commercial Applications Part I: benchmarks Luiz André Barroso Western Research Laboratory Why architects should learn about commercial applications?  Because they are very different from typical benchmarks  Because they are demanding on many interesting architectural features  Because they are driving the sales of mid-range and high-end systems 2 UPC, February 1999 Shortcomings of popular benchmarks  SPEC uniprocessor-oriented  small cache footprints  exacerbates impact of CPU core issues   SPLASH small cache footprints  extremely optimized sharing   STREAMS no real sharing/communication  mainly bandwidth-oriented  3 UPC, February 1999 SPLASH vs. Online Transaction Processing (OLTP) A typical SPLASH app. has > 3x the issue rate, ~26x less cycles spent in memory barriers, 1/4 of the TLB miss ratios, < 1/2 the fraction of cache-to-cache transfers, ~22x smaller instruction cache miss ratio, ~1/2 L2$ miss ratio ...of an OLTP app. 4 UPC, February 1999 But the real reason we care? $$$!  Server market: Total: > $50 billion  Numeric/scientific computing: < $2 billion  Remaining $48 billion?  – OLTP – DSS – Internet/Web  Trend is for numerical/scientific to remain a niche 5 UPC, February 1999 Relevance of server vs. PC market High profit margins  Performance is a differentiating factor  If you sell the server you will probably sell:  the client  the storage  the networking infrastructure  the middleware  the service  ...  6 UPC, February 1999 Need for speed in the commercial market  Applications pushing the envelope Enterprise resource planning (ERP)  Electronic commerce  Data mining/warehousing  ADSL servers   Specialized solutions Intel splitting Pentium line into 3-tiers  Oracle’s raw iron initiative  Network Appliances’ machines  7 UPC, February 1999 Seminar disclaimer  Hardware centric approach: target is build better machines, not better software  focus on fundamental behavior, not on software “features”  Stick to general purpose paradigm  Emphasis on CPU+memory system issues  Lots of things missing:  object-relational and object-oriented databases  public domain/academic database engines  many others  8 UPC, February 1999 Overview  Day I: Introduction and workloads Background on commercial applications  Software structure of a commercial RDBMS  Standard benchmarks  – TPC-B – TPC-C – TPC-D – TPC-W Cost and pricing trends  Scaling down TPC benchmarks  9 UPC, February 1999 Overview(2)  Day 2: Evaluation methods/tools Introduction  Software instrumentation (ATOM)  Hardware measurement & profiling  – IPROBE – DCPI – ProfileMe Tracing & trace-driven simulation  User-level simulators  Complete machine simulators (SimOS)  10 UPC, February 1999 Overview (3)  Day III: Architecture studies Memory system characterization  Out-of-order processors  Simultaneous multithreading  Final remarks  11 UPC, February 1999 Background on commercial applications  Database applications:  Online Transaction Processing (OLTP) – massive number of short queries – read/update indexed tables – canonical example: banking system  Decision Support Systems (DSS) – smaller number of complex queries – mostly read-only over large (non-indexed) tables – canonical example: business analysis 12 UPC, February 1999 Background (2)  Web/Internet applications  Web server – many requests for small/medium files  Proxy – many short-lived connection requests – content caching and coherence  Web search index – DSS with a Web front-end  E-commerce site – OLTP with a Web front-end 13 UPC, February 1999 Background (3)  Common characteristics Large amounts of data manipulation  Interactive response times required  Highly multithreaded by design  – suitable for large multiprocessors Significant I/O requirements  Extensive/complex interactions with the operating system  Require robustness and resiliency to failures  14 UPC, February 1999 Database performance bottlenecks I/O-bound until recently (Thakkar, ISCA’90)  Many improvements since then  multithreading of DB engine  I/O prefetching  VLM (very large memory) database caching  more efficient OS interactions  RAIDs  non-volatile DRAM (NVDRAM)   Today’s bottlenecks: Memory system  Processor architecture  15 UPC, February 1999 Structure of a database workload clients Simple logic checks 16 Application server (optional) Formulates and issues DB query UPC, February 1999 Database server Executes query Who is who in the database market?  DB engine: Oracle is dominant  other players: Microsoft, Sybase, Informix   Database applications: SAP is dominant  other players: Oracle Apps, PeopleSoft, Baan   Hardware:  17 players: Sun, IBM, HP and Compaq UPC, February 1999 Who is who in the database market? (2) Historically, mainly mainframe proprietary OS  Today:  Unix: 40%  NT: 8%  Proprietary: 52%   In two years: Unix 46%  NT 19%  Proprietary 35%  18 UPC, February 1999 Overview of a RDBMS: Oracle8 Similar in structure to most commercial engines  Runs on:  uniprocessors  SMP multiprocessors  NUMA multiprocessors*   For clusters or message passing multiprocessors:  19 Oracle Parallel Server (OPS) UPC, February 1999 The Oracle RDBMS  Physical structure  Control files – basic info on the database, it’s structure and status  Data files – tables: actual database data – indexes: sorted list of pointers to data – rollback segments: keep data for recovery upon a failed transaction  Log files – compressed storage of DB updates 20 UPC, February 1999 Index files Critical in speeding up access to data by avoiding expensive scans  The more selective the index, the faster the access  Drawbacks:  Very selective indexes may occupy lots of storage  Updates to indexed data are more expensive  21 UPC, February 1999 Files or raw disk devices Most DB engines can directly access disks as raw devices  Idea is to bypass the file system  Manageability/flexibility somewhat compromised  Performance boost not large (~10-15%)  Most customer installations use file systems  22 UPC, February 1999 Transactions & rollback segments Single transaction can access/update many items  Atomicity is required:   transaction either happens or not Example: bank transfer Transaction A (accounts X,Y; value M) { read account balance(X) subtract M from balance(X) add M to balance(Y) commit } failure old value of balance(X) is kept in a rollback segment  rollback: old values restored, all locks released  23 UPC, February 1999 Transactions & log files A transaction is only committed after it’s side effects are in stable storage  Writing all modified DB blocks would be too expensive      Alternative: write only a log of modifications    random disk writes are costly a whole DB block has to be written back no coalescing of updates sequential I/O writes (enables NVDRAM optimizations) batching of multiple commits Background process periodically writes dirty data blocks out 24 UPC, February 1999 Transactions & log files (2) When a block is written to disk the log file entries are deleted  If the system crashes:    in-memory dirty blocks are lost Recovery procedure:  25 goes through the log files and applies all updates to the database UPC, February 1999 Transactions & concurrency control  Many transactions in-flight at any given time   Locking of data items is required Lock granularity: Table Block Row  Efficient row-level locking is needed for high transaction throughput 26 UPC, February 1999 Row-level locking Each new transaction is assigned an unique ID  A transaction table keeps track of all active transactions  Lock: write ID in directory entry for row  Unlock: remove ID from transaction table  Data block 120  233 230 233 Data block directory Transaction table 233 234 235 Simultaneous release of all locks 27 UPC, February 1999 Transaction read consistency  A transaction that reads a full table should see a consistent snapshot  For performance, reads shouldn’t lock a table  Problem: intervening writes  Solution: leverage rollback mechanism  28 intervening write saves old value in rollback segment UPC, February 1999 Oracle: software structure  Server processes   DB writer   writes redo logs to disk at commit time Process and system monitors   flush dirty blocks to disk Log writer   actual execution of transactions misc. activity monitoring and recovery Processes communicate through SGA and IPC 29 UPC, February 1999 Oracle: software structure(2) System Global Area (SGA) SGA:   Block buffer area    shared memory segment mapped by all processes cache of database blocks larger portion of physical memory Metadata area    30 UPC, February 1999 Redo buffers Data dictionary Shared pool Fixed region Metadata area  where most communication takes place synchronization structures shared procedures directory information Block buffer area Increasing virtual address  Oracle: software structure(3)  Hiding I/O latency: many server processes/processor  large block buffer area   Process dynamics: server reads/updates database (allocates entries in the redo buffer pool) at commit time server signals Log writer and sleeps Log writer wakes up, coalesces multiple commits and issues log file write after log is written, Log writer signals suspended servers 31 UPC, February 1999 Oracle: NUMA issues Single SGA region complicates NUMA localization  Single log writer process becomes a bottleneck  Oracle8 is incorporating NUMA-friendly optimizations  Current large NUMA systems use OPS even on a single address space  32 UPC, February 1999 Oracle Parallel Server (OPS) Runs on clusters of SMPs/NUMAs  Layered on top of RDBMS engine  Shared data through disk  Performance very dependent on how well data can be partitioned  Not supported by most application vendors  33 UPC, February 1999 Running Oracle: other issues Most memory allocated to block buffer area  Need to eliminate OS double buffering  Best performance attained by limiting process migration  In large SMPs, dedicating one processor to I/O may be advantageous  34 UPC, February 1999 TPC Database Benchmarks  Transaction Processing Performance Council (TPC) Established about 10 years ago  Mission: define representative benchmark standards for vendors (hardware/software) to compare their products  Focus on both performance and price/performance  Strict rules about how the benchmark is ran  Only widely used benchmarks  35 UPC, February 1999 TPC pricing rules  Must include  All hardware – server, I/O, networking, switches, clients  All software – OS, any middleware, database engine 5-year maintenance contract  Can include usual discounts  Audited components must be products  36 UPC, February 1999 TPC history of benchmarks  TPC-A    TPC-B     Current TPC OLTP benchmark Much more complex than TPC-A/B TPC-D   Simpler version of TPC-A Meant as a stress test of the server only TPC-C   First OLTP benchmark Based on Jim Gray’s Debit-Credit benchmark Current TPC DSS benchmark TPC-W  37 New Web-based e-commerce benchmark UPC, February 1999 The TPC-B benchmark  Models a bank with many branches  Branch Teller Begin transaction Update account balance Write entry in history table Update teller balance Update branch balance Commit Account  Metrics:  History   tpsB (transactions/second) $/tpsB Scale requirement:  38 1 transaction type: account update 1 tpsB needs 100,000 accounts UPC, February 1999 TPC-B: other requirements  System must be ACID  (A)tomicity – transactions either commit or leave the system as if were never issued  (C)onsistency – transactions take system from a consistent state to another  (I)solation – concurrent transactions execute as if in some serial order  (D)urability – results of committed transactions are resilient to faults 39 UPC, February 1999 The TPC-C benchmark  Current TPC OLTP benchmark  Moderately complex OLTP  Models a wholesale supplier managing orders  Workload consists of five transaction types  Users and database scale linearly with throughput  Specification was approved July 23, 1992 40 UPC, February 1999 TPC-C: schema Warehouse W Stock 100K W*100K 10 <cardinality> W*10 one-to-many relationship secondary index 3K Customer 41 100K (fixed) W Legend Table Name District W*30K Item Order 1+ W*30K+ 1+ 10-15 History Order-Line W*30K+ W*300K+ New-Order 0-1 UPC, February 1999 W*5K TPC-C: transactions New-order: enter a new order from a customer  Payment: update customer balance to reflect a payment  Delivery: deliver orders (done as a batch transaction)  Order-status: retrieve status of customer’s most recent order  Stock-level: monitor warehouse inventory  42 UPC, February 1999 TPC-C: transaction flow 1 Select txn from menu: 1. New-Order 2. Payment 3. Order-Status 4. Delivery 5. Stock-Level 45% 43% 4% 4% 4% 2 Measure menu Response Time Input screen 3 Keying time Measure txn Response Time Output screen Think time Go back to 1 43 UPC, February 1999 TPC-C: other requirements  Transparency   tables can be split horizontally and vertically provided it is hidden from the application Skew 1% of new-order txn are to a random remote warehouse  15% of payment txn are to a random remote warehouse   Metrics: performance: new-order transactions/minute (tpmC)  cost/performance: $/tpmC  44 UPC, February 1999 TPC-C: scale Maximum of 12 tpmC per warehouse  Consequently:   A quad-Xeon system today (~20,000 tpmC) needs – over 1668 warehouses – over 1 TB of disk storage!!  45 That’s a VERY expensive benchmark to run! UPC, February 1999 TPC-C: side effects of the skew rules Very small fraction of transactions go to remote warehouses  Transparency rules allow data partitioning  Consequence:  Clusters of powerful machines show exceptional numbers  Compaq has current TPC-C record of over 100 KtpmC with an 8-node memory channel cluster   Skew rules are expected to change in the future 46 UPC, February 1999 The TPC-D benchmark Current DSS benchmark from TPC  Moderately complex decision support workload  Models a worldwide reseller of parts  Queries ask real world business questions  17 ad hoc DSS queries (Q1 to Q17)  2 update queries  47 UPC, February 1999 TPC-D: schema Customer Nation Region SF*150K 25 5 Order Supplier Part SF*1500K SF*10K SF*200K LineItem PartSupp SF*6000K SF*800K 48 UPC, February 1999 TPC-D: scale Unlike TPC-C, scale not tied to performance  Size determined by a Scale Factor (SF)   SF = {1,10,30,100,300,1000,3000,10000} SF=1 means a 1GB database size  Majority of current results are in the 100GB and 300GB range  Indices and temporary tables can significantly increase the total disk capacity. (3-5x is typical)  49 UPC, February 1999 TPC-D example query  Forecasting Revenue Query (Q6)  This query quantifies the amount of revenue increase that would have resulted from eliminating company-wide discounts in a given percentage range in a given year. Asking this type of “what if” query can be used to look for ways to increase revenues Considers all line-items shipped in a year  Query definition:  SELECT SUM(L_EXTENDEDPRICE*L_DISCOUNT) AS REVENUE FROM LINEITEM WHERE L_SHIPDATE >= DATE ‘[DATE]]’ AND L_SHIPDATE < DATE ‘[DATE]’ + INTERVAL ‘1’ YEAR AND L_DISCOUNTBETWEEN [DISCOUNT] - 0.01 AND [DISCOUNT] + 0.01 AND L_QUANTITY < [QUANTITY] 50 UPC, February 1999 TPC-D execution rules  Power Test   Queries submitted in a single stream (i.e., no concurrency) Each Query Set is a permutation of the 17 read-only queries Cache Flush Query Set 0 UF1 Query Set 0 UF2 (optional) Warm-up, not timed Throughput Test   Multiple concurrent query streams Single update stream Query Set 1 Query Set 2 . . .  Timed Sequence Query Set N Updates: UF1 UF2 UF1 UF2 UF1 UF2 51 UPC, February 1999 TPC-D: metrics  Power Metric (QppD)  Geometric Mean 3600  SF QppD@ Size  19 i 17 j 2 i 1 j 1  QI (i ,0)UI ( j ,0) where QI(i,0)  Timing Interval for Query i, stream 0 UI(j,0)  Timing Interval for Update j, stream 0 SF  Scale Factor  Throughput (QthD)  Arithmetic Mean QthD@ Size  S  17   TS 3600  SF Both Metrics represent where: of query streams “Queries per Gigabyte Hour” ST number elapsed time of test (in seconds) S 52 UPC, February 1999 TPC-D: metrics(2)  Composite Query-Per-Hour Rating (QphD)  The Power and Throughput metrics are combined to get the composite queries per hour. QphD@ Size  QppD@ Size  QthD@ Size  Reported metrics are: – Power: QppD@Size – Throughput: QthD@Size – Price/Performance: $/QphD@Size 53 UPC, February 1999 TPC-D: other issues Queries are complex and long-running  Crucial that DB engine parallelizes queries for acceptable performance  Quality of query parallelizer is the most important factor  Large improvements are still observed from generation to generation of software  54 UPC, February 1999 The TPC-W benchmark Just introduced  Represent a business that markets and sells over the internet  Includes security/authentication  Uses dynamically generated pages (e.g. cgi-bins)  Metric: Web Interactions Per Second (WIPS)  Transactions:   55 Browse, shopping-cart, buy, user-registration, and search UPC, February 1999 A look at current audited TPC-C systems  Leader in price/performance:  Compaq ProLiant 7000-6/450, MS SQL 7.0, NT – 4x 450MHz Xeons, 2MB cache, 4GB DRAM, 1.4 TB disk – 22,479 tpmC, $18.84/tpmC  Leader in non-cluster performance:  Sun Enterprise 6500, Sybase 11.9, Solaris7 – 24x 336MHz UltraSPARC IIs, 4MB cache, 24 GB DRAM, 4TB disk – 53,050 tpmC, $76.00/tpmC 56 UPC, February 1999 Audited TPC-C systems: price breakdown  Server sub-component prices Compaq Proliant Sun E6500 $/CPU $4,816.00 $15,375.00 $/MB DRAM $3.92 $9.16 $/GB Disk $145.33 $382.03 Server Price Breakdown 100% 90% 80% 70% Disk 60% Memory 50% CPU 40% Base 30% 20% 10% 0% Compaq Proliant 57 Sun E6500 UPC, February 1999 Using TPC benchmarks for architecture studies Brute force approach: use full audit-sized system  Who can afford it?  How can you run it on top of a simulator?  How can you explore a wide design space?   Solution: scaling down the size 58 UPC, February 1999 Careful Scaling of Workloads Identify architectural issue under study  Apply appropriate scaling to simplify monitoring and enable simulation studies   Most scaling experiments on real machines   simulation-only is not a viable option! Validation through sanity checks and comparison with audit-sized runs 59 UPC, February 1999 Scaling OLTP Forget about TPC compliance  Determine lower bound on DB size  monitor contention for smaller tables/indexes  DB size will change with number of processors   I/O bandwidth requirements vary with fraction of DB resident in memory completely in-memory run: no special I/O requirements  favor more small disks vs. few large ones  place all redo logs on a separate disk  reduce OS double-buffering   Limit number of transactions executed 60 UPC, February 1999 Scaling OLTP(2)  Achieve representative cache behavior relevant data structures >> size of hardware caches (metadata area size is key)  maintain same number of processes/CPU as larger run   Simplify setup by running clients on the server machine   need to make lighter-weight versions of the clients Ensure efficient execution  61 excessive migration, idle time, OS or application spinning distorts metrics UPC, February 1999 Scaling DSS  Determine lower bound DB size   sufficient work in parallel section Ensure representative cache behavior DB >> hardware caches  maintain same number of processes/CPU as large run  Reduce execution time through sampling  Major difficulty is ensuring representative query plans  DSS results more volatile due to improvements in query optimizers  62 UPC, February 1999 Tuning, tuning, tuning Ensure scaled workload is running efficiently  Requires a large number of monitoring runs on actual hardware platform  Resembles “black art” on Oracle  Self-tuning features in Microsoft SQL 7.0 are promising   63 ability for user overrides is desirable, but missing UPC, February 1999 Does Scaling Work? 64 UPC, February 1999 TPC-C: scaled vs. full size TPC-C, scaled bcache miss 24% 1-issue 8% TPC-C, full-size 2-issue 8% bcache miss 27% tlb 3% 1-issue 11% 2-issue 8% tlb 1% repl trap 2% repl trap 5% br/pc mispr. 3% br/pc mispr. 2% mb 3% bcache hit 30% scache hit 17% mb 6% bcache hit 20% scache hit 22% Breakdown profile of CPU cycles  Platform: 8-proc. AlphaServer 8400  65 UPC, February 1999 Using simpler OLTP benchmarks: TPC-B, scaled 1-issue 7% TPC-C, full-size 2-issue 6% bcache miss 27% tlb 2% 1-issue 11% 2-issue 8% tlb 1% repl. trap 5% bcache miss 37% repl trap 2% br/pc mispr. 2% br/pc mispr. 3% mb 9% scache hit 16% mb 6% bcache hit 20% scache hit 22% bcache hit 16%  Although “obsolete” TPC-B can be used in architectural studies 66 UPC, February 1999 Benchmarks wrap-up Commercial applications are complex, but need to be considered during design evaluation  TPC benchmarks cover a wide range of commercial application areas  Scaled down TPC benchmarks can be used for architecture studies  Architect needs deep understanding of the workload  67 UPC, February 1999

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Design and Evaluation of Architectures for Commercial Applications