Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Set Query Benchmark Patrick E. O’Neil Department of Mathematics and Computer Science University of Massachusetts at Boston Hak Chan Lee School of Computing, Soong Sil Univ. [email protected] Table of Contents • Introduction to the benchmark • An application of the benchmark • How to run the set query benchmark Introduction to the benchmark • Strategic value data applications – – – – Marketing Information System Decision Support System Management Reporting Direct Marketing • Set query Benchmark – To aid decision makers who require performance data relevant to strategic value data applications. • Database – IBM’s DB2 – Computer Corporation of America’s MODEL 204 • Performance – I/O, CPU time, elapsed time Features of The Set Query Benchmark • Four key characteristics of Set Query – Portability • The queries for benchmark are specified in SQL which is available on most systems. – Functional Coverage • Prospective users should be able a rating for the particular subset they expect to use most in the system they are planning – Selectivity Coverage • Applies to a clause of a query, and indicates the proportion of rows of the database to be selected. • A single row has very high selectivity – Scalability • The database has a single table, known as the BENCH table, which contains an integer multiple of 1 million rows of 200bytes each. Definition of the BENCH Table • BENCH table – 13 indexed columns (length 4 each) – Character columns • s1 (length 8) • s2 through s8 (length 20 each) • Never used in retrieval queries – (13*4) + (1*8) + (7*20) = 200(bytes) – Cardinality (number of distinct value) – Where a BENCH table of more than 1 million rows is created, the three highest cardinality columns are either renamed or reinterpreted in a consistent way First 10 rows of the BENCH database Achieving Functional Coverage • Five companies with state of the art strategic information applications were contacted • Three general strategic value data applications – Document search – Direct marketing – Decision support / management reporting Document search • The “documents” can represent any information • (i) A COUNT of records with a single exact match condition, known as query Q1: Q1: SELECT COUNT (*) FROM BENCH WHERE KN=2; For each KN ε {KSEQ, K100K,…, K4, K2} • (ii) A COUNT of records from a conjunction of two exact match condition: query Q2A: Q2A: SELECT COUNT (*) FROM BENCH WHERE K2 = 2 AND KN = 3; For each KN ε {KSEQ, K100K,…, K4} Or Q2B: SELECT COUNT (*) FROM BENCH WHERE K2 = 2 AND NOT KN = 3; For each KN ε {KSEQ, K100K,…, K4} Document search (cont.) • (iii) A retrieval of data (not counts) given constraints of three conditions, including range conditions, (Q4A), or constraints of five conditions, (Q4B). Q4: SELECT KSEQ, K500K FROM BENCH WHERE constraint with (3 or 5) conditions; Direct Marketing • Goal – To identify a list of households which are most likely to purchase a given product or service • Approach selecting such a list – Preliminary sizing and exploration of possible criteria for selection • R.L. Polk • reinforced Q2A and Q2B – Retrieving the data from the records for a mailing or other communication • Saks Fifth Avenue • reinforced Q4A and Q4B Direct Marketing (cont.) Q3A: SELECT SUM (K1K) FROM BENCH WHERE KSEQ BETWEEN 400000 AND 500000 AND KN = 3; For each KN ε {K100K,…, K4} Q3B: SELECT SUM (K1K) FROM BENCH WHERE (KSEQ BETWEEN 400000 AND 410000 OR KSEQ BETWEEN 420000 AND 430000 OR KSEQ BETWEEN 440000 AND 450000 OR KSEQ BETWEEN 460000 AND 470000 OR KSEQ BETWEEN 480000 AND 500000) AND KN = 3; For each KN ε {K100K,…, K4} Decision Support and Management Reporting Q5: SELECT KN1, KN2, COUNT (*) FROM BENCH GROUP BY KN1, KN2; For each (KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)} Q6A: SELECT COUNT (*) FROM BENCH B1, BENCH B2 WHERE B1.KN = 49 AND B1.K250K = B2.K500K; For each (KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)} Q6B: SELECT B1.KSEQ, B2.KSEQ FROM BENCH B1, BENCH B2 WHERE B1.KN = 99 AND B1.K250K = B2.K500K AND B2.K25 = 19; For each (KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)} Running the Benchmark • • • • Architectural Complexity A Single Unifying Criterion Confidence in the Result Output Formatting Architectural Complexity • Measuring these two products exposes an enormous number of architectural features peculiar to each. • Perfect apples-to-apples comparison between two products in every features is generally impossible. • MODEL 204 – Uses standard (random) I/O – Existence Bit Map • Is able to perform much more efficient negation queries. • DB2 – Uses Random I/O and Prefetch I/O A Single Unifying Criterion • Sum up our benchmark results with a single figure – Dollar Price per Query Per Second ($PRICE/QPS) Confidence in the Result • Table 2.2.1, containing Q1measurements, displays a maximum elapsed time of 39.69 seconds in one case and 0.31seconds in the other • How can we have confidence that we have not made a measurement or tuning mistake when confronted with variation of this kind? – The answer lies in understanding the two system architectures. Output Formatting • The query results are directed to a user file which represents the answers printable form. – About half the reports return very few records (only counts) – About half return a few thousand records. An Application of the Benchmark • Hardware/Software environment – DB2 • Version 2.2 with 1200 4KByte memory buffer – MODEL 204 • Version 2.1 with 800 6KByte buffer – 3380 disk drive – Standalone 4381-3, a 4.5 MIPS dual processor. – OS • MVS XA 2.2 Operating System Statistics Gathered • RUNSTATS – To update the various tables, such as SYSCOLUMNS and SYSINDEXES – Examine the data – Accumulate statistics Statistics Gathered (cont.) • The rows of the BENCH database were loaded in the DB2 and MODEL 204 database in identical fashion. • Indices in both cases were loaded with leaf nodes 95% full Statistics Gathered (cont.) • MODEL 204 – Lager pages (6Kbytes) – B-tree indices have complex indirect inversion structures – B-tree entries need not contain lists of RIDs of rows with duplicate values. – But may point to a list or “bitmap” of such RIDs. – more efficient Statistics Gathered (cont.) • In MODEL 204 – The native “User language” was used to formulates these queries, rather than SQL. Statistics Gathered (cont.) • Q1: Count, Single Exact Match SELECT COUNT (*) FROM BENCH WHERE KN=2; For each KN ε {KSEQ, K100K,…, K4, K2} Statistics Gathered (cont.) • DB2 times com from DB2PM Class 1 • Class 1 statistics – Include time spent by SPUFI application and cross-region communication • Class 2 statistics – Include only time spent within DB2 • Total number of pages (N) – R : random reads – P : prefetch reads – N = R + 32 * P • DB2 slowdown in not caused by the I/O Statistics Gathered (cont.) • Q2: Count ANDing two Clauses Q2A: SELECT COUNT (*) FROM BENCH WHERE K2 = 2 AND KN = 3; For each KN ε {KSEQ, K100K,…, K4} Statistics Gathered (cont.) Q2B: SELECT COUNT (*) FROM BENCH WHERE K2 = 2 AND NOT KN = 3; For each KN ε {KSEQ, K100K,…, K4} Statistics Gathered (cont.) • Q3: Sum, Range and Match Clause Q3A: SELECT SUM (K1K) FROM BENCH WHERE KSEQ BETWEEN 400000 AND 500000 AND KN = 3; For each KN ε {K100K,…, K4} Statistics Gathered (cont.) Q3B: SELECT SUM (K1K) FROM BENCH WHERE (KSEQ BETWEEN 400000 AND 410000 OR KSEQ BETWEEN 420000 AND 430000 OR KSEQ BETWEEN 440000 AND 450000 OR KSEQ BETWEEN 460000 AND 470000 OR KSEQ BETWEEN 480000 AND 500000) AND KN = 3; For each KN ε {K100K,…, K4} Statistics Gathered (cont.) • Q4: Multiple Condition Selection Q4: SELECT KSEQ, K500K FROM BENCH WHERE constraint with (3 or 5) conditions; • Condition Sequence – – – – – (1) (3) (5) (7) (9) K2 = 1; (2) K100 > 80; K100K BETWEEN 2000 and 3000; (4) K5 = 3; K25 IN (11, 19); (6) K4 = 3; K100 < 41; (8) K1K BETWEEN 850 AND 950; K10 = 7; (10) K25 IN (3, 4); Statistics Gathered (cont.) Statistics Gathered (cont.) • Q5: Multiple Column GROUP BY Q5: SELECT KN1, KN2, COUNT (*) FROM BENCH GROUP BY KN1, KN2; For each (KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)} Statistics Gathered (cont.) • Q6: Join Condition Q6A: SELECT COUNT (*) FROM BENCH B1, BENCH B2 WHERE B1.KN = 49 AND B1.K250K = B2.K500K; For each (KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)} Statistics Gathered (cont.) Q6B: SELECT B1.KSEQ, B2.KSEQ FROM BENCH B1, BENCH B2 WHERE B1.KN = 99 AND B1.K250K = B2.K500K AND B2.K25 = 19; For each (KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)} Single Measure Result • The ultimate measure of Set Query benchmark for any platform measured is price per query per second ($/QPS). How to Run the Set Query Benchmark • • • • Generating the Data Running the Queries Interpreting the Result Configuration and Reporting Requirement – a checklist • Concluding Remark Generating the Data • Pseudo-random number generator is suggested by Ward Cheney and David Kincaid Generating the Data (cont.) • Loading the Data – Indexes for 13 columns – Indices • B-tree type (if available) • Loaded 95% full – Set Query benchmark permit the data to be loaded on as many disk devices as desired. Running the Queries • To support tuning decisions regarding memory purchase, the Set Query benchmark offers two approaches – It is possible to scale the BENCH table to any integer multiple of one million bytes. – The benchmark measurements are normalized by essentially flushing the buffers between successive queries, so that the number of I/Os is maximum Interpreting the result • A common demand of decision maker who require benchmark result is a single number. • Calculating the $/QPS Rating – The calculation of the $PRICE/QPS rating for the Set Query benchmark is quite straightforward in concept • 69 queries • T : Elapsed time to run all of them in series – 69/T queries per second • P : well defined dollar, for hardware and software • $PRICE/QPS : (P*T)/69 Interpreting the result (cont.) • Price calculation starts by summing the CPU and I/O resources • To calculate T F=SUMELA/SUMCPU T=F*TOTCPU Configuration and Reporting Requirements – a Checklist • The data should be generated in the precise manner • The BENCH table should be loaded on the appropriate number of stable devices, and resource utilization for the load, together with any preparatory utilities to improve performance should be reported • The exact hardware/software configuration used for test should be reported • The access strategy for each Query reported should be explained in the report if this capability is offered by the database system. • The benchmark should be run from either an interactive or an embedded program and the Page Buffer and Sort Buffer (if separate) space in Mbyte should be reported. • All resource use should be reported for each of the queries. Concluding Remarks • Variation in resource use translate into large dollar price difference • Extensions to the Set Query benchmark have been suggested by a number of observers, and a multi-user Set Query benchmark is under study.