Download The Set Query Benchmark Patrick E. O`Neil Department of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Collaborative decision-making software wikipedia , lookup

Bayesian inference in marketing wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
The Set Query Benchmark
Patrick E. O’Neil
Department of Mathematics and Computer Science
University of Massachusetts at Boston
Hak Chan Lee
School of Computing, Soong Sil Univ.
[email protected]
Table of Contents
• Introduction to the benchmark
• An application of the benchmark
• How to run the set query benchmark
Introduction to the benchmark
• Strategic value data applications
–
–
–
–
Marketing Information System
Decision Support System
Management Reporting
Direct Marketing
• Set query Benchmark
– To aid decision makers who require performance data
relevant to strategic value data applications.
• Database
– IBM’s DB2
– Computer Corporation of America’s MODEL 204
• Performance
– I/O, CPU time, elapsed time
Features of The Set Query
Benchmark
• Four key characteristics of Set Query
– Portability
• The queries for benchmark are specified in SQL which is
available on most systems.
– Functional Coverage
• Prospective users should be able a rating for the particular
subset they expect to use most in the system they are planning
– Selectivity Coverage
• Applies to a clause of a query, and indicates the proportion of
rows of the database to be selected.
• A single row has very high selectivity
– Scalability
• The database has a single table, known as the BENCH table,
which contains an integer multiple of 1 million rows of 200bytes
each.
Definition of the BENCH Table
• BENCH table
– 13 indexed columns (length 4 each)
– Character columns
• s1 (length 8)
• s2 through s8 (length 20 each)
• Never used in retrieval queries
– (13*4) + (1*8) + (7*20) = 200(bytes)
– Cardinality (number of distinct value)
– Where a BENCH table of more than 1 million rows
is created, the three highest cardinality columns
are either renamed or reinterpreted in a consistent
way
First 10 rows of the BENCH
database
Achieving Functional Coverage
• Five companies with state of the art strategic
information applications were contacted
• Three general strategic value data applications
– Document search
– Direct marketing
– Decision support / management reporting
Document search
•
The “documents” can represent any information
•
(i) A COUNT of records with a single exact match condition, known as
query Q1:
Q1:
SELECT COUNT (*) FROM BENCH
WHERE KN=2;
For each KN ε {KSEQ, K100K,…, K4, K2}
•
(ii) A COUNT of records from a conjunction of two exact match
condition: query Q2A:
Q2A:
SELECT COUNT (*) FROM BENCH
WHERE K2 = 2 AND KN = 3;
For each KN ε {KSEQ, K100K,…, K4}
Or
Q2B:
SELECT COUNT (*) FROM BENCH
WHERE K2 = 2 AND NOT KN = 3;
For each KN ε {KSEQ, K100K,…, K4}
Document search (cont.)
• (iii) A retrieval of data (not counts) given constraints of three
conditions, including range conditions, (Q4A), or constraints of
five conditions, (Q4B).
Q4:
SELECT KSEQ, K500K FROM BENCH
WHERE constraint with (3 or 5) conditions;
Direct Marketing
• Goal
– To identify a list of households which are most
likely to purchase a given product or service
• Approach selecting such a list
– Preliminary sizing and exploration of possible
criteria for selection
• R.L. Polk
• reinforced Q2A and Q2B
– Retrieving the data from the records for a mailing
or other communication
• Saks Fifth Avenue
• reinforced Q4A and Q4B
Direct Marketing (cont.)
Q3A:
SELECT SUM (K1K) FROM BENCH
WHERE KSEQ BETWEEN 400000 AND
500000 AND KN = 3;
For each KN ε {K100K,…, K4}
Q3B:
SELECT SUM (K1K) FROM BENCH
WHERE (KSEQ BETWEEN 400000 AND 410000
OR
KSEQ BETWEEN 420000 AND 430000
OR
KSEQ BETWEEN 440000 AND 450000
OR
KSEQ BETWEEN 460000 AND 470000
OR
KSEQ BETWEEN 480000 AND 500000)
AND KN = 3;
For each KN ε {K100K,…, K4}
Decision Support and
Management Reporting
Q5:
SELECT KN1, KN2, COUNT (*) FROM BENCH
GROUP BY KN1, KN2;
For each
(KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)}
Q6A:
SELECT COUNT (*) FROM BENCH B1, BENCH B2
WHERE B1.KN = 49 AND B1.K250K = B2.K500K;
For each (KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)}
Q6B:
SELECT B1.KSEQ, B2.KSEQ FROM BENCH B1, BENCH B2
WHERE B1.KN = 99
AND B1.K250K = B2.K500K
AND B2.K25 = 19;
For each (KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)}
Running the Benchmark
•
•
•
•
Architectural Complexity
A Single Unifying Criterion
Confidence in the Result
Output Formatting
Architectural Complexity
• Measuring these two products exposes an enormous
number of architectural features peculiar to each.
• Perfect apples-to-apples comparison between two
products in every features is generally impossible.
• MODEL 204
– Uses standard (random) I/O
– Existence Bit Map
• Is able to perform much more efficient negation queries.
• DB2
– Uses Random I/O and Prefetch I/O
A Single Unifying Criterion
• Sum up our benchmark results with a single figure
– Dollar Price per Query Per Second ($PRICE/QPS)
Confidence in the Result
• Table 2.2.1, containing Q1measurements, displays a
maximum elapsed time of 39.69 seconds in one case
and 0.31seconds in the other
• How can we have confidence that we have not made
a measurement or tuning mistake when confronted
with variation of this kind?
– The answer lies in understanding the two system
architectures.
Output Formatting
• The query results are directed to a user file which
represents the answers printable form.
– About half the reports return very few records (only counts)
– About half return a few thousand records.
An Application of the Benchmark
• Hardware/Software environment
– DB2
• Version 2.2 with 1200 4KByte memory buffer
– MODEL 204
• Version 2.1 with 800 6KByte buffer
– 3380 disk drive
– Standalone 4381-3, a 4.5 MIPS dual processor.
– OS
• MVS XA 2.2 Operating System
Statistics Gathered
• RUNSTATS
– To update the various tables, such as
SYSCOLUMNS and SYSINDEXES
– Examine the data
– Accumulate statistics
Statistics Gathered (cont.)
• The rows of the BENCH database were
loaded in the DB2 and MODEL 204 database
in identical fashion.
• Indices in both cases were loaded with leaf
nodes 95% full
Statistics Gathered (cont.)
• MODEL 204
– Lager pages (6Kbytes)
– B-tree indices have complex indirect inversion structures
– B-tree entries need not contain lists of RIDs of rows with
duplicate values.
– But may point to a list or “bitmap” of such RIDs.
– more efficient
Statistics Gathered (cont.)
• In MODEL 204
– The native “User language” was used to
formulates these queries, rather than SQL.
Statistics Gathered (cont.)
• Q1: Count, Single Exact Match
SELECT COUNT (*) FROM BENCH
WHERE KN=2;
For each KN ε {KSEQ, K100K,…, K4, K2}
Statistics Gathered (cont.)
• DB2 times com from DB2PM Class 1
• Class 1 statistics
– Include time spent by SPUFI application and
cross-region communication
• Class 2 statistics
– Include only time spent within DB2
• Total number of pages (N)
– R : random reads
– P : prefetch reads
– N = R + 32 * P
• DB2 slowdown in not caused by the I/O
Statistics Gathered (cont.)
• Q2: Count ANDing two Clauses
Q2A:
SELECT COUNT (*) FROM BENCH
WHERE K2 = 2 AND KN = 3;
For each KN ε {KSEQ, K100K,…, K4}
Statistics Gathered (cont.)
Q2B:
SELECT COUNT (*) FROM BENCH
WHERE K2 = 2 AND NOT KN = 3;
For each KN ε {KSEQ, K100K,…, K4}
Statistics Gathered (cont.)
• Q3: Sum, Range and Match Clause
Q3A:
SELECT SUM (K1K) FROM BENCH
WHERE KSEQ BETWEEN 400000 AND 500000 AND KN = 3;
For each KN ε {K100K,…, K4}
Statistics Gathered (cont.)
Q3B:
SELECT SUM (K1K) FROM BENCH
WHERE (KSEQ BETWEEN 400000 AND 410000
OR
KSEQ BETWEEN 420000 AND 430000
OR
KSEQ BETWEEN 440000 AND 450000
OR
KSEQ BETWEEN 460000 AND 470000
OR
KSEQ BETWEEN 480000 AND 500000)
AND KN = 3;
For each KN ε {K100K,…, K4}
Statistics Gathered (cont.)
• Q4: Multiple Condition Selection
Q4:
SELECT KSEQ, K500K FROM BENCH
WHERE constraint with (3 or 5) conditions;
• Condition Sequence
–
–
–
–
–
(1)
(3)
(5)
(7)
(9)
K2 = 1;
(2) K100 > 80;
K100K BETWEEN 2000 and 3000; (4) K5 = 3;
K25 IN (11, 19);
(6) K4 = 3;
K100 < 41;
(8) K1K BETWEEN 850 AND 950;
K10 = 7;
(10) K25 IN (3, 4);
Statistics Gathered (cont.)
Statistics Gathered (cont.)
• Q5: Multiple Column GROUP BY
Q5:
SELECT KN1, KN2, COUNT (*) FROM BENCH
GROUP BY KN1, KN2;
For each
(KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)}
Statistics Gathered (cont.)
• Q6: Join Condition
Q6A:
SELECT COUNT (*) FROM BENCH B1, BENCH B2
WHERE B1.KN = 49 AND B1.K250K = B2.K500K;
For each
(KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)}
Statistics Gathered (cont.)
Q6B:
SELECT B1.KSEQ, B2.KSEQ FROM BENCH B1, BENCH B2
WHERE B1.KN = 99
AND B1.K250K = B2.K500K
AND B2.K25 = 19;
For each
(KN1, KN2) ε {(K2, K100), (K10, K25), (K10, K25)}
Single Measure Result
• The ultimate measure of Set Query
benchmark for any platform measured
is price per query per second ($/QPS).
How to Run the Set Query
Benchmark
•
•
•
•
Generating the Data
Running the Queries
Interpreting the Result
Configuration and Reporting
Requirement – a checklist
• Concluding Remark
Generating the Data
• Pseudo-random number generator is
suggested by Ward Cheney and David
Kincaid
Generating the Data (cont.)
• Loading the Data
– Indexes for 13 columns
– Indices
• B-tree type (if available)
• Loaded 95% full
– Set Query benchmark permit the data to be
loaded on as many disk devices as desired.
Running the Queries
• To support tuning decisions regarding
memory purchase, the Set Query
benchmark offers two approaches
– It is possible to scale the BENCH table to
any integer multiple of one million bytes.
– The benchmark measurements are
normalized by essentially flushing the
buffers between successive queries, so
that the number of I/Os is maximum
Interpreting the result
• A common demand of decision maker who
require benchmark result is a single number.
• Calculating the $/QPS Rating
– The calculation of the $PRICE/QPS rating for the
Set Query benchmark is quite straightforward in
concept
• 69 queries
• T : Elapsed time to run all of them in series
– 69/T queries per second
• P : well defined dollar, for hardware and software
• $PRICE/QPS : (P*T)/69
Interpreting the result (cont.)
• Price calculation starts by summing the
CPU and I/O resources
• To calculate T
F=SUMELA/SUMCPU
T=F*TOTCPU
Configuration and Reporting
Requirements – a Checklist
•
The data should be generated in the precise manner
•
The BENCH table should be loaded on the appropriate number of
stable devices, and resource utilization for the load, together with any
preparatory utilities to improve performance should be reported
•
The exact hardware/software configuration used for test should be
reported
•
The access strategy for each Query reported should be explained in the
report if this capability is offered by the database system.
•
The benchmark should be run from either an interactive or an
embedded program and the Page Buffer and Sort Buffer (if separate)
space in Mbyte should be reported.
•
All resource use should be reported for each of the queries.
Concluding Remarks
• Variation in resource use translate into
large dollar price difference
• Extensions to the Set Query benchmark
have been suggested by a number of
observers, and a multi-user Set Query
benchmark is under study.