Download qe2-02

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Solution Sketches
Qualifying Exam Part2
COSC 6340 (Data Management)
May 10, 2002
Your Name:
Your SSN:
Your Mailing Address:
Problem 1 [15]: Implementing Joins
Problem 2 [10]: B+-trees
Problem 3 [18]: Query Optimization
Problem 4 [9]: Association Rules
Problem 5 [21]: Multidimensional Database Technology
:
Grade:
The exam is “open books/notes” and you have 105 minutes to complete the exam.
1
1) Implementing Joins [15]
Assume we have to implement a natural join between two relations R1(A, B, C) and
R2(A, D). A is the primary key for R2 and R1.A is a foreign key for R1 (R1[A] .
R2[A]). R1 and R2 are stored as an unordered file (blocks are assumed to be 100% full);
R1 contains 200000 tuples that are stored in 500 blocks, and R2 has 50000 tuples that are
stored in 1000 blocks (that is, every R2.A value occurs at an average four time as a value
of R1.A). Moreover, assume that only a small buffer size of 8 blocks is available. Based
on these assumptions answer the following questions (indicate, if you assume in your
computations that the output relation is written back to disk or not):
a. How many tuples will the output relation R=R1 
 R2 contain? [1]
200000
b. What is the cost for implementing the natural join using the block-nested loop join? [2]
M + (M*N)/(B-2)= 500 + 500*1000/6=…
c. Now assume that either a hashed index on A of R1 or a hashed index on A of R2 is
available (assume that there are no overflow pages). Compute the cost for using the index
nested loops join using the index for R1.A and for using the index nested loops join for
R2.A (2 computations][4]
500 + 200000* (1+1)=…
1000+ 50000* (1+4)=…
d. Is it possible to apply the hash-join in this case (explain your answer!)? [2]
No, 8>sqrt(500)
e. Which of the 4 proposed implementations is the best? [1]
b
Defining an index on
f. Now assume that the response time of the methods you evaluated is not satisfactory.
What else could be done, to speed up? [5]
Change the database design, create a new relation Relation R= R2 OUTERJOIN R1
Cost reduces to …(answer omitted)
2
2) B+ Tree Index Structures [10]
Assume a relation R(A, B, C, D) is given; R is stored as an unordered file and contains
1000000 (1 million) tuples. Attributes A, B, C, D need 4 byte of storage each, and blocks
have a size of 4096 Byte. Moreover, the following query is given:
Q1: Select A,C,D
from R
where B<12222 and B> 1300
returns 30000 answers
a) What is the cost of implementing Q1 assuming no index structures are available? [2]
Scan R; the cost of 3907 block accesses
b) Now assume a B+-tree index using attribute B as the search key is created. What
parameters p (maximum of pointers in an intermediate node) and m (number of
entries in a leaf node) would be chosen for the B+-tree and what will be the height of
the B+-tree (assuming that each node of the tree are filled 60%)? You can assume that
B+-tree node pointers require 4 byte of storage and index pointers also require 4 byte
of storage. Give reasons for your answers! [4]
p=m=512;
p*m*(0.6**2)< 10000 and p*p*m*(0.6**3) >10000; therefore 3 levels
c) Based on your answers to b), compute the cost for Q1 assuming the B+-tree index is
used. Explain your computations! [4].
2 (access B+ tree intermediate nodes)+ 98 (retrieve blocks of leaf nodes that
satisfy the selection range in the query) + 30000 (access to blocks of relation R) =
30100
3
3) Query Optimization [18]
Assume 4 relations R1(A,B) R2(A,C) R3(A,D) and R4(A,E) and the following SQL
query is given:
SELECT B, C, D
FROM R1, R2, R3, R4
WHERE R1.A=R2.A and R2.A=R3.A and R3.A=R4.A.
a) Give a query execution plan (Chaudhuri calls those physical operator trees) that
implements the above query (does not need to be the most efficient one)[3]:
b) The system R SQL optimizer only considers linear plans for query execution.
Give an example of a plan for the above query that would not be considered by
the SQL optimizer[2].
Same as a)
c) Using the above query as an example, explain the principle of optimality
(explained on page 2 of the Chaudhuri paper). How does it help in reducing the
number of plans that need to be evaluated by the query optimizer? [5]
Principle of optimality: For plans for queries that involve k subexpressions it
suffices to find optimal plans for a single subexpression, and then extend this
plan by finding optimal plans for the remaining subexpressions (one by one)
Example (see page 2 of paper)
PoO reduces plans to be considered from O(n!) to O(n*2n-1 )
d) Another critical problem in query optimization is the propagation of statistical
information. Explain what this problem is by using the query plan you generated
for sub problem a) as an example. [4]
In order to evaluate the cost of the plan ((R1 NJOIN R2) NJOIN ((R3 NJOIN
R4)) it is necessary to predict the size and statistical properties of the two
relations that are obtained by joining R1 and R2, and by joining R3 and R4. The
size and statistical properties of those intermediate relations have to be predicted
by “propagating the statistical of R1, R2, R3, R4”; then this information is used
to determine the cost of the third join operation
4
e) Chaudhuri states that “a desirable optimizer is one where … the enumeration
algorithm is efficient”. What does this requirement mean? [4]
The enumeration algorithm searches the implementation plan space for a query
for a “good” or optimal solution; to be efficient it needs to have the following
characteristics:
a. it construct plans to be considered fast
b. it employs heuristics that give preference to considering “good” plans
prior to “bad” plans
c. It is complete in the sense that all “promising” plans are considered
4) Association Rule Mining [9]
Compare the approach to association rule mining that relies on support, confidence, and
APRIORI style algorithms with the approach that is advocated by the Cohen/Datar Paper.
What are the main differences between the two approaches? What are the similarities of
the two approaches? Do you think the two approaches are complementary?
Limit your answer to 8-12 sentences!
Solution Sketch
1) Both are interested in mining association rules
2) APRIORI centers on high support, high confidence rules; COHEN/DATAR
centers on high confidence, high correlation rules
3) Both try need to find efficient ways to compute item sets that satisfy minimum
confidence requirements
4) Phase3 in COHEN/DATAR (scanning the database) is the same as in
APRIORI
5) APRIORI first computes the candidate item set of items that satisfy the
minimum support requirement and the uses the generated support information
to generate rules; in general, it uses lack of support to prune potential
candidates associations
6) COHEN/DATA cannot use support for efficiency; relies on an approximate
algorithm (occasionally false positive and negative occur) that creates signatures
for items relying on complicated hashing scheme. Then these signatures are used
to find candidate items with high correlation.
7) The approaches are complementary.
5
5) Multi-dimensional Database Technology [21]
a) Multi-dimensional database technology has come a long way since its inception.
What are the key features of this technology (what can be done with it)? [4] Limit
your answer to 4-6 sentences.
Solution sketch:
MDT:
a. Allows analyzing measures (of interest) along dimensions
using data cubes. Dimensions can be dynamically changed;
definition of new cubes and visualization of cubes is
supported.
b. Generalize spreadsheets for multiple dimensions
c. Supports interactive KDD for decision support
d. MDT support data warehousing, OLAP, and data mining
b) The Sarawagi paper suggests associating cells with path exception values
(PathExp). How can this value help data analysts in analyzing OLAP data. How
could PathExp values be computed? [6]
PathExp measures the degree of surprise from each drill down path
from a cell. This information can alert the analyst to exceptions or
other unusual events that occur at a lower level of granularity, but
not at the current level of granularity. PathExp is computed as
maximum SelfExp over all cells that can be reached by drill down
operations along the current path. The paper computes SelfExp as
the difference between the exact value, and the expected value based
on the data given at the current cube (divided by the standard
deviation for normalization purposes).
c) What are, in your opinion, the scientific contributions of the Ross paper? [3]
Limit your answer to 3 sentences!
 It helps classifying of multi-feature cubes based on their degree
of incrementality.
 It provides syntactically sufficient conditions on multi-feature
cube queries to determine, if they are distributive or algebraic.
 It presents an algorithm that incrementally computes coarser
granularity output of a distributive multi-feature cube using …
6
Problem 5 continued
d) Assume you are in charge of an organization that financially supports research
that centers on improving and enhancing multi-dimensional database technology.
Propose and describe 2 different research projects you would like to see funded
by your organization. [8]
No answer given!
7