Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Solution Sketches Qualifying Exam Part2 COSC 6340 (Data Management) May 10, 2002 Your Name: Your SSN: Your Mailing Address: Problem 1 [15]: Implementing Joins Problem 2 [10]: B+-trees Problem 3 [18]: Query Optimization Problem 4 [9]: Association Rules Problem 5 [21]: Multidimensional Database Technology : Grade: The exam is “open books/notes” and you have 105 minutes to complete the exam. 1 1) Implementing Joins [15] Assume we have to implement a natural join between two relations R1(A, B, C) and R2(A, D). A is the primary key for R2 and R1.A is a foreign key for R1 (R1[A] . R2[A]). R1 and R2 are stored as an unordered file (blocks are assumed to be 100% full); R1 contains 200000 tuples that are stored in 500 blocks, and R2 has 50000 tuples that are stored in 1000 blocks (that is, every R2.A value occurs at an average four time as a value of R1.A). Moreover, assume that only a small buffer size of 8 blocks is available. Based on these assumptions answer the following questions (indicate, if you assume in your computations that the output relation is written back to disk or not): a. How many tuples will the output relation R=R1 R2 contain? [1] 200000 b. What is the cost for implementing the natural join using the block-nested loop join? [2] M + (M*N)/(B-2)= 500 + 500*1000/6=… c. Now assume that either a hashed index on A of R1 or a hashed index on A of R2 is available (assume that there are no overflow pages). Compute the cost for using the index nested loops join using the index for R1.A and for using the index nested loops join for R2.A (2 computations][4] 500 + 200000* (1+1)=… 1000+ 50000* (1+4)=… d. Is it possible to apply the hash-join in this case (explain your answer!)? [2] No, 8>sqrt(500) e. Which of the 4 proposed implementations is the best? [1] b Defining an index on f. Now assume that the response time of the methods you evaluated is not satisfactory. What else could be done, to speed up? [5] Change the database design, create a new relation Relation R= R2 OUTERJOIN R1 Cost reduces to …(answer omitted) 2 2) B+ Tree Index Structures [10] Assume a relation R(A, B, C, D) is given; R is stored as an unordered file and contains 1000000 (1 million) tuples. Attributes A, B, C, D need 4 byte of storage each, and blocks have a size of 4096 Byte. Moreover, the following query is given: Q1: Select A,C,D from R where B<12222 and B> 1300 returns 30000 answers a) What is the cost of implementing Q1 assuming no index structures are available? [2] Scan R; the cost of 3907 block accesses b) Now assume a B+-tree index using attribute B as the search key is created. What parameters p (maximum of pointers in an intermediate node) and m (number of entries in a leaf node) would be chosen for the B+-tree and what will be the height of the B+-tree (assuming that each node of the tree are filled 60%)? You can assume that B+-tree node pointers require 4 byte of storage and index pointers also require 4 byte of storage. Give reasons for your answers! [4] p=m=512; p*m*(0.6**2)< 10000 and p*p*m*(0.6**3) >10000; therefore 3 levels c) Based on your answers to b), compute the cost for Q1 assuming the B+-tree index is used. Explain your computations! [4]. 2 (access B+ tree intermediate nodes)+ 98 (retrieve blocks of leaf nodes that satisfy the selection range in the query) + 30000 (access to blocks of relation R) = 30100 3 3) Query Optimization [18] Assume 4 relations R1(A,B) R2(A,C) R3(A,D) and R4(A,E) and the following SQL query is given: SELECT B, C, D FROM R1, R2, R3, R4 WHERE R1.A=R2.A and R2.A=R3.A and R3.A=R4.A. a) Give a query execution plan (Chaudhuri calls those physical operator trees) that implements the above query (does not need to be the most efficient one)[3]: b) The system R SQL optimizer only considers linear plans for query execution. Give an example of a plan for the above query that would not be considered by the SQL optimizer[2]. Same as a) c) Using the above query as an example, explain the principle of optimality (explained on page 2 of the Chaudhuri paper). How does it help in reducing the number of plans that need to be evaluated by the query optimizer? [5] Principle of optimality: For plans for queries that involve k subexpressions it suffices to find optimal plans for a single subexpression, and then extend this plan by finding optimal plans for the remaining subexpressions (one by one) Example (see page 2 of paper) PoO reduces plans to be considered from O(n!) to O(n*2n-1 ) d) Another critical problem in query optimization is the propagation of statistical information. Explain what this problem is by using the query plan you generated for sub problem a) as an example. [4] In order to evaluate the cost of the plan ((R1 NJOIN R2) NJOIN ((R3 NJOIN R4)) it is necessary to predict the size and statistical properties of the two relations that are obtained by joining R1 and R2, and by joining R3 and R4. The size and statistical properties of those intermediate relations have to be predicted by “propagating the statistical of R1, R2, R3, R4”; then this information is used to determine the cost of the third join operation 4 e) Chaudhuri states that “a desirable optimizer is one where … the enumeration algorithm is efficient”. What does this requirement mean? [4] The enumeration algorithm searches the implementation plan space for a query for a “good” or optimal solution; to be efficient it needs to have the following characteristics: a. it construct plans to be considered fast b. it employs heuristics that give preference to considering “good” plans prior to “bad” plans c. It is complete in the sense that all “promising” plans are considered 4) Association Rule Mining [9] Compare the approach to association rule mining that relies on support, confidence, and APRIORI style algorithms with the approach that is advocated by the Cohen/Datar Paper. What are the main differences between the two approaches? What are the similarities of the two approaches? Do you think the two approaches are complementary? Limit your answer to 8-12 sentences! Solution Sketch 1) Both are interested in mining association rules 2) APRIORI centers on high support, high confidence rules; COHEN/DATAR centers on high confidence, high correlation rules 3) Both try need to find efficient ways to compute item sets that satisfy minimum confidence requirements 4) Phase3 in COHEN/DATAR (scanning the database) is the same as in APRIORI 5) APRIORI first computes the candidate item set of items that satisfy the minimum support requirement and the uses the generated support information to generate rules; in general, it uses lack of support to prune potential candidates associations 6) COHEN/DATA cannot use support for efficiency; relies on an approximate algorithm (occasionally false positive and negative occur) that creates signatures for items relying on complicated hashing scheme. Then these signatures are used to find candidate items with high correlation. 7) The approaches are complementary. 5 5) Multi-dimensional Database Technology [21] a) Multi-dimensional database technology has come a long way since its inception. What are the key features of this technology (what can be done with it)? [4] Limit your answer to 4-6 sentences. Solution sketch: MDT: a. Allows analyzing measures (of interest) along dimensions using data cubes. Dimensions can be dynamically changed; definition of new cubes and visualization of cubes is supported. b. Generalize spreadsheets for multiple dimensions c. Supports interactive KDD for decision support d. MDT support data warehousing, OLAP, and data mining b) The Sarawagi paper suggests associating cells with path exception values (PathExp). How can this value help data analysts in analyzing OLAP data. How could PathExp values be computed? [6] PathExp measures the degree of surprise from each drill down path from a cell. This information can alert the analyst to exceptions or other unusual events that occur at a lower level of granularity, but not at the current level of granularity. PathExp is computed as maximum SelfExp over all cells that can be reached by drill down operations along the current path. The paper computes SelfExp as the difference between the exact value, and the expected value based on the data given at the current cube (divided by the standard deviation for normalization purposes). c) What are, in your opinion, the scientific contributions of the Ross paper? [3] Limit your answer to 3 sentences! It helps classifying of multi-feature cubes based on their degree of incrementality. It provides syntactically sufficient conditions on multi-feature cube queries to determine, if they are distributive or algebraic. It presents an algorithm that incrementally computes coarser granularity output of a distributive multi-feature cube using … 6 Problem 5 continued d) Assume you are in charge of an organization that financially supports research that centers on improving and enhancing multi-dimensional database technology. Propose and describe 2 different research projects you would like to see funded by your organization. [8] No answer given! 7