Download h3-05-sol

Dr. Christoph F. Eick Graded Homework3 COSC 6340 Spring 2005 Solution Sketches Due: Th., May 5, 9a (electronic submission) --- submit hardcopy during the Lab3 Demo! Last updated: April 25, 12:30a Remark: Points associated with particular problems are subject to change. 10) Implementing Joins [14] Graded Assume we have to implement a natural join between two relations R1(A, B, C) and R2(A, D). A is the primary key for R2 and R1.A is a foreign key for R1 (R1[A] . R2[A]). R1 and R2 are stored as an unordered file (blocks are assumed to be 100% full); R1 contains 200000 tuples that are stored in 500 blocks, and R2 has 50000 tuples that are stored in 1000 blocks (that is, every R2.A value occurs at an average four time as a value of R1.A). Moreover, assume that only a small buffer size of 8 blocks is available. Based on these assumptions answer the following questions (indicate, if you assume in your computations that the output relation is written back to disk or not): a. How many tuples will the output relation R=R1   R2 contain? [1] 200000 b. What is the cost for implementing the natural join using the block-nested loop join? [2] M + (M*N)/(B-2)= 500 + 500*1000/6=83833 c. Now assume that either a hashed index on A of R1 or a hashed index on A of R2 is available (assume that there are no overflow pages). Compute the cost for using the index nested loops join using the index for R1.A and for using the index nested loops join for R2.A (2 computations][4] 500 + 200000 (index block accesses) + 200000 (accesses of the inner relation)=400500 1000+ 50000(index block accesses) + 4*50000 (accesses of the inner relation)=251000 d. Is it possible to apply the hash-join in this case (explain your answer!)? [2] No, 8>sqrt(500) 1 e. Which of the 4 proposed implementations is the best? [1] b Defining an index on f. Now assume that the response time of the methods you evaluated is not satisfactory. What else could be done, to speed up? [5] Change the database design, create a new relation Relation R= R2 OUTERJOIN R1 Cost reduces to …(answer omitted) 11) Physical Database Design Ungraded Assume two relations R1(A, B, C) and R2(A, D, E) are given; R1 and R2 are both stored as an unordered file and R1 contains 1000000 (1 million) tuples and R2 contains 500000 (half a million) tuples. Attributes A, B, C, D, and E need 4 byte of storage each, and blocks have a size of 4096 Byte. A is the primary key of both R1 and R2 but only very few A-values occur in both R1 and R2. Moreover, we assume that static hashing is used to implement index structures, and that index pointers require 4 byte of storage; furthermore, you can assume that pages of index blocks are 80% full and do not contain any overflow pages. Moreover, the database system only supports the block nested loops join (only 3 blocks of buffer are available) and the index nested loops join. What index structures would you create to speed up the following 2 queries? Q1: Select B, E from R1, R2 R1.A=R2.A returns 100 answers Q2: Select B from R1, R2 where R1.A=R2.A and D=12; returns 2 answers (assume there are 20000 tuples in R2 with D=12) Describe which index structure you would create (justify your design!), and compute the cost for executing Q1 and Q2 for your chosen design. Also give the query evaluation plan you assume the database system should use to implement query Q2. Solution Sketch(not a complete solution): Define an index on R1.A making the relation with the smaller number of tuples (R2 in our case) the outer relation; using the index nested join the cost for Q1 are: Sequential_Scan-cost(R2) + 500000 (block accesses for index) + 100 (block access to data of R1 to retrieve B value; only very few of the values of R2.A occur in R1.A)=… For Q2, using a hashing index for Q2 on attribute D does not help because 20000 is much large than the number of blocks of R2. We reuse the index on R1.A we introduced earlier, and we sequentially scan through the tuples of R2, and for each tuple we check the D=12 condition; we take it’s R2.A value, go to the R1.A index and check if value exist in R1 (this occurs 20000 times), and go to the data if the value was found (only occurs 2 times). This implementation has the following cost: Sequential_Scan-cost(R2) + 20000 + 2=… Many other “good” and “bad” solutions for the problem exist… 2 Query plan not given… 3 12) XML [6] Graded Currently, many organizations are developing domain specific XML DTDs1. What is the reason for this development? How can XML DTD help these organizations? Limit your answer to 4-6 sentences! The answer below gives more details that what I requested in your answers; but it is a good idea to read it! XML DTD: is a set of rules that allows specification of new set of elements, attributes and entities. XML provides an application independent way of sharing data. With a DTD, independent groups of people can agree to use a common DTD for interchanging data. Domain Specific XML DTDs: Every organization could have organization specific DTDs that would allow documents to be exchanged only within the organization and if the same had to be shared outside the organization, it would require the information first to be transformed into the form others can understand, which is something similar to the concept of data islands in database. Presently organizations exchange data either by implementing specialized protocols such as Electronic data Interchange (EDI) or by implementing ad hoc solutions. One step could be to have domain specific XML DTDs that would allow different organizations belonging to the same domain to share information such as engineering, financial, scientific domain etc, which is similar to the concept of having an integrated database. The reason organizations prefer standardized DTDs is that it would enable seamless data exchange between heterogeneous sources. Also up to 80% of a company’s information is stored in unstructured textual documents. Hence, document warehousing and text mining are emerging disciplines for capturing and exploiting the flood of textual information for decision making. However, acquiring interesting and actionable knowledge from textual databases is still a major challenge for the data mining community. Domain specific XML-DTD thus pay a vital role as they allow creating semantic markup that helps provide explicit knowledge about text archives to facilitate search and browsing or to enable information integration with related data sources. Lot of work has been done in this area and some of popular commercial products are:  Microsoft’s BizTalk Server BizTalk Server, an integration server, allows to develop, deploy, and manage integrated business processes and XML-based Web services. It provides integration between messaging and orchestration and enhanced security and support for industry standards.  1 Electronic business XML (ebXML) The goal of the ebXML initiative is to develop an XML-based framework for global electronic business. The traditional technology of business-to-business eCommerce, EDI, has a very high barrier of entry, in terms of cost and complexity. The guiding vision of ebXML is 'to create a single global electronic More recently, XML DTD are replaced by XML-Schema that is a more powerful data model, 4 marketplace where enterprises of any size and in any geographical location can meet and conduct business with each other through the exchange of XML based messages. ebXML enables anyone, anywhere, to do business with anyone else over the Internet. 13) Functional Dependencies, Multi-valued Dependencies and Keys [32] Graded Assume we have a relation R(A,B,C,D,E) with the following dependencies: (1) ABC  DE (2) D ABCE (3) E B Answer the following questions giving reasons for your answers: a) Does D BC hold for R? [ANSWER] Yes. From D ABCE and decomposition rule we have D BC and from the replication rule we have D BC. b) Does the decomposition of R into R1(B,E) and R2(A,C,D,E) have the lossless join property --- can R be reconstructed with a natural join of R1 and R2?. [ANSWER] Yes. Lossless join property holds if one of the conditions R1∩ R2 R1 or R1∩ R2 R2 would be satisfied. In this problem R1∩ R2 = E. We are able to prove that E  EB. We know E B. From MVD E B and FD D  B and coalescence rule we have EB. So, From EB, E D and Union rule we would infer EBE. c) Does EB hold for R (either show that this dependency can be inferred from the given 3 dependencies, or give a counter example of a relation that satisfies (1), (2), (3) but violates EDB)? [ANSWER] Yes. We know E B. From MVD E B and FD D  B and coalescence rule we have EB. d) Does EDB always hold for R (either show that this dependency can be inferred from the given 3 dependencies, or give a counter example of a relation that satisfies (1), (2), (3) but violates EDB)? [ANSWER] No. Consider the following counter example. a1, b1, c1, d1, e1 a2, b1, c2, d2, e1 5 These 2 examples satisfy FD’s and MVD in 1, 2, 3. But they don’t satisfy MVD E B because for satisfying this MVD we need 2 other tuples a2, b1, c2, d1, e1 and a1, b1, c1, d2, e1 but the above relationship doesn’t include these two extra tuples. e) f) Is R in BCNF?2 This is a difficult, kind of open-ended problem; limit the time you spend on this sub-problem to at most 3 hours! [ANSWER] We know that ED doesn’t hold because if it holds, from ED and EB we have EDB which is a contradiction with the answer to part (d) (previous sub-problem). So E is not any kind of key (super-key or candidate key). One of the functional dependencies inferred form the FD’s and MVD 1, 2, 3 is EB (proof in part c) and because E is not a key so this FD violates the condition for BCNF. 2 Warning: The presence of the MVD might imply other functional dependencies (see textbook page 637) that are “bad”. 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download h3-05-sol