Download h3-05-sol

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Index fund wikipedia , lookup

Transcript
Dr. Christoph F. Eick
Graded Homework3 COSC 6340
Spring 2005
Solution Sketches
Due: Th., May 5, 9a (electronic submission) --- submit hardcopy during the Lab3 Demo!
Last updated: April 25, 12:30a
Remark: Points associated with particular problems are subject to change.
10) Implementing Joins [14] Graded
Assume we have to implement a natural join between two relations R1(A, B, C) and
R2(A, D). A is the primary key for R2 and R1.A is a foreign key for R1 (R1[A] .
R2[A]). R1 and R2 are stored as an unordered file (blocks are assumed to be 100% full);
R1 contains 200000 tuples that are stored in 500 blocks, and R2 has 50000 tuples that are
stored in 1000 blocks (that is, every R2.A value occurs at an average four time as a value
of R1.A). Moreover, assume that only a small buffer size of 8 blocks is available. Based
on these assumptions answer the following questions (indicate, if you assume in your
computations that the output relation is written back to disk or not):
a. How many tuples will the output relation R=R1 
 R2 contain? [1]
200000
b. What is the cost for implementing the natural join using the block-nested loop join? [2]
M + (M*N)/(B-2)= 500 + 500*1000/6=83833
c. Now assume that either a hashed index on A of R1 or a hashed index on A of R2 is
available (assume that there are no overflow pages). Compute the cost for using the index
nested loops join using the index for R1.A and for using the index nested loops join for
R2.A (2 computations][4]
500 + 200000 (index block accesses) + 200000 (accesses of the inner relation)=400500
1000+ 50000(index block accesses) + 4*50000 (accesses of the inner relation)=251000
d. Is it possible to apply the hash-join in this case (explain your answer!)? [2]
No, 8>sqrt(500)
1
e. Which of the 4 proposed implementations is the best? [1]
b
Defining an index on
f. Now assume that the response time of the methods you evaluated is not satisfactory.
What else could be done, to speed up? [5]
Change the database design, create a new relation Relation R= R2 OUTERJOIN R1
Cost reduces to …(answer omitted)
11) Physical Database Design Ungraded
Assume two relations R1(A, B, C) and R2(A, D, E) are given; R1 and R2 are both stored
as an unordered file and R1 contains 1000000 (1 million) tuples and R2 contains 500000
(half a million) tuples. Attributes A, B, C, D, and E need 4 byte of storage each, and
blocks have a size of 4096 Byte. A is the primary key of both R1 and R2 but only very
few A-values occur in both R1 and R2. Moreover, we assume that static hashing is used
to implement index structures, and that index pointers require 4 byte of storage;
furthermore, you can assume that pages of index blocks are 80% full and do not contain
any overflow pages. Moreover, the database system only supports the block nested loops
join (only 3 blocks of buffer are available) and the index nested loops join. What index
structures would you create to speed up the following 2 queries?
Q1: Select B, E
from R1, R2
R1.A=R2.A
returns 100 answers
Q2: Select B
from R1, R2
where R1.A=R2.A
and D=12;
returns 2 answers (assume there are 20000 tuples
in R2 with D=12)
Describe which index structure you would create (justify your design!), and compute the
cost for executing Q1 and Q2 for your chosen design. Also give the query evaluation plan
you assume the database system should use to implement query Q2.
Solution Sketch(not a complete solution): Define an index on R1.A making the relation
with the smaller number of tuples (R2 in our case) the outer relation; using the
index nested join the cost for Q1 are: Sequential_Scan-cost(R2) + 500000 (block
accesses for index) + 100 (block access to data of R1 to retrieve B value; only very
few of the values of R2.A occur in R1.A)=…
For Q2, using a hashing index for Q2 on attribute D does not help because 20000 is
much large than the number of blocks of R2. We reuse the index on R1.A we
introduced earlier, and we sequentially scan through the tuples of R2, and for each
tuple we check the D=12 condition; we take it’s R2.A value, go to the R1.A index
and check if value exist in R1 (this occurs 20000 times), and go to the data if the
value was found (only occurs 2 times). This implementation has the following cost:
Sequential_Scan-cost(R2) + 20000 + 2=…
Many other “good” and “bad” solutions for the problem exist…
2
Query plan not given…
3
12) XML [6] Graded
Currently, many organizations are developing domain specific XML DTDs1. What is the
reason for this development? How can XML DTD help these organizations? Limit your
answer to 4-6 sentences!
The answer below gives more details that what I requested in your answers; but it is
a good idea to read it!
XML DTD: is a set of rules that allows specification of new set of elements, attributes
and entities. XML provides an application independent way of sharing data. With a DTD,
independent groups of people can agree to use a common DTD for interchanging data.
Domain Specific XML DTDs:
Every organization could have organization specific DTDs that would allow documents
to be exchanged only within the organization and if the same had to be shared outside the
organization, it would require the information first to be transformed into the form others
can understand, which is something similar to the concept of data islands in database.
Presently organizations exchange data either by implementing specialized protocols such
as Electronic data Interchange (EDI) or by implementing ad hoc solutions.
One step could be to have domain specific XML DTDs that would allow different
organizations belonging to the same domain to share information such as engineering,
financial, scientific domain etc, which is similar to the concept of having an integrated
database. The reason organizations prefer standardized DTDs is that it would enable
seamless data exchange between heterogeneous sources.
Also up to 80% of a company’s information is stored in unstructured textual documents.
Hence, document warehousing and text mining are emerging disciplines for capturing and
exploiting the flood of textual information for decision making. However, acquiring
interesting and actionable knowledge from textual databases is still a major challenge for
the data mining community. Domain specific XML-DTD thus pay a vital role as they
allow creating semantic markup that helps provide explicit knowledge about text archives
to facilitate search and browsing or to enable information integration with related data
sources.
Lot of work has been done in this area and some of popular commercial products are:
 Microsoft’s BizTalk Server
BizTalk Server, an integration server, allows to develop, deploy, and manage
integrated business processes and XML-based Web services. It provides
integration between messaging and orchestration and enhanced security and
support for industry standards.

1
Electronic business XML (ebXML)
The goal of the ebXML initiative is to develop an XML-based framework for
global electronic business. The traditional technology of business-to-business
eCommerce, EDI, has a very high barrier of entry, in terms of cost and
complexity. The guiding vision of ebXML is 'to create a single global electronic
More recently, XML DTD are replaced by XML-Schema that is a more powerful data model,
4
marketplace where enterprises of any size and in any geographical location can
meet and conduct business with each other through the exchange of XML based
messages. ebXML enables anyone, anywhere, to do business with anyone else
over the Internet.
13) Functional Dependencies, Multi-valued Dependencies and Keys [32]
Graded
Assume we have a relation R(A,B,C,D,E) with the following dependencies:
(1) ABC  DE
(2) D ABCE
(3) E B
Answer the following questions giving reasons for your answers:
a) Does D BC hold for R?
[ANSWER]
Yes. From D ABCE and decomposition rule we have D BC and from the
replication rule we have D BC.
b) Does the decomposition of R into R1(B,E) and R2(A,C,D,E) have the lossless join
property --- can R be reconstructed with a natural join of R1 and R2?.
[ANSWER]
Yes. Lossless join property holds if one of the conditions R1∩ R2 R1 or R1∩ R2 R2
would be satisfied.
In this problem R1∩ R2 = E. We are able to prove that E  EB.
We know E B. From MVD E B and FD D  B and coalescence rule we have
EB.
So, From EB, E D and Union rule we would infer EBE.
c) Does EB hold for R (either show that this dependency can be inferred from the
given 3 dependencies, or give a counter example of a relation that satisfies (1), (2),
(3) but violates EDB)?
[ANSWER]
Yes. We know E B. From MVD E B and FD D  B and coalescence rule we
have EB.
d) Does EDB always hold for R (either show that this dependency can be inferred
from the given 3 dependencies, or give a counter example of a relation that satisfies
(1), (2), (3) but violates EDB)?
[ANSWER]
No. Consider the following counter example.
a1, b1, c1, d1, e1
a2, b1, c2, d2, e1
5
These 2 examples satisfy FD’s and MVD in 1, 2, 3. But they don’t satisfy MVD E B
because for satisfying this MVD we need 2 other tuples a2, b1, c2, d1, e1 and a1, b1, c1,
d2, e1 but the above relationship doesn’t include these two extra tuples.
e) f) Is R in BCNF?2 This is a difficult, kind of open-ended problem; limit the time you
spend on this sub-problem to at most 3 hours!
[ANSWER]
We know that ED doesn’t hold because if it holds, from ED and EB we have
EDB which is a contradiction with the answer to part (d) (previous sub-problem).
So E is not any kind of key (super-key or candidate key).
One of the functional dependencies inferred form the FD’s and MVD 1, 2, 3 is EB
(proof in part c) and because E is not a key so this FD violates the condition for BCNF.
2
Warning: The presence of the MVD might imply other functional dependencies (see textbook
page 637) that are “bad”.
6