Download Midterm Exam with Solutions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Exam1
COSC 6340 (Data Management)
March 22, 2001
Your Name:
Your SSN:
I agree that my grades are posted using the last 4 digits of my ssn
………………….(signature, if you like us to post your grades)
Problem 1 [17]:
Problem 2 [8]:
Problem 3 [7]:
Problem 4 [12]:
Problem 5 [23]
Problem 6: [5]
:
Grade:
The exam is “open books” and you have 75 minutes to complete the exam.
1) Relational Database Design [17]
Consider the following relation R(A,B,C,D,E) with the following functional
dependencies is given ( -> denotes a functional dependency):
(1) A -> B
(2) C -> B
(3) B -> E
(4) E -> D
a) Assume we decompose R into R1(A,B, C) and R2(B,D,E). Does this decomposition
have the lossless join property --- is it possible to reconstruct R from R1 and R2 using a
natural join? Give reasons for your answer! [5]
Yes. Apply the lossless decomposition test on page 435 of the textbook:
R1 ņ R2 = {B}
For R2, the FDs are: B -> E and E -> D. By applying transitivity rule, we get B > D. Because B -> E and B-> D, B->BDE (Union), i.e. B is the candidate key of
relation R2. So R1 ņ R2 -> R2. The decomposition is lossless.
You can also use attribute closure to explain the answer.
b) What is (are) the candidate key(s) of R? [2]
AC
c) Is R in BCNF? If not, which functional dependencies are bad (violate BCNF)? [2]
No. All FDs are bad.
d) Transform the relational schema into a relational schema that is in BCNF and does
not have any lost dependencies; if this is not possible decompose R into a schema that is
in BCNF and has the fewest number of lost functional dependencies. [8]
One possible solution:
R1 (ABCDE)
E -> D
R2 (ABCE)
R3 (DE)
B -> E
R4 (ABC)
R5 (BE)
A -> B
R6 (AC)
R7 (AB)
lost FD C -> B
2) Multi-valued Dependencies [8]
Assume the following relation R(A,B,C,D) is given and the following multivalued
dependencies hold: A ->-> B and A ->-> C
Assume the relation R contains the following two tuples
R(A B C D)
(12 34)…
(15 67)…
What other tuples must R contain so that A ->-> B and A ->-> C hold for R (said
differently, given and example of a relational relation that contains the two tuples and
does not violate the two multi-valued dependencies)?
Apply the formula on page 446 of the textbook, the tuples that must be
included due to the two multi-valued dependency are:
(1 2 6 7)
(1 5 3 4)
(1 2 6 4)
(1 5 3 7)
(1 2 3 7) second round
(1 5 6 4) second round
3) Query Optimization [7]
a) What are the goals and objectives of query optimization? [4]
See textbook!!
b) Why are statistics gathered from the database important for query optimization?
[3]
To predict cost of operations to predict the size of intermediate relations; better
prediction model result in more accurate evaluations of query plans.
4) B+-Trees [12]
a) Compare B+-trees with static hashing. What are the main advantages of B+-trees
if compared with static (bucket hashing techniques). What are the disadvantages?
[4].
Advantages of B+-tree: Sorted data structure; Self-organizing; Efficient for
range search
Disadvantage: May require 1 or 2 more I/Os for equality search than hashing
b) Assume that the following B+-tree with p=5 and k=3 is given. Furthermore,
assume that the keys 1, 21, 22, 23, 39 are deleted in the indicated order. Show
how the tree looks like after each deletion. [8]
21
2
1
2
5
3
23
4
5
20
21
22
23
39
40
40
50
44
50
60
63
One possible solution:
Delete 1:
21
2
3
3
5
4
5
23
20
21
22
23
39
40
40
50
44
50
40
50
44
50
60
63
Delete 21:
23
3 20
2
Delete 22:
3
4
5
20
22
23
39
40
60
63
23
3 5
2
3
4
5
20
23
40
50
39
40
44
50
44
50
60
63
Delete 23:
3
2
3
4
5
20
20
39
40
50
40
Delete 39:
3
2
3
4
5
5
20
5) Physical Database Design [23]
40
40
50
44
50
60
63
60
63
Assume two relations R1(A, B, C) and R2(A, D, E); R1 and R2 are both stored as an
unordered file and contains 1000000 (1 million) tuples. Attributes A, B, C, D, and E need
4 byte of storage each, and blocks have a size of 4096 Byte. A is the primary key of both
R1 and R2 and R1[A]=R2[A]. Moreover, we assume that static hashing is used to
implement index structures, and that index pointers require 4 byte of storage;
furthermore, you can assume that pages of index blocks are 80% full and do not contain
any overflow pages. Moreover, the database system only supports the block nested loops
join (only 3 blocks of buffer are available) and the index nested loops join. What index
structures would you create to speed up the following 3 queries?
Q1: Select B, E
from R1, R2
where R1.A=R2.A
and D=12;
returns 4 answers
Q2: Select B
from R1, R2
where R1.A=R2.A
and C=12;
returns 100000 answers
Q3: Select sum(A)
from R1;
returns one answer
Describe which index structure you would create (justify your design!), and compute the
cost for executing Q1, Q2, and Q3 for your chosen design (Hint: look for unusual
solutions!). Also give the query evaluation plan you assume the database system would
use to implement query Q1.
Q1:
Because 'Q1 returns 4 answers', 'R1[A] = R2[A]' and 'A is the primary key of both
relations' => there are exactly 4 tuples in R2 satisfy D = 12. Establish hash index
on R2.D and use it to find the 4 tuples in R2 satisfying D = 12 without writing out
the result. Meanwhile establish hash index on R1.A and use it to find the four
tuples satisfying R1.A = R2.A on the fly, and write out B, E.
Cost:
To find out the four tuples in R2 with D = 12:
1 (index block of R2.D) + 4 (data blocks) = 5
For each tuple with D = 12, find out the tuple in R1 satisfying R1.A = R2.A:
1 (index block of R1.A) + 1 (data block) = 2
Total cost (without considering writing out the final result):
5 + 4 * 2 = 13 I/Os
Q2:
# of file blocks of each relation = 1000000 * (4 * 3) / 4096 ≈ 3000
Because Q2 returns 100000 answers, index on R1.C will not help. Scan R1 to
retrieve tuples with C = 12 (3000 block access). Meanwhile use hash index on
R2.A do index nested loop join on the fly.
Cost:
3000 + 100000 * (1 + 1) = 203000 I/Os
Q3:
Because only the values of R1.A is needed, and we have already established the
index on R1.A for Q1, an index only search is enough to get the result.
Cost: (equals the number of index blocks)
1000000 * (4 + 4) / (4096 * 80%) ≈ 2442 I/Os
Evaluation Plan for Q1: (reference page 372 of the textbook)
 B, E
(on-the-fly)
|X| A=A
(use hash
index; do not
write result
to temp)
D=12
(index nested loop join)
R1
R2
6) Data Warehousing, OLAP, and KDD [5]
Explain the increased popularity of Data Warehousing, OLAP, and data mining
techniques in the commercial area!
Reasons:
 deals with data explosion problem (e.g. from scanner, earth satellites,…);
automated tools are necessary to make sense of data, because of
limitations of human resources.
 provide high level summary of data that facilitates data analysis, data
mining, and data visualization
 support intelligent decision making through aggregated summaries of low
level (production) data
 provide information for management
 facilitate cooperate report generation