Download Final (Word)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Transcript
CS 316/416 Database Systems Final.
Handout: 9:00am, May 8, 2017
Handin: 11:59pm, May 11, 2017, via one of:
i)
email to instructors ([email protected])
ii)
hardcopy dropped off at instructor’s office (Malone 233)
NAME: __________________________________________
JHED ID: __________________________________________
Students should work through this midterm independently, there is to be no collaboration on this
exam. The exam is open book, students are free to use course notes and textbooks as needed.
Please acknowledge the following statement with your initials: I herewith state that I understand
and will adhere to the Johns Hopkins Computer Science Department Code of Academic Integrity.
---------------------------------------------------------------Signature
Maximum number of points possible: 125.
CS316 students should answer 4 out of the 5 sections, and may choose which question they want
to skip. CS416 students should answer all 5 questions A-E.
Questions vary in difficulty and have their contributions to your score as indicated.
Do not get stuck on one question. Good luck!
Exam questions start on the next page.
PART A: Analytics and Multi-query optimization (25 points)
A.1) Using a GROUP BY CUBE clause in an SQL query, how would you find the debt of all
individuals in Baltimore and San Francisco, with ages between 20 and 40, and with income
greater than $50K? (7 points)
A.2) Consider the following queries with tables R(a,b), S(c,d), T(e,f):
Q1: select T.f, sum(R.a) from R, S, T where R.b = S.c and S.d = T.e group by T.f;
Q2: select sum(R.a) from R, S where R.b = S.c;
Q3: select sum(R.a) from R, S, T where R.b = S.c and S.d > T.f;
Write out:
i)
the query signatures that can be used to detect commonality during multiquery
optimization.
ii) the AND-OR graph for the above queries, with Σv to indicate aggregate operators, with v
as group-by variables.
(9 points)
A.3) Consider the select-aggregate queries below, with a histogram of tuple cardinalities on the
attribute “b” given as the following interval-cardinality pairs:
[([0, 5), 10), ([5,10), 10), ([10,15), 10)]
Assuming that addition operations have a cost of 2 units and comparison operations have a cost of
1 unit, what is the cost of running the two queries independently, compared to their shared
execution? (9 points)
Q1: select sum(a) from R where R.b < 5
Q2: select sum(a) from R where R.b < 10
PART B: System architecture and physical database design (25 points)
B.1) Give two different reasons on how database views can be used to speed up query processing.
(4 points)
B.2) As an alternative to the standard "shared nothing" approach to using large-scale, non- shared
memory parallel computer system architectures, at least one vendor takes a "shared disk"
approach. In the context of such a machine, this means that all processors are permitted to fetch
and buffer disk pages from any of the processors that have disks attached, and as a result, that any
query can run on any processor (or set of processors).
Identify the potential performance advantages and disadvantages of the shared disk approach vs.
the shared nothing approach. Do this comparison with respect to the following issues: (9 points)
i)
ii)
iii)
concurrency control using page-level locking
parallelizing the hash join algorithm
recovery using write-ahead logging
B.3) Database management systems use a “what-if” architecture for their query optimizers, that
incorporate feedback from running queries to update internal statistics. Give two reasons why
“what-if” approaches might choose to run a query as part of query optimization. (4 points)
B.4) The following picture illustrates an ST-Holes histogram, with cardinalities (not densities)
specified for each bucket. In the configuration below, which buckets would the ST-Holes
algorithm change via merges? Your answer should show the buckets after these changes have
been applied. (8 points)
B2: 100
B6: 1000
B7: 900
B8: 10
B4: 4000
B3: 105
B5: 200
B9: 500
B10: 5
B1: 100
PART C: Concurrency control (25 points)
C.1) Consider the following schedule of transactions T1 and T2, operating on database objects x,
y, z:
S = b w1(x) r2(x) w2(y) c2 w1(z) c1 e
Events b and e are the beginning and end of the schedule respectively. The following questions
ask where certain events may occur on this schedule. Assume that two-phase locking is used.
Express your answers by giving the two S events that must precede and follow the event in the
question. For example, transaction T1 must lock x between b and w1(x).
i.
Where can T1 unlock x?
ii.
Where can T1 lock z?
iii.
If undo logging is used, where can T2 write to the log (on disk) the undo information
for its write to y?
iv.
If redo logging is used, where can T2 write to the log (on disk) the redo information
for its write to y?
v.
Schedule S cannot occur if strict two phase locking is used. Assuming we can move
the T2 actions in S, where can w2(y) occur in the new schedule?
(15 points)
C.2) Explain when a schedule can be view serializable but not conflict-serializable. Give a small
example schedule that has this property. (4 points)
C.3) Give one advantage of Strict 2PL over Rigorous 2PL, and one advantage of Rigorous 2PL
over Strict 2PL. (3 points)
C.4) Compare and contrast optimistic concurrency control with multi-version concurrency control.
Give one advantage of each approach over the other. (3 points)
PART D: Recovery (25 points)
D.1) If you had a choice of implementing only one of FORCE/STEAL and NO FORCE/NO
STEAL buffer managers, rather than the NO FORCE/STEAL policy described in class, which
would you pick, and why? (5 points)
D.2) Assume that the buffer manager implements the steal and force policies, i.e., the buffer
manager allows pages to be written to disk before transactions that have written to the page have
committed (steal), and the buffer manager flushes all updates by a transaction to disk before the
transaction commits (force). How would this change the structure of a log record? Why? (5 points)
D.3) When does the WAL (Write-Ahead Logging) protocol force the log to disk? State the precise
point, such as “after XXX happens”, or “before XXX happens”. (Hint: There is more than one
occasion.) (5 points)
D.4) In extreme scenarios, recovery systems must address the issue of repeated failures during the
recovery process, that is the database system may suffer from hardware (or external) failures in the
middle of recovering with the UNDO and REDO logs. How does recovery behave when this issue
arises, and is this scenario a problem? (5 points)
D.5) What are the advantages and disadvantages in using disk-based checkpoints for recovery
purposes compared to using a replicated distributed database? (5 points)
PART E: Parallel and distributed databases (25 points)
E.1) Define the terms scale-up and speed-up in the context of a parallel database system. (4 points)
E.2) Briefly describe three different ways of horizontally partitioning a relation across several
processors in a parallel database system that uses the shared-nothing architecture. (4 points)
E.3) Consider both local-area and wide-area distributed join algorithms:
i. In the wide-area setting, when would you use query shipping rather than data shipping? How
does this compare to scenarios in which to use query shipping in a local-area network? (4 points)
ii. How can semi-joins reduce the amount of data transferred between sites in a distributed join? (2
points)
iii. What are the limitations of using a DHT (Distrubted Hash Table) to evaluate a distributed join?
(2 points)
E.4) What is the difficulty in using an asynchronous, update-everywhere replication protocol? Can
you give an example where such a protocol is applicable? (4 points)
E.5) The CAP theorem has been discussed as being inappropriate for cluster and datacenter
environments. Give three reasons to support this claim, covering both failure models and quality
of service arguments. (5 points)
SECTION
QUESTION
Part A
Analytics and multiquery optimization
A.1 (Max: 7 points)
SCORE SECTION TOTAL
(Max: 25 points)
A.2 (Max: 9 points)
A.3 (Max: 9 points)
Part B
Physical database
design
B.1 (Max: 4 points)
(Max: 25 points)
B.2 (Max: 9 point)
B.3 (Max: 4 points)
B.4 (Max: 8 points)
Part C
Concurrency control
C.1 (Max: 15 points)
(Max: 25 points)
C.2 (Max: 4 points)
C.3 (Max: 3 points)
C.4 (Max: 3 points)
Part D
Recovery
D.1 (Max: 4 points)
(Max: 25 points)
D.2 (Max: 4 points)
D.3 (Max: 4 points)
D.4 (Max: 4 points)
D.5 (Max: 4 points)
Part E
Parallel and
distributed databases
E.1 (Max: 4 points)
Fa
E.2 (Max: 4 points)
E.3 (Max: 8 points)
E.4 (Max: 4 points)
E.5 (Max: 5 points)
Total (Max: 125 points)
(Max: 25 points)