Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 316/416 Database Systems Final. Handout: 9:00am, May 8, 2017 Handin: 11:59pm, May 11, 2017, via one of: i) email to instructors ([email protected]) ii) hardcopy dropped off at instructor’s office (Malone 233) NAME: __________________________________________ JHED ID: __________________________________________ Students should work through this midterm independently, there is to be no collaboration on this exam. The exam is open book, students are free to use course notes and textbooks as needed. Please acknowledge the following statement with your initials: I herewith state that I understand and will adhere to the Johns Hopkins Computer Science Department Code of Academic Integrity. ---------------------------------------------------------------Signature Maximum number of points possible: 125. CS316 students should answer 4 out of the 5 sections, and may choose which question they want to skip. CS416 students should answer all 5 questions A-E. Questions vary in difficulty and have their contributions to your score as indicated. Do not get stuck on one question. Good luck! Exam questions start on the next page. PART A: Analytics and Multi-query optimization (25 points) A.1) Using a GROUP BY CUBE clause in an SQL query, how would you find the debt of all individuals in Baltimore and San Francisco, with ages between 20 and 40, and with income greater than $50K? (7 points) A.2) Consider the following queries with tables R(a,b), S(c,d), T(e,f): Q1: select T.f, sum(R.a) from R, S, T where R.b = S.c and S.d = T.e group by T.f; Q2: select sum(R.a) from R, S where R.b = S.c; Q3: select sum(R.a) from R, S, T where R.b = S.c and S.d > T.f; Write out: i) the query signatures that can be used to detect commonality during multiquery optimization. ii) the AND-OR graph for the above queries, with Σv to indicate aggregate operators, with v as group-by variables. (9 points) A.3) Consider the select-aggregate queries below, with a histogram of tuple cardinalities on the attribute “b” given as the following interval-cardinality pairs: [([0, 5), 10), ([5,10), 10), ([10,15), 10)] Assuming that addition operations have a cost of 2 units and comparison operations have a cost of 1 unit, what is the cost of running the two queries independently, compared to their shared execution? (9 points) Q1: select sum(a) from R where R.b < 5 Q2: select sum(a) from R where R.b < 10 PART B: System architecture and physical database design (25 points) B.1) Give two different reasons on how database views can be used to speed up query processing. (4 points) B.2) As an alternative to the standard "shared nothing" approach to using large-scale, non- shared memory parallel computer system architectures, at least one vendor takes a "shared disk" approach. In the context of such a machine, this means that all processors are permitted to fetch and buffer disk pages from any of the processors that have disks attached, and as a result, that any query can run on any processor (or set of processors). Identify the potential performance advantages and disadvantages of the shared disk approach vs. the shared nothing approach. Do this comparison with respect to the following issues: (9 points) i) ii) iii) concurrency control using page-level locking parallelizing the hash join algorithm recovery using write-ahead logging B.3) Database management systems use a “what-if” architecture for their query optimizers, that incorporate feedback from running queries to update internal statistics. Give two reasons why “what-if” approaches might choose to run a query as part of query optimization. (4 points) B.4) The following picture illustrates an ST-Holes histogram, with cardinalities (not densities) specified for each bucket. In the configuration below, which buckets would the ST-Holes algorithm change via merges? Your answer should show the buckets after these changes have been applied. (8 points) B2: 100 B6: 1000 B7: 900 B8: 10 B4: 4000 B3: 105 B5: 200 B9: 500 B10: 5 B1: 100 PART C: Concurrency control (25 points) C.1) Consider the following schedule of transactions T1 and T2, operating on database objects x, y, z: S = b w1(x) r2(x) w2(y) c2 w1(z) c1 e Events b and e are the beginning and end of the schedule respectively. The following questions ask where certain events may occur on this schedule. Assume that two-phase locking is used. Express your answers by giving the two S events that must precede and follow the event in the question. For example, transaction T1 must lock x between b and w1(x). i. Where can T1 unlock x? ii. Where can T1 lock z? iii. If undo logging is used, where can T2 write to the log (on disk) the undo information for its write to y? iv. If redo logging is used, where can T2 write to the log (on disk) the redo information for its write to y? v. Schedule S cannot occur if strict two phase locking is used. Assuming we can move the T2 actions in S, where can w2(y) occur in the new schedule? (15 points) C.2) Explain when a schedule can be view serializable but not conflict-serializable. Give a small example schedule that has this property. (4 points) C.3) Give one advantage of Strict 2PL over Rigorous 2PL, and one advantage of Rigorous 2PL over Strict 2PL. (3 points) C.4) Compare and contrast optimistic concurrency control with multi-version concurrency control. Give one advantage of each approach over the other. (3 points) PART D: Recovery (25 points) D.1) If you had a choice of implementing only one of FORCE/STEAL and NO FORCE/NO STEAL buffer managers, rather than the NO FORCE/STEAL policy described in class, which would you pick, and why? (5 points) D.2) Assume that the buffer manager implements the steal and force policies, i.e., the buffer manager allows pages to be written to disk before transactions that have written to the page have committed (steal), and the buffer manager flushes all updates by a transaction to disk before the transaction commits (force). How would this change the structure of a log record? Why? (5 points) D.3) When does the WAL (Write-Ahead Logging) protocol force the log to disk? State the precise point, such as “after XXX happens”, or “before XXX happens”. (Hint: There is more than one occasion.) (5 points) D.4) In extreme scenarios, recovery systems must address the issue of repeated failures during the recovery process, that is the database system may suffer from hardware (or external) failures in the middle of recovering with the UNDO and REDO logs. How does recovery behave when this issue arises, and is this scenario a problem? (5 points) D.5) What are the advantages and disadvantages in using disk-based checkpoints for recovery purposes compared to using a replicated distributed database? (5 points) PART E: Parallel and distributed databases (25 points) E.1) Define the terms scale-up and speed-up in the context of a parallel database system. (4 points) E.2) Briefly describe three different ways of horizontally partitioning a relation across several processors in a parallel database system that uses the shared-nothing architecture. (4 points) E.3) Consider both local-area and wide-area distributed join algorithms: i. In the wide-area setting, when would you use query shipping rather than data shipping? How does this compare to scenarios in which to use query shipping in a local-area network? (4 points) ii. How can semi-joins reduce the amount of data transferred between sites in a distributed join? (2 points) iii. What are the limitations of using a DHT (Distrubted Hash Table) to evaluate a distributed join? (2 points) E.4) What is the difficulty in using an asynchronous, update-everywhere replication protocol? Can you give an example where such a protocol is applicable? (4 points) E.5) The CAP theorem has been discussed as being inappropriate for cluster and datacenter environments. Give three reasons to support this claim, covering both failure models and quality of service arguments. (5 points) SECTION QUESTION Part A Analytics and multiquery optimization A.1 (Max: 7 points) SCORE SECTION TOTAL (Max: 25 points) A.2 (Max: 9 points) A.3 (Max: 9 points) Part B Physical database design B.1 (Max: 4 points) (Max: 25 points) B.2 (Max: 9 point) B.3 (Max: 4 points) B.4 (Max: 8 points) Part C Concurrency control C.1 (Max: 15 points) (Max: 25 points) C.2 (Max: 4 points) C.3 (Max: 3 points) C.4 (Max: 3 points) Part D Recovery D.1 (Max: 4 points) (Max: 25 points) D.2 (Max: 4 points) D.3 (Max: 4 points) D.4 (Max: 4 points) D.5 (Max: 4 points) Part E Parallel and distributed databases E.1 (Max: 4 points) Fa E.2 (Max: 4 points) E.3 (Max: 8 points) E.4 (Max: 4 points) E.5 (Max: 5 points) Total (Max: 125 points) (Max: 25 points)