Download Materialized View Creation and Transformation of Schemas in

Jørgen Løland Materialized View Creation and Transformation of Schemas in Highly Available Database Systems Thesis for the degree philosophiae doctor Trondheim, October 2007 Norwegian University of Science and Technology Faculty of Information Technology, Mathematics and Electrical Engineering Department of Computer and Information Science NTNU Norwegian University of Science and Technology Thesis for the degree philosophiae doctor Faculty of Information Technology, Mathematics and Electrical Engineering Department of Computer and Information Science © Jørgen Løland ISBN 978-82-471-4381-0 (printed version) ISBN 978-82-471-4395-7 (electronic version) ISSN 1503-8181 Doctoral theses at NTNU, 2007:199 Printed by NTNU-trykk To Ingvild and Ottar. Preface This thesis is submitted to the Norwegian University of Science and Technology in partial fulfillment of the degree PhD. The work has been carried out at the Database System Group, Department of Computer and Information Science (IDI). The study was funded by the Faculty of Information Technology, Mathematics and Electrical Engineering through the “forskerskolen” program. Acknowledgements First, I would like to thank my advisor Professor Svein-Olaf Hvasshovd for his guidance and ideas, and for providing valuable comments to drafts of the thesis and papers. I would also like to thank my co-advisors Dr. Ing. Øystein Torbjørnsen and Professor Svein Erik Bratsberg for constructive feedback and interesting discussions regarding the research. During the years I have been working on this thesis, I have received help from many people. In particular, I would like to thank Heine Kolltveit and Jeanine Lilleng for many interesting discussions. In addition, Professor Kjetil Nørvåg has been a seemingly infinite source of information when it comes to academic publishing. I would also like to thank the members of the Database System Group in general for providing a good environment for PhD students. I sincerely thank Rune Havnung Bakken, Jon Olav Hauglid and Associate Professor Roger Midtstraum for proofreading and commenting drafts of the thesis. Your feedback have been invaluable. I would also like to thank my parents and sister for their inspiration and encouragements. Finally, I express my deepest thanks to my wife Ingvild for her constant love and support. Abstract Relational database systems are used in thousands of applications every day, including online web shops, electronic medical records and for mobile telephone tracking. Many of these applications have high availability requirements, allowing the database system to be offline for only a few minutes each year. In existing DBMSs, user transactions get blocked during creation of materialized views (MVs) and non-trivial schema transformations. Blocking user transactions is not an option in database systems requiring high availability. A non-blocking method to perform these operations is therefore needed. Our research has focused on how the MV creation and schema transformation operations can be performed in database systems with high availability requirements. We have examined existing solutions to MV creation and schema transformations, and identified requirements. Most important among these requirements were that the method should not have blocking effects, and should degrade performance of concurrent transactions to the smallest possible extent. The main contribution of this thesis is a method for creation of derived tables (DTs) using relational operators. Furthermore, we show how these DTs can be used to create MVs and to perform schema transformations. The method is non-blocking, and may be executed as a low priority background process to minimize performance degradation. The MV creation and schema transformation methods have been implemented in a prototype DBMS. By performing thorough empirical validation experiments on this prototype, we show that the method works correctly. Furthermore, through extensive performance experiments, we show that the method incurs little response time and throughput degradation under moderate workloads. Thus, the method provides a way to create MVs and to transform the database schema that can be used in highly available database systems. Contents I Background and Context 3 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . 1.1.1 The Derived Table Creation Problem 1.2 Research Questions . . . . . . . . . . . . . . 1.3 Research Methodology . . . . . . . . . . . . 1.4 Organization of this thesis . . . . . . . . . . 2 Derived Table Creation Basics 2.1 Database Systems - An Introduction 2.2 Concurrency Control . . . . . . . . . 2.3 Recovery . . . . . . . . . . . . . . . . 2.4 Record Identification Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 7 10 11 12 . . . . 14 14 16 17 21 3 A Survey of Technologies Related to Non-Blocking Derived Table Creation 3.1 Ronström’s Schema Transformations . . . . . . . . . . . . . . 3.1.1 Simple Schema Changes . . . . . . . . . . . . . . . . . 3.1.2 Complex Schema Changes . . . . . . . . . . . . . . . . 3.1.3 Cost Analysis of Ronström’s Method . . . . . . . . . . 3.2 Fuzzy Table Copying . . . . . . . . . . . . . . . . . . . . . . . 3.3 Materialized View Maintenance . . . . . . . . . . . . . . . . . 3.3.1 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Materialized Views . . . . . . . . . . . . . . . . . . . . 3.4 Schema Transformations and DT creation in Existing DBMSs 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 25 25 36 40 41 41 42 44 45 II 47 Derived Table Creation 4 The Derived Table Creation Framework 49 x CONTENTS 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Overview of the Framework . . . . . . . . . . . . . Step 1: Preparation . . . . . . . . . . . . . . . . . . Step 2: Initial Population . . . . . . . . . . . . . . Step 3: Log Propagation . . . . . . . . . . . . . . . Step 4: Synchronization . . . . . . . . . . . . . . . Considerations for Schema Transformations . . . . 4.6.1 A lock forwarding improvement for schema mations . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . 5 Common DT Creation Problems 5.1 Missing Record and State Identification . 5.2 Missing Record Pre-States . . . . . . . . 5.3 Lock Forwarding During Transformations 5.4 Inconsistent Source Records . . . . . . . 5.4.1 Repairing Inconsistencies . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . 6 DT Creation using Relational Operators 6.1 Difference and Intersection . . . . . . . . . 6.1.1 Preparation . . . . . . . . . . . . . 6.1.2 Initial Population . . . . . . . . . . 6.1.3 Log Propagation . . . . . . . . . . 6.1.4 Synchronization . . . . . . . . . . . 6.2 Horizontal Merge with Duplicate Inclusion 6.2.1 Preparation . . . . . . . . . . . . . 6.2.2 Initial Population . . . . . . . . . . 6.2.3 Log Propagation . . . . . . . . . . 6.2.4 Synchronization . . . . . . . . . . . 6.3 Horizontal Merge with Duplicate Removal 6.3.1 Preparation Step . . . . . . . . . . 6.3.2 Initial Population Step . . . . . . . 6.3.3 Log Propagation Step . . . . . . . 6.3.4 Synchronization Step . . . . . . . . 6.4 Horizontal Split Transformation . . . . . . 6.4.1 Preparation . . . . . . . . . . . . . 6.4.2 Initial Population . . . . . . . . . . 6.4.3 Log propagation . . . . . . . . . . . 6.4.4 Synchronization . . . . . . . . . . . 6.5 Vertical Merge . . . . . . . . . . . . . . . . 6.5.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . transfor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 51 53 53 54 56 . 59 . 59 . . . . . . 60 60 61 63 66 68 69 . . . . . . . . . . . . . . . . . . . . . . 70 71 71 73 73 75 77 79 79 79 80 81 83 83 83 85 86 86 87 87 89 89 91 CONTENTS 6.6 6.7 6.8 III xi 6.5.2 Initial Population . . . . . . . . . . . . . . . . . . . . 6.5.3 Log Propagation . . . . . . . . . . . . . . . . . . . . 6.5.4 Synchronization . . . . . . . . . . . . . . . . . . . . . Vertical Split over a Candidate Key . . . . . . . . . . . . . . 6.6.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Initial Population . . . . . . . . . . . . . . . . . . . . 6.6.3 Log Propagation . . . . . . . . . . . . . . . . . . . . 6.6.4 Synchronization . . . . . . . . . . . . . . . . . . . . . Vertical Split over a Functional Dependency . . . . . . . . . 6.7.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . 6.7.2 Initial Population . . . . . . . . . . . . . . . . . . . . 6.7.3 Log Propagation . . . . . . . . . . . . . . . . . . . . 6.7.4 Synchronization . . . . . . . . . . . . . . . . . . . . . 6.7.5 How to Handle Inconsistent Data - An Extension to Vertical Split . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation and Evaluation 7 Implementation Alternatives 7.1 Alternative 1 - Simulation . . . . . . . 7.2 Alternative 2 - Open Source DBMS . . 7.3 Alternative 3 - Prototype . . . . . . . . 7.4 Implementation Alternative Discussion . . . . 8 Design of the Non-blocking DBMS 8.1 The Non-blocking DBMS Server . . . . . 8.1.1 Database Communication Module 8.1.2 SQL Parser Module . . . . . . . . 8.1.3 Relational Manager Module . . . 8.1.4 Scheduler Module . . . . . . . . . 8.1.5 Recovery Manager Module . . . . 8.1.6 Data Manager Module . . . . . . 8.1.7 Effects of the Simplifications . . . 8.2 Client and Administrator Programs . . . 8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 91 93 95 96 96 96 97 97 99 100 101 102 . 103 . 105 109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 . 112 . 112 . 115 . 117 . . . . . . . . . . 120 . 123 . 123 . 123 . 124 . 127 . 128 . 129 . 130 . 132 . 132 9 Prototype Testing 134 9.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . 134 9.2 Empirical Validation of the Non-Blocking DT Creation Methods139 xii CONTENTS 9.3 9.4 Performance Testing . . . . . . . . . . . . . . . . . . . . . . . 142 9.3.1 Log Propagation - Difference and Intersection . . . . . 147 9.3.2 Log Propagation - Vertical Merge . . . . . . . . . . . . 154 9.3.3 Low Performance Degradation or Short Execution Time?156 9.3.4 Other Steps of DT Creation . . . . . . . . . . . . . . . 157 9.3.5 Performance Experiment Summary . . . . . . . . . . . 159 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 10 Discussion 10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 A General DT Creation Framework . . . . . . . . . . 10.1.2 DT Creation for Many Relational Operators . . . . . 10.1.3 Support for both Schema Transformations and Materialized Views . . . . . . . . . . . . . . . . . . . . . . 10.1.4 Solutions to Common DT Creation Problems . . . . 10.1.5 Implemented and Empirically Validated . . . . . . . 10.1.6 Low Degree of Performance Degradation . . . . . . . 10.1.7 Based on Existing DBMS Functionality . . . . . . . . 10.1.8 Other Considerations - Total Amount of Data . . . . 10.2 Answering the Research Question . . . . . . . . . . . . . . . 10.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . 162 . 162 . 163 . 163 11 Conclusion and Future Work 11.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . 11.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 . 172 . 173 . 175 IV 177 Appendix . . . . . . . . 164 165 165 165 168 169 169 171 A Non-blocking Database: SQL Syntax 179 B Performance Graphs 181 Glossary 191 Bibliography 195 List of Figures 2.1 2.2 Database System . . . . . . . . . . . . . . . . . . . . . . . . . 15 Compensation Log Records provide valid State Identifiers . . . 18 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Ronströms Horizontal Merge Method . . . . . . . . . . Ronströms Horizontal Split Method . . . . . . . . . . . Examples of Vertical Merge Schema Change . . . . . . Chain of Triggers in Ronströms Vertical Merge Method Ronströms Vertical Split Method . . . . . . . . . . . . Ronströms Vertical Split Transformation . . . . . . . . Ronströms Vertical Split Method and Inconsistent Data Example MV Consistency Problem . . . . . . . . . . . 4.1 The four steps of DT creation. . . . . . . . . . . . . . . . . . . 50 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Solving the Record and State Identification Problems Solving the Missing Record Pre-State Problem . . . . Example Simple Lock Forwarding (SLF) . . . . . . . Lock Compatibility Matrix . . . . . . . . . . . . . . . Example Many-to-One Lock Forwarding (M1LF) . . Example Many-to-Many Lock Forwarding (MMLF) . Inconsistent Source Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 62 64 65 66 67 67 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 Difference and Intersection DT Creation . . . . . . . . . . . Horizontal Merge DT Creation . . . . . . . . . . . . . . . . . Horizontal Merge - Duplicate Inclusion . . . . . . . . . . . . Horizontal Merge - Duplicate Inclusion with type attribute . Horizontal Merge - Duplicate Removal. . . . . . . . . . . . . Horizontal Split DT Creation . . . . . . . . . . . . . . . . . Example vertical merge DT creation. . . . . . . . . . . . . . Synchronization of a Vertical Merge Schema Transformation Vertical split over a Candidate Key. . . . . . . . . . . . . . . Vertical split over a non candidate key. . . . . . . . . . . . . . . . . . . . . . . 72 77 78 80 82 86 90 94 95 98 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28 30 32 33 34 35 42 xiv LIST OF FIGURES 7.1 Possible Modular Design of Prototype. . . . . . . . . . . . . . 116 8.1 8.2 8.3 8.4 8.5 8.6 Modular Design Overview of the Non-blocking DBMS. . . . UML Class Diagram of the Non-blocking Database System. . Sequence Diagram - Relational Manager Processing a Query Organization of the log. . . . . . . . . . . . . . . . . . . . . Organization of data records in a table. . . . . . . . . . . . . Screen shot of the Client program in action. . . . . . . . . . . . . . . . 121 122 126 128 129 133 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 Response time and throughput for difference and intersection Response time distribution for difference and intersection. . . Response time - difference log propagation . . . . . . . . . . Throughput - difference log propagation . . . . . . . . . . . Response time and throughput - vertical merge DT creation Comparison of vertical merge and difference/intersection . . Time vs Degradation . . . . . . . . . . . . . . . . . . . . . . Response Time Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 148 151 153 154 155 157 160 11.1 Example of Schema Transformation performed in two steps. . 175 11.2 Example interface for dynamic priorities for DT creation. . . . 175 B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8 Response time and throughput - horizontal merge DT creation Response time - horizontal split DT creation . . . . . . . . . . Throughput - horizontal split DT creation . . . . . . . . . . . Response time - vertical merge DT creation . . . . . . . . . . Response time - vertical merge DT creation, varying table size Throughput - vertical merge DT creation . . . . . . . . . . . . Response time - vertical split DT creation . . . . . . . . . . . Throughput - vertical split DT creation . . . . . . . . . . . . . 182 183 184 185 186 187 188 189 List of Tables 3.1 3.2 3.3 3.4 The three dimensions of Ronström’s schema transformations. Legend for Tables 3.3 and 3.4. . . . . . . . . . . . . . . . . . Cost Incurred by Ronström’s Vertical Merge Schema Transformation Method . . . . . . . . . . . . . . . . . . . . . . . . Added Cost by Ronström’s Vertical Split Schema Transformation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . 36 . 37 . 38 5.1 DT Creation Problems and Solutions . . . . . . . . . . . . . . 69 6.1 6.2 DT Creation Operators . . . . . . . . . . . . . . . . . . . . . . 71 Problems and solutions for DT Creation methods . . . . . . . 107 7.1 7.2 Evaluation - Open Source DBMSs . . . . . . . . . . . . . . . . 115 Evaluation of implementation alternatives. . . . . . . . . . . . 118 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 Hardware and Software Environment for experiments. . Transaction Mix 1 . . . . . . . . . . . . . . . . . . . . Transaction Mix 2 . . . . . . . . . . . . . . . . . . . . Transaction Mix 3 . . . . . . . . . . . . . . . . . . . . Table Sizes used in the experiments . . . . . . . . . . . Response Time Distribution Summary . . . . . . . . . Response Time Initial Population and Log Propagation Effects of varying priorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 137 137 138 138 149 158 158 Part I Background and Context Chapter 1 Introduction The topic of this thesis is schema transformations and materialized view creation in relational database systems with high availability requirements. The main focus is on how creation of derived tables can be used to perform both operations while incurring minimal performance degradation for concurrent transactions. In this chapter, the motivation for the topic is presented, and the research questions and methodology are discussed. 1.1 Motivation Relational database systems have had tremendous success since Ted Codd introduced the relation concept in the famous paper “A relational model of data for large shared data banks” in 1970 (Codd, 1970). Today, this type of database system is so dominant that “database system” is close to synonymous with “relational database system”. Relational database systems1 , or simply database systems, are used in virtually all kinds of applications, spanning from simple personal database systems to huge and complex business database systems. Personal database systems, including CD or book archives and contact information for friends and family, typically contain few tuples (order of hundreds). The consequences of unavailability2 for such database systems are, in general, not critical; it would still be possible to play music even if the CD archive was 1 Throughout this thesis, the term database is used to denote a collection of related data. A Database Management System (DBMS) is a program used to manage these data, while database system denotes a collection of data managed by a DBMS (Elmasri and Navathe, 2004). 2 In this thesis, database systems are considered available when they can be fully accessed by their intended users. 6 1.1. MOTIVATION unavailable. Because people normally consider sporadic downtime of such systems acceptable, these systems have low requirements for availability. At the other end of the scale, database systems used in business applications may be very large, often in the order of millions or even billions of tuples. Business database systems are involved in everything from stock exchanges, web shops, banking and airline ticket booking to patient histories at hospitals3 , Enterprise Resource Planning (ERP) and Home Location Registries (HLR) used to keep track of mobile telephones in a network. While database system availability is not critical to all business applications, it certainly is to many. Database systems are, e.g., required for the exchange of stocks at NASDAQ and for customers to shop at Amazon.com. Even more critical; the HLR database system is required for any mobile phone to work in a network. These systems should not be unavailable for long periods of time. Database Operations Users interact with database systems by performing transactions, which may consist of one or more database operations (Elmasri and Navathe, 2004). Examples of basic database operations include inserting a new patient into a hospital database and querying a patient’s medical record for possible dangerous allergies before a surgery. In modern database systems, e.g. Microsoft SQL Server 2005 (Microsoft TechNet, 2006) and IBM DB2 version 9 (IBM Information Center, 2006), the basic operations are designed to achieve high degrees of concurrency (GarciaMolina et al., 2002). While the operations performed by one user may block other users from accessing the very same data items at the same time, other data items are still accessible. A database operation is said to be blocking if it keeps other transactions from executing their update (and possibly read) operations, effectively making the involved data unavailable. Short term blocking of small parts of the database may not be problematic. However, blocking a huge amount of data for a long period of time seriously reduces availability. This is obviously unwanted in highly available database systems. In the following section, two database operations, database schema transformations and creation of materialized views, are described. Neither of these can be performed without blocking the involved tables for a long period of time in existing DBMSs (Løland, 2003). 3 Patient history databases may, e.g., describe previous treatment, allergies, x-ray images etc CHAPTER 1. INTRODUCTION 1.1.1 7 The Derived Table Creation Problem ”Due to planned technical maintenance and upgrades, the online bank will be unavailable from Saturday 8 p.m. to Sunday 10 a.m. Our contact center will not be able to help with account information as internal systems are affected as well. We are sorry for the inconvenience this may cause for our customers.” Norwegian Online Bank, October 2006 Schema Transformations Database schemas4 are typically designed to model the relevant parts and aspects of the world at design time. The schema may be excellent for the intended usage at the time it is design, but many applications change over time. New kinds of products appear, departments which had one head of department suddenly has a board, or new laws that affect the company are introduced by the government. These are examples of changes that may require a transformation of the database schema. In addition to changing needs as a source for transformations, designers may also have been unsuccessful in designing a good schema in the first place. After being used for some time, it may turn out that a schema does not work as well as it was intended to. Often, the reason for this is that the design is a compromise between many factors, some of which include readability of the E/R diagram, removal of anomalies and optimization of runtime efficiency. It may very well turn out that the schema is too inefficient or that the designers just forgot or misinterpreted something. In a study of seven applications, Marche (Marche, 1993) reports of significant changes to relational database schemas over time. Six of the studied schemas had more than 50% of their attributes changed. The evolution continued after the development period had ended. A similar study of a health management system came to the same conclusion (Sjøberg, 1993). As should be clear, a database schema may sometimes have to be changed after the database has been populated with data. In this thesis, we refer to such changes as “schema transformations”. An important shortcoming of all but the least complex schema transformations is that they must be performed in a blocking way in todays DBMSs (Lorentz and Gregoire, 2003b; Microsoft TechNet, 2006; IBM Information Center, 2006). This will be elaborated on in Section 3.4. 4 The description, or model, of a database (Elmasri and Navathe, 2000). 8 1.1. MOTIVATION Materialized Views A database view is a table derived from other tables, and is defined by a database query called the view query (Elmasri and Navathe, 2004). Views may be either virtual or materialized. Virtual views do not physically store any records, but can still be queried like normal tables. This is done by using the view queries to rewrite the user queries (Elmasri and Navathe, 2004). Depending on the complexity of the view query, querying a virtual view may be much more costly than querying a normal table. To remedy this, most modern DBMSs support Materialized Views (MVs)5 (Løland, 2003). Unlike a virtual view, the result of the view query is stored physically in an MV (Elmasri and Navathe, 2004). MVs have many uses in addition to speeding up queries (Alur et al., 2002). They can be used to store historical information, e.g. sales reports for each quarter of a year. They are also frequently used in Data Warehouses. Because of the great performance advantages of MVs and their widespread use, much research has been conducted on how to keep MVs consistent with the source tables (Løland and Hvasshovd, 2006c). However, in current DBMSs, the MVs still have to be created in a way that blocks all updates to the source tables while the MV is created (Lorentz and Gregoire, 2003b; IBM Information Center, 2006; Microsoft TechNet, 2006). Using Derived Tables for Schema Transformations and Materialized View Creation The blocking MV creation and schema transformation methods described in the previous sections may take minutes or more for tables with large amounts of data. If either of these operations is required, the database administrator is forced to choose between unavailability while performing the operation, or to not perform it at all. Both choices may, however, be unacceptable. This is especially the case when the database system has high availability requirements. A derived table (DT) is, as the name suggests, a database table containing records derived from other tables6 (Elmasri and Navathe, 2004). A table “Sales Report” that stores a one-year history of all sales by all employees is an example of a DT. Hence, a materialized view is obviously one type of DT. A less intuitive application of DTs is to redirect operations from source 5 Materialized Views are called Indexed Views by Microsoft (Microsoft TechNet, 2006) and Materialized Query Table by IBM (IBM Information Center, 2007) 6 Throughout this thesis, the tables that records are derived from will be called source tables. CHAPTER 1. INTRODUCTION 9 tables to derived tables and thereby perform a schema transformation. A method to create DT is therefore likely to be usable for both operations. Both schema transformations and Materialized Views are defined by a query (Microsoft TechNet, 2006; Lorentz and Gregoire, 2003a; Løland, 2003), and we therefore focus on DT creation using relational operators7 , also called relational algebra operations. Relational operators can be categorized in two groups: non-aggregate and aggregate operators (Elmasri and Navathe, 2004). The non-aggregate operators are cartesian product, various joins, projection, union, selection, difference, intersection and division. Aggregate operators are mathematical functions that apply to collections of records. Both non-aggregate and aggregate operators can be used to define schema transformations and MVs. However, aggregate operators are typically not used without non-aggregate operators (Alur et al., 2002), and we therefore consider non-aggregate operators the best starting point for DT creation. The main topic of this thesis is to develop a method that solves the unavailability problem of creating derived tables, using common relational operators. Due to time constraints, we will focus on six operators: full outer equijoin (one-to-many and many-to-many relationships), projection, union, selection, difference and intersection8 . Full outer equijoin is chosen because these can later be reduced to any inner/left/right join simply by removing records from the result, and because equality is the most commonly used comparison type in joins (Elmasri and Navathe, 2004). Furthermore, in terms of derived table creation, cartesian product is actually a simpler full outer join in which no attribute comparison is performed. The final non-aggregate operator, division, can be expressed in terms of the other operators, and is therefore considered less important. The suggested method must solve any problem associated with utilizing the DTs as materialized views and in schema transformations. To gain insight in the field, the work must include a thorough study of existing solutions to the described and closely related problems. Existing DBMS functionality should be used to the greatest possible extent to ease the integration of the method into existing DBMSs. Since the goal is to develop a method that incurs little performance degradation to concurrent transactions, the performance implications of the method needs to be tested. 7 Relational operators are the building blocks used in queries (Elmasri and Navathe, 2004). 8 Due to naming conventions in the literature (Ronström, 1998), we use the names vertical merge and split, horizontal merge and split, difference and intersection, respectively, when these relational operators are used in DT creation. 10 1.2 1.2. RESEARCH QUESTIONS Research Questions Based on the discussion in the previous section, the main research question of the thesis is: How can we create derived tables and use these for schema transformation and materialized view creation purposes while incurring minimal performance degradation to transactions operating concurrently on the involved source tables. We realize that this is a research question with many aspects. To be able to answer it, the research question is therefore refined into four key challenges: Q1: Current situation What is the current status of related research designed to address the main research question or part of it? Q2: System Requirements What DBMS functionality is required for non-blocking DT creation to work? Q3: Approach and solutions How can derived tables be created with minimal performance degradation, and be used for schema transformation and MV creation purposes? • How can we create derived tables using the chosen six relational operators. • What is required for the DTs to be used a) as materialized views? b) for schema transformations? • To what extent can the solution be based on standard DBMS functionality and thereby be easily integradable in existing DBMSs? Q4: Performance Is the performance of the solution satisfactory? • How much does the proposed solution degrade performance for user transactions operating concurrently? • With the inevitable performance degradation in mind; under which circumstances is the proposed solution better than a) other solutions? b) performing the schema transformation or MV creation in the traditional, blocking way? CHAPTER 1. INTRODUCTION 1.3 11 Research Methodology Denning et al. divides computer science research into three paradigms: theory, abstraction and design (Denning et al., 1989). Theory is rooted in mathematics and aims at developing validated theories. The paradigm consists of four steps: 1. characterize objects of study 2. hypothesize possible relationships among them, i.e., form theorem 3. determine whether the relationships are true, i.e., proof 4. interpret results Abstraction is rooted in experimental scientific method. In this method, a phenomenon is investigated by collecting and analyzing experimental results. It consists of four steps: 1. form a hypothesis 2. construct a model and make a prediction 3. design an experiment and collect data 4. analyze results Design is rooted in engineering, and aims at constructing a system that solves a problem. 1. state requirements 2. state specifications 3. design and implement the system 4. test the system The research presented in this thesis fits naturally into the Design paradigm. The research aims at solving the problem that creation of derived tables is a blocking operation. For our suggested solution to be useful, the method must fit into common DBMS design. Hence, the first step in solving the research question is to understand commonly used DBMS technology that is somehow related to the research question. This will enable us to state requirements. The next step is to state specifications for a method that can be used to create derived tables in a non-blocking way. The method should be designed 12 1.4. ORGANIZATION OF THIS THESIS to fit into existing DBMSs to the greatest possible extent, and to degrade performance as little as possible. To verify validity, and to test the actual performance degradation, a DBMS and the suggested method is then designed and implemented. The implementation is then subjected to thorough performance testing. 1.4 Organization of this thesis The thesis is divided into three parts with different focus. The focus in Part I is on the background for the research. This includes the research question, required functionality and a survey of related work. Part II presents our solution to the derived table creation problem, and shows how the DTs can be used to transform the schema and to create materialized views. In Part III, we discuss the results of experiments on a prototype DBMS. This part also includes a discussion of the research contributions, and suggests further work. Part I - Background and Context contains an introduction to derived table creation. The focus in this part is on research from the literature and existing systems that is relevant to our solution of the research question and suggestions for further work. Chapter 1 contains this introduction. The chapter states the motivation for the research, and the research methodology that is used in the work. Chapter 2 introduces the DBMS fundamentals required to perform non-blocking derived table creations. Chapter 3 is a survey of existing solutions to the non-blocking DT creation problem and related problems. Part II - DT Creation Framework presents our solution for non-blocking creation of derived tables, and how these derived tables can be used for schema transformations and materialized view creation. Chapter 4 introduces our framework for non-blocking derived table creation. Chapter 5 identifies problems that are encountered when derived tables are created as described in Chapter 4. The chapter also shows how these problems can be solved. CHAPTER 1. INTRODUCTION 13 Chapter 6 describes in detail how the DT creation framework presented in Chapter 4 can be used for non-blocking creation of derived tables using the six relational operators that have been chosen. The chapter also describes what needs to be done to use these DTs for schema transformations or as materialized views. Part III - Implementation and Testing presents the design of our prototype DBMS. The prototype is capable of performing non-blocking DT creation as described in Part II. Results from performance testing of this prototype are also presented. Chapter 7 evaluates three alternatives for implementation of the DT creation method. Chapter 8 describes the design of a prototype DBMS capable of performing the DT creation method developed in Part II. Chapter 9 discusses experiment types the results of performing the required experiments on the prototype. Chapter 10 contains a discussion of the results of the research. Chapter 11 presents an overall conclusion and the contributions of the thesis. Chapter 2 Derived Table Creation Basics This chapter describes basic Database Management System concepts that are used by or are otherwise relevant to our non-blocking DT creation method. A thorough description of DBMSs is out of the scope of this thesis. For further details, the reader is referred to one of the many text books on the subject, e.g. “Database Systems The Complete Book” (Garcia-Molina et al., 2002) or “Fundamentals of Database Systems” (Elmasri and Navathe, 2004). 2.1 Database Systems - An Introduction A database (DB) is a collection of data items1 , each having a value (Bernstein et al., 1987). All access to the database goes through the Database Management System (DBMS). As illustrated in Figure 2.1, a database managed by a DBMS is called a database system (Elmasri and Navathe, 2004). Database access is performed by executing special transaction programs, which have a defined start and end, set by start and either commit or abort operation requests. A commit ensures that all operations of the transaction are executed and safely stored, while an abort call removes all effects of the transaction (Elmasri and Navathe, 2004). In its most common use, transactions have four properties, known as the “ACID” properties (Gray, 1981; Haerder and Reuter, 1983): Atomicity - The transaction must execute successfully, or must appear not to have executed at all. This is also referred to as the “all or nothing” property of transactions. Thus, the DBMS must be able to undo all operations performed by a transaction that is aborted. 1 The data items are called tuples or rows in the relational data model. Internally in database systems, they are called records (Garcia-Molina et al., 2002). To avoid confusion, the term “record” will be used throughout this thesis. CHAPTER 2. DERIVED TABLE CREATION BASICS 15 User Database System Database Management System Database Figure 2.1: Conceptual Model of a Database System. Consistency - A transaction must always transform the database from one consistent state2 to another. Isolation - It should appear to each transaction that other transactions either appeared before or after it, but not both (Gray and Reuter, 1993). Durability - The results of a transaction are permanent once the transaction has committed. Broadly speaking, the ACID properties are enforced mainly by concurrency control, which is used to achieve isolation, and recovery which is used for atomicity and durability. Consistency means that transactions must preserve constraints. The following two sections give a brief introduction to common concurrency control and recovery mechanisms. 2 A consistent state is a state where database constraints are not broken (Garcia-Molina et al., 2002). 16 2.2 2.2. CONCURRENCY CONTROL Concurrency Control In a database system where only one transaction is active at any time, concurrency control is not needed. In this scenario, the operations from each transaction are executed in serial (i.e. sequential) order, and two transactions can never interfere with each other. The isolation property is therefore implicitly guaranteed. However, this scenario is seldom used since the database system is normally only able to use small parts of the available resources at any time (Bernstein et al., 1987). When concurrent transactions are allowed, the operations of the various transactions must be executed as if the execution was serial (Garcia-Molina et al., 2002). A sequence, or history, of operations that gives the same result as serial execution is called serializable. It is the responsibility of the “Scheduler”, or “Concurrency Controller”, to enforce serializable histories, which in turn is a guarantee for isolation (Bernstein et al., 1987). Schedulers Schedulers can be either optimistic or pessimistic. With the optimistic strategy, transactions perform operations right away without first checking for conflicts. When the transaction requests a commit, however, its history is checked. If the transaction has been involved in any non-serializable operations, the transaction is forced to abort. Timestamp ordering, serialization graph testing and locking can all be used for optimistic scheduling (Bernstein et al., 1987). The most common form of scheduling is pessimistic, however. With this strategy, transactions are not allowed to perform operations that will form non-serializable histories in the first place. Thus, the scheduler has to check every operation to see if it conflicts with any operation executed by another currently active transaction. When a conflict is found, the scheduler may decide to either delay or reject the operation (Bernstein et al., 1987). The pessimistic Two Phase Locking (2PL) strategy has become the de facto scheduling standard in commercial DBMSs. It is, e.g., used in Oracle Database 10g (Cyran and Lane, 2003). In 2PL, a lock must be set on a data item before a transaction is allowed to operate on it. Two lock types, shared and exclusive, are typically used. The idea is that multiple transactions should be allowed to concurrently read the same record, while only one transaction should at any time be allowed to write to a record. Thus, read operations are allowed if the transaction has a shared or exclusive lock on the record, while write operations are allowed only if the transaction has an exclusive lock on it. CHAPTER 2. DERIVED TABLE CREATION BASICS 17 As the name indicates, 2PL works in two phases: locks are acquired during the first phase, and released during the second phase. This implies that a transaction is not allowed to set new locks once it has started releasing locks. Unless the transaction pre-declares all operations it will execute, the scheduler does not know if a transaction is done operating on a particular object, or if it will need more locks in the future. Locks are therefore typically not released until the transaction terminates. This is known as Strict 2PL (Garcia-Molina et al., 2002). The derived table creation method described in this document assumes that 2PL is used, although it may be tailored to suit other scheduling strategies as well. Two-phase commit is a commonly used protocol for commit handling in distributed DBMSs used to ensure that the transaction either commits on all nodes or aborts on all nodes. The protocol works in two phases: in the prepare phase, the transaction coordinator asks all transaction participants if they are ready to commit. If they all agree to commit, the coordinator completes the transaction by sending a commit message to all participant (Gray, 1978). This is called the commit phase. 2.3 Recovery In a database system, failure may occur on three levels. Transaction failure happens when as transaction either chooses to, or is forced to, abort. System failure happens when the contents of volatile storage is lost or corrupted. A power failure is a typical reason for such failures. Media failure happens when the contents in non-volatile storage is either lost or corrupted, e.g. because of a disk crash. In what follows, “memory” and “disk” will be used instead of volatile and non-volatile storage, respectively. Physical Logging Recovery managers are constructed to correct the three types of failure. The idea behind almost all recovery managers is that information for how to recover the database to the correct state must be stored safely at any time. This information is typically stored in a log, which can either be physical, logical or physiological (Haerder and Reuter, 1983; Bernstein et al., 1987). Physical logging, or value logging, writes the before and after value of a changed object to the log (Gray, 1978). The physical unit that is logged is typically a disk block or a record. Assuming that records are smaller than disk blocks, the former produces drastically higher log volumes than the latter (Haerder and Reuter, 1983). Since the log records contain before and after values of the 18 2.3. RECOVERY 11: T1 - R1=10 12: T2 - R2=15 13: T2 commit History Block:27 LSN:10 Disk Block R1=3 R2=6 14: T1 abort Block:27 LSN:12 R1=10 R2=15 Block:27 LSN:?? R1=3 R2=15 Figure 2.2: Two records in the same disk block are updated by different transactions, T1 and T2 . After T1 aborts, there is no valid state identifier for the block. changed object, logged operations are idempotent which means that redoing the operation multiple times yields the same result as redoing it once. Logical Logging Logical logging, or operation logging, logs the operation that is performed instead of the before and after value (Haerder and Reuter, 1983). This strategy produces much smaller log volumes than the physical methods (Bernstein et al., 1987). A Log Sequence Number (LSN) is assigned to each log record, and data items are tagged with the LSN of the latest operation that has changed it. This is done to ensure that changes are applied only once to each record since logically logged operations are not idempotent in general. LSNs may be assigned to block (block state identifier, BSI) or record (record state identifier, RSI) level (Bernstein et al., 1987). The former requires slightly less disk space whereas the latter is better suited in replicated database systems based on log redo since this allows for different physical organization at the different nodes (Bratsberg et al., 1997a). Two common methods to increase the degree of concurrency are finegranularity locking and semantically rich locks. Fine-granularity locks are normal locks that are set on small data items, i.e. records (Weikum, 1986; Mohan et al., 1992). Semantically rich locks allow multiple transactions to lock the same data item provided that the operations are commutative (Korth, 1983). Operations that commute may be performed in any order, e.g., “increase” and “decrease”. When these methods are combined with logical logging, Compensating Log Records (Crus, 1984) are required (Mohan et al., 1992). The reason for this is that there is no correct LSN that can be used as BSI after certain undo operations, as illustrated in Figure 2.2; after the abort of transaction 1, the LSN of the block cannot be set to what it was before the update because that would not reflect the change performed by CHAPTER 2. DERIVED TABLE CREATION BASICS 19 transaction 2. Neither can the LSN of the abort log record be used since this invalidates the one-to-one correspondence between updates and log records (Gray and Reuter, 1993). Thus, Compensating Log Records (CLR) (Crus, 1984) are written to the log when an undo operation is performed due to any of the failure scenarios presented in Section 2.3. The CLR describes the undo action that takes place (Gray, 1978). It also keeps the LSN of the log record that was undone. E.g., if the insert of record X is undone, a CLR describing the deletion of X is written to the log. LSNs are assigned to CLRs, thus the state identifier of a record or disk block will increase even when undo operations are performed. Logical logging is considered better than physical logging because of the reduced log volume and because the state identifiers reduces recovery work. However, it has one major flaw: the logged operations are not action consistent3 since they are not atomic. One insert may, e.g., require that large parts of a B-tree is restructured. This can be solved by using a two-level recovery scheme where the low-level system provides action consistent operations to the high-level logical logging scheme (Gray and Reuter, 1993). Shadowing is an example low-level scheme. With this method, blocks are copied before a change is applied. The method is complex and requires locks to be set on blocks since this is the granularity of the copies (Gray and Reuter, 1993). Physiological Logging Physiological logging (Mohan et al., 1992), also called physical-to-a-page logical-within-a-page, is a compromise between physical and logical logging. It uses logical logging to describe operations on the physical objects; blocks. In the shadowing strategy, non-atomic operations are executed by minitransactions. The mini-transactions consist of atomic operations, each of which are physiologically logged. Thus, the log records are small, while the problems of logical logging are avoided (Gray and Reuter, 1993). (No-)Steal and (No-)Force Cache Managers The log may provide information to undo or redo an operation. Which kind of information is required for recovery to work depends heavily on the strategy of another DBMS component, the cache manager, which is responsible for copying data items between memory and disk. Two parameters are of particular relevance to the recovery manager. The first determines whether 3 This means that one logical operation may involve multiple physical operations. Hence, a database system may crash when only parts of a logically logged operation has been performed (Gray and Reuter, 1993). 20 2.3. RECOVERY or not updates from uncommitted transactions may be written to disk. If uncommitted updates are allowed on disk, the cache manager uses a steal strategy; otherwise a no-steal strategy is used (Gray, 1978). Since memory is almost always a limited resource, stealing gives the cache manager valuable freedom in choosing which data items should be moved out. The problem is that if a failure occurs, data items on disk may have been changed by transactions that have not committed. The atomicity property requires that the effects of these transactions should be removed. Hence, if stealing is allowed, undo information of uncommitted writes must be forced to the log before the updated records are written to disk. The second parameter determines if data items updated by a transaction must be forced to disk or not (i.e. no-force) before the transaction is committed (Gray, 1978). If force is used, the disk must be accessed during critical parts of transaction execution. This may lead to inefficient cache management. If no-force is used, redo information of all committed changes must be written to the log, and the log must then be forced to disk. This is known as Force Log at Commit (Mohan et al., 1992). It is common to use a steal/no-force cache manager since this provides the maximum freedom and highest performance. This is also the case for a well-known recovery strategy, ARIES (Mohan et al., 1992), which is briefly described in the next section. The ARIES Recovery Strategy ARIES (Mohan et al., 1992) (Algorithm for Recovery and Isolation Exploiting Semantics) is a recovery method for the steal/no-force cache manager strategy described above. ARIES uses 2PL for concurrency. It is in common use in many commercial DBMS, e.g. SQL Server 2005 (Microsoft TechNet, 2006) and IBM DB2 version 9 (IBM Information Center, 2006). The principles are also used in the derived table creation method, which is the topic of this thesis. ARIES uses the Write-Ahead Logging (WAL) protocol (Gray, 1978), which requires that a log record describing a change to a data item is written to disk before the change itself is written. One sequential log, containing both undo and redo log records, is used. A unique, ascending Log Sequence Number (LSN) is assigned to each record in this log. The LSN is also used to tag blocks so that a disk block is known to reflect a logged change if and only if the LSN of the disk block is equal to or greater than that of a log record. Log records are initially added to the volatile part of the log file, and are forced to disk either when a commit request is processed or when the cache manager writes changed data items to disk. CHAPTER 2. DERIVED TABLE CREATION BASICS 21 The ARIES protocol can be used with both logical and physiological logging. It supports fine-granularity locks and semantically rich locks (Mohan et al., 1992). 2.4 Record Identification Policy The DBMS needs a way to uniquely identify records so that transactional operations and recovery work is applied to the correct record. Each record is therefore assigned a Record Identifier (RID). The mapping from RID to the physical record is called the access path. There are four identification techniques (Gray and Reuter, 1993): Relative Byte Address, Tuple Identifier, Database Key, and Primary Key. Physical Identification Policies Relative Byte Addresses (RBA) consist of a block address and an offset, i.e. the byte number within that block. RBA is fast since it points directly to the correct physical address. Physical location is not very stable, however. E.g., an update may increase a records size, which may change the offset or block. An address that is as unstable as this is not well suited as a RID (Gray and Reuter, 1993). Tuple Identifiers (TID) consists of a block address and a logical ID within the block. Each block has an index used to map the ID to the correct offset. Hence, a record may be relocated within a block without changing the RID. A pointer to the new address is used if a record is relocated to another block. When a pointer is followed, the access path to the record becomes more costly, however. Hence, relocated records should eventually receive a new TID reflecting the actual location. This reorganization must be executed online, i.e. in parallel to normal processing, and represents an overhead. This seems to be the most common record identification technique; it is used by, e.g., IBM DB2 v9 (IBM Information Center, 2006), SQL Server 2005 (Microsoft TechNet, 2006) and Oracle 10g (Cyran and Lane, 2003). Logical Identification Policies Database Keys are unique, ascending integers assigned to records by the DBMS. A translation table maps database keys to the physical location of the records. The database key works as an index in the array-like translation table, and therefore requires only one block access. This mapping ensures that a record can be relocated to any block without having to change its RID. The extra lookup incurs an access path overhead, however. 22 2.4. RECORD IDENTIFICATION POLICY Since all records in a relational database system are required to have a unique primary key 4 , the primary keys may serve as RIDs as well. Addressing is indirect, but in contrast to the previous method, primary keys can not be used as an index in a translation table since primary keys do not have monotonically increasing values. Thus, a B-tree is used to map the primary key to the physical location of the record. The access path is approximately as costly as database keys, but has a number of advantages. These include that access to records is often done through primary keys, so the primary key mapping to record must be done either way. The uniqueness of primary keys must also be guaranteed by the DBMS. This is efficient to do when primary key is used as RID (Gray and Reuter, 1993). This technique is used, e.g, in Oracle 10g if the table is index-organized (Cyran and Lane, 2003). When creating derived tables, the physical location of records residing in DTs are not the same as in the source tables. The blocks are obviously different. The location within the blocks may also be different since a relational operator is applied. Hence, the DT creation method described in this thesis assumes that a logical record identification scheme is used, i.e. either Database Keys or Primary Keys. Using physical identification policies, i.e. RBA or TID, is also possible, but requires an additional address mapping table. 4 Either supplied by the user or generated by the system (Gray and Reuter, 1993). Chapter 3 A Survey of Technologies Related to Non-Blocking Derived Table Creation This chapter describes the state of the art in non-blocking creation of derived tables (DT). The aim of the survey is to evaluate the functionality and cost of existing methods used for this purpose. Some of the ideas presented here will later be used in our non-blocking DT creation method. This will be explicitly commented on. Three related areas of research are discussed. First, a schema transformation method that can be used for some of the relational operators is described. To the author’s knowledge, this is the only research on non-blocking transformations in relational database systems in the literature. Next, we describe fuzzy copying, which is a method for non-blocking creation of DTs, but without the ability to apply relational operators. Third, maintenance techniques for Materialized Views (MV) are discussed. The motivation for this is that an MV is a type of DT, and some of the research in MV maintenance is therefore applicable to our suggested DT creation method. Finally, methods for schema transformations and DT creation available in existing DBMSs are described. 3.1 Ronström’s Schema Transformations Ronström (Ronström, 2000) presents a non-blocking method that uses both a reorganizer and triggers within users’ transactions to perform schema transformations, called schema changes by Ronström (Ronström, 1998). It is argued that there are three dimensions to schema transformations. These 24 3.1. RONSTRÖM’S SCHEMA TRANSFORMATIONS are soft vs. hard schema changes, simple vs. complex schema changes and simple vs. complex conversion functions (Ronström, 1998). A summary can be found in Table 3.1. The soft vs. hard schema change dimension determines whether or not new transactions are allowed to use the new schema before all transactions on the old schema have terminated. Thus, with soft schema changes, transactions that were started before the new schema was created continue processing on the old schema while new transactions start using the transformed one (Ronström, 1998). With hard schema changes, new transactions are not allowed to start processing on the affected parts of the transformed schema until all transactions on the old schema have terminated. Soft schema changes are desirable since with this strategy, new transactions are not blocked while the old transactions finish processing. In some cases, soft schema changes can not be used, however. This happens whenever the new schema does not contain enough information to trigger updates back to the old schema, i.e. when a mapping function from the transformed attributes to the old attributes does not exist (Ronström, 1998). The second dimension to schema changes divides transformations into simple and complex schema changes. Simple schema changes are short lived, and typically involve changes to the schema description only (Ronström, 1998). Complex schema changes involve many records and take considerable time. With this method, complex schema changes should not be executed as one single blocking transaction due to their long execution time (Ronström, 2000). Instead, complex schema changes are organized using SAGAs1 (Garcia-Molina and Salem, 1987). The third dimension to schema changes is that of simple vs. complex conversion functions. In simple conversions, all the information needed to apply an operation in the transformed schema is found in the operated-on record in the old schema. Complex conversions, on the other hand, need information from other records before the operation can be applied. This information may be found in other tables (Ronström, 2000). Complex conversions can only be performed by complex schema changes (Ronström, 2000). The following sections describe how transformations are performed in Ronström’s method. The description divides transformations into simple and complex changes, i.e. along the second dimension. Even though not all of the complex schema changes actually create DTs, all changes that can be performed by the method are presented for readability. A thorough cost analysis of schema transformation operators is presented in Section 3.1.3. 1 SAGA is a method to organize long running transactions (Garcia-Molina and Salem, 1987) CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 25 Schema Change Soft New transactions are allowed to start accessing the new tables while the old transactions are accessing the old tables. Hard Transactions that try to access the new tables are blocked until all transactions accessing the old tables have completed. Simple Short lived, typically only changes schema description, executed as one transaction. Complex Long lived, involves many records, executed using triggers and SAGA transactions. Conversion Function Simple A record can be added to the new schema by reading only the operated-on record in the old schema. Complex Adding a record to the new schema may require information from multiple records in the old schema. Always executed as a complex schema change. Table 3.1: The three dimensions of Ronström’s schema transformations. 3.1.1 Simple Schema Changes Simple schema changes only change the schema description of the database. The changes are organized in a way similar to the two-phase commit protocol, described in Section 2.2. First, the system coordinator sends the new schema description to all nodes. If the transformation is hard, each node will wait for ongoing transactions to finish processing. The nodes then lock the involved parts of the schema, update the schema description and log the change before acknowledging the change request. When all nodes have acknowledged the change, the coordinator sends commit to all nodes, including a new schema version number. New transactions will from now on use the transformed schema. Examples of simple schema changes include adding and dropping a table, adding an attribute with a default value, and dropping an attribute or index. None of these operations involve creation of derived tables. 3.1.2 Complex Schema Changes Schema changes involving many records are considered complex in Ronström’s method. This includes adding attributes, with values derived from other attributes, to a table. It also includes adding a secondary index. Additionally, both horizontal and vertical merge and split of tables can be performed (Ronström, 1998). In terms of relational operators, horizontal merge 26 3.1. RONSTRÖM’S SCHEMA TRANSFORMATIONS corresponds to the union operator, while vertical merge corresponds to the left outer join operator. The split methods are inverses of the merge methods. All complex schema changes go through three phases. The schema is first changed by adding the necessary tables, attributes, triggers, indices and constraints (Ronström, 2000). Second, the involved tables are operated on by reading and performing necessary operations one record at a time. The required operations depend on the transformation being performed. Involved tables are left unlocked for the entire transformation, whereas the records are locked temporarily. To ensure that the transformation does not lock records for long periods of time, only one record is operated on per transaction. All these transactions, each operating on one record, are organized by using SAGAs. While transactions read records in the source tables and perform the operations necessary for the transformation, triggers ensure that insert, delete and update operations in the old schema are performed in the new schema as well (Ronström, 1998). The third phase is started once the SAGA organized transactions have completed. If the schema change is soft, new user transactions start using the new schema immediately while active transactions are allowed to finish execution on the old schema. Since both schemas are in use, triggers have to forward operations both from the old to the new schema and vice versa. If the schema change is hard, transactions are not allowed to use the new schema until all transactions on the old schema have completed. When all transactions that were using the old schema have terminated, obsolete triggers, tables, attributes, indices and constraints are removed (Ronström, 2000). In what follows, all complex schema changes that can be performed by Ronström’s method are described in detail. Ordered by increasing complexity, these are horizontal merge and split, and vertical merge and split transformations. Horizontal Merge Schema Change In Ronström’s schema transformation framework, horizontal merge corresponds to the UNION relational operator without duplicate removal. The transformation is performed by creating a new table in which records from both source tables are inserted. Hence, this is a derived table. As illustrated in Figure 3.1, records from both source tables may have identical primary keys. This is a problem because two records with the same primary key are not allowed to coexist in the new table. This may be solved by using a non-derived primary key, e.g. an additional attribute with an autoincremented number, in the new table. Alternatively, the primary key of the CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 27 Foreign key Vinyl ("original table 1") Artist Album Smith, Jimmy Jones, Norah Davis, Miles Root Down Come away... Kind of... FK FromTbl CD ("original table 2") Artist Smith, Jimmy Krall, Diana Album Record ("new table") FK Vin Vin Vin CD CD Artist Smith, Jones, Davis, Smith, Krall, Jimmy Norah Miles Jimmy Diana Album FK Root Down Come away... Kind of... Root Down The look... Root Down The look... Figure 3.1: Horizontal Merge. Two tables, “Vinyl” and “CD”, are merged into one new table, “Record”. The primary key of both tables is <artist,album>. The new table includes an attribute “FromTable” so that identical primary key values from the two source tables may coexist. new table may be a combination of the primary key from the old schema and an attribute identifying which table the record belonged to (Ronström, 2000). The method starts by creating the new table. Foreign keys are then added to both the old tables and to the new table. Since duplicates are not removed, there is a one-to-one relationship between records in the old and the new schema. Thus, triggers in both schemas will only have to operate on one record in the other schema. Update and delete operations in one of the source tables trigger the same operation on the record referred to in the new table. For soft transformation, updates and deletes in the new table have similar triggers. Insert operations in a source table simply trigger an equal insert into the new table. Inserts into the new table should trigger inserts into one of the old tables as well. If the new table contains an attribute that identifies which old table it should belong to, this is straightforward. If this is not the case, e.g. because a nonderived key is used in the new table, the transformation cannot be performed softly. In the second step, records from the old tables are read and inserted into the new table one record at a time. When all records have been copied, new transactions are given access to the new schema. The old tables are deleted once all old transactions have finished processing. 28 3.1. RONSTRÖM’S SCHEMA TRANSFORMATIONS Foreign key HighSalary ("new table1") F.Name S.Name Salary Hanna Markus Valiante Oaks $40’ $42’ FK Employee ("original table") F.Name S.Name Salary Hanna Erik Markus Peter Valiante Olsen Oaks Pine $40’ $32’ $42’ $35’ FK LowSalary ("new table2") F.Name S.Name Salary Erik Peter Olsen Pine $32’ $35’ FK Figure 3.2: Horizontal Split. One table, “Employee”, is split into two tables based on salary. Horizontal Split Schema Change Horizontal split is the inverse of horizontal merge; it splits one table into two or more tables by copying records to the new tables depending on a condition. An example transformation is that of splitting “Employee” into “High Salary Employee” and “Low Salary Employee” based on conditions like “salary >= $40.000” and “salary < $40.000”. This transformation is illustrated in Figure 3.2. Only horizontal split transformations where all source table records match the condition of one and only one new table is described by Ronström (Ronström, 2000). Because of this, records in the old table refer to exactly one record in the new schema, thus simplifying the transformation. The new tables are first added to the schema, thus this method creates derived tables. Foreign keys, initially set to null, are then added both to the old and new tables. Once a record has been copied to the new schema, the foreign keys in both schemas are updated to point to the record in the other schema. The transformation can easily be made soft by adding triggers to both the old and the new tables. Because of the one-to-one relationship between records in the old and new schema, these triggers are straightforward: deletes and updates perform the same operation on the record referred to by the foreign key. Insert operations into the old table trigger an insert into the new table that has a matching condition. Ronström does not discuss how to handle an insert into a new table if the old table already contains a record with the same primary key. In all other cases, inserts into the new table CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 29 simply result in an insert into the old table as well. When the triggers have been added, the transformation is executed as described in the general method by copying the data in the old table one record at a time. Vertical Merge Schema Changes The vertical merge schema change uses the left outer join relational operator. Since records without a join match in the left table of the join are not included, this transformation is not lossless. The method requires that the tables have the same primary key, or that one table has a foreign key to the other table (Ronström, 2000). This is illustrated in Figures 3.3(a) and 3.3(b), respectively. These requirements imply that the method cannot perform a join of many-to-many relations, nor a full outer join. Since the join is performed by adding attributes to one of the existing tables, a DT is not created. Hence, the method cannot be used for other purposes than schema transformations. The transformation starts by adding the attributes belonging to the joined record to the left table of the join. This table is called the left table in the old schema and the merged table in the new schema. The left table is represented by “Person” in both Figures 3.3(a) and 3.3(b). A foreign key to the right table of the join, called the originating table, is also added if it does not already exist. The originating table is represented by “Salary” and “PAddress” in Figures 3.3(a) and 3.3(b), respectively. In addition, an attribute indicating whether the record has already been transformed is added. During the transformation, transactions that operate on the old schema do not see the attributes added to the left table. Triggers are then added to the originating table. Update and delete operations trigger update operations of all records referring to it in the merged table. The trigger on deletes also removes the foreign key of referring records. Insert operations trigger updates of all records in the merged table matching the join attributes. All these triggers also set the has-been-transformed attribute to true (Ronström, 2000). Since old transactions are free to operate on the left table, a number of triggers must be added there as well. The trigger of insert operations reads the matching record in the originating table so that the value of the added attributes can be updated accordingly. Update operations that change the foreign key, trigger a read of the new join match and update the added attributes to keep it consistent. In addition, all modifying operations2 have 2 Inserts, updates and deletes are modifying operations. 30 3.1. RONSTRÖM’S SCHEMA TRANSFORMATIONS "Left table" Old Schema New Schema Person Person Firstname Surname Address Zip Code Firstname Surname Address Zip Code Salary Salary "Merged table" Salary "Originating table" Firstname Surname Salary (a) Vertical merge with same primary key. “Firstname, Surname” is the primary key in all tables. The merged table is created by adding the salary attribute to “Person”. Old Schema Person "Left table" "Originating table" Firstname Surname Address Zip Code New Schema Person Firstname Surname Address Zip Code City City "Merged table" PAddress Zip Code City (b) Vertical Merge of functional dependency. Person.ZipCode is a foreign key to PAddress.ZipCode. The merged table is created by adding city to “Person”. Figure 3.3: Examples of Vertical Merge Schema Change. CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 31 to update the foreign key reference in the original table. Once all necessary attributes and triggers are in place, the records in the left table are processed one at a time. A transaction reads a record and uses the foreign key to find the record referred to in the originating table. The attribute values of that record are then written to the added attributes in the merged table, and the has-been-transformed attribute is set (Ronström, 2000). When all records in the merged table have been processed, it contains a left outer join of the two tables. A hard schema change is simply achieved by letting old transactions complete on the old schema. Triggers, attributes and foreign keys no longer in use are then dropped, before new transactions are allowed to use the new schema (Ronström, 2000). Performing a soft vertical merge transformation implies adding triggers to the merged table as well. These triggers make sure that write operations executed by transactions operating on the new schema are also visible for transactions using the old schema. Thus, updates on the added attributes of records in the merged table must trigger updates to the referred record in the originating table. Insert operations into the merged table trigger a scan of the originating table to see if it already contains a matching record. If so, only the foreign key of the inserted record is updated. Otherwise, a record is also inserted into the originating table. A problem arises if a record is inserted into the merged table and the trigger scanning the originating table finds that an inconsistent record already exists. The same case is encountered if the added attributes of a merged table record are updated while multiple records refer to the same originating record. Since two records with the same primary key cannot exist in the originating table at the same time, a simple insert of a new record is not possible. Furthermore, it would not be correct to just update the originating record since the other records referring to it would disagree on its attribute values. This problem is not addressed in the method, but there are at least two possible solutions: the first possibility is to abort the transaction trying to insert or update the record in the merged table. This would be a serious restriction to operations in the merged table. The second is to update the record in the originating table, which in turn triggers updates on all records in the merged table that refers to it. This is illustrated in Example 3.1.1: Example 3.1.1 (Triggered Updates During Soft Vertical Merge) Consider the vertical merge illustrated in Figure 3.4. During a soft schema change, an attribute added to the merged table, “City”, is updated. This triggers an update to the originating table, “Postal Address”, again triggering 32 3.1. RONSTRÖM’S SCHEMA TRANSFORMATIONS "update merged set city = London where..." User update Triggered update F.Name S.Name Address Zip City Zip City Hanna Erik Markus Peter Sofie Valiante Olsen Oaks Pine Clark Moholt 3 Torvbk 6 Mollenb.7 Oslovn 34 Berg 1 7020 5121 7020 0340 7020 Tr.heim Bergen Tr.heim Oslo Tr.heim 0340 5121 7020 9010 Oslo Bergen Tr.heim Tromsø Figure 3.4: Example 3.1.1 - An update to a record in the merged table triggers updates in both the originating and the merged table. updates on records in the merged table that refers to it. This second scenario would probably be preferred in most cases. Note, however, that the behavior of transactions in the new schema would not be equal. Vertical Split Schema Change The vertical split transformation uses the projection relational operator. The transformation is the inverse of vertical merge. Like vertical merge, vertical split uses triggers and transactions that operate on one record at a time to perform the transformation. The transformation starts by creating a table containing a subset of attributes from the original table. If, e.g., a table “Person” is vertically split into “Person” and “PAddress”, “PAddress” is the new table. This is illustrated in Figure 3.5. The table that is split is called the original table in the old schema and the split table in the transformed schema (Ronström, 2000). Note that the “new” table is the only derived table that is created in this transformation. When the new table has been created, foreign keys are added to both the original and the new tables. Records in the original table use these to refer to the records in the new table and vice versa. As illustrated in Figure 3.6, all foreign keys are initially NULL (Ronström, 2000). A number of triggers are needed on the original table to ensure that operations are executed on the new table as well (Ronström, 2000). Inserts into the original table trigger a scan of the new table. If a record with matching attribute values is found in the new table, the foreign key of the inserted record is set to point to it. In addition, the foreign key of the CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION Old Schema 33 New Schema Person Person "Original table" Firstname Surname Address Zip Code City City Firstname Surname Address Zip Code "Split table" PAddress "New table" Zip Code City Figure 3.5: Vertical Split over a functional dependency. A table “Person” is split into “Person” and “PAddress”. Only the “new” table, PAddress, is a derived table. record in the new table is updated to also point to the newly inserted record. If no matching record is found in the new table, a record is inserted before updating the foreign keys. A delete operation triggers a delete of the record referred to in the new table if this is the only original record contributing to it. Hence, the existence of other referring records has to be checked before the record in the new table is deleted. A trigger is also needed on updates that affect records in the new table. If the updated record in the original table is the only record referring to it, the new record is simply updated. Otherwise, a record with the updated values is inserted into the new table before updating the foreign keys. Assuming that the schema transformation is soft, triggers must also be added to the new table (Ronström, 1998; Ronström, 2000). Delete operations trigger the deletion of split attribute values in all referring records. Update operations trigger updates to all referring records in the split table. Finally, insert operations update all record in the original table with matching join attribute values. Thus, the insert trigger has to scan all original records to find these join matches. Inconsistencies. Assuming that inconsistencies never occur, the described vertical split method works well. Unfortunately, this assumption does not hold. Consider Figure 3.7, which illustrates a typical inconsistency: two records with zip code “7020” have the city value “Tr.heim”, while a third 34 3.1. RONSTRÖM’S SCHEMA TRANSFORMATIONS F.Name S.Name Address Zip City FK Hanna Erik Markus Peter Sofie Valiante Olsen Oaks Pine Clark Moholt 3 Torvbk 6 Mollenb. 7 Oslovn 34 Berg 1 7020 5121 7020 0340 7020 Tr.heim Bergen Tr.heim Oslo Tr.heim NULL NULL NULL NULL NULL Zip City FK (a) The first step of Vertical Split Schema Transformation of a table “Person” into “Person” and “PAddress”. The new table has been created, and foreign keys have been added to both the original and the new table. The City attribute is gray because it is only part of the original schema, not the transformed one. Currently scaning Foreign key F.Name S.Name Address Zip City Hanna Erik Markus Peter Sofie Valiante Olsen Oaks Pine Clark Moholt 3 Torvbk 6 Mollenb. 7 Oslovn 34 Berg 1 7020 5121 7020 0340 7020 Tr.heim Bergen Tr.heim Oslo Tr.heim FK Zip City 5121 7020 Bergen Tr.heim FK NULL NULL (b) The transformation process has started to copy records from the original table to the new table. Only the three topmost records have been read so far. Figure 3.6: Illustration of the vertical split transformation. record has a mistyped value “Tr.hemi”. Inconsistencies like this are called anomalies (Garcia-Molina et al., 2002). Anomalies are likely to appear in vertical split transformations since the old schema does not guarantee consistency. E.g., nothing prevents the city attribute values “Tr.hemi” and “Tr.heim” to appear at the same time, as illustrated in Figure 3.7. In many cases, vertical split changes over a functional dependency may be performed just to avoid such anomalies by decomposing a table into Boyce-Codd Normal Form (BCNF)3 . This problem is not addressed in the method description, but there are at least a few possible solutions: one possibility is to update the record in the new table with values from the latest read. This would in turn result in an update of all records in the original table that are referred to by this record. 3 A schema in BCNF is guaranteed to not have anomalies (Garcia-Molina et al., 2002). CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 35 Currently scaning Foreign key F.Name S.Name Address Zip City Hanna Erik Markus Peter Sofie Valiante Olsen Oaks Pine Clark Moholt 3 Torvbk 6 Mollenb. 7 Oslovn 34 Berg 1 7020 5121 7020 0340 7020 Tr.heim Bergen Tr.heim Oslo Tr.hemi FK ? Zip City 5121 7020 0340 Bergen Tr.heim Oslo FK Figure 3.7: Vertical Split with an inconsistency. The figure illustrates the same scenario as Figure 3.6, but with an inconsistency between records with zip code 7020. In Figure 3.7, both Hanna Valiante and Markus Oaks would in this case get an incorrect but consistent city name. Another solution is to count the number of records that agree on each value, in this case two vs. one, and let the majority decide. With this strategy, the record in the original table may have to be updated. In the figure, the city value of Sofie Clark would be updated to Tr.heim. This strategy is likely to produce correct results in most scenarios, but there are no guarantees. It may, in some cases, result in incorrect attribute values for records that were correct in the first place. A third possibility is to add a non-derived primary key, e.g. an autoincremented number, to the new table. The new schema could, e.g., look like this: Person (f.name,s.name, address, postalID) PostalAddress (postalID, zip, city) This would, however, not remove anomalies since the implicit functional dependency between zip code and city is not resolved. Thus, in cases where the vertical split is used to decompose a table into 3NF or BCNF, this solution would not meet the intention of the operation. The described problem is similar to that described for soft vertical merge transformations in the previous section, where triggers of inserts into the merged table find a record in the originating table that has the same key but differs on other attributes. Such inconsistencies are, however, much more likely to occur in vertical split transformations since they may be introduced not only during the time interval when transactions operate on both schemas, but also at any point in time before the transformation started. Furthermore, 36 3.1. RONSTRÖM’S SCHEMA TRANSFORMATIONS rx /wx rxf k /wxf k |x| RF vm vs lef t ogt m ogl split new Legend Read/write a record in table x Read/write the foreign key of a record in table x Cardinality of table x Reference Factor - average number of references a record in one table has to records in another table Vertical merge Vertical split Left table (see vertical merge) Originating table (see vertical merge) Merged table (see vertical merge) Original table (see vertical split) Split table (see vertical split) New table (see vertical split) Table 3.2: Legend for Tables 3.3 and 3.4. the problem cannot be solved by aborting the transaction that introduced the anomaly since it may already have committed. 3.1.3 Cost Analysis of Ronström’s Method Since the triggers that keep the old and the new schemas consistent are executed within the transaction that triggered them, the cost added to normal operations is highly relevant: it will increase response time of each operation in addition to the overall workload of the database system. Since cost estimates or test results have not been published for the method, an analysis is provided in this section. The reference factor (RF) is defined as the average number of references a record in one table has to records in another table. Thus, RFvm = |lef t| |originating| (3.1) |split| |new| (3.2) RFvs = for vertical merge and split, respectively. RFvm and RFvs will be referred to as RF unless otherwise noted. Tables 3.3 and 3.4 summarizes the added trigger cost for normal operations on the involved tables during vertical merge and split schema transformations, respectively. The added trigger costs for the horizontal methods CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 37 Vertical Merge Operation on Added cost Left rogt +wm +wogtf k ( rogt + wm + 2 × wogtf k Update 0 Delete Orig.t Old Schema Insert Insert1 0 ( RF × wm |lef t| × rm + RF × wm assuming an index on join attribute if join att. is updated, otherwise if index on join att. in left table, otherwise. Update RF × wm Delete Merged New Schema Insert RF × (rmf k + wm )   rogt + wogt + (RF − 1) × wm rogt + wogtf k   wogt if non-equal record exists, if equal record exists, if record does not exist   rogtf k + wogtf k + wogt if non-derived prim-key, >1 reference,     rogtf k + wogt if non-derived prim-key, one reference,  Update rogtf k + wogt + (RF − 1) × wm    if join att is prim-key,    0 if no atts. in orignt. table are updated. Delete 0 Table 3.3: Added Cost Incurred by Vertical Merge Schema Transformation Methods. (1 ) Note that RF is often 0 for inserts into the originating table. are always the same as the cost of the original operation, and are therefore not shown in a separate table. A few examples are provided to clearify the incurred costs: Example 3.1.2 (Cost of Consistent Insert During Vertical Split) A database contains a number of tables. One of these is “PersonInfo”, containing information on all Norwegian citizens, including zip code and city. The table is then vertically split into “Person” and “PostalAddress”. There are 4.6 million people and 4600 zip codes registered in the database. Thus, RF = 4.6million = 1000 4600 During the transformation, a user inserts a person into the original table (i.e. in the old schema). This new person has a zip code and city that is consistent 38 3.1. RONSTRÖM’S SCHEMA TRANSFORMATIONS Vertical Split Operation on Original Split if inconsistent, if consistent, if non-existing.   rnewf k + wnewf k + wnew if non-derived prim-key, >1 reference,      r if non-derived prim-key, one reference,  newf k + wnew Update rnewf k + wnew + (RF − 1) × wogl    if join att is prim-key,    0 if no atts. in new table are updated. Delete ( rnewf k + wnew rnewf k + wnewf k if only one referring, otherwise. Insert ( rnew + woglf k rnew + wogl + wnewf k if no join match in new table, if join match in new table. ( rnew + wogl + 2 × wnewf k Update 0 Delete New New Schema Old Schema Insert Added cost   rnew + wnew + (RF − 1) × wogl rnew + wnewf k   wnew Insert1 rnewf k + wnewf k ( RF × wogl |original| × rogl + RF × wogl if join attribute is updated, otherwise. if index on join att. in original, otherwise. Update RF × wogl Delete RF × (roglf k + wogl ) Table 3.4: Added Cost Incurred by Vertical Split Schema Transformation Methods. (1 ) Note that RF is often 0 for inserts into the new table. CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 39 with all other persons in the “PersonInfo” table. The cost is: Cex1 = Cnormal + Cadded = wogl + rnew + wnewf k For readability, assume that a read and a write operation has the same cost in terms of IO and CPU, and that a write of a foreign key has the same cost as other write operations. With these assumptions, the simple insert has to perform three times more operations than it would without the transformation. Furthermore, it has to read lock a record in the new table. Example 3.1.3 (Cost of Update During Vertical Split) During the transformation described in Example 3.1.2, another user updates the city of a person in the original table. Since this transformation performs a split over a functional dependency, the primary key of the PostalAddress table is zip code. Thus, when the update triggers an update of the record in the new table, that update triggers an update of all original records with the same zip code: Cex2 = Cnormal + Cadded = wogl + rnewf k + wnew + (RF − 1) × wogl = rnewf k + wnew + 1000 × wogl Hence, the update results in 1001 more operations and 1000 more locks than would be the case without the transformation. Example 3.1.4 (Cost of Update in New Schema) The transformation described in Example 3.1.2 is made soft, and a third user therefore gets access to the new schema. The user updates a postal address: Cex3 = Cnormal + Cadded = wnew + RF × wogl = wnew + 1000 × wogl The update results in 1000 more locks and operations than it would without the transformation. As can be seen from Tables 3.3 and 3.4 and the examples, the cost of operations during a schema transformation varies enormously. In almost all cases, however, the cost is at least two to three times higher in terms of operations and locks than it would be without the transformation. 40 3.2. FUZZY TABLE COPYING 3.2 Fuzzy Table Copying Fuzzy copying is a technique used to make a copy of a table in parallel with other operations, including updates. There are two variants: the first method is block oriented, and works like a fuzzy checkpoint (Hagmann, 1986) that is made sharp, i.e. consistent (Gray and Reuter, 1993). The second method is record oriented, and is better suited when the copy is installed in an environment that is not completely identical to the source environment. An example is copying a table from one node to another in a distributed system. The block size may differ between these nodes, and the physical address of the records in the copy would therefore differ from those in the source. Both fuzzy copy methods work in two steps: in the first step, the source table is read without using any table or record locks, and therefore results in an inconsistent copy. The method gets its “fuzzy” name because this initial copy is not consistent with a state the source table has had at any point in time. In the second step, the copy is made consistent by applying log records that have been generated by concurrent operations during the first step. In the block oriented fuzzy copy (BoFC) method (Gray and Reuter, 1993; Bernstein et al., 1987), the source table is read one block at a time. Locks are ignored, but block latches4 (Gray, 1978) are used to ensure that the blocks are action consistent. The log records that have been generated while reading the blocks are then redone to the fuzzy copy. Since the method copies blocks, the addresses of records in the copy are the same as in the source. Hence, logical and physiological logging (see Section 2.3) can both be used. Also, all four record identification schemes, described in Section 2.4, will work. The record oriented fuzzy copy (RoFC) method (Bratsberg et al., 1997a) copies records instead of blocks in the first step. As for BoFC, only latches are used during this process. The copied records may then be inserted anywhere; there are no restrictions on the physical address of the records at the new location. Once the inconsistent copy has been installed, the log records are applied to it. Since the physical addresses of records may be different at the new location, only logical logging can be used. Furthermore, state identifiers must be assigned to records instead of blocks, and the records must be identified by a non-physical record ID, as described in Section 2.4. Due to its location independent nature, this strategy is suitable for declustering in distributed database systems. 4 Latches, also called semaphores, are locks held for a very short time (Elmasri and Navathe, 2004). They are typically used to ensure that only one operation is applied to a disk block at a time (Gray and Reuter, 1993). CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 3.3 41 Materialized View Maintenance Materialized views (MVs) are used in database systems, e.g., to speed up queries by precomputing and storing query results, and in data warehouses. The work on MVs started with Snapshots (Adiba and Lindsay, 1980). These were able to answer queries on historical data and speed up the query processing. As the benefits of Snapshots were appreciated, the concept was extended to be able to answer queries on current data to lower query cost. This extension to Snapshots is called Materialized Views (MVs). During the last two decades, MVs have evolved to become a very beneficial addition to DBMSs. Benefits include less work when processing queries and less network communication in distributed queries. The purpose of the chapter is to address the problems with MVs and to show their proposed solutions from the literature. As is evident from the following chapters, methods to keep MVs up to date have been researched extensively. The initial creation of MVs has, however, been neglected. 3.3.1 Snapshots Database Snapshots marks the beginning of what is now known as Materialized Views (MVs). They are defined by a query and are populated by storing the query result in the Snapshot table. Once created, transactions may query them for historical data (Adiba and Lindsay, 1980). Snapshots can later be refreshed to reflect a newer state. This can be done by deleting the content of the snapshot and then reevaluate the query (Adiba and Lindsay, 1980). An alternative is to take advantage of recovery techniques such as differential files and the recovery log to compute deltavalues to the old Snapshot only (Kahler and Risnes, 1987). This generally requires less work than the first method. An algorithm using the second strategy is presented by Lindsay et al. (Lindsay et al., 1986). The algorithm associates a timestamp value with every record in the source relation of the Snapshot. The timestamp is a monotonically increasing value with which it is easy to decide whether an update occurred before or after another update. When a Snapshot is updated, the update transaction uses the timestamp to find updates that took place after the previous Snapshot. Only records with a higher timestamp value need to be updated in the Snapshot. 42 3.3. MATERIALIZED VIEW MAINTENANCE PostalAddress MV Zip City City 7020 7030 0340 Trondheim Trondheim Oslo Trondheim Oslo Figure 3.8: Illustration of Example 3.3.1: A Materialized View stores all city names in the PostalAddress table. 3.3.2 Materialized Views In contrast to Snapshots, MVs are typically not allowed to be out of date when they are queried. They are divided into two main groups: immediate and deferred update. These two groups differ only in when they are refreshed; the former method forwards updates within the user transaction that updated the source records while the latter leaves the MV update work to a separate view update transaction. An important part of MV maintenance is to keep the MVs consistent with the source tables. To do this, updates to the source tables have to be forwarded correctly to the view. The following example illustrates one of the problems of keeping consistency: Example 3.3.1 (An MV Consistency Problem) An MV defined as the set (i.e. no duplicates) of all cities in the PostalAddress table, as illustrated in Figure 3.8. The current state of the source table and the MV is illustrated above. Suppose a transaction deletes the record < 0300, Oslo > from the source table. A correct execution in the MV is to delete the record < Oslo >. On the other hand, if the transaction deletes the record < 7020, T rondheim >, a correct execution is to not delete < T rondheim > from the MV. Multiple solutions have been proposed to address the consistency problem. Blakeley et al. (Blakeley et al., 1986) showed that counters could be used to keep track of the multiplicity of records in MVs. The counter is increased by insertion of duplicates, and decreased by deletion of duplicates. If the counter reaches zero, the record is removed from the MV. The method could originally only handle select-project-join (SPJ) views, but has later been improved to handle aggregates, negations and unions (Gupta et al., 1992; Gupta et al., 1993). CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 43 Gupta et al. (Gupta et al., 1993) presents another algorithm called “Delete and Rederive” (DRed). An overestimate of records that may be deleted from the MV is first computed. The records with alternative derivations are then removed from the delete set. Finally, new records that need to be inserted are computed. The authors recommend the DRed method when dealing with recursive MVs, and the counting method when dealing with non-recursive MVs. In contrast to the methods described so far, Qian and Wiederhold (Qian and Wiederhold, 1991) and Griffin et al. (Griffin and Libkin, 1995; Griffin et al., 1997) use algebra as the basis for computing updates. They argue that it is easier to prove correctness of algebra than algorithms and to derive rules for other languages (Griffin et al., 1997). Qian et al. present algebra propagation rules that take update operations on SPJ views as input and produce an insert set ∆R and a delete set ∇R (Qian and Wiederhold, 1991). The method has also been extended to handle bag semantics (Griffin and Libkin, 1995). All methods described so far are immediate. Immediate maintenance has a serious drawback, however: an extra workload is incurred on all user transaction operations that have to be propagated. In deferred MVs, the view is maintained by a view update transaction, not the user transaction. Accordingly, this does not incur extra update work on each user transaction, and deferred methods should therefore be used whenever possible (Colby et al., 1996; Kawaguchi et al., 1997). The update transaction is typically invoked periodically or by a query to the view. The algorithms described for immediate update cannot be used in a deferred strategy without modification. The reason for this is that the immediate methods use data from the source tables to update the MVs. When deferred methods are used, the state of the source tables may have changed before the maintenance starts. This is called the “state bug” (Colby et al., 1996). Colby et al. (Colby et al., 1996) extend the algebra by Qian et al. (Qian and Wiederhold, 1991) to overcome this problem: both a log L and two view differential tables (∆MV and ∇MV for inserts and deletes, respectively) are used. The MV is in a state consistent with a previous state sp of the source tables. ∆MV and ∇MV contain the operations that need to be applied to the MV to make it consistent with a state closer to the present. The log L is used to maintain ∆MV and ∇MV. I.e., the MV has a state sp which is before or equal in time to an intermediate state si that can be reached by applying the updates in ∆MV and ∇MV. The current state of the source tables sc can again be reached from si by applying the updates in L. By using the log, the authors compute the state si that the differential tables are in. This is the pre-state needed to use the immediate propagation 44 3.4. SCHEMA TRANSFORMATIONS AND DT CREATION IN EXISTING DBMSS methods without encountering the state bug. The algorithm uses two propagation transactions: one for updating the differential tables using the log, and one for updating the MVs using the differential tables. This imposes very little overhead on ordinary transactions as the only extra work they have to do is to write a log record without any further computation. Self-Maintainability Operations that can be forwarded to derived tables without requiring more data than the table itself and the operation are called Autonomously Computable Updates (ACUs) (Blakeley et al., 1989). Self-maintainable (Self-M) Materialized Views are MVs where all operations are ACUs (Gupta et al., 1996). When operations are applied to an MV that is not Self-M, the source tables must be queried for the missing information. Self-M is therefore a highly desirable property in systems with fast response time requirements (Gupta et al., 1996) and when the MV is stored on a different node than the source tables. For Self-M MVs, only the log has to be shipped to the MV node. Only a very limited set of views are Self-M, and Quass et al. (Quass et al., 1996) therefore extend the concept to also include views where auxiliary information makes the view Self-M. The auxiliary information, typically a table, is stored together with the MV, and is updated accordingly. Our derived table creation method benefits from self-maintainability in the same way as MV maintenance. Hence, in DT creation of all relational operators where the DTs themselves are not Self-M, an auxiliary table with the missing information will be added. 3.4 Schema Transformations and DT creation in Existing DBMSs Existing database systems, including IBM DB2 v9 (IBM Information Center, 2006), Microsoft SQL Server 2005 (Microsoft TechNet, 2006), MySQL 5.1 (MySQL AB, 2006) and Oracle 10g (Lorentz and Gregoire, 2003b), offer only simple schema transformation functionality (Løland, 2003). These include removal of and adding one or more attributes to a table, renaming attributes and the like. Removal of an attribute can be performed by changing the table description only, thus leaving the physical records unchanged for an unspecified period of time. CHAPTER 3. A SURVEY OF TECHNOLOGIES RELATED TO NON-BLOCKING DERIVED TABLE CREATION 45 Complex schema transformations and MV creations are performed by blocking operations. The source tables are locked with a shared lock while the content is read and the result of the query inserted into the DTs (Løland, 2003). Throughout this thesis, we will call this the insert into select method due to the common SQL syntax of these operations. For example, the SQL syntax for DT creation in MySql is (MySQL AB, 2006): insert into <table-name> select <select-statement> 3.5 Summary In this chapter, research related to non-blocking creation of derived tables was presented. A method that can be used for vertical and horizontal merge and split schema transformations has been suggested by Ronström (Ronström, 2000). The solution involves creation of derived tables in the horizontal merge and split cases. In addition, one of the resulting tables in vertical split is a DT. It is therefore likely that these relational operators can be used for other DT creation purposes than schema transformations as well. One example is creation of materialized views, although this possibility is not discussed by Ronström. Although test results have not been published on the method, the cost analysis in Section 3.1 indicates that it is likely to degrade throughput and response time significantly for the duration of the transformation. The reason for this is that write operations in one schema (old or new) trigger a varying number of write operations in the other schema. The DT creation method we suggest in Part II of this thesis extends the ideas from the record oriented fuzzy copy (RoFC) technique described in Section 3.2. Similar to RoFC, we make an inconsistent copy of the involved tables and use the log to make the copied data consistent. An important difference is, however, that we apply relational operators to the inconsistent copies. Because of this, the log records can not be applied to the copied data in any straightforward way. The DT creation method developed in this thesis is also related to materialized view maintenance. In particular, we will use auxiliary tables to achieve Self-Maintainable derived tables. The data in these auxiliary tables will be needed whenever the DTs themselves do not contain all data required to apply the log. The “insert into select” method for DT creation will not be used in our solution. We will, however, compare our DT creation method to the “insert 46 3.5. SUMMARY into select” method, and discuss under which circumstances our method is better than the existing solution and vice versa. Part II Derived Table Creation Chapter 4 The Derived Table Creation Framework In Chapter 1, we presented the overall research question of this thesis: How can we create derived tables and use these for schema transformation and materialized view creation purposes while incurring minimal performance degradation to transactions operating concurrently on the involved source tables. With the research question in mind, this part of the thesis describes our suggested method for creating derived tables (DTs) without blocking other transactions. Once a DT has been created, it can be used as a materialized view (MV) or to transform the schema. The method aims at degrading the performance of concurrent transactions as little as possible. To which extent the method meets the performance aspects of the research question is discussed in Part III: Implementation and Testing. In this chapter, we suggest a framework that can be used in the general case to create DTs in a non-blocking way. As such, this chapter presents an abstract solution to the first part of the research problem stated above. In Chapter 5, we identify common problems encountered when the framework is used for DT creation using relational operators. General solutions to these problems are also suggested. Chapter 6 contains detailed descriptions of how the framework is used to create DTs using the six relational operators. 4.1 Overview of the Framework The non-blocking DT creation framework presented in this chapter operates in four steps. As illustrated in Figure 4.1, these are: preparation, initial population, log propagation and synchronization. 50 4.1. OVERVIEW OF THE FRAMEWORK 2) Initial Population Users DBA Algebra Nonblocking read Old Schema Office Employee Roomnumber Telephone Address Position City Positioncode Product Salary Firstname Surname PositionPAddress Address ZipCodeZipCode City Latch State ProductCode ItemsInStock Price 4) Synchronize Insert 1) Preparation New Schema ModifiedEmp Firstname Surname Position Address ZipCode City State 3) Log Propagation Log Figure 4.1: The four steps of DT creation. During the preparation step, necessary tables, indices etc are added to the database schema. These should not be visible to users until DT creation reaches the final step; synchronization. Once the required structures are in place, the initial population step writes a fuzzy mark to the log and then starts to read the involved source tables. The fuzzy mark is later used to find the point where reading started. The relational operator used to create the DT is then applied, and the result is inserted into the newly created DTs. Note that no table or record locks are used, and the derived records are therefore not necessary consistent with the source table records1 . Log propagation is then started. It works in iterations, and each iteration starts by writing a place-keeper mark, called fuzzy mark, to the log. The write operations that have been executed on source table records between the 1 They are consistent if no modifying operations have been performed on the source records during the copying. CHAPTER 4. THE DERIVED TABLE CREATION FRAMEWORK 51 last mark and the new one are then propagated, or forwarded, to the DTs by applying recovery techniques. These techniques must, however, be modified since relational operators have been applied to create the derived records. If a considerable amount of updates have been performed on the source records during an iteration, a new iteration is started. This is repeated until there are few operations that distinguish the source tables from the DTs. The fourth step, synchronization, latches the source tables while the remaining logged operations are applied to the DTs. Since log propagation was repeated until there were only a few operations left to apply to the DTs, these latches are held for a very short period of time. When all log records have been forwarded, the DTs are in the same state as the source tables, and are ready to be used. The four steps of non-blocking DT creation are described in more detail in the rest of this chapter. Materialized Views can use this framework without modification. Note, however, that even though DTs can also be used to perform schema transformations, the framework must be slightly modified to do so. The reason for this is that different transactions are allowed to concurrently operate in the two schema versions. These modifications to the framework are discussed in Section 4.6. 4.2 Step 1: Preparation DT creation starts by adding the derived tables to the database schema. This is done by create table SQL statements. In addition to the wanted subset of attributes from the source tables, the DTs typically have to include a record state identifier2 , and a Record ID (RID) from each source record contributing to the derived records. In this thesis, we assume that the RID is based on logical addressing, but physical identification techniques can also be used if a RID mapping table is maintained. The record and state identification concepts were described in Section 2.3. Depending on the relational operator used for DT creation, attributes other than RID and LSN may also be required. An example is vertical merge, i.e. full outer join, in which the join attributes are required to identify which source records should be merged in the DT. If any of these required attributes are not wanted in the DT, they must be removed after the DT creation has completed. This can be done by a simple schema transformation, which is available in modern DBMSs. Constraints, both new and from the source tables, may be added to the new tables. This should, however, be done with great care since constraint 2 I.e., a Log Sequence Number (LSN) on record (Hvasshovd, 1999). 52 4.2. STEP 1: PREPARATION violations may force the DT creation to abort, as illustrated in Example 4.2.1: Example 4.2.1 (Bad Constraint) Consider a one-to-many full outer join of the tables Employee and PostalAddress, as illustrated in Figure 4.1. A unique constraint has been defined for the ZipCode attribute in PostalAddress. If this unique-constraint is added to the derived table “ModifiedEmp”, the transformation will have to abort if more than one person has the same zip code. Any indices that are needed on the new tables to speed up the DT creation process should also be added during this step. In particular, all attributes that are used by DT creation to identify records should be indexed. Examples include RIDs copied from the source tables, and join attributes in the case of vertical merge. These indices decrease the time used to create the DTs significantly. The source record ID3 is, e.g., often used to identify derived records affected by a logged operation. Without an index on this attribute, log propagation of these operations have to scan all records in all DTs to find the correct record(s). With the index, the record(s) can be identified in one single read operation. Which indices are required differ for each relational operator, and are therefore described in more detail in Chapter 6. Note that the indices created during the preparation step will be up to date at any time, including immediately after the DT has been created. The DT creation for some of the relational operators requires information that is not stored in the DTs. Consider the following example: Example 4.2.2 (Auxiliary Tables) Two DTs, “GoldCardHolder” and “PlatinumCardHolder”, are created by performing a horizontal split4 of the table “FrequentFlyer Customer”. Marcus, who has a Silver Frequent Flyer Card, does not qualify for any of these DTs. While the DTs are being created, however, Marcus buys a flight ticket to Hawaii. With this purchase, he is qualified for a gold card. His old customer information is now required by the DT creation process so that Marcus can be added to the “GoldCardHolder” DT. This information can not be found in either of the DTs. 3 4 The RID of the source record a DT record is derived from. Horizontal Split is the inverse of union. CHAPTER 4. THE DERIVED TABLE CREATION FRAMEWORK 53 In cases like the one in Example 4.2.2, auxiliary tables must also be added to the schema. The auxiliary tables store the information required by the DT creation method, and are similar to those used to make MVs self-maintainable (Quass et al., 1996). The detailed DT creation descriptions in Chapter 6 describe the required auxiliary tables when these are required. 4.3 Step 2: Initial Population The newly created DTs have to be populated with records from the source tables. This is done by a modified fuzzy copy technique (see Section 3.2), and the first step of populating the DTs is therefore to write a fuzzy mark in the log. This log record must include the transaction identifier of all transactions that are currently active on the source tables. This is a subset of the active transaction table (Løland and Hvasshovd, 2006c). The transaction table will be used by the next step, log propagation, to identify the oldest log record that needs to be applied to the DTs. The source tables are then read without setting locks. This results in an inconsistent read (Hvasshovd et al., 1991). The relational operator used for DT creation is then applied, and the results, called the initial images, are inserted into the DTs. 4.4 Step 3: Log Propagation Log propagation is the process of redoing operations originally executed on source table records to records in the DTs. All operations are reflected sequentially in the log, and by redoing these, the derived records will eventually reflect the same state as the source records. The log propagation step, which works in iterations, starts when the initial images have been inserted into the DTs. Each iteration starts by writing a new fuzzy mark to the log. This log record marks the end of the current log propagation iteration and the beginning of the next one. Log records of operations that may not be reflected in the DTs are then inspected and applied if necessary. In the first iteration, the oldest log record that may contain such an operation is the oldest log record of any transaction that was active when the first fuzzy mark was written. The reason for this is that the transactions that were active on the source tables may have been able to log a planned operation but not perform it yet at the time the initial read was started. This is a consequence of Write-Ahead Logging, as described in Section 2.3, which requires that write operations are logged before the record 54 4.5. STEP 4: SYNCHRONIZATION is updated. In later iterations, only log records after the previous fuzzy mark needs to be propagated. When the log propagator reads a new log record, affected records in the DTs are identified and changed if the LSNs indicate that the records represent an older state than that of the log record. The effects of applying the log records depend on the relational operator used for the DT creation in question, and are therefore described in more detail in Chapter 6. The synchronization step should not be started if a significant portion of the log remains to be propagated. The reason for this is that synchronization involves latching the source tables while the last portion of the log is propagated. These latches effectively pauses all transactions on the source tables. Each log propagation iteration therefore ends with an analysis of the remaining work. The analysis can, e.g., be based on the time used to complete the current iteration, a count of the remaining log records to be propagated, or an estimated remaining propagation time. Based on the analysis, either another log propagation iteration or the synchronization step is started. A consequence of the described log propagation strategy is that this step will never finish iterating if more log records are produced than the propagator can process during the same time interval. We suggest four possible solutions for this case, none of which are optimal: One possibility is to abort the DT creation transaction. If so, the DT creation work performed is lost, but normal transaction processing will be able to continue as normal. Alternatively, the DT creation process may get a higher priority. The effect of this is that more log is propagated at the cost of lower performance for other transactions. A third possibility is to reduce the number of concurrent transactions by creating a transaction queue. Like the previous alternative, this increases response time and decreases throughput for other transactions. As a final alternative, we may stop log propagation and go directly to the synchronization step. Synchronization will in this case have to latch the source tables for a longer period of time. Depending on the remaining number of log records to propagate, this strategy can still be much quicker than the insert into select strategy used in modern DBMSs. 4.5 Step 4: Synchronization When synchronization is initiated, the state of the DTs should be very close to the state of the source tables. This is because the source tables have to be latched during one final log propagation iteration that makes the DTs consistent with the source tables. We suggest two ways to synchronize the DTs to the source tables and CHAPTER 4. THE DERIVED TABLE CREATION FRAMEWORK 55 thereby complete the DT creation process. These are blocking synchronization and non-blocking synchronization. The blocking method makes the DTs transaction consistent with the source tables, while the non-blocking method only enforces action consistency. Note that the choice of strategy affects the synchronization step only; the three first steps of DT creation are unaffected. Blocking synchronization Blocking synchronization blocks all new transactions that try to access any of the involved tables. Transactions that already have locks on the source tables are either allowed to complete or forced to abort. Table locks are then acquired on the source tables before a final log propagation iteration is performed. This log propagation makes the DTs transaction consistent with the source tables. Blocking complete is the least complex synchronization strategy, but it does not satisfy the non-blocking requirement for DT creation. Non-blocking synchronization The non-blocking strategy latches the source tables for the duration of one final log propagation iteration. Latching effectively pauses ongoing transactions that perform update work on the source tables, but the pause should be very brief since the state of the DTs is very close to that of the source tables. Note that read operations are not paused. Once the log propagation completes, the DTs are in the same state as the source tables. The newly created DTs are now almost ready to be used as MVs. The only remaining task is to add the preferred MV maintenance strategy to it, e.g. one of those described in Section 3.3. From now on, the MV maintenance strategy is responsible for keeping the DTs consistent with the source tables. The latches are then released, allowing transactions to resume their update operations on the source tables. The above distinction between the blocking and non-blocking strategies may seem artificial. However, the difference in time interval in which updates are blocked from the source tables may be considerable. In the former method, new transactions are blocked from the source tables until all transactions that have already accessed them have completed. The updates performed by these transactions also have to be redone by the log propagation, which adds further to the blocking time. The time required for a transaction to complete can not be easily controlled. In the latter strategy, transactions are only blocked during log propagation. As previously discussed, we can easily control the time needed by this 56 4.6. CONSIDERATIONS FOR SCHEMA TRANSFORMATIONS log propagation by not starting the synchronization step until the states of the DTs and the source tables are very close. 4.6 Considerations for Schema Transformations Materialized View creation is a straightforward application of DTs, and can therefore use the non-blocking DT creation framework as described in the previous sections. On the other hand, using the framework to perform schema transformations is more complex. The reason is that in contrast to the MV creation case, transactions will be active in both the source tables and the DTs at the same time during non-blocking synchronization. Consider the following example: Example 4.6.1 (A Non-blocking Schema Transformation) A non-blocking schema transformation is being performed, in which the tables “Employee” and “PostalAddress” are vertically merged into “ModifiedEmp”. This is illustrated in Figure 4.1. The first three steps of the DT creation process have already been executed; only non-blocking synchronization remains. The synchronization step starts by latching “Employee” and “PostalAddress”, before propagating the remaining log records to “ModifiedEmp”. At this point, the records in the source tables and the DT reflect the same state, and schema transformation is nearly complete. New transactions are now given access to the “ModifiedEmp” table instead of the source tables. However, the transactions that were paused by the latches may have more work to do. Thus, until these old transactions have completed, new transactions will be updating records in “ModifiedEmp” while the old transactions are updating records in “Employee” and “PostalAddress”. Example 4.6.1 illustrates that we have to make sure that transactions operating on the same data objects in two different schema versions do not conflict. For example, a transaction in the new schema should not be allowed to change the position of Eric to “Software Engineer” if a currently active transaction in the old schema has already modified the same Eric record. Recall from Section 2.2 that the main responsibility of the scheduler is to provide isolation, and that this property is guaranteed if the histories are serializable. Thus, when synchronizing schema transformations, we must ensure serializability between operations performed on records stored in different tables. CHAPTER 4. THE DERIVED TABLE CREATION FRAMEWORK 57 Note that if the blocking synchronization strategy is used, serialization is not a problem. In this case, transactions that are active in the old schema complete their work before transactions are given access to the new schema. Thus, this synchronization strategy can be adopted without modification. We therefore focus only on non-blocking synchronization in the rest of this chapter. The non-blocking synchronization strategies are divided into two strategies for schema transaction purposes. These are non-blocking abort and nonblocking commit. Here, “abort” and “commit” refers to whether the transactions active in the old schema are forced to abort or are allowed to continue work after the source table latches have been removed. The reason for making this distinction is that in the former case, transactions in the old schema are not allowed to acquire new locks. This scenario is significantly easier to handle than the latter case, in which new locks may be acquired in both schema versions. Both non-blocking strategies ensure serializable histories across the two schema versions. This is done by using a modified Strict Two Phase Locking (2PL)5 (Bernstein et al., 1987) strategy, in which locks are set on all versions of the data record. Non-blocking Abort Synchronization The non-blocking abort strategy latches the source tables for the duration of a log propagation iteration. As previously discussed, this pauses write operations on the source table records. Once the log propagation completes, the DTs are in the same state as the source tables. The locks acquired by transactions in the source tables are then forwarded to the respective records in the DTs. At this point, new transactions are given access to the unlocked parts of the DTs. The source table latches are now released, and all transactions in the old schema are forced to abort. Aborting all old transactions incurs that they may not acquire new locks6 . Log propagation continues to iterate until all operations performed on the source tables have been forwarded to the DTs. In addition to redoing write operations, the propagator also ensures that source table locks forwarded to the DTs are released. Forwarded locks are released as soon as log propagation has processed the abort log record of the transaction owning the lock. The source tables can be removed from the schema once all old transactions have 5 See Section 2.2 for a description on 2PL. An alternative is to only abort the old transactions that try to acquire new locks. The non-blocking abort strategy works for both cases since new locks are not acquired in the old schema in either of the cases. 6 58 4.6. CONSIDERATIONS FOR SCHEMA TRANSFORMATIONS aborted. Non-blocking Commit Synchronization Non-blocking commit synchronization is equal to the previous strategy in many aspects: the source tables are latched, a log propagation iteration is used to synchronize the DT states to the source table states, and source table locks are forwarded to the DTs. In contrast to the previous strategy, however, transactions on the source tables are allowed to continue forward processing after the latches have been removed. Ronström calls this a soft transformation (Ronström, 2000). A consequence of allowing source table transactions to acquire new locks, is that new conflicts may occur across different schema versions. This strategy therefore requires that when a lock is acquired, all other versions of that record are immediately locked as well. A thorough discussion of implications can be found in Section 5.3, but simply put, a transaction that wants to access a record rdt in a DT has to set a lock both on rdt and on all records that rdt is derived from. Likewise, a transaction accessing a record in a source table has to lock both that record and all DT records derived from it. To avoid unnecessary deadlocks, locks are always acquired in the source tables first, and then in the DTs. Log propagation has to be modified to forward DT operations not only from source to derived records, but from derived to source records as well. This is done so that source table transactions can see the updates performed by transactions in the new schema. Log propagation is also responsible for removing source table locks from the DTs and vice versa. This is done in a similar manner as described for the non-blocking abort strategy. It is clear that the non-blocking abort strategy produces serializable histories: the scheduler uses Strict 2PL to produce serializable histories within each table. Before any transaction is given access to the records in the derived tables, all records that are locked in the source tables are also locked in the derived tables. These forwarded source locks are not released in the new schema until the log propagator has applied all operations executed by the transaction owning the lock. Hence, transactions in the new schema can only access committed values. Although less intuitive, non-blocking commit synchronization also produces serializable histories: locks are acquired immediately in both schema versions. If a transaction is not able to lock all versions of a record, the transaction has to wait for the lock. Furthermore, forwarded locks are not released until after all operations by that transaction have been propagated. Hence, transactions in either schema can only access committed values with CHAPTER 4. THE DERIVED TABLE CREATION FRAMEWORK 59 this strategy as well. Because of the added complexity and increased chance for locking conflicts associated with non-blocking commit, the non-blocking abort strategy may be a better choice for some schema transformation operators. This is especially true for operators where one DT record may be composed of multiple source records, since one single DT lock requires multiple source table locks. Thus, a decision must be made on whether or not aborting the transactions active in the source tables is worse than risking lock contention. This problem will be discussed in greater detail for each relational operator in Chapter 6. 4.6.1 A lock forwarding improvement for schema transformations We have argued that for schema transformations, transactions cannot be given access to the new schema before all locks acquired in the old schema have been forwarded. Forwarding all source table locks to the DTs may require a considerable amount of work. This may be unacceptable since it has to be done during the critical time interval of synchronization when transactions are not given access to any of the involved tables. Two modifications are required to remove the lock forwarding phase from this critical time interval: first, the initial population step must be modified to store the source table locks to the first fuzzy mark. Second, DT lock acquisition and release must be included as part of the log propagation. By doing so, locks will be continuously forwarded, and therefore in place when the synchronization step is started. 4.7 Summary In this chapter we have described a framework that will be used to create DTs for the six relational operators focused on. The framework can be used both to perform schema transformations and to create materialized views. An important purpose of the framework has been to degrade the performance of concurrent transactions as little as possible. The framework is therefore based on copying the source tables in a non-blocking way. Since the source tables are not locked, the copied records, which are inserted into the DTs, may not be consistent with the records in the source tables. Logged operations originally applied to source records are then propagated to the DTs. When all logged source operations have been propagated, the DTs are consistent with the source tables and are ready to be used. Chapter 5 Common DT Creation Problems In this chapter, we identify five problems that are encountered by DT creation for multiple relational operators. We call these the missing record and state identification, missing record pre-state, lock forwarding during transformation and inconsistent source record problems. In what follows, we discuss these problems and suggest solutions. 5.1 Missing Record and State Identification For logging to be useful as a recovery mechanism, there must be a way to identify which record the logged operation applies to. Records therefore have a unique identifier, assumed in this thesis to be a Record Identifier (RID), that is stored in each log record. Record IDs are described in more detail in Section 2.1. Since RIDs are unique, however, a record in a DT can not have the same RID as the source record(s) it is composed of. Furthermore, even if the source RIDs could have been reused, the DT creations where one DT record may be composed of two source records would still be problematic. We call the problem of mapping record identification from source records to derived records the record identification problem. It is solved by letting the records in the DTs have their own RIDs, but at the same time store the RID of each source record contributing to it. In vertical merge (full outer join) DT creation, e.g., the RID of both source tables would be stored in the DT. Log Sequence Numbers (LSNs) on records are used as state identifiers to ensure idempotence during recovery and when making fuzzy copies. During recovery or fuzzy coping, each log record is compared to the record with a CHAPTER 5. COMMON DT CREATION PROBLEMS Employee 61 Position Name Address Hanna Erik Markus Sofie Moholt 3 Torvbk 6 Mollenb.7 Berg 1 PosID RID LSN PosID PosTitle Salary RID LSN 005 005 050 052 r01 r02 r03 r04 10 11 12 13 001 005 052 sec.tary QA proj mgr $23’ $31’ $48’ r11 r12 r13 14 15 16 Employee Name Address PosID PosTitle Salary Hanna Erik Markus Sofie NULL Moholt 3 Torvbk 6 Mollenb.7 Berg 1 NULL 005 005 050 052 NULL QA QA NULL proj mgr sec.tary $31’ $31’ NULL $48’ $23’ RID_L RID_R LSN_L LSN_R r01 r02 r03 r04 NULL r12 r12 NULL r13 r11 10 11 12 13 NULL 15 15 NULL 16 14 Figure 5.1: The Record and State Identification Problems are solved by including the record IDs and LSNs from both contributing source records in each derived record. matching RID, and is applied only if the logged state represents a newer state than that of the record. The LSNs from the source records may be used in the same way for DT creation. Derived records may, however, be composed of more than one source record. In these cases, one LSN is not enough to identify the state of the derived record. We call this the state identification problem. The problem is solved by including the LSN of all contributing source records. Both the record and state identification problems are illustrated in Figure 5.1. 5.2 Missing Record Pre-States The suggested DT creation method is based on applying operations in the log to derived records during log propagation. For the horizontal split, difference and intersection operators, however, some source records may not belong to any of the DTs. The missing record pre-state problem is encountered if any of the records not included in a DT are needed by the log propagator. The problem can be solved by letting the log propagator acquire the missing information from the source tables. Since the source tables are in a different state than the DTs, however, this solution complicates the log propagation rules. Furthermore, it incurs that the method is no longer self- 62 5.2. MISSING RECORD PRE-STATES Do Not Sell These Vinyl Records Artist Record Smith, Jimmy Evans, Bill Davis, Miles Root Down Intuition Kind of Blue Cassidy, Eva Imagine RID LSN r1 r2 r3 101 102 103 r20 170 RID LSN r10 r11 r12 151 152 153 ? Artist Record Smith, Jimmy Davis, Miles Root Down Kind of Blue RIDSrc LSN r1 r3 101 103 CD Records Artist Krall, Diana Peterson, O. Evans, Bill Record All for You The Trio Intuition (a) A DT, “Do Not Sell These”, storing the difference between Vinyl and CD records, is created. The log propagator does not have the information needed to decide whether the new Eva Cassidy vinyl should belong to the DT or not. Do Not Sell These Vinyl Records Artist Record Smith, Jimmy Evans, Bill Davis, Miles Root Down Intuition Kind of Blue Cassidy, Eva Imagine RID LSN r1 r2 r3 101 102 103 r20 170 RID LSN r10 r11 r12 151 152 153 Artist Record Smith, Jimmy Davis, Miles Root Down Kind of Blue Cassidy, Eva Imagine RIDSrc LSN r1 r3 101 103 r20 170 CD Records Artist Krall, Diana Peterson, O. Evans, Bill Record All for You The Trio Intuition CompareTo Artist Krall, Diana Peterson, O. Evans, Bill Record All for You The Trio Intuition RIDSrc LSN r10 r11 r12 151 152 153 (b) By adding the derived state of the CD record table, the log propagator is able to determine that the new vinyl should be inserted into the DT. Figure 5.2: The missing record pre-state problem of Example 5.2.2 is solved for inserts into the first source table of a difference DT. Note that this only solves one of the missing record pre-state problems for difference DT creation. maintainable (Blakeley et al., 1989; Quass et al., 1996), as described in Chapter 2. The other solution is to add auxiliary tables to store information on missing records that are necessary to make the DT self-maintainable. Auxiliary tables were originally suggested for this purpose in MV maintenance by Quass et al. (Quass et al., 1996). Consider the following examples: CHAPTER 5. COMMON DT CREATION PROBLEMS 63 Example 5.2.1 (Missing Record Pre-State in Horizontal Split) Consider a DT creation using the horizontal split operator where one table, “Employee”, is split into “LondonEmployee” and “ParisEmplyee”. John, who is the company’s only salesman in New York, would not be represented in either of these derived tables. If John later moves to Paris (i.e. an update operation is encountered in the log) the previous state of John’s derived record is needed by the log propagator before it can be inserted into the ParisEmplyee table. Example 5.2.2 (Missing Record Pre-State in Difference) A DT “Do Not Sell These” is created as the difference between the tables “Vinyl Records” and “CD Records”. During log propagation, a new Eva Cassidy vinyl record is inserted. As can be seen in Figure 5.2(a), the log propagator does not know whether to insert this record into the derived table or not. The log propagator could scan the “CD Records” for an equal record, but that table represents a different state. Furthermore, the operation would not be self-maintainable. By adding the compare-to table containing the derived state of the CD records, the log propagator may scan that table instead. This reveals that the Eva Cassidy record should belong to the DT, as illustrated in Figure 5.2(b). 5.3 Lock Forwarding During Transformations When either the non-blocking abort or commit strategy is used for synchronization of a schema transformation, old transactions are allowed to operate on the source table records while new transactions operate on the DT records at the same time. As described in Section 4.5, this incurs that locks have to be forwarded from the source records to the DT records. For non-blocking commit, locks also have to be forwarded from the DTs to the source tables since the source table transactions are allowed to acquire new locks. Derived locks, i.e. locks that have been forwarded, are released once the log propagator has processed a log record describing the completion of the transaction that owns the lock. Lock forwarding ensures that concurrency control is enforced between the source and DT versions of the same record. In this section, four lock forwarding cases are discussed. In the first case, simple lock forwarding (SLF), each DT record is derived from one and only 64 5.3. LOCK FORWARDING DURING TRANSFORMATIONS Do Not Sell These Vinyl Records Artist Smith, J Artist Record Evans, Bill Davis, Miles Root Down Intuition Kind of Blue Cassidy, Eva Imagine RID LSN r1 r2 r3 171 r20 170 RID LSN r10 r11 r12 151 152 153 102 103 Record RIDSrc LSN r1 r3 171 Davis, Miles Root Down Kind of Blue Cassidy, Eva Imagine r20 170 Smith, J 103 CD Records Artist Krall, Diana Peterson, O. Evans, Bill Record All for You The Trio Intuition Figure 5.3: Simple Lock Forwarding (SLF) during Non-Blocking Commit. Source record locks require only one DT record lock to be acquired and vice versa. one source record. Furthermore, a source record contributes to one and only one derived record. The second case, many-to-one lock forwarding (M1LF), applies when each source record contributes to one derived record only, but a record in the DT may be derived from multiple source records. Third, one-to-many lock forwarding (1MLF) is discussed. As the name suggests, it applies when one source record may contribute to many derived records, but each record in the DTs is derived from one source record only. The fourth case, many-to-many lock forwarding (MMLF), inherits the problems of both M1LF and 1MLF. Simple Lock Forwarding (SLF) Because of the one-to-one relationship between source and DT records in the simple lock forwarding case, lock forwarding is straightforward. When a source record is locked, the DT record with the same source RID is locked and vice versa. The horizontal merge with duplicate inclusion, horizontal split into disjoint DTs, difference and intersection operators work like this. Many-to-One Lock Forwarding (M1LF) Horizontal merge with duplicate removal and vertical merge of one-to-one relationships are the only transformations presented that need many-to-one lock forwarding. As always, locks on source records ensure that conflicting source operations are not given access at the same time. Concurrency control requires these locks to be forwarded to the DTs before new transactions can be given access to the DTs. If the normal shared (S) and exclusive (X) CHAPTER 5. COMMON DT CREATION PROBLEMS Src.S Src.X DT.S DT.X Src.S Src.X DT.S DT.X Y Y Y N Y Y N N Y N Y N 65 N N N N Figure 5.4: Lock compatibility matrix for many-to-one lock forwarding. Locks transfered from the source tables do not conflict with each other, but conflict as normal with locks set on the target tables. The compatibility matrix can easily be extended to multigranularity locking. locks are used in the DTs, however, these non-conflicting source operations could nevertheless be conflicting. This happens if more than one source record contributing to the same derived record is locked. Since the scheduler guarantees that operations on the source records are serializable (Bernstein et al., 1987), there is no need for these locks to conflict. New locks are therefore suggested. Figure 5.4 shows the lock compatibility matrix used to avoid conflict between non-conflicting operations forwarded from the source tables. Conflicting operations on the target table are still blocked. These new locks solve the concurrency issues with M1LF. Locks can now be forwarded from the source records to the DTs without causing conflicts for both non-blocking strategies. If non-blocking commit is used, locks set on DT records must also be set on all source records contributing to the record in question. One-to-Many Lock Forwarding (1MLF) The third case is that of one-to-many lock forwarding. It applies to vertical split of one-to-one relationships and to horizontal split since the resulting DTs may be overlapping, i.e. non-disjoint. Since one DT record may be derived from one source record only, lock compatibility remains unchanged for the non-blocking abort strategy. Thus, the only difference between nonblocking abort in SLF and in 1MLF is that one source record lock may result in many DT record locks. Non-blocking commit, however, is more complicated. The reason for this is that if a DT record is updated, the update must be applied not only to the source record it is derived from, but also to all DT records derived from it. Locks must also be set on all these records immediately. Otherwise, the target and source versions of the record would not be equal. By doing this, the behavior of transactions operating on the DTs during synchronization differs 66 5.4. INCONSISTENT SOURCE RECORDS CD Records Vinyl Records Artist Record Artist Record Smith, Jimmy Jones, Norah Root Down Come Away With Me Kind of Blue Krall, Diana Davis, Miles Evans, Bill The Look of Love Miles Ahead Waltz for Debby Kind of Blue Davis, M Davis, M U Unique Records Artist Record Smith, Jimmy Jones, Norah Root Down Come Away With Me Kind of Blue The Look of Love Miles Ahead Waltz for Debby Davis, M Krall, Diana Davis, Miles Evans, Bill Figure 5.5: Many-to-One Lock Forwarding (M1LF) during Non-Blocking Commit. During synchronization of horizontal merge with duplicate removal, a DT record lock results in a lock of all records it is derived from. Note that record and state identification information is not included. from the behavior after synchronization is complete. If this is considered a problem, non-blocking abort must be used instead. This problem will be elaborated on in the detailed operator description sections. Many-to-Many Lock Forwarding (MMLF) Vertical merge and split of one-to-many relationships belong to the fourth category, many-to-many lock forwarding. These operators inherit the problems from both the 1MLF and M1LF cases. This means that the modified lock compatibility scheme must be used for both non-blocking strategies. In addition, operations performed on derived records may have to be forwarded both to multiple source records and to all DT records derived from these if the non-blocking commit strategy is used. 5.4 Inconsistent Source Records As pointed out when describing the schema transformation method by Ronström in Section 3.1, inconsistencies between records in the source tables may CHAPTER 5. COMMON DT CREATION PROBLEMS Employee 67 PostalCode F.Name S.Name Address Hanna Erik Markus Sofie Valiante Olsen Edwards Moholt 3 Torvbk 6 Mollenb.7 Berg 1 Clark PCode RecID LSN PCode 7030 5121 7020 7020 r01 r02 r03 r04 5121 7020 9010 10 11 53 13 City Bergen Moss Tromsø RecID LSN r11 r12 r13 14 53 16 EmployeePost F.Name S.Name Address Hanna Erik Markus Sofie NULL Valiante Olsen Edwards Moholt 3 Torvbk 6 Mollenb.7 Berg 1 NULL Clark NULL PCode 7030 5121 7020 7020 9010 City NULL Bergen Moss Moss Tromsø RID_L LSN_L RID_R LSN_R r01 r02 r03 r04 NULL 10 11 53 13 NULL NULL r11 r12 r12 r13 NULL 14 53 53 16 Figure 5.6: Many-to-Many Lock Forwarding (MMLF) during Non-Blocking Commit. A DT record lock on Markus results in two additional source and one additional DT record lock. Employee F.Name Hanna Erik Markus Sofie Peter S.Name Address Valiante Olsen Oaks Clark Smith Moholt 3 Torvbk 6 Mollenb.7 Berg 1 Nordre 2 PCode 7030 5121 7020 7020 7020 City NULL Bergen Tr.heim Tr.heim Tr.hemi Figure 5.7: Example of Inconsistent Source Records. Three employees have postal code 7020, but the city names are not equal. be inevitable and may cause problems for the DT creations where multiple source records contribute to the same derived record. In the detailed DT creation descriptions in Chapter 6, this problem is relevant to vertical split, and to vertical merge for schema transformations. Figure 5.7 illustrates a typical inconsistency where records with the same postal code have different city names. The illustrated inconsistency is an anomaly since it breaks a functional dependency (Garcia-Molina et al., 2002). The functional dependency is only intended, however; the DBMS does not enforce it. How to handle inconsistencies have been studied for different applications where data from multiple sources are integrated into one. These include merging of knowledge bases in the field of Artificial Intelligence, and merg- 68 5.4. INCONSISTENT SOURCE RECORDS ing database records from distributed database systems. The problem for both is that even though integrity constraints guarantee consistency for each independent information source, the combination of records from multiple sources may still be inconsistent. The solutions in the literature either focus on repairing the integrated records (i.e. make the records consistent) or answering queries consistently based on the inconsistent records. Only the former solution is relevant in this thesis. 5.4.1 Repairing Inconsistencies Before a repair can be performed, the records from all sources are integrated into one table. Conflicts may be removed already at this stage, depending on the selected integration operator. It may, e.g., be performed by the merge (Greco et al., 2001a), merge by majority (Lin and Mendelzon, 1999) or prioritized merge integration operators (Greco et al., 2001b). As the names suggest, merge simply includes all record versions from all sources while merging by majority tries to resolve conflicts by using the value agreed upon by the majority of sources. Prioritized merge orders sources so that when conflicts are encountered, the version from the source with higher priority is chosen. Other integration operators also exist, see e.g. (Lin, 1996; Greco et al., 2003; Caroprese and Zumpano, 2006). If an inconsistency can not be resolved during integration, the different versions are all stored in the result table. To enable multiple records with the same primary key to exist at the same time, an attribute that refers to the originating source table is added to the primary key (Agarwal et al., 1995). If there are inconsistencies between integrated records, the repair operator is applied to the result table. It identifies the alternative sets of insertions and deletions that will make the table consistent (Greco et al., 2003), and then applies one of these insert-delete sets. Preference rules, or Active Integrity Constraints (AICs), may be added so that in the case of conflict, one solution is preferred to the others (Flesca et al., 2004). An example AIC presented by Flesca et al. (Flesca et al., 2004), states that if two conflicting records for an employee are found, the one with the lowest salary should be kept. Even if AICs are used, however, there may be many alternative insert/delete sets. To the authors knowledge, the choice of which alternative to choose has not been discussed in the literature. In vertical split DT creation, merging by majority is used as the integration operator, i.e. during initial population. This integration may be constrained to only resolve conflicts if there is a major majority, e.g. >75%. The derived records where the conflicts are not resolved are then tagged CHAPTER 5. COMMON DT CREATION PROBLEMS Problem Missing Record ID Missing State ID Missing Record Pre-State Lock Forwarding during Transformations Inconsistent Source Records Description Unable to identify which derived record a logged operation applies to. Unable to identify whether a logged operation has already been applied to a derived record. Information about a record that is not stored in any of the DTs is required. 69 Solution Add RID of all contributing source records to each DT record. Add LSN of all contributing source records to each DT record. Store the information in an auxiliary table. Scheduling problem for records existing in two schema versions. Forward locks between schema versions and Modify the Lock Compatibilities. Anomalies are found in the source tables. Resolve by major majority or tag record as inconsistent and ask DBA. Table 5.1: Summary of common DT creation problems and their solutions. with an inconsistent mark. The repair algorithm will identify these records and present the different alternatives so that the DBA may decide which alternative is correct. 5.5 Summary In this chapter, we have identified five problems encountered when our framework is used for DT creation. Identifying these common problems and showing how they can be solved in general, makes it easier to explain DT creation for each of the six relational operators. They also make it easier to develop DT creation methods for other relational operators if they share the problems. A summary of the problems and their solutions is shown in Table 5.1. Chapter 6 DT Creation using Relational Operators This chapter describes how the DT creation process is performed for each relational operator. All methods follow the general framework presented in Chapter 4. The methods are described in order of increasing complexity. As shown in Table 6.1, this is the same order as the lock forwarding categories for schema transformations described in the previous chapter. The detailed DT creation descriptions start with the difference and intersection operators and horizontal merge with duplicate inclusion. Schema transformations using these operators belong to the Simple Lock Forwarding (SLR) category, and are the least complex operators. Horizontal merge with duplicate removal is more complex than the duplicate inclusion case since records in the DT may be derived from multiple source records. Hence, it belongs to the Many-to-One Lock Forwarding (M1LF) category. The next operator, horizontal split, is the inverse of union. Since the split is allowed to form overlapping result sets, one source record may be derived into multiple DTs. Horizontal split schema transformation belongs to the One-to-Many Lock Forwarding (1MLF) category. The two final operators are vertical merge and split. With these operators, one source record may contribute to multiple derived records. Furthermore, one derived record may be derived from multiple source records1 . The schema transformations using these operators require Many-to-Many Lock Forwarding (MMLF). Vertical split DT creation is also complicated by possible inconsistencies between source records, and is therefore the most 1 An exception is vertical split over a candidate key. With this operator, records in the DTs are derived from exactly one source record, and it therefore belongs to the 1MLF category. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 71 DT Creation Operator Lock forwarding category Section Difference, intersection Difference, intersection SLF 6.1 Horizontal Merge, Dup Inclusion Union SLF 6.2 Horizontal Merge, Union Dup Removal M1LF 6.3 Horizontal Split Selection 1MLF 6.4 Vertical Merge Full Outer Join MMLF 6.5 Vertical Split, candidate key Projection 1MLF 6.6 Vertical Split, non-candidate key Projection MMLF 6.7 Table 6.1: DT Creation Operators. complex operator. 6.1 Difference and Intersection Difference and intersection (diff/int) DT creations are so closely related that the same method is applied to both operations. The method takes two source tables, Sin and Scomp (compared-to), as input. Sin contains the records that belong to either the difference or intersection set, based on the existence of equal records in Scomp . The output is a DT containing the difference (DTdif f ) or intersection (DTint ) set of the source tables. An example DT creation is shown in Figure 6.1. In the figure, DTaux is created to solve the missing record pre-state problem described in Chapter 5. Note that in many cases, Scomp is not removed even if the DT is used for a schema transformation. 6.1.1 Preparation During preparation, the derived table is added to the database schema. It may contain any subset of attributes from Sin that are wanted in the new DT. It is assumed that if a candidate key is not among these attributes, a generated primary key, e.g. an auto-incremented integer, is added to the DT. 72 6.1. DIFFERENCE AND INTERSECTION Difference Vinyl Records Artist Record Smith, Jimmy Evans, Bill Davis, Miles Root Down Intuition Kind of Blue RID LSN r1 r2 r3 101 102 103 CD Records Artist Krall, Diana Peterson, O. Evans, Bill Artist Record Smith, Jimmy Davis, Miles Root Down Kind of Blue All for You The Trio Intuition RID LSN r10 r11 r12 151 152 153 r1 r3 101 103 Intersection Artist Evans, Bill Record RIDSrc LSN Record Intuition RIDSrc LSN r2 102 Auxiliary Artist Krall, Diana Peterson, O. Evans, Bill Record All for You The Trio Intuition RIDSrc LSN r10 r11 r12 151 152 153 Figure 6.1: Difference and intersection DT creation. Grey attributes are used internally by the DT creation process, and are not visible to normal transactions. The generated primary key will not be considered when checking which of DTint or DTdif f a record from Sin belongs to. Duplicates may form as a result of only including a subset of source attributes in the DTs. Implications of this is not discussed since the method used for horizontal merge with duplicate removal, described later in Section 6.3, can be used for diff/int as well. In addition to the source attributes and the key, the DT must include the source RID and LSN to solve the record and state identification problems. These are shown as grey attributes in Figure 6.1. The diff/int DT creations suffer from the missing record pre-state problem if only one derived table, storing either the difference or intersection set, is created. The problem is twofold. First, a record t derived from Sin may at one point in time belong to the difference set and later to the intersection set or vice versa. This may be caused by an update of the record itself, or by an insert, delete or update of a record in Scomp . The old state of t is needed in both cases. Thus, the first missing record pre-state problem is that the state of records from Sin that do not belong to the difference or intersection DT being created are also needed. The problem is solved by adding an auxiliary table storing the Sin records that are not in the result set. Thus, both DTdif f and DTint are needed during both DT creations. Second, the state of records derived from Scomp are frequently needed to determine if a record derived from Sin should belong to DTdif f or DTint . This happens every time a log record describes an update or insert of a record in Sin as well as when records in Scomp are updated or deleted. In the case that an Scomp record r is updated, e.g., records in DTint that are equal to CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 73 the old state of r may have to be moved to DTdif f , and records in DTdif f that are equal to the new state of r should be moved to DTint . Thus, the second missing record pre-state problem is that the derived state of records from Scomp are needed as well, and is solved by storing these records in an auxiliary table called DTaux . Because of the missing record pre-state problems described above, three tables are created during the preparation step. These are DTdif f , DTint and DTaux . Both auxiliary tables must have the same attributes as the DT. Indices are created on the source RID attributes of all derived tables. If candidate keys from the source tables are included in the DTs, indices should also be created on one of these in all derived tables. With these indices, only one record has to be read when log propagation searches for equal records in any of the DTs2 . If a candidate key is not included, an index should be added to one or more attributes that differ the most between records. In the worst case scenario, i.e. without an index on any derived attribute, initial population and log propagation must read all records in the derived tables when testing for equality. Unless the source tables contain few records, such DT creations are in danger of never completing. 6.1.2 Initial Population Once the derived and auxiliary tables have been created, the fuzzy mark is written to the log. Both source tables are then read fuzzily. Records from Scomp are inserted directly into DTaux whereas records from Sin are first compared to the records in DTaux . If an equal record is found in DTaux , the Sin record is inserted into DTint . Otherwise, it is inserted into DTdif f . When this step is complete, the DTs are said to contain the initial image. 6.1.3 Log Propagation Log propagation is organized in iterations. Each iteration starts by writing a fuzzy mark in the log L, and then retrieves all log records relevant to the source tables. The oldest log record that must be retrieved depends on whether or not this is the first log propagation iteration, as discussed in Section 4.4. If the DT will be used to transform the schema, the synchronization strategy (step 4) must be decided on now. If either of the non-blocking synchronization strategies will be used, locks should be maintained continuously 2 Although reading one record may involve reading many disk blocks if the index is partially stored on disk. 74 6.1. DIFFERENCE AND INTERSECTION during log propagation so that the locks are in place when synchronization is started. The log records are applied to the DTs in sequential order. Thus, the log L consists of a partially ordered (Bernstein et al., 1987), finite number of log records, `1 ...`m , that are applied to the DTs in the same order as the logged operations were applied to the source tables. Note that a partial order only guarantees the ordering of conflicting operations. In the diff/int DT creations, source records may contribute to only one derived record. Furthermore, since it is assumed that duplicates are not removed, each derived record is derived from only one source record. Log propagation has much in common with ARIES redo recovery processing (Mohan et al., 1992) due to this one-to-one relationship between source and derived records. A difference is, however, that records may move between DTint and DTdif f . Since the source candidate keys may not be included in the DTs, multiple records derived from Sin may be equal to each DTaux record. This is reflected in the log propagation rules described next. Propagation of Sin log records Consider a log record ` ∈ L, describing an insert, update or delete of a record in Sin . Independent of the operation described in `, the first step of propagation is to perform a lookup of the record ID in ` in the source RID index of DTint and DTdif f . The record found is called t. If the logged operation is an insert of a record r into Sin , and a record t was found in the source RID lookup, the logged operation is already reflected and is therefore ignored. If a record t was not found in any of the DTs, DTaux is scanned for a record with equal attribute values. r is then inserted into either DTint or DTdif f , depending on whether or not an equal record was found in DTaux . As mentioned in the preparation section, the cost of scanning for equal records in DTaux varies greatly with the availability of good indices. If an index on a unique attribute exist, at most one record has to be read to determine which table r belongs to. If no indices are present at all, all records in DTaux may have to be read. Let ùpd describe the update of a record r ∈ Sin . If the record ID of r was not found in the source RID lookup in DTint and DTdif f , ùpd is ignored. t is guaranteed to reflect all relevant operations that happened before ùpd (and possibly some that happens later) (Løland and Hvasshovd, 2006c). Thus, not finding t in any of the DTs can only mean that a delete of r will be described by a later log record `2 ∈ L, ùpd ≺ `2 , where a ≺ b means that a happens before b. If the record t was found, and `LSN > tLSN , the update described in ùpd is CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 75 applied. This update may require t to move between the DTs: if the updated version of t is equal to a record in DTaux , it should be in DTint . Otherwise, it should be in DTdif f . Moving t to the other DT is done by deleting the old version of t from one table and inserting the updated version into the other. Log propagation of delete operations is straightforward. If the record t was found in the source RID lookup, t is deleted. Propagation of Scomp log records In contrast to derived Sin −records, derived Scomp −records may only belong to one table: DTaux . The reason for maintaining DTaux is only to decide which of DTint or DTdif f an Sin record should belong to. Consider a log record ìns ∈ L, describing the insertion of a record r into Scomp . The log record is ignored if the RID of r is found in a lookup in the source RID index of DTaux . This means that ìns is already reflected. Otherwise, r is inserted, and DTdif f is scanned to check if equal records e1 , . . . , em are represented there. If found, e1 , . . . , em are moved to DTint . Let ùpd ∈ L describe an update of a record r in Scomp . If the source RID of r is not found in DTaux , ùpd is ignored. Otherwise, if the record t with the described source RID is found, and if `LSN > tLSN , t is updated. This update may require records to be moved between DTint and DTdif f . DTint and DTaux are first scanned for records equal to the old version of t. If the records e1 , . . . , em in DTint are found, and no equal records were found in DTaux , e1 , . . . , em are moved to DTdif f . DTdif f is then scanned for records equal to the updated version of t. All matching records are moved to DTint . Propagation of a delete log record `del ∈ L starts by identifying the derived version r of the record to delete in DTaux . This is done by a lookup in the source RID index. r is then deleted. If DTaux does not contain other records that are equal to r, DTint is scanned. All records in DTint that are equal to r are then moved to DTdif f . 6.1.4 Synchronization As argued in Section 4.5, synchronization should not be started until the states of the DTs are very close to the states of the source tables. Hence, in what follows, we assume that this is the case. If the blocking complete strategy is used, new transactions are first blocked from accessing Sin and Scomp . Transactions already active on the source tables are then either allowed to commit or are forced to abort. When all these transactions have terminated, a final log propagation iteration is executed. 76 6.1. DIFFERENCE AND INTERSECTION This makes the derived tables transaction consistent with the source tables. The DTs are now ready to be used in a schema transformation or as MVs. The non-blocking strategies differ between schema transformation and MV creation purposes. They are therefore described separately. Synchronization for Schema Transformations When performing non-blocking synchronization for schema transformations, transactions are allowed to perform updates in the source and derived tables at the same time. Concurrency control is needed to ensure that the different versions of the same record are not updated inconsistently. As described in Section 5.3, this is done by setting locks in both tables. Each record in Sin or Scomp is derived into only one DT record, and each DT record is composed of only one source record. The diff/int schema transformations therefore belong to the simple lock forwarding (SLF) category described in Section 5.3. Synchronization starts by latching Sin and Scomp for the duration of a log propagation iteration. Read operations are, however, not affected by this latch. Since the states of the source and derived tables do not differ much, this pause should be very brief. This log propagation makes the DTs action consistent with the source tables. Remember that since locks have been continuously forwarded by log propagation, all locks on source records are also set on the derived records. If the non-blocking abort strategy is used, transactions active on Sin and Scomp are then forced to abort while new transactions are allowed to access DTdif f and/or DTint . The aborting transactions can not acquire new locks. Log propagation continues to forward the undo operations performed by the aborting source table transactions. Locks forwarded from the source tables are released once the log propagator encounters the abort log record of the transaction holding it. When all source table transactions have terminated, Sin and Scomp may be removed from the schema. With non-blocking commit, source table transactions are allowed to access new records. In addition to the lock and operation forwarding from source to derived table performed in non-blocking abort, locks and operations must also be forwarded from the derived tables to the source tables. The reason for this is, of course, that transactions operating on the source tables may access the records that have been modified in the derived tables. With SLF, an operation and lock on one DT record t results in an operation and lock on only the one record that t is derived from. Locks are always acquired immediately on both versions of the record whereas locks are not released until the log propagator encounters the transaction’s commit (or abort) log CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 77 Vinyl Records RecID r1 r2 r3 Artist Record LSN Smith, Jimmy Jones, Norah Davis, Miles Root Down Come Away With Me Kind of Blue 101 102 103 CD Records RecID r10 r11 r12 Artist Record LSN Krall, Diana Davis, Miles Evans, Bill The Look of Love Miles Ahead Waltz for Debby 151 152 153 U Records RecID RIDSrc r101 r102 r103 r104 r105 r106 r1 r2 r3 r10 r11 r12 Artist Smith, Jones, Davis, Krall, Davis, Evans, Jimmy Norah Miles Diana Miles Bill Record LSN Root Down Come Away With Me Kind of Blue The Look of Love Miles Ahead Waltz for Debby 101 102 103 151 152 153 Figure 6.2: Horizontal Merge DT creation. record. Synchronization for MV Creations Since transactions do not update records in the DTs when used to create MVs, operations will not be forwarded from DTdif f and DTint to Sin . Sin and Scomp are first latched during one final log propagation iteration. Read operations are still allowed, however. This log propagation makes the DTs action consistent with the source tables. An MV maintenance method is then added to DTdif f and/or DTint , before the source table latches are removed. The MV maintenance strategy is now responsible for keeping the MV consistent with the source tables, and DT creation is now complete. 6.2 Horizontal Merge with Duplicate Inclusion The horizontal merge DT creation uses the union relational operator. It takes records from m source tables, S1 , . . . , Sm , and inserts these into a derived table DThm . DThm may contain any subset of attributes from the source 78 6.2. HORIZONTAL MERGE WITH DUPLICATE INCLUSION Vinyl Records RecID r1 r2 r3 Artist Record LSN Smith, Jimmy Jones, Norah Davis, Miles Root Down Come Away With Me Kind of Blue 101 102 103 CD Records RecID r10 r11 r12 r13 Artist Krall, Davis, Evans, Davis, Diana Miles Bill Miles Record LSN The Look of Love Miles Ahead Waltz for Debby Kind of Blue 151 152 153 154 U Records RecID RIDSrc r101 r102 r103 r104 r105 r106 r107 r1 r2 r3 r10 r11 r12 r13 Artist Smith, Jones, Davis, Krall, Davis, Evans, Davis, Jimmy Norah Miles Diana Miles Bill Miles Record LSN Root Down Come Away With Me Kind of Blue The Look of Love Miles Ahead Waltz for Debby Kind of Blue 101 102 103 151 152 153 154 Figure 6.3: Horizontal Merge with Duplicate Inclusion. Notice the duplicate Miles Davis albums (record IDs r103 and r107 in the derived table). tables. An example horizontal merge between two source tables, “Vinyl records” and “CD records”, is illustrated in Figure 6.2. The DT may be defined to keep or remove duplicates. If duplicates are kept, all records in the source tables are represented in the DT. In this case, DThm is self-maintainable. When duplicates are removed, however, multiple source records may contribute to the same derived record. DT creation using this operator requires additional information stored in an auxiliary table to be self maintainable (Quass et al., 1996). Horizontal merge with duplicate inclusion is described in this section, whereas duplicate removal is discussed in Section 6.3. Figure 6.3 shows a slightly modified version of Figure 6.2 where the two source tables contain one version of the Miles Davis album “Kind of Blue” each. As expected when duplicates are not removed, the resulting DT contains both. Notice that the different Source Record IDs (RIDSrc) enables us to identify which record is derived from which source record. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 79 6.2.1 Preparation Since all records from S1 , . . . , Sm are represented exactly once in DThm , horizontal merge with duplicate inclusion only suffers from the missing record and state identification problems. By including the source RID and LSN in DThm , the derived table is made self-maintainable. During preparation, the derived table DThm is first added to the schema. The table may include any subset of attributes from the source tables. Since it is common for DBMSs to require a primary key in each table, an auto generated primary key may have to be added to DThm . This generated primary key is not shown in the figures of this section. DT creation only uses the source RID attribute for record identification. An index is therefore only required on this attribute. 6.2.2 Initial Population Initial population starts by writing a fuzzy mark, containing the identifiers of all transactions active in S1 , . . . , Sm , in the log. The source tables are then read without the use of locks, and each record is then inserted into DThm . The resulting initial image in DThm is not consistent with the source tables at any point in time. 6.2.3 Log Propagation All source records are represented in DThm when log propagation starts. Furthermore, each source record contributes to only one derived record, and each record in DThm is derived from only one source record. Log propagation can therefore be performed like normal ARIES crash recovery redo work (Mohan et al., 1992), which was discussed in Section 2.3. As for difference and intersection DT creation, a fuzzy mark is first written to the log. The relevant log records, i.e. operations on records in any of S1 , . . . , Sm since the last fuzzy mark, are then retrieved and applied to DThm in sequential order. A log record ìns ∈ L describing the insert of a source record r into S1 , . . . , Sm starts by checking if the record is already represented in DThm . This is done by a lookup on r’s RID in the source RID index of DThm . If a record is found, the log record is ignored; the derived table already reflects this state (Løland and Hvasshovd, 2006c). If the RID is not found, the derived version of r is inserted into DThm . Let the log records ùpd ∈ L and `del ∈ L describe the update and deletion, respectively, of a record r in one of the source tables. Propagation of both 80 6.2. HORIZONTAL MERGE WITH DUPLICATE INCLUSION Vinyl Records RecID r1 r2 r3 r111 CD Records RecID r10 r11 r12 r13 Artist Krall, Davis, Evans, Davis, Diana Miles Bill Miles Artist Record LSN Smith, Jimmy Jones, Norah Davis, Miles Peterson, O Root Down Come Away With Me Kind of Blue The Trio 101 102 103 160 Record LSN The Look of Love Miles Ahead Waltz for Debby Kind of Blue 151 152 153 154 U Records RecID RIDSrc r101 r102 r103 r104 r105 r106 r107 r110 r1 r2 r3 r10 r11 r12 r13 r111 Artist Record Smith, Jimmy Jones, Norah Davis, Miles Krall, Diana Davis, Miles Evans, Bill Davis, Miles Peterson, O Root Down Come Away With Me Kind of Blue The Look of Love Miles Ahead Waltz for Debby Kind of Blue The Trio Type LSN Vin Vin Vin CD CD CD CD Vin 101 102 103 151 152 153 154 160 Figure 6.4: The Horizontal Merge shown in Figure 6.3, but with an added “Type” attribute in the “Records” table. With this information, the log propagator is able to insert the new “Oscar Peterson” record into the correct source table. operations starts with a lookup in the source RID index of DThm . If no record is found, the log record is ignored. If a record t is found, log propagation of `del simply deletes t. Log propagation of ùpd checks the LSN and updates t if `LSN > tLSN . If the derived table will be used for a schema transformation, locks are maintained in DThm as part of log propagation. 6.2.4 Synchronization The synchronization step is performed in the same way as synchronization of diff/int DT creation and is therefore not repeated. There is, however, one potential problem with non-blocking commit during schema transformations. Consider Example 6.2.1: CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 81 Example 6.2.1 (Lack of Information During Non-blocking Commit) A horizontal merge between two tables containing CD records and Vinyl records was illustrated in Figure 6.3 on page 78. Notice that the derived table does not include any information that can be used to determine which of the source tables a derived record would belong to. During non-blocking commit synchronization, a transaction inserts a new vinyl record, “Oscar Peterson, The Trio”, into the derived table “Records”. Since the fact that the new record is a vinyl record can not be expressed in the attributes of the DT, the log propagator has no way of knowing which source table it belongs to. Example 6.2.1 illustrates an important problem: non-blocking commit synchronization can only be used for horizontal merge if the log propagator can determine which source table an inserted DThm record belongs to. When this is not the case, non-blocking abort must be used instead. Figure 6.4 illustrates how adding a “Type” attribute can be used to solve the problem of Example 6.2.1. 6.3 Horizontal Merge with Duplicate Removal Horizontal merge DT creation is more complex when duplicates are removed since this incurs that multiple source records may contribute to the same derived record. Although the method still only suffers from the missing record and state ID problems, these can not be solved simply by adding source RID and LSN as attributes in DThm . The proposed solution is to create an auxiliary table A in addition to DThm . The source RIDs and LNSs are then stored in A. Figure 6.5(a) illustrates the same horizontal merge shown in Figure 6.3, but this time with duplicate removal. Thus, the duplicate Miles Davies album “Kind of Blue” is stored only once in the DT. As shown, the auxiliary table A contains three attributes: the RID of the record in the derived table, in the source table, and the current LSN of the record. Another example of horizontal merge with duplicate removal is illustrated in Figure 6.5(b). The source tables are equal to those in Figure 6.5(b), but this time DThm only contains the artist attribute. The result is that all three “Miles Davis” albums in the source tables are merged into one derived record. Regardless of the number of source records contributing to a record in DThm , A is able to store the record and state identification information required to perform the creation. Together, DThm and A are self-maintainable. 82 6.3. HORIZONTAL MERGE WITH DUPLICATE REMOVAL Vinyl Records RecID r1 r2 r3 Artist Record LSN Smith, Jimmy Jones, Norah Davis, Miles Root Down Come Away With Me Kind of Blue 101 102 103 CD Records RecID r10 r11 r12 r13 U Artist Krall, Davis, Evans, Davis, Diana Miles Bill Miles Unique Records RecID LSN 151 152 153 154 ID Artist r101 r102 r103 r104 r105 r106 Record The Look of Love Miles Ahead Waltz for Debby Kind of Blue Smith, Jones, Davis, Krall, Davis, Evans, Record Jimmy Norah Miles Diana Miles Bill RIDder RIDSrc LSN Root Down Come Away With Me Kind of Blue The Look of Love Miles Ahead Waltz for Debby r101 r102 r103 r103 r104 r105 r106 r1 r2 r3 r13 r10 r11 r12 101 102 103 154 151 152 153 (a) Horizontal Merge DT creation with duplicate removal. DTid stores record and state identification information on derived records. Vinyl Records RecID r1 r2 r3 Artist Record LSN Smith, Jimmy Jones, Norah Davis, Miles Root Down Come Away With Me Kind of Blue 101 102 103 CD Records RecID U Unique Artists RecID r101 r102 r103 r104 r105 r10 r11 r12 r13 Artist Krall, Davis, Evans, Davis, Diana Miles Bill Miles Record LSN The Look of Love Miles Ahead Waltz for Debby Kind of Blue 151 152 153 154 ID Artist Smith, Jones, Davis, Krall, Evans, Jimmy Norah Miles Diana Bill RIDder RIDSrc LSN r101 r102 r103 r103 r103 r104 r105 r1 r2 r3 r11 r13 r10 r12 101 102 103 154 151 152 153 (b) Horizontal Merge DT creation with duplicate removal. Two records from “CD” and one from “Vinyl” contribute to the same derived Miles Davis record. Figure 6.5: Horizontal Merge DT Creation with Duplicate Removal. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 83 6.3.1 Preparation Step As discussed, horizontal merge with duplicate removal suffers from the missing record and state identification problems. To solve these, two tables are required: the derived table, DThm , and an auxiliary table A. A will include an attribute for the record ID in the derived and source tables, in addition to the LSN. DThm may consist of any subset of attributes from S1 , . . . , Sm . Derived records are identified by performing a lookup in A, and DThm does therefore not have to include the source RID. An index is created on the RID in DThm and on both source RID and derived RID in A. Records are considered duplicates in DThm if they have equal attribute values. This check for equality must be performed very frequently by the DT creation process. Another index should therefore be added to the attribute in DThm that differ the most between the source records. If it is not clear to the DBA which attribute is more divergent between the records, statistics should be acquired from the DBMS. 6.3.2 Initial Population Step As for DT creation using other relational operators, the initial population starts by writing a fuzzy mark to the log. This log record contains the identifiers of all transactions active on any of the source tables, S1 , . . . , Sm . The source tables are then read fuzzily, and the resulting set of records, denoted SR, are inserted into DThm . Insertion of a source record r ∈ SR into DThm starts by performing a lookup in DThm . This is done to identify if a record teq with equal attribute values is already represented. An index on a divergent attribute, as described in the previous section, would speed up this search tremendously. If no equal record is found, a record tnew , containing the wanted subset of attribute values from r, is inserted into DThm . A record a is then inserted into A. a consists of the RID of r, the RID of tnew and the LSN of r. If an equal record teq was found in DThm , the insertion into DThm is not performed. Instead, a is inserted into A, consisting of the RID of r, the RID of teq and the LSN of r. 6.3.3 Log Propagation Step Since the source RIDs are only stored in the auxiliary table A, all log propagation rules must perform lookups in this table to identify derived records. When the log propagator has written a fuzzy mark to the log, all log records relevant to S1 , . . . , Sm is retrieved. The log records are then applied 84 6.3. HORIZONTAL MERGE WITH DUPLICATE REMOVAL in sequential order. If DThm will be used to perform a non-blocking schema transformation, locks should be maintained as part of log propagation, as discussed in Section 5.3. Let the log record ìns ∈ L describe the insertion of a record r into one of the source tables S1 , . . . , Sm . Propagation starts by performing a lookup on the RID of r in A. If the RID is found, the log is already reflected in DThm , and ìns is therefore ignored. Even if the record is not represented in DThm , it may still be a duplicate of an existing record. Thus, DThm must be scanned to check if an existing record teq with equal attribute values exists. Assuming that an equal record is not found in the scan of the derived table, a record tnew , derived from r, is inserted into DThm . A record anew , containing the RID of r and tnew , is then inserted into A. The LSN of this record is set to that of ìns . If a duplicate record teq is found in DThm , however, only the insert of aeq into A is performed. This record stores the RID of teq instead of tnew , but is otherwise equal to anew . Consider a log record `del ∈ L, describing the deletion of a record r from any of S1 , . . . , Sm . A lookup is first performed on the source RID index of A. Assuming that a record adel with the RID of r is found, another lookup is performed on A’s derived RID index. If adel is the only record in A with this derived RID, r is the only source record contributing to the derived record t ∈ DThm , where the RID of t is equal to the derived RID of adel . In this case, both t and adel are deleted. Otherwise, if adel is not the only record in A with this derived RID, only adel is removed. New duplicates may form, and old duplicates may be split as a result of update operations. As for insert and delete operations, log propagation of a log record ùpd ∈ L, describing an update of a source record r, starts with a lookup in the source RID index of A. If a record a ∈ A with the source RID of r is not found, or if the LSN of a indicates that a newer state is already reflected, the ùpd is ignored. If `LSN > aLSN , however, the update should be applied. In this case, a lookup in the derived RID index of A is performed to identify any duplicates to the pre-update version of the record t ∈ DThm derived from r. If a is the only record in A with this derived RID, t does not represent duplicates. Assume for now that t is only derived from r. DThm is now scanned to find if there is a record with equal attribute values to t after ùpd has been applied to it. If there is not, t is updated as dictated by ùpd . If the updated record is a duplicate of a record tdup , however, t is deleted, and the derived RID of a is updated to refer to the RID of tdup . In the case that t is derived from more source records than r alone, ùpd can not be applied directly to t. DThm is first scanned to find if t has CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 85 duplicates after ùpd has been applied. If the updated record is a duplicate of tdup ∈ DThm , the derived RID of a is updated to refer to tdup . If the updated version of t is not equal to any existing record in DThm , a new record tnew is inserted into DThm . tnew represents t after ùpd has been applied to it. The derived RID of a is then set to the RID of tnew . In all four update cases described above, the LSN of a is updated to the LSN value of ùpd . 6.3.4 Synchronization Step The blocking complete synchronization, non-blocking abort synchronization for schema transformations and non-blocking synchronization for MV creation strategies works like described for difference and intersection. These strategies are not described further. The non-blocking commit strategy for schema transformations is different, however, and is therefore described next. The reason for this difference is that multiple source records may contribute to the same derived record. Hence, horizontal merge with duplicate removal belongs to the Many-to-One Lock Forwarding (M1LF) category. Non-blocking commit synchronization of schema transformations start by latching S1 , . . . , Sm while a log propagation iteration is performed. The latches do not affect read operations in the source tables. When the iteration is complete, DThm is action consistent with S1 , . . . , Sm .. Because locks have been maintained as part of log propagation, locks that are set on source records are also set on their counterparts in DThm . These locks must use the modified lock compatibility matrix in Figure 5.4 on page 65. With the modified lock compatibility, locks forwarded from source records do not conflict with each other. New transactions are now allowed to operate on records in DThm , and the transactions that are active in S1 , . . . , Sm may continue processing on the source tables. Since the old transactions are allowed to access new records, locks and operations must also be forwarded from the DT to the source tables. Hence, for the rest of the synchronization step, transactions in DThm must acquire locks on both the DThm record t it tries to access and all source records in S1 , . . . , Sm that t is derived from. The log propagator continues to process the log to ensure that the operations executed in the source tables are also executed in the DT and vice versa. Forwarded locks are not released until the log propagator processes the commit or abort log record of the transaction owning a lock. When all source transactions have terminated, S1 , . . . , Sm and A may be removed from the schema. 86 6.4. HORIZONTAL SPLIT TRANSFORMATION Music Artist Record Type Smith, Jimmy Evans, Bill Davis, Miles Krall, Diana Peterson, O Root Down Intuition Kind of Blue Live In Paris The Trio Vinyl Vinyl CD DVD CD RID LSN r1 r2 r3 r4 r5 101 102 103 104 105 Vinyl Records Artist Record Smith, Jimmy Evans, Bill RIDSrc LSN Root Down Intuition r1 r2 101 102 CD Records Artist Davis, Miles Peterson, O Record RIDSrc LSN Kind of Blue The Trio r3 r5 103 105 Music DVDs Artist Record Krall, Diana Live In Paris RIDSrc LSN r4 104 Figure 6.6: Horizontal Split DT creation. Grey attributes are used internally by the DT creation process, and are not visible to normal transactions. 6.4 Horizontal Split Transformation Horizontal split DT creation uses the selection relational operator. The transformation takes records from one source table, S, and distributes them into two or more derived tables DT1 , . . . , DTm by using selection criterions. An example horizontal split of a table containing music on different media is illustrated in Figure 6.6. Other examples include splitting an employee-table into “New York employee” and “Paris employee” based on location, or into “high salary employee” and “low salary employee” based on a salary condition like “salary > $40.000”. The selection criterions may result in non-disjoint, i.e. overlapping, sets, and may not include all records from the source table. 6.4.1 Preparation Horizontal split suffers from the missing record pre-state problem since the selection criterions may not include all records. As an example, consider the employee table that was split into New York and Paris offices. An employee in London would not match any of these, and is therefore not part of any of the resulting DTs. If, during DT creation, the employee is moved to the CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 87 Paris office, the old state of the record is required before it can be updated and inserted into the table containing Paris employees. The reason for this is that update log records only contain the new values of the attributes that are changed. The missing record pre-state problem is solved by adding an auxiliary table A, containing all records that do not qualify for any of the DTs. The selection criterion for this table is the negated selection criterion of all the DTs. Thus, all records are guaranteed to belong to either one or more of the derived tables DT1 , . . . , DTm , or to A. Horizontal split DT creation also suffers from the missing record and state identification problems. These problems are solved by including the source RID and LSN in all derived tables. The preparation step consists of creating one table for each selection criterion result set and one for the auxiliary information. In this section, DT1 , . . . , DTm , A will be called the derived tables. All tables must include the source RID and LSN, in addition to any subset of attributes from S. As for difference and intersection, it is assumed that a candidate key from S is among the included subset of attributes. Alternatively, the derived tables may include a generated key, e.g. an autoincremented number, that is assigned to all records. Thus, duplicate removal is not considered here. If required, duplicate removal as described for horizontal merge in Section 6.3 may be used. The log propagation rules always use the source record ID to identify records. Indices on other attributes are therefore not required. 6.4.2 Initial Population Initial population starts by writing a fuzzy mark, containing the transaction identifiers of all transactions active on S, in the log. S is then read without setting locks, and each source record is then inserted into one or more derived tables, depending on the selection criterions it satisfies. If the record does not match any selection criterion, it is inserted into A. 6.4.3 Log propagation After initial population, all source records are represented in at least one derived table. Also, the derived records have a source RID and LSN defining which record and state they represent. With this information, the derived tables are self-maintainable. Each log propagation iteration starts by writing a fuzzy mark to the log L. All log records between the last fuzzy mark and the new one that are relevant 88 6.4. HORIZONTAL SPLIT TRANSFORMATION to S are then retrieved. The log records are then processed sequentially using the propagation rules described below. If the DTs will be used to perform a non-blocking schema transformation, locks are maintained on derived records as part of the log propagation. Consider a log record ìns ∈ L, describing the insert of a record r into S. A lookup is first performed on the source RID indices of DT1 , . . . , DTm and A. If r’s record ID, rRID , is found in any of these tables, the operation is already reflected and is ignored. Otherwise, r is evaluated with respect to the selection criterions and inserted into all DTs it matches. Propagation of an update log record ùpd ∈ L, updating a record r in S, starts by identifying all records t1 , . . . , tn ∈ DT1 , . . . , DTm , A derived from it. This is done by performing a lookup of rRID in the source RID index of the derived tables. Since operations performed on a source record are always applied to all derived versions of it, all records in DT1 , . . . , DTm , A derived from r have the same LSNs. More formally, ∀tx ∀ty (tx , ty ∈ {DT1 , . . . , DTm , A}, txRIDSrc = tyRIDSrc ⇒ txLSN = tyLSN ) where txRIDSrc and tyRIDSrc are the source RIDs for the derived records tx and ty , respectively. This can be used to determine whether or not ùpd has already been applied by inspecting the LSN of only one of the derived records. If none of the attributes that are updated by ùpd are used in the selection criterions, t1 , . . . , tn are simply updated. If a selection criterion attribute is updated, however, two sets of derived tables are identified. The first is the set Ppre of derived tables where the pre-update version of the records derived from r were stored. These are the same tables that t1 , . . . , tn were found in. The second is the set Ppost of DTs where the updated versions of the records derived from r should be stored. The update is processed in two steps: First, for all derived records t ∈ {t1 , . . . , tn } identified by the initial source RID lookup, t is deleted if t is stored in table T and T 6∈ Ppost . Otherwise, if t is stored in T and T ∈ Ppost , t is updated with the new attribute values of ùpd . When all records found in the initial lookup have been processed, the updated version of the derived record is inserted into all tables I where I 6∈ Ppre and I ∈ Ppost . When a delete log record `del ∈ L is encountered by the log propagator, a lookup is performed on the source RID index of all derived tables, using the RID of the record described in the `del . The records that are found are simply deleted. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 89 6.4.4 Synchronization The blocking complete and non-blocking MV synchronization strategies work in the same way as described for difference and intersection. Hence, we only focus on non-blocking synchronization for schema transformations here. Synchronization for Schema Transformations Non-blocking abort starts by latching S during a log propagation iteration that makes DT1 , . . . , DTm , A action consistent with S. The latch does not affect read operations. With horizontal split, a record in DT1 , . . . , DTm , A may only be derived from one source record, while one source record may contribute to multiple DT records. Hence, this schema transformation belongs to the one One-to-Many Lock Forwarding (1MLF) category. In 1MLF, one source lock may have to be forwarded to multiple DT records since a derived record may belong to multiple DTs. As always, the next steps of non-blocking abort are to release the latch and force transactions in the source table to abort. Locks forwarded from S are released in DT1 , . . . , DTm once the abort log record of the transaction holding the lock is encountered by the log propagator. When all source transactions have terminated, S and A may be removed from the schema. Since the transactions on S may access new records, non-blocking commit synchronization requires the log propagator to forward operations performed on a record r in DT1 , . . . , DTm to the record s ∈ S it is derived from. However, the operation must also be propagated to all records s contributes to, as described in Section 5.3. If not, the other records t1 , . . . , tu derived from s would not be consistent with r. As discussed in Section 5.3, this transaction behavior differs from the behavior after synchronization has completed. If this is not acceptable, non-blocking abort should be used instead. 6.5 Vertical Merge The vertical merge DT creation method creates a derived table DTvm by applying the full outer join (FOJ) operator on two source tables, Sl and Sr . Sl is the left, and Sr the right table of the join. In contrast to inner join and left and right outer join operators, FOJ is lossless in the sense that records with no join match are included in the result. In addition to being lossless, there are multiple reasons for focusing on full outer join. First, the full outer join result can later be reduced to any of the inner/left/right joins by simply deleting all records that do not have the necessary join matches, whereas going the opposite direction is not possible. Second, full outer join is the 90 6.5. VERTICAL MERGE Employee PostalAddress F.Name S.Name Address Zip RecID LSN Hanna Erik Markus Sofie Valiante Olsen Oaks Clark Moholt 3 Torvbk 6 Mollenb.7 Berg 1 7030 5121 7020 7020 r01 r02 r03 r04 10 11 12 13 Zip City 5121 7020 9010 Bergen Tr.heim Tromsø RecID LSN r11 r12 r13 14 15 16 EmployeePost F.Name S.Name Address Hanna Erik Markus Sofie NULL Valiante Olsen Oaks Clark NULL Moholt 3 Torvbk 6 Mollenb.7 Berg 1 NULL PCode 7030 5121 7020 7020 9010 City NULL Bergen Tr.heim Tr.heim Tromsø RID_L LSN_L RID_R LSN_R r01 r02 r03 r04 NULL 10 11 12 13 NULL NULL r11 r12 r12 r13 NULL 14 15 15 16 Figure 6.7: Example vertical merge DT creation. only one of these operators that does not suffer from the missing record prestate problem since all source records are represented at least once in the DT (Løland and Hvasshovd, 2006b). An example vertical merge DT creation is shown in Figure 6.7. The figure will be used as an example throughout this section. Vertical merge DT creation suffers from the missing record and state identification problems. As argued in Section 5.1, these problems can be solved by including the record IDs and LSNs from both source tables in DTvm . This method is used in this section. An alternative method has been presented by the authors (Løland and Hvasshovd, 2006c), however. It uses candidate keys to identify records and totally ignores LSNs in DTvm . Because the latter method does not solve the record and state identification problems, it has some flaws compared to the one used here. First, the log propagation rules are much less intuitive and second, it cannot handle semantically rich locks (Løland and Hvasshovd, 2006b). Thus, in contrast to the method presented here, it cannot handle delta updates (Korth, 1983). On the other hand, it requires slightly less storage space since the two source RID attributes and an additional LSN are not added to DTvm . In the following sections, the four steps of DT creation are explained in detail for the vertical merge operator. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 91 6.5.1 Preparation During preparation, the derived table DTvm is added to the database schema. This table may include a subset of attributes from the two source tables, but some attributes are mandatory. To solve the record and state identification problems, the source RIDs and LSNs from both Sl and Sr are needed. Since records that should be affected by a logged operation are identified by the RID provided by the log record, indices are added to each of the source RID attributes. In addition, an index is added to the join attribute(s) in DTvm since new join matches may have to be identified as a result of inserts and updates of source table records. Together, these indices provide direct access to all affected records for any operation that may be encountered. 6.5.2 Initial Population As for the other relational operators, initial population starts by writing a fuzzy mark to the log, containing the identifiers of transactions that have accessed two source tables Sl or Sr . The source tables are then read without using locks. Once read, the full outer join operator is applied, and the joined records are inserted into DTvm . At this point, the state of DTvm is called the initial image. 6.5.3 Log Propagation No assumption is made on whether the join is over a one-to-one, one-to-many or a many-to-many relationship. An implication of this is that records from both source tables may be represented in multiple records in DTvm . In what follows, it is assumed that the vertical merge is defined over an equijoin. The method can, however, be modified to use other comparison operators or, in the case of cartesian product, no comparison operator. Insert rules Consider a log record ìns ∈ L, describing the insertion of a record r into source table S, S ∈ {Sl , Sr }. The first step of propagating ìns is to perform a lookup of the RID of r, denoted rRID , in either the left or right source RID index of DTvm . The index to use depends on which source table r was originally inserted into. If one or more records in DTvm have rRID as the source RID, the logged operation is already reflected, and ìns is ignored. If no records with the source RID value of rRID are found in DTvm , every DTvm record with a join attribute matching that of r, denoted rJM , are identified. The set of records with a matching join attribute value is called 92 6.5. VERTICAL MERGE JM (Join Match). If no join matches were found, i.e. JM = ∅, r is joined with the null-record and inserted into DTvm . The source RID and LSN is set to that of the ìns . If one or more join matches were found, all records t ∈ JM are composed of two source records. One of these is from the same table as r, and is denoted t1 . The other part, t2 , is from the other source table. If two or more records in JM consist of the same t2 part,only one of these records are used. Thus, for each record t ∈ JM with a t2 −part that has not already been processed, t is updated with the attribute values of r iff t1 is the null-record. If t1 is not the null-record, the attribute values of t2 are read and joined with r. This new record is then inserted into DTvm with source RIDs and LSNs from both ìns and t2 . Update rules Propagation of an update log record ùpd ∈ L, updating the record r from source table S, S ∈ {Sl , Sr }, starts by identifying all records in DTvm partially composed of r. This is done by a lookup of rRID in either the left or right source RID index of the derived table, depending on which table r belongs to. The set of records that are found is called P. If P = ∅, or if the LSN of all records p ∈ P is greater than the LSN of ùpd , the log record is ignored. As argued in Section 6.4.3, the LSN of all p ∈ P are equal. The LSN is therefore checked only once. The logged update is applied to DTvm if P 6= ∅ and the LSN of p is lower than that of ùpd . If the join attribute values of r are not updated, all records p ∈ P are simply updated with the new attribute values and LSN of ùpd . This is similar to crash recovery redo work as described in Section 2.3, but applied to multiple instances of the same source record. If the join attribute of r is updated, however, log propagation becomes more complex. An additional set, N (new join matches), is first defined. N contains all records in DTvm that matches the updated join attribute value of r. It is found by a lookup on the join attribute index on DTvm . Since we assume that the vertical merge DT creation is defined over an equijoin, P and N are disjoint, i.e. not overlapping sets. Records in P are first processed. For each DTvm record p, p ∈ P : p is composed of two joined source records, r and one record p2 from the other source table. If p2 is represented in at least one other record in DTvm , p is deleted. This is checked by a lookup of p2 source RID in the index of DTvm . If p2 is not represented in any other record in DTvm , however, p2 is joined with the null-record. This is done because all source records must be represented when using the full outer join operator. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 93 If N = ∅, r is padded with null-values and inserted. If N 6= ∅, each record n ∈ N is analyzed. Again, n is composed of two joined records: one record n1 from the same table as r, and one record n2 from the other table. If n is composed of n2 and the null-record, n is updated with the attribute values of r. If n is the join of n2 and another record n1 , a new record is inserted into DTvm containing the join of the r and n2 . In both cases, source RID and LSN is set to reflect the log record. Delete rules The propagation of a delete log record `del ∈ L is fairly intuitive. First, the set D of all records in DTvm consisting of r are identified. For each record d ∈ D, d is deleted if it consists of r joined with the null record or a record d2 that is represented in at least one other record in DTvm . If d is the only record in DTvm that contains d2 , d2 is joined with the null record. 6.5.4 Synchronization The synchronization step is started when the state of DTvm is very close to the states of Sl and Sr . The blocking complete and non-blocking synchronization of MV strategies work as described for difference and intersection. These are not explained further. Non-blocking Synchronization for Schema Transformations As discussed in Section 5.3, vertical merge DT creation belongs to the manyto-many lock forwarding (MMLF) category. This means that the modified lock compatibility matrix, shown in Figure 5.4 on page 65, must be used so that locks forwarded from the source tables to DTvm do not conflict with each other. The source locks do, however, conflict with locks acquired on DTvm records. Note that if there are many more Sl records than Sr records, a few locks in Sr may result in a considerable amount of locks in DTvm . If the non-blocking abort strategy is used, transactions active on Sl and Sr are forced to abort, while new transactions are allowed to start processing on the unlocked records in DTvm . Log propagation continues to apply the undo operations performed by the aborting source table transactions. Source locks in DTvm are removed once the abort log record of the owner transaction has been processed. Sl and Sr may be removed when all source transactions have terminated. In the case of non-blocking commit, transactions active on the source tables are allowed to access new records. A consequence of this is that locks 94 6.5. VERTICAL MERGE Employee Position Name Address Hanna Erik Markus Sofie Moholt 3 Torvbk 6 Mollenb.7 Berg 1 PosID RID LSN PosID PosTitle Salary RID LSN 005 005 050 052 r01 r02 r03 r04 10 11 12 13 001 005 052 050 sec.tary QA proj mgr sw arch $23’ $34’ $48’ $33’ r11 r12 r13 r18 14 72 16 68 Employee Name Address PosID PosTitle Salary Hanna Erik Markus Sofie NULL Moholt 3 Torvbk 6 Mollenb.7 Berg 1 NULL 005 005 050 052 NULL QA QA sw arch proj mgr sec.tary $34’ $34’ $33’ $48’ $23’ RID_L RID_R LSN_L LSN_R r01 r02 r03 r04 NULL r12 r12 r18 r13 r11 10 11 12 13 NULL 72 72 68 16 14 Figure 6.8: The updated salary of Hanna (see Example 6.5.1) requires a salary update of the “QA” position, resulting in an increased salary for all “QA” personel. and operations from transactions in DTvm must be forwarded to Sl and Sr . This makes synchronization more complicated because the update of a record t in DTvm may have to be propagated not only to the source records tl and tr that t is derived from, but to all DTvm records derived from tl and tr . Consider the following example: Example 6.5.1 (Updates during non-blocking commit) Figure 6.8 illustrates the vertical merge schema transformation of source tables Employee and Position. There are two employees with position “QA”. During non-blocking commit synchronization, a transaction in DTvm updates the salary attribute of Hanna from “$33,000”to “$34,000”. This update requires that the “QA” record in the Position source table is locked and updated accordingly. Furthermore, the Erik record in DTvm , which is also derived from this source record, has to be locked and updated with a new salary to maintain consistency. It should be clear from Example 6.5.1 that the choice of non-blocking abort vs. commit is not transparent for transactions operating on DTvm . With non-blocking commit, the behavior of transactions operating during synchronization differs from that of transactions after synchronization completes. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 95 Employee EmpID Name Address 01 02 03 04 Hanna Erik Markus Sofie Moholt 3 Torvbk 6 Mollenb.7 Berg 1 Salary $40’ $32’ $42’ $35’ RID LSN r1 r2 r3 r4 101 102 103 104 Salary ModifiedEmp EmpID Name Address 01 02 03 04 Hanna Erik Markus Sofie Moholt 3 Torvbk 6 Mollenb.7 Berg 1 RID LSN r1 r2 r3 r4 101 102 103 104 EmpID Salary 01 02 03 04 $40’ $32’ $42’ $35’ RID LSN r1 r2 r3 r4 101 102 103 104 Figure 6.9: Vertical split over a Candidate Key. Before choosing to use non-blocking commit synchronization, the consequences must be considered carefully. The increased number of locks required, forwarding of numerous update operations between the tables and the non-transparent behavior of transactions operating on DTvm during and after synchronization may outweigh the fact that transactions on the source tables are not aborted. 6.6 Vertical Split over a Candidate Key Vertical split is the inverse of the full outer join DT creation method described in the previous section, and uses the projection relational operator. It takes one source table S as input, and creates two derived tables, DTl (left result) and DTr (right result), each containing a subset of the source table attributes. Some attributes, called the split attributes, must be included in both DTs. These attributes can later be used to join DTl and DTr . In what follows, we assume that the only overlap between the attribute sets in DTl and DTr are the split attributes. If the split attributes form a candidate key in S, each source record will be derived into exactly one record in each of the DTs, and each record in the DTs will be derived from exactly one source record. The DT creation described in this section therefore belongs to the One-to-Many Lock Forwarding category. An example split is illustrated in Figure 6.9. If S is split over a functional dependency that is not a candidate key in S, multiple source records may have equal split attribute values and may 96 6.6. VERTICAL SPLIT OVER A CANDIDATE KEY therefore contribute to the same derived record in DTr . This type of split is typically executed to perform a normalization of the database schema. Vertical split DT creation over a functional dependency is described in Section 6.7. 6.6.1 Preparation In vertical split, two derived tables, DTl and DTr , are first added to the schema. They typically include two different subsets of attributes from S, but must both include the candidate key used as the split attribute. Both DTl and DTr suffer from the missing record and state identification problems. Since the records in the DTs are derived from exactly one source record, these problems are solved by adding source RID and LSN directly to the derived tables. Since log propagation will identify all derived records based on the source RID, indices are only required on this attribute in DTl and DTr . 6.6.2 Initial Population Initial population starts by writing the fuzzy mark, containing the identifiers of all transactions active in S, to the log. S is then read fuzzily, and for each record found in S, one record is inserted into DTl and DTr . 6.6.3 Log Propagation Log propagation is run in iterations. Each iteration writes a new fuzzy mark to the log L, and then retrieves log records relevant to S since the last fuzzy mark (or since the oldest operation of active transactions in the first iteration). These log records are then applied to DTl and DTr in sequential order. If the DTs will be used to perform a non-blocking schema transformation, locks are maintained as part of log propagation. When a log record ìns ∈ L, describing the insertion of a record r into S, is encountered by log propagation, a lookup of the RID of r is first performed in the source RID index of one of the DTs. If the RID is found, r is already reflected in the DTs. If not, the wanted subset of attributes from r is inserted into DTl and DTr . Note that it is not necessary to perform a RID lookup in both derived tables since r is either reflected in both or none of them. A delete log record, describing the deletion of a record r from S, is propagated by performing a lookup in the source RID index of one of the DTs. The log record is ignored if the source RID is not found. Otherwise, the CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 97 identified record is deleted, and the same process is then applied to the other DT. Consider a log record ùpd ∈ L, describing an update of a record r in S. Again, log propagation starts with a lookup in the source RID index of one of the DTs. ùpd may affect attributes in only DTl or DTr . If so, the lookup is performed in this DT. Assuming that a derived record t with correct source RID is found, and that t has a lower LSN than ùpd , the described update is applied. If ùpd affects attributes in the other DT as well, the procedure is repeated for that table. Most DBMSs do not allow primary key updates. Thus, the described rules work under the assumption that the DT primary key attributes are not updated. This assumption holds if the same attributes are used as primary keys in S and the DTs. The vertical split example in Figure 6.9 illustrates this. If another candidate key from S is used as primary key in DTl and DTr , however, the log propagator may encounter updates of these attributes. If this is the case, the described update rules must be modified to delete the preupdate derived records and then insert the updated ones unless the DBMS allows primary key updates. 6.6.4 Synchronization Vertical split over a candidate key belongs to the 1MLF category. The synchronization strategy works exactly as described for horizontal split in Section 6.4, and is therefore not repeated here. 6.7 Vertical Split over a Functional Dependency This section describes vertical split DT creation when the split attributes are not a candidate key in S. This may, e.g., be done to perform a normalization (Elmasri and Navathe, 2004) of the database schema. An example split over a non candidate key is illustrated in Figure 6.10. As can be seen, the source table “Employee” has two functional dependencies, and is split over “zip”, which is not a candidate key: firstname,surname → zip zip → city The legend is the same as was used in the previous chapter. Thus, the DT creation method splits a table S into two derived tables DTl and DTr , each 98 6.7. VERTICAL SPLIT OVER A FUNCTIONAL DEPENDENCY Employee F.Name S.Name Zip City Hanna Erik Markus Sofie NULL Valiante Olsen Oaks Clark NULL 7030 5121 7020 7020 9010 NULL Bergen Tr.heim Tr.heim Tromsø RID LSN r1 r2 r3 r4 r5 101 102 103 104 105 PostalAddress ModifiedEmp F.Name S.Name Zip RIDSrc LSN Zip City # Hanna Erik Markus Sofie Valiante Olsen Oaks Clark 7030 5121 7020 7020 r1 r2 r3 r4 5121 7020 9010 Bergen Tr.heim Tromsø 1 2 1 101 102 103 104 Figure 6.10: Vertical split over a non candidate key. containing a subset of attributes from the source table. Both tables must include the split attributes. A consequence of splitting S over a non candidate key is that multiple source records may have the same split attribute value, e.g. multiple employees with the same zip. These source records should be derived into only one record in DTr . Furthermore, a record in DTr should only be deleted if there are no more records in S with that split attribute value. To be able to decide if this is the case, a counter, similar to that of Gupta et al. (Gupta et al., 1993), is associated with each DTr record. When a DTr record is first inserted, it has a counter of 1. After that, the counter is increased every time an equal record is inserted, and decreased every time one is deleted. Before the method is described in detail, we show that S may contain inconsistencies that complicates DT creation. Consider the following example: Example 6.7.1 Consider the table “Employee” below. This table is used as a source table to perform the DT creation illustrated in Figure 6.10. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 99 Firstname Hanna Erik Markus ... Sofie Surname Valiante Olsen Oaks ... Clark Zip 9010 5121 7020 ... 7020 City Tromsø Bergen Trondheim ... Trnodheim There are intentionally two functional dependencies in this table: firstname,surname → zip zip → city Notice, however, that there is an inconsistency between employees Markus and Sofie since the zips are the same, whereas the city names differ. Nothing prevents such inconsistencies from occurring in this table, and the DT creation framework can not decide whether “Trondheim” or “Trnodheim” is correct. One of the main reasons for normalization is to avoid such inconsistencies from occurring in the first place. If inconsistencies like the one in Example 6.7.1 exist in S, we are not able to perform a split transformation without taking measures. For readability, vertical split over a non candidate key is first explained under the unrealistic assumption that inconsistencies never appear between records in S 3 . This provides an easy-to-understand basis for the scenario where inconsistencies may occur. An extension that can handle inconsistencies is then explained in Section 6.7.5. 6.7.1 Preparation As for vertical split over a candidate key, the DTs suffer from the record and state identification problems. For DTl , this problem can be solved by adding the source RID and LSN as attributes to the table. This can not easily be done with DTr , however. The reason for this is that each record in DTr may be derived from multiple source records. A possible solution to the missing record and state identification problems of DTr would be to create an auxiliary table A, containing the record IDs from DTr , the source record IDs and the LSNs. This solution was used in the horizontal merge with duplicate removal DT creation described in Section 6.3. As will be 3 Note that the simplified method can not handle semantically rich locks. Semantically rich locks (Korth, 1983) were described in Chapter 2. 100 6.7. VERTICAL SPLIT OVER A FUNCTIONAL DEPENDENCY clear from the following sections, however, all the required record and state identification information can be found in DTl . Hence, the auxiliary table is not needed. During preparation, DTl is first added to the database schema. In addition to the wanted subset of attributes from S and the split attributes, source RID and LSN are required. The LSNs will be used to achieve idempotence in both derived tables. The DT creation process will use the source RID attribute of DTl for all lookups. Hence, an index should be added to this attribute. DTr is then added with a subset of attributes from S. Only the split attributes are allowed in both DTs. Instead of the normal source RID attribute, a counter is added to DTr . Since the split is over a functional dependency, the split attributes form a candidate key in DTr , and these should therefore be defined as either primary key or unique. The DT creation process will use the split attributes for lookup, and an index should therefore be added to these. If the DTs are used to perform a schema transformation, an alternative strategy is to only create the DTr table. Since all attributes needed in DTl are already present in S, S can be renamed to DTl during synchronization after removing unwanted attributes from it. The transformation would require less space, and updates that would not affect attributes in DTr could be ignored by the log propagator. Unfortunately, the log propagator needs information on both the LSN and the split attribute value of each record in DTl . An auxiliary table A would therefore be needed to keep track of this information during propagation. Although A may potentially be much smaller than DTl , this section describes how the method works when DTl is created as a separate table. Only minor adjustments are needed for the alternative auxiliary method to work. 6.7.2 Initial Population Initial population starts by writing a fuzzy mark in the log. The fuzzy mark contains the identifier of all transactions active on S at this point in time. After performing a fuzzy read of S, the records are inserted into DTl and DTr . Insertion into DTl is straightforward; the wanted subset of attributes is inserted together with the source RID and LSN of the record in S. A lookup is then performed on the split attribute index of DTr . If a record with the same split attribute value already exists, the counter of that record is increased4 . If the split attribute value is not found in DTr , 4 Recall that for now, it is assumed that all records with equal split attribute values are CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 101 a new record is inserted. It consists of the wanted subset of attributes, and a counter value of one. 6.7.3 Log Propagation Log propagation is started once the initial images have been inserted into DTl and DTr . Each iteration starts by writing a fuzzy mark to the log L, and then retrieves all log records relevant to S. The log records retrieved are then applied in sequential order to the DTs. If the DTs will be used in a non-blocking schema transformation, locks are maintained as part of log propagation. In general, log propagation for records in DTl is more intuitive than for records in DTr . The reason for this is that each record in DTl is derived from exactly one source record. This is not the case for the records in DTr , which may be derived from multiple source records. Log propagation of records in DTr must be treated with care. Since an arbitrary number of source records may contribute to the same derived record, the source RID can not be used for identification simply by adding it as an attribute. Instead, the split attribute value of the corresponding DTl record is used for identification. Since there is a one-to-one relationship between source records and records in DTl , the value to search for is found by reading the record tl ∈ DTl with correct source RID. Furthermore, DTr does not provide correct state identifiers since multiple source records may contribute to each record. Thus, the LSN of tl will be used to determine if a log record is already reflected in DTr . By reading tl , both the record and state identification problems are solved. The records in DTr may have incorrect state identifiers during DT creation. The reason for this is that there is only one LSN for each derived record tr ∈ DTr . If the source record that last updated tr is later deleted, the LSN of tr will have a state ID that belongs to a source record no longer contributing to it. Nevertheless, the LSN of tr reflects the last update or insert propagated to it. Since all source records contributing to tr are assumed to be consistent, and since the LSN of tr is not used to achieve idempotence, this is not a problem. Consider a log record ìns ∈ L, describing the insert of a record r into S. The RID of r is first used to perform a lookup in the source RID index of DTl . If a record with this source RID is found, ìns is already reflected in the DTs and is therefore ignored. Otherwise, a record tl with the wanted subset of attribute values from r, including the source RID and LSN, is inserted into consistent. 102 6.7. VERTICAL SPLIT OVER A FUNCTIONAL DEPENDENCY DTl . A lookup is then performed on the split attribute index of DTr . Assuming that a record told ∈ DTr with the same split attribute values is found, the counter of told in increased. If ìns has a higher LSN than the record, the attribute values of told are updated as well. The LSN is then set to the higher LSN value of r and told . If the split attribute value of r is not found in DTr , a new record tr is inserted. It contains the wanted subset of attributes and the LSN from r, in addition to a counter value of one. Log propagation of a delete log record `del ∈ L starts with the same source RID lookup in DTl . If the RID of the deleted source record is not found, the log record `del is already reflected in the DTs. If a record tl with the correct source RID is found, however, tl is deleted. A lookup is then performed on the split attribute index of DTr , using the split value of tl . If the record tr found in this lookup has a counter of one, the record is deleted. Otherwise, the counter is decreased by one. Let the ùpd ∈ L be a log record describing the update of a record r in S. Propagation of ùpd starts by performing a lookup in the source RID index of DTl . If no record with this source RID exists, ùpd is ignored. Otherwise, if a record tl ∈ DTl is found, and if ùpd represents a newer state, i.e. has a higher LSN than tl , the update is applied. ùpd is now applied to the attributes in tl if any of these are updated. Even if no attributes in DTl are updated, the LSN of tl is set to that of ùpd . Log propagation then continues in DTr if ùpd describes updates of attributes there. Assume for now that the split attribute values of tl are not updated. A lookup is then performed in the split attribute index of DTr , using the split values read from tl . The record found in DTr is called tr . If ùpd represents a newer state than tr , i.e. the LSN is higher, ùpd is applied, and the LSN is set to reflect the new state. If the split attribute value is updated, log propagation in DTr works by delete and insert. The record told ∈ DTr is first read by a lookup in the split attribute index of DTr , using the pre-update split attribute value. The counter of told is decreased, and a new record tnew with updated attribute values is inserted, as described for insert log records. 6.7.4 Synchronization The blocking complete, non-blocking synchronization for MV creation and non-blocking abort for schema transformation strategies work like described CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 103 for vertical merge. Hence, only non-blocking commit for schema transformations is discussed here. Since each source record is split into two derived records, and each record in DTr may be derived from multiple source records, vertical split transformation over a non candidate key requires many-to-many lock forwarding (MMLF). As previously discussed, non-blocking commit allows transactions in S to access new records. This incurs that operations and locks set on a record t in DTl or DTr must be forwarded to all the records in S that t was derived from. To allow fast lookup of these records, a split attribute index should be added to S. Furthermore, if the operation on t changes the split attribute values, the operation must also be forwarded to the record derived from r in the other DT. As argued for vertical merge schema transformation in Section 6.5.4, nonblocking abort may be a better choice than non-blocking commit since it is much less prone to locking conflicts. The commit algorithm is also much more complex. 6.7.5 How to Handle Inconsistent Data - An Extension to Vertical Split In this section, we extend the vertical split DT creation method just described to handle inconsistent source records. The extension is inspired by solutions to similar problems in merging of knowledge bases (Lin, 1996; Greco et al., 2003; Caroprese and Zumpano, 2006) and merging of records in distributed database systems (Flesca et al., 2004) described in Section 5.4. A flag attribute is added to records in DTr . The flag may either signalize Consistent (C) or Unknown (U ). A C flag is used when a derived record is known to be derived from consistent source records, and the U flag is used when it is known to be derived from inconsistent source records or has an unknown consistency state. During initial population, all records in DTr that were consistent in the fuzzy read get a C−flag. All other records get a U −flag. The log propagation rules must also be modified to maintain these flags. If the log propagator inserts a record tnew into DTr , and a record told already has that split attribute, the flag of told is changed from C to U iff told 6= tnew . The flag change from C to U is also performed for updates if the derived record in DTr has a counter greater than one. A U −flag can only be changed to C if a logged update applies to all attributes that are not part of the split attributes, and the record has a counter of one. 104 6.7. VERTICAL SPLIT OVER A FUNCTIONAL DEPENDENCY A “Consistency Checker” (CC) is run regularly as a background thread. A record with a U flag, tu ∈ DTr , is chosen. The CC then writes a “Begin Consistency Check on tu ” mark to the log. All records in S contributing to tu are then read without using locks5 . If these are consistent in S, another mark stating that u is consistent is written to the log together with the correct image of tu . The CC marks are later encountered by the log propagator. Assuming that no logged operations apply to tu between the begin and end CC log marks, tu is guaranteed to be consistent and is changed accordingly. Any modification that applies to tu between these marks invalidate the result, however. Note that all records in DTr should have a C−flag before synchronization is started since a DBA will have to manually fix the problem if inconsistent records still exist. This may take considerable time. If the source records contributing to tu are not consistent, the “Consistency Remover” (CR) is started. It starts by collecting statistics on the source records contributing to tu . This corresponds to identifying repairs that may remove inconsistencies (Greco et al., 2003). Based on these statistics, the CR may either remove the inconsistencies based on predefined rules, or may suggest solutions to the DBA. The CR makes inconsistency removal decisions based on rules inspired by integration operators (Lin, 1996; Greco et al., 2003; Caroprese and Zumpano, 2006) and Active Integrity Constraints (AIC) (Flesca et al., 2004). A rule may, e.g., state that the attribute values agreed upon by a majority of source records should be used if more than 75% agree. Many rules may be defined, but if none of these evaluate to true, the DBA must decide what to do. Example 6.7.2 illustrate how the CR works during removal of inconsistencies. Example 6.7.2 Consider the inconsistency in Example 6.7.1 on page 98 one more time. Assume that the table is split over the zip functional dependency, and that CR is trying to solve the inconsistency between records with postal zip “7020”. Based on a read of the employee table, CR has found the following statistics: Total number of records with zip ‘‘7020’’: 306 Records agreeing on ‘‘Trondheim’’: 77% (235) Records agreeing on ‘‘trondheim’’: 22% (68) Records agreeing on ‘‘Trnodheim’’: 1% (3) 5 Note that since the source table is read, this DT creation is not self-maintainable. CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 105 Only one CR rule has been defined, stating that in the case of a 75 % majority or more, the majority value is used. Thus, CR can now update the 71 records with cities not equal to “Trondheim”. When the attribute values have been decided upon, either automatically or by the DBA, the CR is ready to remove the inconsistency. All records on S that do not agree on the decided upon values are now updated one record at a time. The CR must acquire write locks on the involved records to do this, but only one record is locked and updated at a time. When all source records with incorrect attribute values have been updated, CC is again executed for tu . If no transactions have introduced new inconsistencies during this process, CC will now inform the log propagator to set a C−flag. 6.8 Summary In this chapter, we have described in detail how the DT creation framework can be used to create derived tables using the six relational operators. The solutions to the DT creation problems described in Chapter 5 have been used extensively in the design of the DT creation methods. Table 6.2 shows a summary of which problems were encountered by which DT creation operator, and how the problems were solved. The DT creation operators have been presented in order of increasing complexity. This order is closely related to the lock forwarding categorization and therefore the “cardinality” of the operation6 . It is clear that many records may have to be locked during non-blocking commit synchronization of schema transformations, especially in the MMLF methods. If too many records require locking, it may be better to use the non-blocking abort strategy. However, the number of locks depends heavily on parameters like which types of modifications are frequently performed7 , the number of records in each source table etc. Thus, there is no simple answer for when one method should be used instead of the other. Note that lock contention is not a problem for MV creation. Although DT creation for all the relational operators use the same framework described in Chapter 4, it is clear that the work required to apply log 6 How many derived records a source record may contribute to, and how many source records a derived record may be composed of. 7 In vertical merge, e.g., modifications to records in Sl cause far fewer locks in the DT than modifications to records in Sr . 106 6.8. SUMMARY records to the DTs varies from operator to operator. In DT creation using the difference operator, source record modifications requires the log propagator to lookup and modify records in two or even three DTs. For other operators, e.g. horizontal merge with duplicate inclusion, log propagation of each logged operation only requires lookup and modification of one record. These differences are expected to cause variations to the incurred performance degradation of transactions running concurrently with DT creation. In the following chapters, we focus on implementation and testing of the DT creation methods. Our goal is to validate the methods and to indicate to which extent they degrade performance of other transactions. - Add source RID and LSN to both DTs Add source RID and LSN to left DT Vertical Merge Vertical Split, candidate key Vertical Split, non-candidate key MMLF 1MLF MMLF Run Consistency Check and Repair in parallel with log propagation - - - - - - Inconsistent Records Table 6.2: Problems and solutions for the DT Creation methods. - - Add source RID and LSN to DT Horizontal Split 1MLF Add auxiliary table for records not qualifying for DTs Add source RID and LSN to DTs M1LF - SLF SLF Store source RID Horizontal Merge, and LSN in auxilDup Removal iary table Add two auxiliary tables Lock Forwarding Category - Add source RID and LSN to all DTs Missing Pre-State Horizontal Merge, Add source RID Dup Inclusion and LSN to DT Difference, intersection Missing Record and State ID CHAPTER 6. DT CREATION USING RELATIONAL OPERATORS 107 Part III Implementation and Evaluation Chapter 7 Implementation Alternatives In the previous part of this thesis, we described how the derived table creation framework can be used to perform schema transformations and create MVs using six relational operators. We now want to determine the quality of the described methods. More precisely, we want to determine a) whether the methods work, and b) what the implied costs are in terms of performance degradation to concurrent transactions. As discussed by Zelkowitz and Wallace (Zelkowitz and Wallace, 1998), experimentation is of great value in validation and evaluation of new techniques in software. In this thesis, two types of experiments are of particular importance: empirical validation and performance experiments. In the empirical validation experiments, the methods are tested in a controlled environment. An implementation of the methods is viewed as a “black box”, and the output of the black box is compared to what we consider to be the correct output. This type of experiment can not be used as a proof of correct execution1 , but rather as a clear indication of correctness. To confirm the results, empirical validation experiments can be “triangulated”, i.e., performed in two or more different implementations (Walker et al., 2003). Due to time constraints, triangulation is not performed in this thesis. In the performance experiments, we consider relative performance, as opposed to absolute performance, the interesting metric2 . This experiment type is highly relevant since an important design goal has been to impact the performance of concurrent transactions as little as possible. Ideally, non-blocking DT creation should be implemented in a full-scale 1 Although it can be used to prove incorrect execution (Tichy, 1998). In this thesis, relative performance denotes the difference in performance while not running DT creation, compared to performance during DT creation. Absolute performance denotes the actual response time or throughput numbers that are acquired from processing capacity benchmarks. 2 112 7.1. ALTERNATIVE 1 - SIMULATION DBMS. This would have provided excellent conditions for both types of experiments. However, implementing a DBMS comparable to IBM DB2 or Oracle would be an impossible task due to the very high complexity of such systems. This leaves us with three alternative approaches: Simulator Model a DBMS and the non-blocking DT creation functionality, and implement the model in a simulator. Open Source DBMS Add the described functionality to an existing open source DBMS, e.g. MySQL, PostgreSQL or Apache Derby. Prototype Implement a prototype DBMS from scratch. This prototype has to be significantly simplified compared to modern DBMSs in many aspects, especially those considered not to affect DT creation. The alternative we decide to use should be usable for both types of experiments. Due to time constraints, we also consider the implementation cost an important factor. Hence, in the following sections, the alternatives are evaluated on three criteria: usability for empirical validation, usability for performance testing, and the cost (time and risk of failure) of development. An evaluation summary is presented in Section 7.4. 7.1 Alternative 1 - Simulation Assuming that a DBMS and the DT creation strategies can be modelled precisely, simulations can be used to get good estimates of the incurred performance degradation in any simulated hardware environment (Highleyman, 1989). The model would require accurate waiting times for processing and I/O, and correct distributions for task arrivals for all queues. Implementing a model and performing simulations in a simulation program like Desmo-J (Desmo-J, 2006) requires little to moderate implementation work. While it can be used for performance experiments, it can not be used for empirical validation of the non-blocking DT creation methods. 7.2 Alternative 2 - Open Source DBMS If DT creation functionality was added to an open source DBMS, the modified system could be used both for empirical validation and performance testing. In contrast to simulations, in which any hardware environment can be simulated, experiments using this alternative will only be executed in one hardware environment. CHAPTER 7. IMPLEMENTATION ALTERNATIVES 113 Many open source Database Management Systems exist. From these, five well-known systems have been selected as potential DBMSs in which the non-blocking DT creation methods may be implemented. These are: Berkeley DB (and Berkeley DB Java Edition), Apache Derby, MySQL with InnoDB, PostgreSQL and Solid Database Engine. As discussed in Part II, the suggested non-blocking DT creation methods have rigid requirements for the internal DBMS design. Most importantly, the methods require Compensating Log Records (CLR), logical redo logging and state identifiers on record granularity. Hence, in what follows, the six DBMS candidates are evaluated with emphasis on these three requirements. Berkeley DB and Berkeley DB Java Edition Berkeley DB and Berkeley DB Java Edition are C and Java implementations of the same design (Oracle Corporation, 2006b); unless otherwise noted, the name Berkeley DB will be used for both in this thesis. It is not a relational DBMS, but rather a storage engine with transaction and recovery support. Our DT creation methods operate on relations, and a mapping from relations to the physical structure would therefore be needed. This can be solved by using the product as a storage engine in MySQL. However, Berkeley DB uses redo logging physical to page and page state identifiers (Oracle Corporation, 2006a). It is therefore not considered a suitable candidate for the DT creation methods. Apache Derby Apache Derby (Apache Derby, 2007a) is a relational DBMS implemented in Java. It uses ARIES (Mohan et al., 1992) like recovery with Write Ahead Logging and Compensating Log Records. However, redo log records are physical (Apache Derby, 2007b), and state identifiers are associated with blocks, not records (Apache Derby, 2007c). This renders Apache Derby unsuited as an implementation candidate. MySQL with InnoDB MySQL (MySQL AB, 2007) is designed with a highly modular architecture, and is best described as a high level DBMS with a storage engine interface (MySQL AB, 2006). The high level functions include SQL Parsing, query optimization etc, while the storage engines are responsible for concurrency control and recovery (MySQL AB, 2006). Many storage engines exist, e.g., Berkeley DB, InnoDB and SolidDB. MySQL with InnoDB is described below, 114 7.2. ALTERNATIVE 2 - OPEN SOURCE DBMS whereas the Berkeley DB and SolidDB alternatives are treated as individual products. MySQL with InnoDB is the recommended combination when transaction support is required (MySQL AB, 2006; Kruckenberg and Pipes, 2006). The InnoDB storage engine uses physiological logging (Zaitsev, 2006) and page level LSNs (Kruckenberg and Pipes, 2006). It is therefore not considered a good candidate for non-blocking DT creation. Solid Database Engine The Solid Database Engine can be used as a storage engine either in one of Solid Information Technology’s embedded DBMSs (e.g. BoostEngine and Embedded Engine), or in MySQL (Solid Info. Tech., 2007). The Solid Database Engine uses normal Write Ahead Logging with physical redo logging (Solid Info. Tech., 2006b). Furthermore, source code inspection reveals that state identifiers are associated with blocks. Hence, we do not consider Solid Database Engine to be a good implementation candidate. PostgreSQL PostgreSQL, formerly known as Postgres and Postgres95, was originally created for research purposes during the mid-eighties (PostgreSQL Global Development Group, 2007). Until version 7.1, PostgreSQL used a force buffer strategy, and hence did not write redo operations to log at all (PostgreSQL Global Development Group, 2001). In version 7.33 , undo operations, and therefore CLRs, were not logged (PostgreSQL Global Development Group, 2002). It is also clear from source code inspection that the redo log is physical to page, and that state identifiers are associated with pages rather than records4 . The lack of CLRs, redo log records that are physical to page and page state identifiers render PostgreSQL unsuited for the DT creation methods. Open Source DBMS Discussion It is clear that none of the five open source DBMSs evaluated in this section are good implementation candidates for the non-blocking DT creation methods. Since neither the log formats or the state identifiers can be used, both the recovery and cache managers would have to be modified significantly. We 3 4 This was the newest version when the implementation alternatives were evaluated. See access/xlog.h for details CHAPTER 7. IMPLEMENTATION ALTERNATIVES DBMS Redo Log Format Berkeley DB Derby MySQL/Innodb Solid DB PostgreSQL Physical Physical Physical Physical Physical 115 Granularity of State Identifiers Block Block Block Block Block Table 7.1: Evaluation of Open Source DBMS alternatives. consider the implementation cost of making significant changes to unfamiliar code to be very high. 7.3 Alternative 3 - Prototype A prototype DBMS that includes the non-blocking DT creation methods can be used for empirical validation and performance testing in the same way as an open source DBMS. The two strategies share the problem of fixed hardware, meaning that experiments will be performed in one hardware environment only. As described in the introduction, this is not considered a problem since we are interested in relative performance only. It is not feasible to implement a new, fully functional DBMS from scratch due to the complexity of such systems. A prototype should therefore only include the parts that are most relevant to DT creation. The prototype is required to function in a manner similar to traditional DBMS, and should therefore use a standard DBMS design to the largest possible extent. Figure 7.1 shows a widely accepted DBMS design close to what is used in, e.g., MySQL Enterprise Server 5 (MySQL AB, 2006) and Microsoft SQL Server 2005 (Microsoft TechNet, 2006). The figure also bears close resemblance to the model described by Bernstein et al. (Bernstein et al., 1987). To get an idea of the implementation cost of a using a prototype, we consider possible simplifications on a module by module basis in what follows. Modules Operating on the Logical Data Model In full-scale DBMSs, the Connection Manager is responsible for tasks like authentication, thread pooling and providing various connection interfaces. In a prototype, only one connection interface and a thread pooling strategy is required. Authentication, e.g., does not affect DT creation. The SQL Parser of the prototype has to recognize the SQL statements 116 7.3. ALTERNATIVE 3 - PROTOTYPE Connection Manager SQL Parser Logical Data Model Physical Data Model Relational Manager Scheduler Recovery Manager Cache Manager Data Figure 7.1: Possible Modular Design of Prototype. used in the experiments. Hence, by first performing an analysis of the experiments, the prototype SQL Parser can be significantly simplified to understand only a very limited subset of the SQL language. A Relational Manager is typically responsible for mapping between the logical data model seen by users, and the physical data model used internally in the DBMS. The module also performs query optimization, which is used to choose the most efficient access order when multiple tables are involved in one SQL statement. Query optimization is a highly sophisticated operation which involves statistical analysis of access paths. This can be totally ignored in the prototype since the DT creation methods do not rely on it. However, this simplification requires careful construction of all SQL statements that will be used in the experiments. In practice, this can be done by e.g. always stating tables in the most efficient order. With query optimization removed from the module, the relational manager is reduced to perform mapping between the logical and physical data models. Modules Operating on the Physical Data Model Schedulers are responsible for enforcing serializable execution of operations. CHAPTER 7. IMPLEMENTATION ALTERNATIVES 117 As discussed in Section 2.2, two-phase locking (2PL) is the most commonly used scheduling strategy in modern DBMSs. 2PL is also fairly simple to implement, and should therefore be used in a prototype. The primary responsibility of Recovery Managers is to correct transaction, system and media failures. In most DBMSs, including all the open source DBMSs evaluated in the previous section, this is done by maintaining a log. In the non-blocking DT creation methods, this log is also used extensively to forward updates to the derived tables. A prototype Recovery Manager implementation is required to maintain a log that can be used to fully recover the database. The ARIES recovery strategy (Mohan et al., 1992) is a good candidate since it is widely accepted and used in many DBMSs, e.g. in Apache Derby (Apache Derby, 2007b). To be usable by the DT creation methods, the module is also required to use Compensating Log Records and logical redo logging. The final module, the Cache Manager, is responsible for physical data access. In most DBMSs, this includes reading disk blocks into main memory and writing modifications back to disk. A good strategy for choosing which blocks to replace when the memory is full, e.g. the Least Recently Used (LRU) algorithm, is also necessary. As argued in Section 2.3, it is common for Cache Managers to use a steal/no-force buffer strategy. With this strategy, the Cache Manager must cooperate closely with the Recovery Manager so that all operations are recoverable. In a prototype, the Cache Manager is required to cooperate with the Recovery Manager to achieve recoverable operations. Furthermore, the DT creation methods require state identifiers to be associated with records, not blocks. As was clear from evaluating the open source alternative in the previous section, record state identifiers are not normally used in todays DBMSs. 7.4 Implementation Alternative Discussion In this chapter, we have evaluated simulations, implementation in open source DBMSs and implementation in a prototype with respect three important criteria. These criteria were: usability for empirical validation, usability for performance testing and implementation cost. As is clear from Table 7.2, simulations can not be used to empirically validate the DT creation method. For this experiment type, the output of a system is compared to what is considered correct output. The quality of the experiment result is therefore determined by the quality of the test sets. Hence, the open source DBMS and prototype alternatives are considered 118 7.4. IMPLEMENTATION ALTERNATIVE DISCUSSION Simulation Open Source Prototype DBMS Usability; Empirical Validation - High High Usability; Performance Testing Medium High Medium Implementation Cost Low High Medium Risk; Unsuitable Design Low Medium Low Table 7.2: Evaluation of implementation alternatives. equally well suited for this purpose. All three alternatives can be used for performance experiments. Using an existing open source DBMS would provide the most reliable performance results. In contrast to the alternatives, these DBMSs are all fully functional systems that include most aspects of common DBMS technology. Both the simulation and prototype alternatives rely on simplified models. However, we consider the latter alternative to provide the most accurate performance results. The reason for this is that it is easier to verify the correctness of the prototype design (Zelkowitz and Wallace, 1998), and because we do not have to make assumptions to processing and waiting times with this alternative. When it comes to implementation cost, simulation is clearly the least costly alternative. Furthermore, if an open source DBMS with a design suitable for non-blocking DT creation had been found in Section 7.2, the open source alternative would be considered less costly than implementing a prototype. However, the evaluation in Section 7.2 showed that none of the open source DBMSs had a design that was suitable for DT creation. If any of these systems were to be used, both the Cache and Recovery Managers of that DBMS would require significant modifications to support logical logging and record state identifiers. In addition, the Scheduler module would have to be changed to handle modified lock compatibilities and forwarding of locks between source and derived tables. Hence, only the high level modules of the chosen open source DBMS would be usable without drastic changes. Making CHAPTER 7. IMPLEMENTATION ALTERNATIVES 119 extensive changes to unfamiliar code is considered by the author to be both more costly and have a higher risk of failure than implementing a prototype. In contrast to simulations, the prototype alternative is good for both types of experiments. Furthermore, it has a lower implementation cost and risk than that of the open source alternative. Based on this evaluation, we consider a prototype to be the better alternative due to time considerations. Chapter 8 Design of the Non-blocking DBMS This chapter describes the design of a prototype Database Management System, called the “Non-blocking Database Management System” (NbDBMS), which will be used for empirical validation and performance experiments described in Chapter 9. As illustrated in Figure 8.1, the prototype has a modular design inspired by what is used in many modern DBMSs, e.g. MySQL Enterprise Server 5 (MySQL AB, 2006) and Microsoft SQL Server 2005 (Microsoft TechNet, 2006). In addition to providing normal DBMS functionality, NbDBMS is capable of performing the six non-blocking DT creations described in Chapter 6. Figure 8.2 shows a UML Class diagram of the most important parts of the prototype. Note that each module in NbDBMS can be replaced by another implementation as long as the module interface remains unchanged. NbDBMS is simplified significantly compared to modern DBMSs. E.g., only a limited subset of the SQL language is supported, and only a few relational operators are available for user transactions. Furthermore, NbDBMS stores records in main memory only, making buffer management unnecessary. In the following sections, each module is described with emphasis on their similarity to or difference from standard DBMS solutions. The effects of the simplifications are then discussed in Section 8.1.7. CHAPTER 8. DESIGN OF THE NON-BLOCKING DBMS Client ... Client Admin Java RMI Non-blocking Database Communication Manager Sql Parser Relational Manager Scheduler Recovery Manager Data Manager Log Tbl Tbl Figure 8.1: Modular Design Overview of the Non-blocking DBMS. 121 122 Figure 8.2: UML Class Diagram of the Non-blocking Database System. CHAPTER 8. DESIGN OF THE NON-BLOCKING DBMS 8.1 8.1.1 123 The Non-blocking DBMS Server Database Communication Module The Communication Module (CM) acts as the entry point for all access to the database system. When a client or administrator wants to connect to the database system, the CM sets up a network socket for that program. Java RMI is used for communication, but the sockets have been modified to not buffer replies, thus emphasizing response time. This is necessary because many client programs will be run from the same physical node to simulate high workloads during the experiments. However, replies to different clients at the same physical node should not be buffered and sent as one network package, which would be the default Java RMI behavior. Once a connection has been established, the CM simply forwards requests from the client or administrator to the appropriate DBMS module. Depending on the request, this is either the SQL Parser Module or the Relational Manager Module. When performance tests are executed, the module is also responsible for writing a performance report file. During response time tests, e.g., clients periodically report their observed response times, which are written to the report file for later analysis. 8.1.2 SQL Parser Module All user operations and some administrator operations are requested through SQL statements. These statements must be interpreted by the SQL Parser Module (SPM) before they can be processed further. The experiments require a way to perform all basic operation, i.e. insert, delete, update and query. Thus, SPM is designed to interpret a small subset of the SQL language, including one single syntax for each of these operators Likewise, for queries, only one single way of writing projections1 , selection criterions, joins, unions etc. is supported. Consult Appendix A for further details about accepted SQL statements. The SPM works by first checking that the SQL statements have correct syntax. Statements with incorrect syntax are rejected, while accepted statements are tokenized. Tokenization is the process of splitting strings into meaningful blocks, called tokens (Lesk and Schmidt, 1990), which are used as parameters in method calls to the Relational Manager. Consider the following example tokenization: 1 Selection of a subset of attributes(Elmasri and Navathe, 2004). 124 8.1. THE NON-BLOCKING DBMS SERVER Example 8.1.1 (Tokenization) Select statement: select firstname, surname from person where zip=7020; Tokens: statement_type: table: attributes: select_criterion_eq: order_by: {select} {person} {firstname,surname} {{zip,7020}} {} These tokens can then be used in a call to the Relational Manager procedure: executeQuery(Long transID, String table, Array attributes, Array select_criterion, Array order_by) Regular Expressions (regex) (Friedl, 2006) are used for both syntax checking and tokenization of SQL statements. Regex is powerful, but become complex if many different statement syntaxes are allowed. However, since only a limited set of the SQL language needs to be recognized in NbDBMS, this is not a significant problem in the current implementation. If more complex SQL statements are to be allowed in a future implementation, a lexical analyzer like Lex (Lesk and Schmidt, 1990) should be used instead. 8.1.3 Relational Manager Module The Relational Manager Module (RMM) maps the logical data model seen by users to the physical data model used internally by NbDBMS. Hence, this is the lowest level module in which table or attribute names are meaningful. The module consists of three classes: RelationalManager, TableMapper and RelationalAlgorithmHelper. RelationalManager is the main class of the module. It serves as the module’s interface to higher level modules, and organizes the logical to physical data mapping. The TableMapper class is used by RelationalManager whenever information is needed about a database schema object, e.g. a table. If the executeQuery method call in Example 8.1.1 is processed, e.g., the RelationalManager has to ask the TableMapper for the internal IDs of the attributes “firstname” and “surname”. Other responsibilities of TableMapper includes table creation and removal, and providing descriptions of tables. Table descriptions include attribute names, data types and constraints, and are used when derived tables are created. CHAPTER 8. DESIGN OF THE NON-BLOCKING DBMS 125 All information on tables and their attributes are stored in two reserved tables that are created at startup. Other than having reserved names, these behave as other tables. The table manipulation and information gathering performed by the TableMapper is therefore done by updating and querying the records in these. For fast lookup, the TableMapper also maintains a cache of vital schema information. To be able to guarantee that the cached information is valid, only one TableMapper object may be created. This TableMapper object is aware of all changes to the schema since schema manipulations are directed through it. If the RelationalManager is processing a query that involves set operations, the RelationalAlgorithmHelper class is consulted. It contains static, i.e. stateless, implementations of some set operations, including various joins, union and sort. The join algorithms are implemented using hash join (Shapiro, 1986). Since the database only resides in main memory, this strategy is better than both GRACE join and sort-merge join (Shapiro, 1986). Union with duplicate removal is implemented with a Hashtable, and assumes that there are no duplicates in any of the two subquery results. All records from the subquery with fewest records are first copied to the result set, and a Hashtable is created on one of the attributes. The records from the other subquery are then compared to the hashed records and added to the result set if a record with identical attribute values is not found. Ideally, an attribute with a unique constraint should be used in the hash. If this is the case, each record from the second subquery must be compared to at most one record. Note that records from the second table are not added to the Hashtable. The sort operation is implemented with Merge-Sort because this method is both fast (n log n) (Knuth, 1998) and easy to implement. Consider the sequence diagram in Figure 8.3. This diagram illustrate how the module responds to the following query with a join: select * from (person join post on person.zip=post.zip) where person.name=John; As illustrated, the RMM first requests the attribute ID and type of the “name” attribute in “person” from TableMapper. If the TableMapper has not already cached this information, a read is requested from the reserved “columns” table. The TableMapper then caches this information for future requests. The RMM now knows the attribute ID and type (String) of the “name” attribute, and it uses this to read all person records with the name 126 8.1. THE NON-BLOCKING DBMS SERVER Figure 8.3: Relational Manager Module processing the query select * from (person join post on person.zip=post.zip) where person.name=John; CHAPTER 8. DESIGN OF THE NON-BLOCKING DBMS 127 “John”. The same process is repeated for the “post” table, but without a selection criterion. Before the join is performed, the RMM needs to know which attribute ID should be used to join the records. Again, the TableMapper is consulted. The information is now found in the cache of TableMapper. Finally, the results of the subqueries and the join attribute IDs are sent to the RelationalAlgorithmHelper class, which executes the join. As already described, the RelationalManager is the lowest level module in which the logical data model has any meaning. It is also the highest level module with knowledge of the physical data model. For this reason, the algorithms for most of the non-blocking DT creation methods are also implemented here. Consult Chapter 6 for details on these algorithms. 8.1.4 Scheduler Module The Scheduler is responsible for ordering transactional operations on records so that the resulting order is serializable. Note that since schema information is stored as records in two reserved tables, this implies that schema modifications are also performed in serializable order. The Scheduler uses a strict Two Phase Locking (2PL) strategy. Thus, locks are acquired when a transaction requests an operation, but are not released until the transaction has terminated. As argued in Section 7.3, this strategy was chosen for multiple reasons: strict 2PL is commonly used in commercial systems, e.g. SQL Server 2005 (Microsoft TechNet, 2006) and DB2 v9 (IBM Information Center, 2006), is easy to understand and implement. As opposed to basic 2PL, strict 2PL also avoids cascading aborts (Garcia-Molina et al., 2002). The module supports both shared and exclusive locks on either record or table granularity. If a transaction issues an operation that results in a locking conflict, the transaction is aborted immediately. This ensures correct, deadlock-free execution (Bernstein et al., 1987), but comes with a performance penalty: in many cases, the conflict could have been resolved simply by waiting for the lock to be released. If so, the transaction is aborted unnecessarily. On the other hand, deadlock detection can be ignored, thus simplifying the module. Normal transactional requests are processed in three steps in the module. First, the TransactionManager class checks that the transaction is in the active state. If the TransactionManager confirms that the transaction is active, the lock type2 , the transaction ID and the object ID3 is sent to the 2 3 Shared and exclusive locks are supported. The object ID is either a table name and a recordID, or only a table name. 128 8.1. THE NON-BLOCKING DBMS SERVER null null T:1, LSN:1 T:1, LSN:2 T:2, LSN:3 T:2, LSN:4 T:3, LSN:5 T:1, LSN:6 null Figure 8.4: Organization of the log. LockManager. If another transaction has a conflicting lock on the object, the LockManager returns an error code. This results in the abortion of the transaction. Otherwise, if conflicting locks are not found, the LockManager confirms the lock. The Scheduler then sends the operation request to the Recovery Manager Module. While all normal transactional operations require locks, the Scheduler also provides lockless operations to the DT creation methods. Furthermore, methods for lock forwarding from one table to another is implemented in the module. 8.1.5 Recovery Manager Module The next layer of abstraction, the Recovery Module, is responsible for making the database durable. It is designed for a data manager using the steal and no-force buffer strategies. To ensure durability for this type of data manager, the ARIES protocol is adopted. ARIES is used in many modern DBMSs, including open source DBMS Apache Derby (Apache Derby, 2007b). The module maintains a logical log of all operations that modify data, and the Write Ahead Logging (WAL) and Force Log at Commit (FLaC) techniques are used to ensure recoverability. Furthermore, a Compensating Log Record (CLR) is written to the log if an operation is undone. Logical logging is sometimes called operation logging because each log record stores the operation that was performed rather than the data values themselves. The “partial action” and “action consistency” problems (Gray and Reuter, 1993) are not encountered in NbDBMS since records are stored in main memory only. This simplification is discussed in the next section. If records were stored on disk, as in most DBMSs, the logical logging strategy adopted here would have to be replaced by, e.g., a two level logging strategy with one logical log and one physiological log. This technique is used in the ClustRa DBMS (Hvasshovd et al., 1995). As illustrated in Figure 8.4, log records are organized in linked lists. Like in ARIES (Mohan et al., 1992), two sets of links are maintained: the first is a link between two succeeding log records, thus maintaining the sequential CHAPTER 8. DESIGN OF THE NON-BLOCKING DBMS 129 Key Index 1, a, x 2, a, y 3, b, x "a" 4, c, z "b" 5, a, z "c" 6, c, y 7, d, y 8, c, x "d" Att 1 Index Figure 8.5: Organization of data records in a table. The table has two indexes; the primary key index (created automatically) and one user specified index on attribute 1. order of the log. The second link is between two succeeding log records in the same transaction. The latter is only used to fetch log records when a transaction is aborted. These links are maintained as object references in main memory, but are changed to LSN references when written to disk. In addition to maintaining a sequential log of all executed operations, the Recovery Module is responsible for performing recovery after a crash and for undoing an aborted transaction. 8.1.6 Data Manager Module The Data Manager4 Module (DMM) is responsible for storage and retrieval of records, and for performing the actual updates of data records. The records are stored in Java hashtables (Sun Microsystems, 2007), which reside in main memory only. In NbDBMS, these hashtables are called indices. When a table is created, an index is created on the primary key attribute. Indices can later be added to any other attribute. A table with a primary key index and an additional attribute index is illustrated in Figure 8.5. When a read operation is requested, the Data Manager chooses the index that is best 4 The module is called Data Manager instead of Cache Manager because records are never written to disk. 130 8.1. THE NON-BLOCKING DBMS SERVER suited to fetch the record. The choice is based on the best match between the selection criterion and available indices. When an update is requested, the record is fetched using the primary key index before being changed. If the operation modifies an indexed attribute, the record must be rehashed in that index. Although numerous DBMSs designed to achieve low response times are main-memory based (Garcia-Molina and Salem, 1992; Cha et al., 1995; Cha et al., 1997; Bratsberg et al., 1997b; Cha and Song, 2004; Solid Info. Tech., 2006a), DBMSs are more often disk based5 . Compared to disk based DBMSs, keeping records in main memory only is probably the greatest simplification in the Non-blocking DBMS. By doing so, Cache management, i.e. choosing which disk blocks should reside in main memory at any time, can be totally ignored. Furthermore, write operations that would change multiple disk blocks, e.g. by splitting a node in a B-tree, are now atomic. This enables us to use plain logical logging, as argued in the previous section. 8.1.7 Effects of the Simplifications The previous sections have described the simplifications made in the prototype modules. In what follows, the implications of these are discussed. As discussed in Section 8.1.2, NbDBMS only recognizes a very limited SQL language, which is defined in Appendix A. This would obviously be a huge drawback if the Non-blocking DBMS were to be used by real applications. However, in the current version, NbDBMS is only intended for empirical validation and performance testing. A predefined transactional workload will be used in these experiments, hence the system only needs to recognize a subset of the SQL language. This simplification is therefore considered to not affect the experiments. The Scheduler described in Section 8.1.4 is designed to abort transactions that try to acquire conflicting locks. This is called Immediate Restart conflict resolution (Agrawal et al., 1987). Agrawal et al. compared this to the more common strategy of not aborting until a deadlock is detected. The comparison showed that the latter strategy, called blocking conflict resolution, enables higher throughput under most workload scenarios. Hence, we expect this to reduce the maximum throughput in NbDBMS. In most circumstances, the non-blocking DT creation methods described in this thesis do not acquire additional locks. This means that the exact same 5 All the open source DBMSs evaluated in Chapter 7, IBM DB2 (IBM Information Center, 2006), Microsoft SQL Server 2005 (Microsoft TechNet, 2006) etc. are all disk based DBMSs. CHAPTER 8. DESIGN OF THE NON-BLOCKING DBMS 131 number of locking conflicts should occur for transactions executed during normal processing as for those executed during DT creation. Since immediate restart affects transactions in both cases to the same extent, it is considered not to affect the relative performance between the normal and DT creation cases. Furthermore, the empirical validation experiments remain unaffected. As thoroughly discussed in Chapters 4 - 6, there is one exception in which DT creation does require additional locks. This is during non-blocking synchronization of schema transformations. Here, locks are forwarded between the old and new table versions. In all DT creation operators where one source record contributes to multiple derived records or vice versa, additional locking conflicts are expected. In these cases, immediate restart is expected to cause a higher number of aborts than the blocking strategy would have. NbDBMS is therefore expected to perform poorer in this particular case. The Recovery Manager maintains a pure logical log. As argued by Gray and Reuter, this alone does not provide enough information for crash recovery since many disk operations are not atomic (Gray and Reuter, 1993). The logical log includes all the information needed to create DTs and for performing crash recovery in NbDBMS, however. Thus, the only consequence of this design is a reduced log volume compared to disk based DBMSs. This reduction in log volume is equal for transaction processing in the normal and DT creation cases. It is therefore considered to affect the performance of NbDBMB to a negligible extent. Storing data in main memory only is likely to be the simplification with greatest impact on the performance of NbDBMS. As discussed in Section 8.1.6, this greatly reduces the complexity of the Data Manager and enables the use of a pure logical log. The chosen strategy is not common, but is used in some DBMSs, including Solid BoostEngine (Solid Info. Tech., 2006a), ClustRa (Hvasshovd et al., 1995) and P*Time (Cha and Song, 2004) As discussed by Highleyman (Highleyman, 1989), the performance of a DBMS is bound by a bottleneck resource. Example bottlenecks include CPU, disk and network. The “main memory only” simplification implies that the performance results of NbDBMS should be compared to DBMSs that are bound by other resources than cache management. We expect that the empirical validation experiments are unaffected by this design. When it comes to performance testing, the normal processing and DT creation cases are both affected by the design. We therefore consider the relative performance to be affected to a little extent. 132 8.2 8.2. CLIENT AND ADMINISTRATOR PROGRAMS Client and Administrator Programs Both the Database Client and Administrator programs have console user interfaces, and connect to the Non-blocking DBMS through Java Remote Method Invocation (RMI) sockets. As described in Section 8.1.1, the sockets have been modified to not queue replies, thus reducing response time. When a client program has connected to NbDBMS, it may perform operations on the database through a limited SQL language. The operations are either issued through the executeRead method used by queries, or the executeUpdate method used by inserts, deletes and updates. All operations are requested on behalf of a transaction. Transactions are started by calling the startTransaction method, and terminated by calling either the commit or abort method. There are two types of clients: one interactive client that accepts SQL operations from a user, and one automated client that generates semi-randomized transactions to simulate workload. Figure 8.6 shows a screen shot of the interactive client. The automated client type is used in the experiments, and is discussed further in Section 9. The admin program has access to other operations in NbDBMS, but is otherwise similar to the client programs. There are two types of admins. One of these is interactive, and is used for manual verification of the DT creation method. The other type is automated and is used in conjunction with the automated clients in the experiments. 8.3 Summary In this chapter, we have described the design of a prototype DBMS, and discussed the effects of the simplifications we have made to it. The resulting prototype, called the Non-blocking DBMS, is capable of performing basic database operations, like queries and updates, in addition to our suggested DT creation method. Altogether, the prototype consists of approximately 13,000 lines of code. The prototype has been subject to both empirical validation and performance experiments. In the next chapter, we describe the experiments, and discuss the results and implications of these. CHAPTER 8. DESIGN OF THE NON-BLOCKING DBMS Figure 8.6: Screen shot of the Client program in action. 133 Chapter 9 Prototype Testing In this chapter, we focus on two types of experiments that can be used to determine the quality of the DT creation methods. The first type, empirical validation, is performed to determine whether the DT creation methods work. While it is clear that empirical validation can not be used to prove absolute correctness of a method (Tichy, 1998), it can provide a clear indication. This experiment type is therefore commonly used in development of new techniques in software (Tichy, 1998; Pfleeger, 1999; Zelkowitz and Wallace, 1998). The second type of experiment is performance experiments. This experiment is highly relevant since an important design goal of the DT creation framework has been to incur low degrees of performance degradation. The Non-blocking DBMS has been subject to extensive empirical validation and performance experiments. In what follows, we first describe the environment the experiments have been performed in. We then discuss the results and implications of the experiments. 9.1 Test Environment All tests described in this chapter are performed on seven PCs, called nodes, each with two AMD 1400 MHz CPUs and 1 GB of main memory. All nodes run the Debian GNU/Linux operating system with a 2.6.8 smp1 kernel, and are connected with a 100Mb Ethernet LAN network. The Non-blocking DBMS described in Chapter 8 is installed on one of the nodes, whereas the other nodes are used for administrator (1 node) and client (5 node) programs. The reason for using five nodes with client programs is to resemble as realistic workloads as possible with the available resources. In what follows, 1 A symmetric multiprocessing (smp) kernel is required to utilize both CPUs. CHAPTER 9. PROTOTYPE TESTING Type Nodes CPU Memory Operating System Network Java Virtual Machine Java Compiler Java VM Options, Server Java VM Options, Admin Java VM Options, Client 135 Value 7 (1 server, 1 admin, 5 clients) 2 x AMD 1400MHz per node 1 GB per node Debian GNU/Linux, 2.6.8-686-smp kernel 100Mb Ethernet Java HotSpot Server VM, build 1.5.0 08-b03 javac 1.5.0 08 -server -Xms800m -Xmx800m -Xincgc -server -server Table 9.1: Hardware and Software Environment for experiments. these nodes are called “server node”, “admin node” and “client nodes”, respectively. The prototype DBMS and all client and administrator programs have been implemented in Java 2 Standard Edition 5.0 (Sun Microsystems, 2006a). The Server Node The NbDBMS server has been run with the following options in all experiments: java -server -Xms800m -Xmx800m -Xincgc The -server option selects the Java HotSpot Server Virtual Machine (VM), which is optimized for overall performance. The alternative, the Client VM, has faster startup times and a smaller footprint (i.e. requires less memory), and is better suited for applications with graphical user interfaces (GUI) etc. (Sun Microsystems, 2006b) . Both VMs have been tried for the NbDBMS server, and the Server VM has outperformed the Client VM with ∼15-20% higher maximum throughput. The -Xms800m and -Xmx800m options are used to set the starting and maximum heap sizes to 800 MB. The heap is the memory available to the Java VM, and 800 MB has proven to be slightly below the limit where the Java VM sporadically fails to start due to memory conflicts with other processes. By setting the starting heap size equal to the maximum heap size, the overhead of growing the heap is avoided (Shirazi, 2003). -Xincgc is used to select the incremental garbage collector. This algorithm frequently collects small clusters of garbage as opposed to the default method 136 9.1. TEST ENVIRONMENT of collecting much garbage less frequent (Shirazi, 2003). Note that the impact of using this garbage collection algorithm is increasing with the heap size. The reason for this is that the default garbage collection algorithm must collect more garbage in each iteration (Shirazi, 2003). In NbDBMS, this option results in significantly lower response time variance. The Client and Administrator Nodes Like the Non-blocking DBMS server, the administrator and client programs have also been run with the Server VM. The heap and garbage collection options used on the NbDBMS server made no observable difference for these programs, and are therefore not used in the experiments. Each client node runs one organizer thread that spawns transaction threads from a thread pool. When spawned, a transaction thread executes one transaction, consisting of six basic operations, before returning to the thread pool. The organizer uses the Poisson distribution for transaction thread spawning, meaning that the number of requests per second varies, but has a defined mean. As argued by Highleyman, the Poisson distribution should be used when we want to simulate requests from an “infinite” number of clients (Highleyman, 1989). By infinite, we mean many more clients than are currently being processed by the database system. The transactions requested by client threads are randomized within boundaries defined for each DT creation operator. In all experiments, six transaction types are specified. These are called Transaction Mixes, and a spawned transaction thread executes one of the transactions specified in the appropriate mix. The transaction mixes are designed to reflect a varied workload so that all log propagation rules are involved in the DT creation process. Hence, the transaction mixes include inserts, updates, deletes and queries on all involved tables. The transaction mixes are shown in Tables 9.2 to 9.4. The reason for having three transaction mixes is that the DT creation methods have different requirements; some have one source tables while others have two and so on. The transactions are similar to those used in TPC-B benchmarks (Serlin, 1993) although the DT creation table setups are not equal to what is specified in TPC-B. More thorough benchmarks, like TPC-C, TPC-D (Ballinger, 1993) and AS3 AP (Turbyfill et al., 1993) exist, but these do to a much greater extent test DBMS functionality that is of no interest to DT creation. A good example is the query in TCP-D that joins seven tables with greatly varying sizes meant to test the query optimizer, or AS3 AP that tests mixes of batch and interactive queries (Gray, 1993). In addition to the source tables, all experiments are performed with one CHAPTER 9. PROTOTYPE TESTING 137 Transaction Mix 1 Nonsource Source 1 Source 2 Scenario Scenario 1 (%) 2 (%) 1 - 6 updates - 20 5 2 6 updates - - 20 20 3 4 reads 1 read 1 read 40 60 4 - - 6 updates 5 2.5 5 3 inserts 3 deletes - - 10 10 6 - 2 inserts 2 deletes 1 insert 1 delete 5 2.5 Trans Table 9.2: Transaction Mix 1, used in Difference and Intersection, Vertical Merge and Horizontal Merge DT creation. Transactions 1, 4 and 6 require log propagation processing. This corresponds to 30% of the operations in scenario 1, and 10% in Scenario 2 which is more read intensive. Transaction Mix 2 Nonsource Source, “Left” part Source, “Right” part Scenario Scenario 1 (%) 2 (%) 1 - 6 updates - 20 5 2 6 updates - - 20 20 3 4 reads 2 read - 40 60 4 - - 6 updates 5 2.5 5 3 inserts 3 deletes - - 10 10 6 - 3 inserts 3 deletes - 5 2.5 Trans Table 9.3: Transaction Mix 2, used in Vertical Split DT creation. There is only one source table, but the attributes of this table are derived into either the “left” or “right” derived table. Scenario 1 requires log propagation of 30% of the operation, whereas Scenario 2 requires 10%. 138 9.1. TEST ENVIRONMENT Transaction Mix 3 Nonsource Source, Other Attribute Source, Selection Attribute Scenario Scenario 1 (%) 2 (%) 1 - 6 updates - 20 5 2 6 updates - - 20 20 3 4 reads 2 read - 40 60 4 - - 6 updates 5 2.5 5 3 inserts 3 deletes - - 10 10 6 - 3 inserts 3 deletes - 5 2.5 Trans Table 9.4: Transaction Mix 3, used in Horizontal Split DT creation. There is only one source table, but a derived record may have to move between the derived tables if the attribute used in the selection criterion is updated. Operation # and size of records in each table Nonsource: 20,000 records, 100 bytes Difference Source 1: 20,000 records, 80 bytes Intersection Source 2: 5,000 records, 80 bytes Nonsource: 20,000 records, ∼100 bytes Horizontal Source 1: 20,000 records, ∼80 bytes Merge Source 2: 20,000 records, ∼80 bytes Nonsource: 20,000 records, ∼100 bytes Horizontal Source 1: 40,000 records, ∼80 bytes Split Source 2: N/A Nonsource: 20,000 records, ∼100 bytes Vertical Source 1: 20,000 records, ∼100 bytes Merge Source 2: 1,300 to 20,000 records, ∼50 bytes Nonsource: 20,000 records, ∼100 bytes Vertical Source 1: 20,000 records, ∼150 bytes Split Source 2: N/A Table 9.5: Table Sizes used in the performance test experiments. Note that the empirical validation experiments are performed with 5 times more records in all source tables. CHAPTER 9. PROTOTYPE TESTING 139 additional table in the schema. This table is called “nonsource”, and is not involved in the DT creation. The idea of having this table is to be able to generate varying workloads without necessarily changing the log propagation work that needs to be done. We also consider it realistic to have a database schema with more tables than those involved in the DT creation process. Depending on the operator being used for DT creation, either one or two source tables are defined in the original database schema. Before an experiment is started, all tables are filled with records. The number of records in each table are shown in Table 9.5. Since multiple nodes with multiple threads request operations concurrently, it will often be the case that a request arrives at the server while another request is being processed. The number of concurrent server threads may influence the performance. However, we rely on Java RMI to decide on the optimal number of concurrent threads on the server node. Also, as previously described, the server sockets used to connect the clients and the server are modified to avoid buffering of replies to different transaction threads executed on the same node. 9.2 Empirical Validation of the Non-Blocking DT Creation Methods Empirical validation experiments have been performed for all DT creation methods in Non-blocking DBMS. The experiments were executed using the following steps: 1. Populate the source tables with initial data as defined in Table 9.5. Note that five times more records are used in these tests than are specified in the table. 2. Start a workload of semi-random insert, update and delete operations. The workload is described by the transaction mixes defined in Tables 9.2 to 9.4, but the read transaction types are ignored. Execute 200,000 transactions before stopping the workload. 3. Once the random workload has started, start the DT creation process. Let log propagation run until all transactions have completed executing. 4. When all transactions have completed, compare the content of the source tables to that of the derived tables. No attribute values should differ between these schema versions. 140 9.2. EMPIRICAL VALIDATION OF THE NON-BLOCKING DT CREATION METHODS For continuity, the tests have been performed with similar source and derived tables as used in the examples in Chapter 6. Thus, the figures used there may be used for reference. Difference and intersection In the difference and intersection (diff/int) experiment, records from the “Vinyl records” source table were stored in the difference or intersection DTs based on the existence of equal records in the “CD records” source table. All tables had three attributes: artist firstname, artist surname and record title. To achieve a 20% overlap of records between the source tables, some operations wrote completely random values while others wrote predefined default values. After the transactions had completed, the records in the DTs were ordered by artist and record name. The difference and intersection operators were then applied on the source tables, and the sorted results were stored in arrays. The records in these arrays were in turn compared to the derived tables. The contents of “CD records” was also compared to that of the auxiliary table. All records, including LSNs, were equal in the source and derived tables. Horizontal Merge To reduce the implementation work, the horizontal merge experiments have only been performed for the duplicate removal case. Duplicate removal was chosen before duplicate inclusion because the former is more complex, as argued in Section 6.3. The source tables in the horizontal merge experiment were equal to those used in diff/int. Approximately 20% of the records in “CD Records” were duplicates of “Vinyl Records”. There were no duplicates within one table. When all transactions had completed, the records in both source tables were sorted on name and record title and inserted into an array. The unique records in this array were then compared to the DT, and the record IDs were compared to the auxiliary table. No inconsistencies were found. Horizontal Split In the horizontal split experiment, a “Record” source table was split into “Vinyl records” and “CD records”. The attributes in the source table were firstname, surname, record title and type. Type was used to determine which DT a record should be derived to. Possible values of this attribute were “vinyl” (49%), “cd” (49%) or “none” (2%). The latter value was used to indicate that the record did not belong to either DT, and therefore had to CHAPTER 9. PROTOTYPE TESTING 141 be stored in the auxiliary table. The comparison of the source and derived table contents was performed by copying the source records into one of three arrays, depending on the value of the type attribute. The records in each of these arrays were then compared to the DTs. The comparison showed that all records were equal in the source and derived tables. Vertical Merge The vertical merge experiment was conducted by joining the “employee” and “postaladdress” source tables. The resulting DT was called “modified employee”. When all transactions had completed, the comparison of records in the source and derived tables was performed by using the full outer join operator on the source table records. The join result and records in the DT were then sorted on the primary key attribute, social security number (SSN), and stored in arrays. Since the SSN was unique, comparison was straightforward. No inconsistencies were found. Vertical Split Vertical split was performed over a functional dependency, in which the source table “employee” was split into “modified employee” and “postaladdress”. As argued in Section 6.7, this type of vertical split is more complex than the vertical split over a candidate key counterpart since source records may be inconsistent in the former case. The records in the source table were designed to split into four times as many employees as postal addresses. 99% of all write operations to the attributes derived to the “postal address” DT were default values. The remaining 1% were set to non-default values. This resulted in approximately 4% inconsistent records in the final state of the “postal address” DT. The consistency check program was executed in parallel with log propagation. In addition, the consistency check was performed on all records that were flagged as Unknown2 when the transactions had completed. One final log propagation was then executed to achieve correct flag states. The comparison of records was first performed between the source table and the “modified employee” tables. The records from both tables were sorted on the primary key, SSN, before the relevant subset of attributes were compared. No inconsistencies were found in this comparison. The “postal address” table was then checked. This was done by inserting the records in the source table into an array. Only the subset of attributes stored in “postal address” DT were stored, and the array was sorted on zip 2 Recall from Section 6.7.5 that derived records are flagged as either (C)onsistent or (U)nknown. 142 9.3. PERFORMANCE TESTING code. Equal zip codes were discarded after checking that the records were equal; if different attribute values were found, the record in the array was flagged as inconsistent. This content of the array was then compared to the content of the “postal address” DT. Some inconsistencies were found, but only on records marked with an Unknown flag in the DT. A cross check revealed that all of these were marked as inconsistent in the array. Also, no derived records with an Unknown flag were marked as consistent in the array. Empirical Validation Summary With the exception of records flagged as Unknown in the vertical split experiment, no inconsistencies have been found between source records and derived records. This indicates that the DT creation methods work correctly. We base this on what we consider to be extensive testing, in which all basic write operations (insert, delete and update), and both normal and abort transaction execution3 has been involved. 200,000 transactions with 6 operations each have been executed in each experiment. Thus, a total of 1,200,000 modifying operations have been made to 300,000 records. No matter how extensive the experiments are, however, empirical validation can never be used as a proof of correctness (Tichy, 1998). Thus, the experiment should ideally be repeated in another implementation to confirm the results (Walker et al., 2003). Due to time considerations, the experiments have only been performed on one implementation in this thesis. 9.3 Performance Testing The following sections discus the performance test results from the nonblocking DT creation experiments. The same operators as in the empirical validation experiments are considered. Hence, for horizontal merge, only the duplicate removal case is discussed. Similarly, for vertical split, only the functional dependency case is discussed. There are two common measurements for database system performance. These are response time, i.e. the time spent from a client requests an operation until it receives the response, and throughput, i.e. the number of transactions processed per unit of time (Highleyman, 1989). Results for both measurements are discussed. The test results presented in this chapter will not be a benchmark comparisons between the Non-blocking DBMS and a fully functional DBMS. The 3 1%-3% of the transactions have been aborted due to locking conflicts. CHAPTER 9. PROTOTYPE TESTING 143 reason is that the prototype lacks functionality vital to achieve good benchmark results, e.g. the aforementioned query optimizer. Furthermore, as is clear from Section 9.1, the hardware used in the experiments is far from capable of running high performance DBMS benchmarks. What the tests will be used for, however, is to show the relative performance of user transactions when executed alone compared to when executed concurrently with the various DT creation methods. In the following sections, “user transactions” will denote transactions sent from a client application. These are not involved in the DT creation. The performance of all steps of DT creation is tested, but most emphasis will be put on performance during log propagation. The reason for this is that the other three steps have much shorter execution times and therefore impact performance to a lesser extent. Thread Priorities The experiments discussed in the performance test sections have been designed to degrade the performance of concurrent transactions as little as possible. We achieve this by reducing the priority of the DT creation thread to the point where the log propagator is only capable of applying as many log records as are produced. Hence, with this priority, the number of log records to redo remains unchanged. Small increases in the priority of this thread should therefore result in long execution time with a minimal performance degradation. Similarly, a big priority increase should result in shorter execution time at the cost of more performance degradation. The priority of threads in Java 2 Standard Edition 5.0 can be set so that a high priority thread is scheduled before a low priority thread. However, despite setting the priority to an absolute minimum, DT creation tends to complete very fast with the inevitable high performance penalties to concurrent transactions. The reason for this is that the Java VM uses Linux system threads to implement Java threads (Austin, 2000), and that these have time slices of 100ms in the Linux 2.6 kernel (Aas, 2005). Hence, every time the DT creation thread is scheduled, it is allowed to run uninterrupted for 100ms. By only modifying thread priorities, we are not able to achieve acceptable transaction performance. The problem is that each requested operation is processed in approximately 1 ms. This means that if there are 50 threads used for transaction processing and one thread for DT creation, and all threads are scheduled once, the DT creation thread gets twice the CPU time as all the other threads together. Thus, to reduce the priority further, Thread.yield() and Thread.sleep() calls are used on the DT creation thread. This forces the thread to stop processing, thus reducing the time slice. By using this 144 9.3. PERFORMANCE TESTING technique we are able to fine-tune the priority, and thus find the lowest possible performance degradation for each DT creation method. Determining the Maximum Capacity of NbDBMS Most performance results in this chapter are presented on a 50% to 100% workload scale. This implies that the point for 100% workload, i.e., the maximum transaction capacity of NbDBMS, has to be determined. The maximum capacity differs slightly between the DT creation operators since the transaction mixes are not exactly equal, but the method described here is used to find all. As advised by Highleyman (Highleyman, 1989), the steps that should be used to determine the maximum capacity of a database system is to first define the maximum response time that is considered acceptable. Second, the amount of transactions that are required to complete within the maximum response time must be defined. The capacity can then be determined by executing test runs and compare the results with the requirements. We define 10 ms as the maximum acceptable response time of an operation. Considering the fact that all records are in main memory, requests are only sent over a LAN, and the requested operations do not include complex queries, 10 ms should suffice. A transaction that observes higher response time than 10ms for any of its six operations is considered to have failed. It is also decided that 95% of all transactions must complete within an acceptable response time. This is often used as a requirement in telecom systems, e.g. as in ClustRa (Hvasshovd et al., 1995). Considering only transaction failure due to unacceptable response time, 5% transaction failure corresponds to too high response times in 0.85% of all operations since all transactions consist of 6 operations: 0.95 = (1 − x)6 √ 6 x = 1 − 0.95 = 0.0085 (9.1) Figure 9.1(a) shows the mean operation response times with a workload ranging from 100 to almost 500 transactions per second. It is clear from the graph that the mean response time is much lower than 10 ms in all cases. Figure 9.1(b) shows the upper quartiles for a 99% confidence interval using the results from the same test runs as in the left graph. This graph shows that an increasing number of transactions are not answered in time, especially as the throughput increases above 400. The rapid response time increase in both graphs is expected since the delay over a bottleneck resource is given by (Highleyman, 1989): CHAPTER 9. PROTOTYPE TESTING 145 Response Time vs Transactions Per Second 20 Response Time vs Transactions Per Second 2.0 ● ● Avg Response Time Base Response Time ● Response Time − Upper Quartile ● 10 Response Time (ms) 1.6 1.4 ● 1.2 Response Time (ms) 15 1.8 ● ● ● ● ● ● 5 1.0 ● ● ● ● ● 150 ● ● ● 200 0 0.8 ● ● 250 300 350 400 450 150 200 Transactions Requested 250 300 350 400 450 Transactions Requested (a) Response Time increases exponentially as the number of transactions per second (tps) increases. The base response time before the rapid increase is at 0.78 ms. (b) Response time 99% upper quartile. I.e., 0.5% of all response times are equal to or higher than the plot. Throughput for Difference and Intersection Transactions 400 ● ● ● ● 300 ● ● 200 ● ● ● ● 100 Throughput, max response time 10 ms Theoretical Throughput Actual Throughput ● 0 ● ● 0 100 200 300 400 Transactions Requested (c) Theoretical and actual throughput when transactions are considered failed if the response time is higher than 10 ms. Figure 9.1: Response time and throughput for difference and intersection, using Transaction Mix 1 scenario 1 (see Table 9.2). 146 9.3. PERFORMANCE TESTING T (9.2) 1−L Here, T is the average service time over the bottleneck resource, and L is the load. Hence, it is clear that the response time increases towards infinity as the workload approaches 100%. Because of this rapid response time increase, a higher maximum acceptable response time would not increase the maximum throughput much. Figure 9.1(c) illustrates the throughput for transactions that are processed within the 10 ms per operation response time requirement. The number of transactions processed within the time limit increases almost linearly before the response time becomes so high that more and more transactions fail. The highest throughput was observed at 440 transactions per second. If the workload exceeds 440, the actual throughput starts decreasing. This is, however, above the maximum capacity of NbDBMS with the current transaction set. Delay = A note on Locking Conflicts in the Performance Experiments In the empirical validation experiments described in Section 9.2, the source tables were populated with 100,000 or 200,000 records, as shown in Table 9.5. The experiments were performed with approximately 60-70% of the workload used in the performance tests, something that resulted in the abortion of 13% of the transactions due to locking conflicts. To get statistically significant results in the performance experiments, a huge amount of test runs are required. A total of 2000 test runs have been performed to get the data for the graphs in this chapter. In addition, many test runs have been performed to find the maximum capacity of NbDBMS, ideal thread priorities for the DT creation thread and so forth. However, running this many tests with the same number of records as in the empirical validation tests would take too much time. Hence, the number of records in the source tables has been reduced to 20,000 and 40,000, as shown in Table 9.5. This reduced the execution time of each iteration by more than one half. While this reduction in records makes it possible to run the required number of experiments, it causes another problem. With much fewer records, a very high number of transactions are forced to abort due to locking conflicts even at moderate workloads. With this setup, the maximum capacity achievable without severely thrashing the throughput does not nearly utilize the CPU capacity of the server node4 . Thus, this setup can be used to little 4 The throughput thrashed completely at approximately 65% of the maximum capacity used in the experiments described in this section. CHAPTER 9. PROTOTYPE TESTING 147 more than test the locking subsystem of NbDBMS. Since the DT creation methods do not acquire additional locks in all but a few cases, tests with such low workloads are not considered very useful. To be able to perform a high number of test runs and at the same time test the impact of DT creation under high workloads, the transactions used in the performance experiments have been designed to operate on different records. We call this clustered operation. This means that no locking conflicts will occur. When it comes to conflicts, the effect of this design is the same as having many times more records. There should be no difference (except the execution time) to the performance results as long as all records fit in main memory in all cases. To verify this, five diff/int test runs with 200,000 smallsized records in each table and no clustering of the operated on records have been compared to the 20,000 record case with clustered operations. Apart from garbage collection being more noticeable in the former case, the setups performed with similar response times; the 200,000 setup was less then 10% slower than the 20,000 setup. It is worth noticing that the total size of all records in both setups5 were much smaller than the maximum Java VM heap size (800MB). 9.3.1 Log Propagation - Difference and Intersection We start the performance evaluation by a thorough discussion on the difference and intersection (diff/int) experiments. As will be clear, the tendencies found here also apply to the experiments for other DT creation methods. We will not repeat the arguments and discussion for these experiments. Refer to Appendix B for plots of these experiments. The diff/int tests use Transaction Mix 1, shown in Table 9.2, as workload. Experiments are conducted for two scenarios: in the first scenario, the source tables are frequently written to, i.e. 30% of all operations are either updates, inserts or deletes in the source tables. The second scenario is much more read intensive, and the number of write operations on the source tables is reduced to 10%. Because write operations to the source tables are the only operations that must be propagated to the DTs, scenario 1 should produce three times more log records to propagate. Thus, scenario 1 is expected to incur much higher performance degradation to normal transactions. Response Time Distribution Consider Figures 9.2(a) and 9.2(b), showing the distribution of operation response times under 50% workload. The left figure shows the response 5 ∼50 MB in the 200,000 case. 148 9.3. PERFORMANCE TESTING 50% workload − Unloaded 0 0 1 1 2 2 Density Density 3 3 4 4 5 5 50% workload − Log Propagation 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 Response Time (ms) 1.5 2.0 2.5 3.0 3.5 4.0 Response Time (ms) (a) Distribution of response time for 50% workload before DT creation is started. (b) Distribution of response time for 50% workload during log propagation for difference DT creation. 4 3 Density 2 1 0 0 1 2 Density 3 4 5 80% workload − Log Propagation 5 80% workload − Unloaded 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Response Time (ms) (c) Distribution of response time for 80% workload before DT creation is started. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Response Time (ms) (d) Distribution of response time for 80% workload during log propagation for difference DT creation. Figure 9.2: Response time distribution for 50% and 80% workload workload for difference and intersection transactions using scenario 1 from Table 9.2. CHAPTER 9. PROTOTYPE TESTING Workload 50 % 80 % Median Unloaded Loaded 0.709ms 0.714ms 0.750ms 0.770ms % 0.7% 2.7% 149 Mean Unloaded Loaded 0.772ms 0.778ms 0.868ms 1.102ms % 0.8% 27% Table 9.6: Summary of the response time distribution in the histograms of Figure 9.2. times without DT creation, and the right shows response times during log propagation. The histograms show that the distributions are very similar, but that the latter has slightly more outliers to the right. Thus, most operations are processed equally quick in the unloaded case as in the log propagation case, whereas a few operations observe much higher response times in the latter case. Note that for readability, the horizontal axis of the histograms stop at 4 ms, but the tendency of more outliers in the loaded cases is equally clear beyond this limit. The fact that most response times are equal in the unloaded and loaded cases is further confirmed by comparing their median values: The unloaded histogram in Figure 9.2(a) has a median of 0.709ms, while the median for the loaded case, shown in Figure 9.2(b), is only 0.7% higher at 0.714ms. This effect is also seen in the 80% workload histograms in Figures 9.2(c) and 9.2(d), in which the medians are 0.750 ms versus 0.770 ms (2.7% higher). It is interesting to compare these median values to the respective average response times. In the 50% workload case, the mean for the unloaded histogram is 0.772 ms whereas the mean of the log propagation histogram is 0.778 ms. In the 80% workload case, the means have increased to 0.868 ms and 1.102 ms. This corresponds to 0.8% and 27% higher means in the respective workloads. Hence, the means are highly affected by the response time outliers whereas the medians are affected to a much lesser extent. This is because there are relatively few outliers, but these have very high response times, thus affecting the means. The increased number of high response time observations in the log propagation cases compared to the unloaded ones is caused by the DT creation thread. Since threads are given access to the CPU in time slices, and since the DT creation thread has a low priority, most transactional requests arrive at the server when log propagation is inactive. These requests observe the same response times as in the unloaded case. However, when the DT creation thread is active, all transaction requests must be scheduled on only one CPU. These requests form much longer resource queues, and thus observe higher response times. Comparing the histograms with 50% workload to those with 80% work- 150 9.3. PERFORMANCE TESTING load reveals that the response times increases with the workload. Furthermore, the effect of performing the DT creation increases significantly, as indicated by the much flatter histogram in the lower right figure. For the unloaded case, the higher response time is caused by a higher request rate, which in turn increases average queue lengths at the server (Highleyman, 1989). A higher workload also produces more log records per second. Hence, in the loaded case, the priority of the DT creation thread must be increased to be able to propagate these additional log records within the same time interval. Increasing the priority means providing the thread with more CPU time, which in turn increases the probability that the DT creation tread is active when a transactional request arrives at the server. Most of the long response times observed in the unloaded cases are caused by garbage collection. To determine the impact of garbage collection, three algorithms have been tested. The default algorithm resulted in very high standard deviation of the response times compared to the incremental algorithm used in all experiments in this chapter. Both these algorithms were described in Section 9.1. Experiments with no garbage collection have also been conducted. This option resulted in fewer, but not complete removal of, high response time observations. The latter garbage collection alternative is not used in the experiments because the memory quickly becomes full. We assume that the few remaining high response time observations are caused by processes running on the server node that are not under our control. Response Time An important aspect of the performance evaluation is how the response time and throughput is affected by varying workloads. Consider Figure 9.3, which shows response time means and 90% quartiles for the unloaded and log propagation cases. The first Figure, 9.3(a), shows the response time in the unloaded case of Transaction Mix 1, Scenario 1. The plot shows that the lower quartiles are stable at around 0.54 to 0.55 ms whereas the upper quartiles increase rapidly with the workload. This is the same effect that was seen in the histograms in Figure 9.2; as the workload increases, the mean queue lengths at the NbDBMS server increase. This is not surprising since Equation 9.2 determines that the response time should increase towards infinity as the load over the bottleneck resource approaches 100%. Figure 9.3(b) shows the same experiment as in the left plot, but with the response times from both the unloaded and log propagation cases. It is evident from the plot that the response time performance penalty of performing log propagation increases rapidly as the workload exceeds 75-80%. E.g., the mean response times during log propagation are 20%, 84% and 200% higher CHAPTER 9. PROTOTYPE TESTING 151 Response Time Average and 90% Quartile 8 2.0 Response Time Average and 90% Quartile ● Unloaded Unloaded During Log Propagation ● 4 Response Time (ms) 1.5 1.0 Response Time (ms) 6 ● ● ● 2 ● ● ● ● ● ● ● 50 60 70 ● 0.5 ● 50 60 70 80 90 100 % Workload (a) Scenario 1 - Response time mean and 90% quartiles for the unloaded case, i.e. before DT creation is started. 90 3.0 Response Time Average 5 ● Unloaded During Log Propagation ● ● ● 2.0 1.5 ● 1 ● ● ● 50 60 70 ● ● ● 50 60 ● 0.5 ● Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 1.0 2 3 Response Time (ms) 4 2.5 ● 100 (b) Scenario 1 - Response time mean and 90% quartiles for the unloaded and log propagation cases. Response Time Average and 90% Quartile Response Time (ms) 80 % Workload 80 90 100 % Workload (c) Scenario 2 - Response time mean and 90% quartiles for the unloaded and log propagation cases. 70 80 90 100 % Workload (d) Scenario 1 and 2 - Response time mean for unloaded and log propagation cases. Figure 9.3: Response times for varying workloads before and during log propagation of difference and intersection DT creation using Transaction Mix 1. 152 9.3. PERFORMANCE TESTING during in the 80%, 90% and 100% workload cases, respectively. The lower plots, shown in Figure 9.3(c), show the response time means when scenario 2 of Transaction Mix 1 is used. It is clear that this scenario impacts the response time to a much lesser extent than scenario 1. The reason for this is simply that the priority of the DT creation thread can be kept lower since fewer log records need to be propagated. The impact on response time of performing log propagation applies to the upper quartiles in particular. This was also clear from the histograms in Figure 9.2, and indicates that an increasing number of operations have to wait in a queue for long periods of times. Recall from Section 9.3 that the capacity of the Non-blocking DBMS is defined as the workload, measured in transactions per second, at which less then 5% of all transactions observe higher operation response times than 10 ms. Further, the response times increase very rapidly as the workload increases up to and beyond this capacity. It is obvious that log propagation adds to the workload of the Nonblocking DBMS. And, since the transaction arrival rate is not reduced when log propagation starts, it is not surprising that the response time averages during log propagation increases quickly. As opposed to the upper quartile, the lower quartile is relatively stable at approximately 0.55 to 0.56 ms for all workloads. Throughput In addition to response time, throughput represents an important performance metric for database systems. Recall from Section 9.3 that all transactions with higher operation response times than 10 ms are considered failed. Consider Figure 9.4(a), showing the throughput of the unloaded and log propagation cases for scenario 1 of Transaction Mix 1. It is clear from the plot that very few transactions fail at low workloads. As the workload in the log propagation case increases beyond 70%, however, more and more operations are not processed in time. This is consistent with what the 99% upper quartile plot in Figure 9.1(b) and the response time plots in Figure 9.3 indicated: The upper quartile of the response time increases rapidly with the workload. Hence, at low workloads, even most of the “long” response times are lower than the acceptable 10 ms. At approximately 70%, the longest response times start to go beyond this. At even higher workloads, the ever increasing amount of too long response times effectively thrashes the throughput. Furthermore, the number of failed transactions increases very rapidly when 70% workload is reached. Again, considering the rapidly increasing response times in Figure 9.3, this rapid thrashing comes as no surprise. CHAPTER 9. PROTOTYPE TESTING 400 Throughput − Difference and Intersection Scenario 2 400 Throughput − Difference and Intersection Scenario 1 153 Theoretical Throughput Unloaded During Log Propagation ● Theoretical Throughput Unloaded During Log Propagation ● ● ● ● 300 300 ● ● ● ● ● ● ● 200 Throughput ● 200 Throughput ● ● ● 100 ● 100 ● ● ● ● ● 0 0 0 ● 20 40 60 80 100 % Workload (a) Scenario 1 - Throughput for difference and intersection using Transaction Mix 1. ● 0 20 40 60 80 100 % Workload (b) Scenario 2 - Throughput for difference and intersection using Transaction Mix 1. Figure 9.4: Throughput for varying workloads during log propagation of Difference and Intersection DT Creation. Considering the unloaded case of the same plot, it is clear that the throughput is only slightly reduced from approximately 80% and higher workloads. The reason why the reduction is kept relatively low is the way we have defined 100% workload. Recall from Section 9.3 that 100% workload is defined as the point where 95% of all transactions observe acceptable response times. Hence, by definition, all throughput plots should have 5% reduction to the unloaded throughput at 100% workload. Figure 9.4(a) shows the throughput of scenario 2 of the same transaction mix. As can be seen in Table 9.2, this scenario has more read operations than scenario 1. As previously discussed, this means that the DT creation thread needs less CPU time to propagate the log records generated per second. As shown in Figure 9.2(d), this results in less response time degradation for concurrent transactions. It should be clear from the above discussion why a lower response time degradation also results in a lower throughput degradation. 154 9.3. PERFORMANCE TESTING Response Time Average Unloaded During Log Propagation 1.6 1.4 1.2 ● ● 1.0 Response Time (ms) ● Unloaded Scenario 1 1x Unloaded Scenario 1 5x Unloaded Scenario 1 15x Log Prop Scenario 1 1x Log Prop Scenario 1 5x Log Prop Scenario 1 15x ● ● 0.8 2 3 4 Response Time (ms) 5 ● 1.8 6 Response Time Average and 90% Quartile ● ● ● ● ● ● 50 60 70 ● 80 0.6 1 ● 90 100 50 60 70 % Workload (a) Response time mean and 90% quartiles for vertical merge DT creation. Transaction Mix 1, Scenario 1. 80 90 100 % Workload (b) Response times for three variations of table sizes in source table 2. Transaction Mix 1, Scenario 1. Figure 9.5: Response times and throughput for varying workloads during vertical merge DT creation. 9.3.2 Log Propagation - Vertical Merge The vertical merge experiments are performed using scenario 1 and 2 of Transaction Mix 1, shown in Table 9.2. This is the same mix as was used in the diff/int experiments. Consider Figure 9.5(a), showing the mean response time for scenario 1. The graph shows the same rapid response time tendency as the diff/int experiments did. The response time distributions also have similar shapes as those shown in the histograms in Figure 9.2. Hence, most requests are answered quickly, and the amount of requests with very long response times increases with the workload. Although histograms are not shown here, this distribution is indicated by the mean response times in Figure 9.5(a) being much closer to the lower 90% quartile than the upper. The left plot in Figure 9.5 shows the response time results for scenario 1 with 20,000 records in both source tables. We consider it likely that vertical merge in many cases will be performed on tables with uneven number of records, however. For example, if the “employee” and “postal address” source tables are merged6 , it is likely that at least some employees share zip codes. 6 This has been used as an example of vertical merge throughout the thesis; refer to CHAPTER 9. PROTOTYPE TESTING 155 Throughput, Unloaded and During Log Propagation 3.0 400 Response Time Average Unloaded Diff/Int Unloaded Vertical Merge Log Prop Diff/Int Log Prop Vertical Merge ● ● 300 ● Unloaded Diff/Int Unloaded Vertical Merge Log Prop Diff/Int Log Prop Vertical Merge ● ● ● ● ● ● 50 60 70 80 ● ● ● ● ● ● 0 0.5 ● ● ● ● ● ● 100 1.0 ● ● ● ● ● 200 ● Throughput 2.0 ● ● 1.5 Response Time (ms) 2.5 ● 90 100 % Workload (a) Comparison of response times. ● 0 20 40 60 80 100 % Workload (b) Comparison of throughput. Figure 9.6: Comparison of response times and throughput for scenario 1 of difference and intersection and vertical merge. Hence, Figure 9.5(b) illustrates how variations in the number of source table records affect response time degradation. The red line, called 1x, is the plot for 20,000 records in both source tables. This is the same plot as shown in the left figure. The blue 5x line shows test results with 4,000 records in source table 2 while the green line is for 1.300 records. It is worth noticing how a reduction in the number of source records incurs higher degradation. The reason for this is that the records in source table 1 of this experiment always have a join match in source table 2. Hence, as the number of records in source table 2 decreases, the number of join matches for each of these increases. This means that a modification to a record in source table 2 must be propagated to an average of 1, 5 and 15 records in the three cases, respectively. This increases the work that must be done by log propagation, which in turn results in an increased priority for the DT creation thread. As previously discussed, a higher priority on the DT creation thread incurs higher performance degradation for concurrent transactions. As is clear from the above discussion, the response time of vertical merge has a similar behavior as that of diff/int. This does not mean that the performance results are equal, however. In Figure 9.6, the results from the diff/int experiments are shown together with those from vertical merge with 20,000 Section 6.5 for illustrations. 156 9.3. PERFORMANCE TESTING records in both source tables. The plots clearly show that the former experiment degrades performance to a much greater extent than vertical merge. To understand why, we have to investigate the amount of work performed by the two log propagators. In the vertical merge case, each source record modification is applied to one derived record on average. Even though more than one record may be affected by modifications in source table 2, affected records are always found by exactly one lookup in one DT. This is not the case for the diff/int method, in which all source record modifications involve lookup of records in two or even three derived tables. Furthermore, source record modifications may require a derived record to move from the intersection DT to the difference DT or vice versa. Since the diff/int log propagator has to perform more work for each log record, and since the same number of log records are generated in the two experiments, the priority of the diff/int DT creation thread must be higher than that of vertical merge. Hence the higher diff/int degradation to both response time and throughput. 9.3.3 Low Performance Degradation or Short Execution Time? All performance experiments described for log propagation have been performed with the lowest achievable performance degradation in mind. As described in Section 9.3, we define this as the degradation incurred when the log propagator is only capable of applying as many log records as are produced. However, by using this DT creation thread priority, the states of the DTs do not get closer to that of the source tables. Thus, log propagation will never finish. If the priority of the DT creation thread is increased from this point, it gets more CPU time. Hence, it is capable of reducing the number of log records that separate the states of the DTs and source tables. At the same time, however, the performance degradation is increased. Figure 9.7 illustrates the effect of changing the priority in diff/int DT creation, running 30 iterations with Transaction Mix 1 scenario 1 at 50% workload with 50,000 records in each source table. Starting at the leftmost side of the plot where the DT creation is run at maximum priority, it is clear that a slight decrease in priority results in much less degradation to response time at the cost of little additional execution time. As the priority gets lower, however, less and less reduction in degradation is observed. It is up to the database administrator (DBA) to decide if short execution time or low performance degradation is more important. Hence, we will not discuss this further. CHAPTER 9. PROTOTYPE TESTING 157 30 Performance Degradation vs Time Spent ● 15 20 Response time Throughput 10 ● 5 Performance Degradation (%) 25 ● ● 0 ● 4 6 8 10 12 14 Completion Time (s) Figure 9.7: Total time for log propagation with varying DT creation thread priorities. Diff/int DT creation method, running Transaction Mix 1 scenario 1 with 50% workload. 9.3.4 Other Steps of DT Creation This far, only performance degradation during the log propagation step has been discussed. In this section, the implied impact of the other steps are discussed. Unless otherwise noted, difference and intersection DT creation experiments, running Transaction Mix 1 scenario 1, are discussion. However, the discussion applies to all the DT creation methods. Preparation and Initial Population During the preparation step, derived tables and indices are added to the schema. This only involves modifications to the database schema; no records or locks are involved. The performance implication of this step is negligible since it completes very fast. An inspection of 100 performance report files from all DT creation methods7 showed that the longest execution time of this step was 36 ms whereas the shortest was 17 ms. The performance impact of initial population is a completely different matter, even though the priority of the DT creation thread can easily be lowered to the point where no performance degradation is observed at all. The problem with such low priorities is that the step uses long time to complete. Since the log propagation step executed next has to apply the log records 7 These were randomly picked from the 2,000 report files from the previous section. 158 9.3. PERFORMANCE TESTING Workload 50 % 70 % 90 % Initial Population 0.758ms 0.764ms 0.865ms Log Propagation 0.750ms 0.778ms 0.872ms % Difference 1.1% -1.8% -0.8% Table 9.7: Average response times during the initial population and log propagation steps of vertical merge when both steps use the same priority. Max Response time degradation 0.5% Aborted transactions 0.8% DT Creation Time ∞ Max 0.8% 1.0% 22.4 s Max 3.8% 2.8% 14.9 s Max Max Blocking 15.0% 29.1% - 7.1% 15.4% 100% 10.4 s 7.3 s 3.6 s Priority 1 5 2 5 3 5 4 5 Table 9.8: Effects on performance for different priorities of the DT creation thread during the initial population and log propagation steps. generated during initial population, the execution time of this step highly affects the total DT creation execution time. Initial population has no “minimal” priority similar to what we used for log propagation; the priority at which the same number of log records are propagated and produced within a time interval. Hence, we consider two alternatives to the very low priority described above. The first is to use the same priority as the log propagation step. To get an indication on the performance implications of this alternative, 30 test iterations have been performed with workloads of 50, 70 and 90% during diff/int. Not surprisingly, the tests show that the two steps degrade performance to the same extent when the priorities are equal. The results are shown in Table 9.7. The second alternative is to use a very high priority on the initial population step. Intuitively, this results in higher performance degradation for a shorter time interval. The extreme version of this is the insert into select method used in many existing DBMSs (Løland, 2003): it involves readlocking the source tables for the duration of the entire initial population step. Log propagation and synchronization is not needed in this case. Table 9.8 shows the results from the same 30 test iterations described in Section 9.3.3 and illustrated in Figure 9.7. The priority of DT creation CHAPTER 9. PROTOTYPE TESTING 159 has been varied, but initial population and log propagation have had the same priority in all cases. The table clearly illustrates how the performance degradation decreases as the DT creation time increases. The “blocking” line represents the “insert into select” method. As argued in Section 9.3.3, the choice of which priority is “best” is left to the DBA. The performance degradation during the synchronization step is very small when the DTs are not used for schema transformations. 30 test runs have been performed on diff/int at 50 and 80% workload. Synchronization was started automatically when 10 or less log records remained to be redone. The experiments showed that the source table latches were held for a duration of 1-2.5 ms while these log records were applied to the DTs. The test report files indicate a slightly higher average of failed transactions due to unacceptably high response times the second immediately following synchronization; 2.4% and 1.7% for the 50% and 80% workload cases, respectively. However, these results vary much between the report files due to the short time interval and hence few available response time samples. The expected number of failed transactions for this step relies heavily on the remaining number of log records when the source table latches are set and the response time considered acceptable. Intuitively, if acceptable response time latch time, few transactions will fail. 9.3.5 Performance Experiment Summary In the previous sections, we have discussed the results from extensive testing to find the performance degradation incurred by the DT creation methods. Log propagation has been discussed in most detail since this step typically runs for much longer time intervals than the other steps. In the described experiments, ∼75% of the DT creation time has been used by log propagation, ∼25% by initial population and 1% by preparation and synchronization combined. The experiments have shown that the incurred performance degradation relies heavily on the workload. At low to medium (∼70%) workloads, DT creations running scenario 1 of the transaction mixes can be performed almost with no degradation for concurrent transaction. If the workload is increased beyond this point, the performance quickly becomes unbearable because response times increase very quickly, and eventually result in throughput thrashing. In addition to workload, DT creation thread priority also affects performance degradation to a huge extent. While a low priority DT creation thread results in little performance degradation, it also incurs long execution time. Any increase to the priority will decrease the time to complete but also 160 9.4. DISCUSSION 2.5 3.0 Response Time Average 2.0 1.5 ● 1.0 ● ● ● ● 50 60 ● 0.0 0.5 Response Time (ms) ● Difference and Intersection Vertical Split Vertical Merge Horizontal Merge Horizontal Split 70 80 90 100 % Workload Figure 9.8: Summary of average response time for all DT creation methods. increase the performance degradation. The type of work performed by the transactions running on the server also affects the degradation. If the transactions perform few write operations on records in the source tables, the degradation is smaller than if many write operations are performed. Alternatively, the degradation may be held constant while the execution time is varied. Finally, the different DT creation methods incur different degradations. The reason for this is that they must perform different amounts of work to propagate each log record. For example, log propagation of an update operation incurs an update of only one record in a DT when horizontal split is used. In difference and intersection, however, the same update would require a lookup for equal records in one DT, and an update in one or even two records in the other DTs. As shown in Figure 9.8, difference and intersection incurs most degradation while horizontal split incurs least. Vertical merge, vertical split and horizontal merge are between these. 9.4 Discussion In this chapter, we have discussed empirical validation and performance experiments performed on the Non-blocking DBMS prototype. The empirical validation experiments gave predictable and correct output, which strongly indicates that the DT creation methods work correctly. The performance experiments showed that close to 100% of the total DT CHAPTER 9. PROTOTYPE TESTING 161 creation time was used by the log propagation and initial population steps. Under moderate workloads, DT creation can be performed with almost no performance degradation for concurrent transactions. However, this requires a very low priority on the DT creation process, which in turn increases the total execution time significantly. A consequence of this is that the DBA has to make a decision on whether to perform the DT creation quickly with much degradation, slowly with little degradation or something in between. When to use the DT Creation Method It is clear that the execution time decreases and the performance degradation increases as the priority of the DT creation process is increased. At extremely high priorities, the DT creation method behaves almost like the insert into select method used in current DBMSs (Løland, 2003). When fast completion time is more important than low performance degradation, the existing insert into select method or Ronström’s method should be used. Since performance experiments have not been published on Ronström’s method, it is uncertain which of these would be preferred under which circumstances. In cases where it is advisable to trade longer execution time for lower performance degradation our DT creation method should be used instead. Using our method provides flexibility since the priority may be increased or decreased as the DBA sees fit. Note, however, that the insert into select method allows combinations of relational operators, including aggregates. These combinations are not yet supported in our DT creation method. We expect our method to outperform Ronström’s method when it comes to performance degradation. The reason for this is that Ronström’s method forwards source record modifications by using triggers executed within the original transaction (Ronström, 1998). A similar use of triggers is explicitly discouraged for MV maintenance (Colby et al., 1996; Kawaguchi et al., 1997). If disk space is a major issue, Ronström’s method may still be preferred for vertical merge and split schema transformations, however. The reason for this is that Ronström performs vertical merge transformations by adding attributes to an existing table, and vertical split by adding only one of the new tables. In contrast, our method makes full copies of the source tables in both cases. Chapter 10 Discussion This chapter contains a discussion of the work presented in this thesis. We start by discussing our research contributions with respect to how they meet the requirements stated in Chapter 1, and how they compare to related work. We then briefly summarize the research question, and discuss to which extent it has been answered. 10.1 Contributions The work presented in this thesis is based on the argument that database operations which are blocking or incur high degrees of performance degradation are not suited for database systems with high availability requirements. With this in mind, we decided to focus on a solution for two operations, database schema transformation and materialized view creation. The current solutions for both these operations degrade performance of concurrent transactions significantly. In this section, the contributions of our DT creation methods are discussed. To summarize, the main contributions of the thesis are: • A framework based on existing DBMS technology, that can be used to create DTs without blocking effects and with little performance degradation to concurrent transactions. • Methods to create DTs using six relational operators. • Strategies for how to use the DTs for Materialized View creation and schema transformation purposes. • Solutions to common DT creation problems, which significantly simplifies the design of other DT creation methods. CHAPTER 10. DISCUSSION 163 • Empirical validation of all presented DT creation methods. • Thorough performance experiments on all DT creation methods. In particular, we consider how the solution meets the requirements stated in Chapter 1, and how it compares to related work. 10.1.1 A General DT Creation Framework The General DT Creation Framework presented in Chapter 4 is an abstract framework. It is based on the idea of running DT creation as a non-blocking, low priority background process to incur minimal performance degradation for concurrent transactions. Although we have focused on centralized DBMSs in this thesis, we are confident that the framework can be used to create DTs in distributed database systems as well. In particular, the framework should easily integrate into distributed DBMSs where recoverability is achieved by sending logical log records to other nodes. In ClustRa DBMS, e.g., an ARIES like recovery strategy is enforced by shipping logical log records between nodes (Hvasshovd et al., 1995). Furthermore, ClustRa uses logical record identifiers and logical record state identifiers (Hvasshovd et al., 1995; Bratsberg et al., 1997b). Hence, this solution concurs to all the technological requirements that the open source DBMSs evaluated in Chapter 7 did not concur to. This application of the framework is purely theoretical, however. As described in Chapter 6, the framework can be used when creating DTs using six relational operators1 . However, the framework is expressive enough to be useful for DT creation using other relational operators as well. Jonasson uses the framework for DT creation involving aggregates (Jonasson, 2006). An auxiliary table is used to compute the aggregate values. The solution has not been implemented, however, and the performance implications are therefore uncertain. 10.1.2 DT Creation for Many Relational Operators As discussed in Chapter 1, Materialized Views and schema transformations are defined by a query, and are therefore created using relational operators. Relational operators can be categorized in two groups: non-aggregate and aggregate operators. The non-aggregate operators are join, projection, union, 1 In this thesis, the full outer join, projection, union (both duplicate inclusion and removal), selection, difference and intersection relational operators have been used for DT creation 164 10.1. CONTRIBUTIONS selection, difference and intersection. Aggregate operators are mathematical functions that apply to collections of records (Elmasri and Navathe, 2004). In Chapter 1, we decided to use non-aggregate operators as a basis for DT creation, allowing us to use optimized algorithms already available in current DBMSs (Houston and Newton, 2002; IBM Information Center, 2006; Lorentz and Gregoire, 2002; Microsoft TechNet, 2006). We chose to focus on nonaggregate operators because these are useful for both schema transformations and materialized views. An alternative would be to focus on aggregate operators. These are frequently used in materialized views, but are not often used in schema transformations. Furthermore, materialized views defined over aggregate functions often include non-aggregate operators as well (Alur et al., 2002). As described in Chapter 6, our DT creation method can be used to create DTs using the full outer join, projection, union, selection, difference and intersection relational operators. This means that the method can be used to create a broad range of DTs. Only the first four operators can be used in Ronström’s method (Ronström, 2000). 10.1.3 Support for both Schema Transformations and Materialized Views In Chapter 1, we realized that both schema transformations and materialized view (MV) creation were blocking operations that could be seen as applications of derived tables. Hence, we decided to focus both on how DTs should be created to be usable in highly available database systems, and how these DTs could be used for the two operations. In this thesis, we have shown how the DT creation method can be used for both operations. The solution for MV creation proved to be a straightforward application of DTs; they can be used for this purpose without modification. The solution for schema transformations is more complex, especially for operators where a record in one schema version may contribute to or be derived from multiple records in the other schema version2 . As argued in Chapter 6, this may cause high degrees of locking conflicts between concurrent transactions in the different schema versions. If this becomes a big problem, we know of no other alternative than to use the non-blocking abort synchronization strategy, thus resolving the problem by aborting transactions in the old schema version. In contrast to our method and the “insert into select” method, the method 2 This applies to all DT creation methods except horizontal merge with duplicate inclusion, horizontal split and difference and intersection. CHAPTER 10. DISCUSSION 165 suggested by Ronström can only be used for schema transformations (Ronström, 2000). 10.1.4 Solutions to Common DT Creation Problems In Chapter 5, we presented five problems that are frequently encountered in the DT creation methods. To summarize, these problems were: Missing Record and State Identification, Missing Record Pre-States, Lock Forwarding During Transformations and Inconsistent Source Records. We also described how these problems could be solved in general. The solutions to these five problems are contributions by themselves, as they significantly eases the design of new DT creation methods. For example, a method for DT creation using aggregate operators described by Jonasson uses the suggested solutions for the missing record and state identification problems and the missing pre-state problem (Jonasson, 2006). 10.1.5 Implemented and Empirically Validated The DT creation methods for the six relational operators have been implemented in a prototype DBMS. In Chapter 9, we discussed thorough empirical validation experiments that were executed on all the methods in the prototype. All the experiment results showed correct execution, thus strongly indicating that the methods are correct. No matter how strong the indications are, empirical validation experiments can never be used as a proof of correctness with absolute certainty (Tichy, 1998). Even so, empirical validation is considered vital in the software engineering community (Tichy, 1998; Pfleeger, 1999; Zelkowitz and Wallace, 1998). If confirmation of the results is required, empirical validation should be executed on another implementation of the same methods (Walker et al., 2003). Due to time considerations, this has not been done. 10.1.6 Low Degree of Performance Degradation The rationale for the research question was to develop DT creation methods that can be used in highly available database systems. Hence, a crucial goal was to incur as little performance degradation to concurrent transactions as possible. The DT creation methods have been implemented in a prototype DBMS, and the performance implications were thoroughly discussed in Chapter 9. The experiments showed that the degree of performance degradation depends heavily on four parameters: the workload, the transaction mix, the priority of 166 10.1. CONTRIBUTIONS the DT creation thread compared to other threads and the relational operator used to create the DTs. Workload The incurred degradation is highly affected by the workload on the database server. The performance experiments with 2,000 test runs on the six DT creation methods showed that the response time increases rapidly as the workload increases. This comes as no surprise since the delay over a bottleneck resource is given by (Highleyman, 1989): Delay = T 1−L (10.1) Here, T is the average service time over the bottleneck resource, and L is the load. Hence, as the workload approaches 100%, the response time increases towards infinity. Also, since 10 ms was defined as the highest acceptable response time for transactional requests, we observe throughput thrashing when the response time gets too high. Based on these observations, we strongly suggest that DT creation is performed when the workload is moderate or lower. In the experiments, the rapid increase in response time started at approximately 75% workload. Different systems may observe this rapid increase in response time at other workloads, depending on which resource is the bottleneck. Priority on the DT creation thread The priority of the DT creation thread affects both the performance degradation and the execution time to a great extent. As was clear from Section 9.3.3, a high priority results in quick execution with high degradation. On the other hand, a low priority results in low degradation over a longer time interval. We consider it the responsibility of the DBA to determine which priority setting to use. Transaction Mix The transaction mix3 of the workload on the system plays a significant role for the amount of performance degradation. The reason for this is that only write operations to records in the source tables must be propagated to the DTs. Hence, a transaction mix that is read intensive4 produces less relevant 3 4 The type of work performed by the transactions running on the server. Or write intensive on records in non-source tables. CHAPTER 10. DISCUSSION 167 log records5 than a transaction mix that is write intensive on source table records. For equal workloads, the log propagator has less work to do per time unit if few relevant log records are produced than if many are produced. Hence, DT creation during the former transaction mix can either be processed quicker or with lower performance degradation than DT creation during the latter transaction mix. Relational Operator The final identified parameter that affects performance to a great extent is the relational operator used to create the DTs. The reason for the variations is that different amounts of work must be performed by the different operators when a logged operation is applied to the DTs. In Section 9.3, we showed that DT creation using difference and intersection (diff/int) incurs most degradation, while horizontal split incurs the least. Hence, under equal workloads, horizontal split can either be performed quicker or with less performance degradation than diff/int. A Comparison with Related Work Basing the framework on a non-blocking, low priority background process is significantly different from the two alternative strategies for materialized view creation and schema transformations. In the schema transformation method presented by Ronström, a background process is used for copying records from the old to the new schema versions (Ronström, 2000). Triggers executed within normal transactions are then used to keep the copied records up to date with the original records. As discussed in Section 3.1.3, these triggers impose much degradation. A similar trigger strategy has previously been suggested for maintenance of MVs6 (Gupta et al., 1993; Griffin et al., 1997), but is explicitly discouraged due to the high performance cost (Colby et al., 1996; Kawaguchi et al., 1997). Although performance experiments have not been published on Ronström’s schema transformation method, we expect our DT creation method to incur less degradation. The reason for this is that our framework forwards modifications to the DTs using a low priority background process, as opposed to Ronström’s method of using triggers executed within each user transaction (Ronström, 2000). On the other hand, Ronström’s method is likely to complete in shorter time than our method. 5 6 Log records that must be propagated to the DTs. In MV maintenance, this is called Immediate Update. 168 10.1. CONTRIBUTIONS An even more drastic solution is currently used for DT creation in existing DBMSs (Løland, 2003). In the insert into select method, the source tables are locked and read before the content is inserted into the DTs. This method is simple, but all write operations on records in the source tables are blocked during the process. The insert into select method and Ronström’s method is better than our DT creation method only in cases where fast completion time is much more important than low performance degradation. With our DT creation method, however, longer execution time can be traded for lower performance degradation. This can be done to a small or large extent to fit in different scenarios. Hence, in all cases where performance degradation has a high priority, our method outperforms both. 10.1.7 Based on Existing DBMS Functionality Already in the initial requirement specification, it was clear that the solution should be based on functionality found in existing DBMSs whenever possible. By using existing functionality, the method should be easy to integrate into existing systems. Hence, literature on DBMS internals and related works have been studied carefully. The most relevant parts of this study were presented in Chapters 2 and 3. Our DT creation method uses standard DBMS functionality to a great extent. For example, the widely accepted ARIES (Mohan et al., 1992) protocol is used for recovery, Log Sequence Numbers (Elmasri and Navathe, 2004) are used to achieve idempotence and algorithms for relational operators (Garcia-Molina et al., 2002) available in all modern relational DBMSs are used for initial population of the DTs. On the other hand, our solution also requires functionality that is thoroughly discussed in the literature but is not common in existing DBMSs. Most importantly, this includes logical redo logging, record state identifiers (Hvasshovd, 1999) and logical record identification (Gray and Reuter, 1993). These principles were described in Chapter 2, and are required because the records are physically reorganized since relational operators are applied. As was evident from Chapter 7, the use of nonstandard functionality makes the integration into existing DBMSs more complex than if this was not the case. In Section 2.4, it was argued that the logical record identifiers can be replaced by physical identifiers if a mapping between the source and derived addresses is maintained. We have not found any solution to remove the logical redo log and record state identification requirements. Hence, with few exceptions, the method is based on functionality common in current DBMSs. A description of existing functionality can be found in CHAPTER 10. DISCUSSION 169 Chapter 2. 10.1.8 Other Considerations - Total Amount of Data The DT creation framework copies records from source tables to derived tables. Thus, storage space is required for two copies of all records in the source tables during DT creation. When the DTs are used as materialized views, the additional data will persist after DT creation has completed, and is therefore not considered a waste of storage space. When used for schema transformations, on the other hand, the source tables are removed once transformation is complete. Hence, this additional storage space required may be considered wasted. Since the source tables may contain huge amounts of data, the added storage usage in schema transformations may be problematic. This is also a problem in the two alternative solutions for schema transformations: the insert into select (Løland, 2003; MySQL AB, 2006) method and Ronström’s schema transformations (Ronström, 2000). In the former method, the source tables are locked while the records are read, transformed by applying the relational operator and inserted into the new tables. Thus, this method requires the same amount of storage space as our method. As thoroughly described in Section 3.1, Ronström’s method requires less storage space during vertical merge and split transformations. The reason is that these transformations work “in-place”, i.e., attributes are added to or removed from already existing records. Horizontal merge and split requires the same amount of storage space as our method, on the other hand. We have no solution to this problem other than increasing the storage capacity of the database server if required. This was also suggested by Ronström to solve the same problem (Ronström, 1998). 10.2 Answering the Research Question In this section, we discuss how and to what extent our research has answered the research question: How can we create derived tables and use these for schema transformation and materialized view creation purposes while incurring minimal performance degradation to transactions operating concurrently on the involved source tables. In Chapter 1, we decided on refining the research question into four key challenges. In the following sections, the results of the research is discussed 170 10.2. ANSWERING THE RESEARCH QUESTION with respect to these challenges and the research question. Q1: Current Situation What is the current status of related research designed to address the main research question or part of it? The current status of related research was presented in a survey in Chapter 3. From this review, we have identified the main limitations of existing solutions. The limitations are mainly associated with unacceptable performance degradation. Q2: System Requirements What DBMS functionality is required for non-blocking DT creation to work? Our DT Creation method is inspired by Fuzzy Copying (Hagmann, 1986; Gray and Reuter, 1993; Bratsberg et al., 1997a), and is based on making an inconsistent copy of the involved tables. The copies are then made consistent by applying logged operations. The requirements of this strategy are described in Chapters 2 and 4. Most of these are related to the reorganized structure of records after applying relational operators. Q3: Approach and Solutions How can derived tables be created with minimal performance degradation, and be used for schema transformation and MV creation purposes? • How can we create derived tables using the chosen six relational operators. • What is required for the DTs to be used a) as materialized views? b) for schema transformations? • To what extent can the solution be based on standard DBMS functionality and thereby be easily integradable in existing DBMSs? Our solution to creating derived tables was presented in Chapters 4 and 6. The method enables DT creation using the six relational operators, and the DTs can be used for both MVs and schema transformations. Thus, we have answered the two former parts of the question. CHAPTER 10. DISCUSSION 171 The method is based on standard, existing functionality whenever we have found it possible to do so. However, we also require some functionality that is not commonly used in current DBMSs. The implications of this were discussed in detail in Section 10.1.7. Q4: Performance Is the performance of the solution satisfactory? • How much does the proposed solution degrade performance for user transactions operating concurrently? • With the inevitable performance degradation in mind; under which circumstances is the proposed solution better than a) other solutions? b) performing the schema transformation or MV creation in the traditional, blocking way? The performance implications of executing the DT creation method was thoroughly discussed in Chapter 9. We found that the method incurs little performance degradation when the workload is not too high. However, there are circumstances when DT creation incurs high performance degradation. Hence, the database administrator has to consider three parameters before starting the operation: the workload on the database server, the DT creation priority and the operator used for DT creation. This was discussed in detail in Sections 9.3 and 10.1.6. 10.2.1 Summary We have answered the main research question by developing the Non-blocking DT Creation method. To summarize, the work included: deciding on a research approach based on the design paradigm (Denning et al., 1989), a thorough study of related work and usable functionality in existing DBMSs, development of a general DT creation framework, identification of and solutions to common DT creation problems, specialized methods for the six relational operators and a prototype design and implementation used in experiments. The main research question and all the refined research questions have been answered. Chapter 11 Conclusion and Future Work This chapter summarizes the main contributions and suggests several directions for future research. Finally, publications resulting from the research are briefly described. 11.1 Research Contributions The major research contributions of this thesis are: An easily extendable framework for derived table creation. A framework that can be used in the general case to create derived tables (DTs) is presented. It is designed to degrade performance of concurrent transactions as little as possible. In this thesis, the framework is used by six relational operators in a centralized database system setting. It is, however, extendable in multiple ways. Examples include adding aggregate operators or to perform DT creation in a distributed database system setting. Methods for creating derived tables using six relational operators. By using the general DT creation framework, we present non-blocking DT creation solutions for six relational operators: vertical merge and split (full outer join and its inverse), horizontal merge and split (union and its inverse), difference and intersection. Together, these methods represent a powerful basis for DT creation. Means to use the derived tables for schema transformation and materialized view creation purposes. Schema transformations and materialized view (MV) creation are two database operations that must be performed in a blocking way in current DBMSs. By using the DT creation frame- CHAPTER 11. CONCLUSION AND FUTURE WORK 173 work to perform these operations, we take advantage of the non-blocking and low performance degradation capabilities. Design and implementation of a prototype capable of non-blocking derived table creation. The DT creation methods for each of the six relational operators have been implemented in a DBMS prototype. Extensive experiments on this prototype have been used to empirically validate the methods. Thorough performance experiments for DT creation using all six relational operators. Thorough performance experiments have been performed on the six DT creation methods in the prototype. The experiments show that the performance degradation for concurrent transactions can be made very low. They also indicate under which circumstances DT creation should be avoided. This is primarily during high workload. 11.2 Future Work The following topics are identified as possible directions for further research. DT Creation in Distributed Database Systems The primary focus in this thesis has been DT creation in centralized database systems. However, we believe that the general framework can be used for DT creation in distributed database systems as well. Especially, distributed systems based on log shipping, i.e., systems where the transaction log is sent to another node instead of written to disk, seem to be a good starting place for this research. DT Creation using Aggregate Operators Some work has already been conducted on DT creations that involve aggregate operators (Jonasson, 2006), but the research is far from complete. It would be very interesting to implement these methods into the Non-blocking Database. Experiments should then be executed to empirically validate the methods and to indicate how much performance degradation they incur. Implementation in a Modern DBMS Because the DT creation methods require functionality that was not found in any of the DBMSs investigated in Chapter 7, we chose to perform experiments 174 11.2. FUTURE WORK on a prototype DBMS. Ideally, DT creation should not require any nonstandard functionality. For example, we believe that block state identifiers can be allowed in the source tables as long as the derived tables have record state identifiers. Implementing the DT creation functionality in an existing DBMS could provide additional insight. For example, the simplifications made to the prototype would be removed, and the methods could be subject to full-scale benchmarks to get better indications on performance implications. Furthermore, a second implementation could be subject to more empirical validation experiments, and thereby to confirm the correctness of the methods (Walker et al., 2003). Hence, further research to find whether more standard DBMS functionality could be used, and an implementation in an existing DBMS would both be considered interesting to the author. Synchronizing Schema Transformations with Application Modifications When schema transformations have been discussed in this thesis, we have been concerned with incurring low performance degradation and performing the switch between schemas as fast as possible so that latches are only held for a millisecond or two. The transactions executed in the database system are, however, typically requested by applications. When the schema is modified, the applications should also be modified to reflect the new schema. It would be interesting to investigate how changes to an application can be synchronized with schema transformations. As an initial approach, we would start by creating views reflecting the old schema when a schema transformation is committed. By doing so, the application can be changed after the schema transformation has completed. Of course, this strategy requires that all old data is intact in the new schema. Combining Multiple Operators In this thesis, we have designed methods to create derived tables using any of six relational operators. What we have not considered, however, is that materialized views and schema transformations may in some cases require multiple operators to get the wanted result. A Materialized View in a Datawarehouse may, e.g., be constructed by a join between four tables and an aggregate operator. For schema transformation purposes, the effect of multiple operators can often be achieved by performing the required transformations in serial; CHAPTER 11. CONCLUSION AND FUTURE WORK 175 Student Student StudID Name CourseID CourseName Grade StudID Name StudentCourse StudID CourseID Grade StudentCourse StudID CourseID CourseName Grade Course CourseID CourseName Figure 11.1: Example of Schema Transformation performed in two steps. Figure 11.2: Example interface for dynamic priorities for DT creation. an example is illustrated in Figure 11.1. This serial execution can not be used for DT creation in general. Hence, it would be interesting to research if and how the DT creation method can be used when multiple operators are involved. Dynamic Priorities for the DT Creation Process The performance experiments showed that the priority setting of the DT creation process can be used to make the operation complete fast but with high performance degradation, or complete in longer time with less degradation. In the current implementation, the priority is set once and for all when DT creation is started. However, we see no reason why this priority should not be dynamic. Figure 11.2 shows an example of how a graphical user interface for dynamic priority could look like. 11.3 Publications Some of the research presented in this thesis has already been presented at several conferences. The papers, presented in chronological order, are: 176 11.3. PUBLICATIONS 1. Jørgen Løland and Svein-Olaf Hvasshovd (2006) Online, non-blocking relational schema changes. In Advances in Database Technology – EDBT 2006, volume 3896 of Lecture Notes in Computer Science, pages 405–422. Springer-Verlag. This paper describes the first strategy used to perform schema transformations with the vertical merge and split operators. Records in the transformed schema are identified using their primary keys, whereas the method in this thesis identifies records on non-physical Record ID. Compared to using non-physical Record IDs for identification, the primary key solution of this paper requires more complex transformation mechanisms. On the other hand, the method is not restricted to DBMSs that identify records in one particular way. 2. Jørgen Løland, and Svein-Olaf Hvasshovd (2006) Non-blocking materialized view creation and transformation of schemas. In Advances in Databases and Information Systems - Proceedings of ADBIS 2006, volume 4152 of Lecture Notes in Computer Science, pages 96–107. Springer-Verlag. In this paper, the general framework for derived table creation used in this thesis is presented. DT creation methods for all six relational operators in this thesis are described, and the idea of using DTs for either schema transformations or materialized views is introduced. As opposed to the schema transformations described in “Online, nonblocking relational schema changes”, the methods in this paper uses Record IDs for identification. 3. Jørgen Løland, and Svein-Olaf Hvasshovd (2006) Non-blocking Creation of Derived Tables. In Proceedings of Norsk Informatikkonferanse 2006. Tapir Forlag. The generalized DT creation problems formalized in Chapter 5 were first described in this paper. By using these generalized problems, the DT creation methods can be described in a more structured way. Part IV Appendix Appendix A Non-blocking Database: SQL Syntax The SQL recognized by the Non-blocking Database prototype is by no means the complete SQL language. A small subset of the SQL standard has been selected for implementation with the goal of providing enough flexibility in testing while being feasible to implement. In the language definitions below, <. . . > means that it is a variable name. [. . . ] means optional. {. . . } is used for ordering. The following statements are recognized: create table <tablename>(<colname> <type> [<constraint>], ...); drop table <tablename>; delete from <tablename> where <pk_col>=<value>; insert into <tablename>(<col1>, <col2>,...,<colX>) values(<value1>, <value2>,...,<valueX>); update <tablename> set <col1>=<value1>, <col2>=<value2>... where <pk_col>=<value_pk>; select {<col1>, <col2>...|*} from <tablename> [where <col>=<value>] [order by <col>]; select {<col1>, <col2>...|*} 180 from (<table1> join <table2> on <ja_col1>=<ja_col2>) [where <colX>=<value>] [order by <colY>]; select {<col1>, <col2>...|*} from <table1> union select {<col1>, <col2>...|*} from <table2> select {<col1>, <col2>...|*} from <table1> {difference|intersection} select {<col1>, <col2>...|*} from <table2> <colX> = the name of attribute X <valueX> = the value of attribute X <constraint> = primary key <ja_colX>= the join attribute column name of table X <type> = Integer|String|Boolean|autoincrement <pk_col> = column name of primary key attribute Appendix B Performance Graphs 1.8 Response Time Average 1.2 1.4 ● ● 1.0 Response Time (ms) 1.6 ● Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 ● 0.6 0.8 ● ● ● 50 60 70 80 90 100 % Workload (a) Response times for varying workloads, scenario 1 and 2. 400 Throughput, Unloaded and During Log Propagation Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 300 ● ● ● ● 200 ● ● ● ● 100 Throughput ● ● 0 ● ● 0 20 40 60 80 100 % Workload (b) Throughput for varying workloads, scenario 1 and 2. Figure B.1: Response times and throughput for varying workloads during horizontal merge DT creation, Transaction Mix 1, Scenario 1 and 2. 3.5 Response Time Average and 90% Quartile Unloaded Log Propagation 2.5 2.0 1.5 Response Time (ms) 3.0 ● ● 1.0 ● ● ● 50 60 70 ● 0.5 ● 80 90 100 % Workload (a) Response times for varying workloads, scenario 1. The red lines indicate mean response time and confidence intervals without DT creation. Response Time Average 1.2 ● ● Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 1.0 ● 0.8 ● ● ● 0.6 Response Time (ms) ● 50 60 70 80 90 100 % Workload (b) Response times for varying workloads, scenario 1 and 2. Figure B.2: Response times for varying workloads during horizontal split DT creation. 400 Throughput, Unloaded and During Log Propagation Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 300 ● ● ● ● 200 ● ● ● ● 100 Throughput ● ● 0 ● ● 0 20 40 60 80 100 % Workload Figure B.3: Throughput for varying workloads during horizontal split DT creation. 6 Response Time Average and 90% Quartile Unloaded During Log Propagation 4 3 2 Response Time (ms) 5 ● ● 1 ● ● ● ● 50 60 70 ● 80 90 100 % Workload (a) Response time mean and 90% quartiles for varying workloads, scenario 1. Response Time Average 1.6 ● 1.2 1.0 ● ● ● ● ● 0.6 0.8 Response Time (ms) 1.4 ● Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 50 60 70 80 90 100 % Workload (b) Response times for varying workloads. Transaction Mix 1, scenario 1 and 2. Figure B.4: Response times for varying workloads during vertical merge DT creation. ● 1.2 1.4 ● Unloaded Scenario 1 1x Unloaded Scenario 1 5x Unloaded Scenario 1 15x Log Prop Scenario 1 1x Log Prop Scenario 1 5x Log Prop Scenario 1 15x 1.0 ● ● ● ● ● 0.6 0.8 Response Time (ms) 1.6 1.8 Response Time Average 50 60 70 80 90 100 % Workload Figure B.5: Response times for varying workloads during vertical merge DT creation for three variations of record numbers in the two source tables. The 1x plots are the ones shown in Figure B.4(a). Transaction Mix 1, Scenario 1 and 2. 400 Throughput, Unloaded and During Log Propagation Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 300 ● ● ● ● 200 ● ● ● ● 100 Throughput ● ● 0 ● ● 0 20 40 60 80 100 % Workload Figure B.6: Throughput for varying workloads during vertical merge DT creation. Transaction Mix 1, Scenario 1 and 2. 6 Response Time Average and 90% Quartile Unloaded During Log Propagation 4 3 2 Response Time (ms) 5 ● ● 1 ● ● ● ● ● 50 60 70 80 90 100 % Workload (a) Response time mean and 90% quartiles for varying workloads. Transaction Mix 1, Scenario 1. Response Time Average 1.8 ● 1.4 1.2 ● 1.0 Response Time (ms) 1.6 ● Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 ● 0.6 0.8 ● ● ● 50 60 70 80 90 100 % Workload (b) Response times for varying workloads. Transaction Mix 1, Scenario 1 and 2. Figure B.7: Response times for varying workloads during vertical split DT creation. 400 Throughput, Unloaded and During Log Propagation Unloaded Scenario 1 Unloaded Scenario 2 Log Prop Scenario 1 Log Prop Scenario 2 300 ● ● ● 200 ● ● ● ● ● 100 Throughput ● ● 0 ● ● 0 20 40 60 80 100 % Workload Figure B.8: Throughput for varying workloads during vertical split DT creation. Transaction Mix 1, Scenario 1 and 2. Glossary 1MLF Abbreviation for One-to-Many Lock Forwarding. Technique used during non-blocking commit and abort synchronization of schema transformations. 2PC Common abbreviation for two-phase commit. Commonly used by schedulers in distributed database systems to ensure that transactions either commit or abort on all nodes. 2PL Common abbreviation for two-phase locking. Transactions are not allowed to acquire new locks once they have released a lock. Availability A database system is available when it can be fully accessed by all users that are supposed to have access. Consistency Checker A background thread used to find inconsistencies between records during vertical split DT creation. Database A collection of related data. Database Management System The program used to manage a database. Database Schema The description, or model, of a database. Database Snapshot The first type of materialized view. In contrast to MVs, snapshots can not be continuously refreshed. Database System A database managed by a DBMS. DBMS Common abbreviation for Database Management System. Derived Table A table containing data gathered from one or more other tables. DT Abbreviation for Derived Table. Fine-granularity locking Locks that are set on small data items, i.e. records. 192 Fuzzy Copy A technique used to make a copy of a table without blocking concurrent operations, including updates, to the same table. Can be based on copying records or blocks of records. Fuzzy Mark A special log record used as a place-keeper by DT creation. High availability Defines systems that are not allowed to be unavailable for more than a few minutes each year on average. Horizontal Merge Derived Table creation operator, corresponding to the union relational operator. Horizontal Split Derived Table creation operator, corresponding to the selection relational operator. Idempotence Idempotent operations can be redone any number of times and still yield the same result. Initial Population Step Second step of the DT creation framework. The derived tables are populated with records read from the source tables without using locks. Latch A lock held for a very short time. For example used to ensure that only one thread writes to a disk block at a time. Also called semaphore. Log Propagation Step Third step of the DT creation framework. Log records describing operations on source table records are applied to the records in the derived tables. Log Sequence Number See State Identifier. Logical Log A transaction log containing the operations performed on the data objects. LSN Common abbreviation for Log Sequence Number; See State Identifier. M1LF Abbreviation for Many-to-One Lock Forwarding. Technique used during non-blocking commit and abort synchronization of schema transformations. Materialized View A view where the result of the view query is physically stored. MMLF Abbreviation for Many-to-Many Lock Forwarding. Technique used during non-blocking commit and abort synchronization of schema transformations. APPENDIX B. PERFORMANCE GRAPHS 193 MV Common abbreviation for Materialized View. NbDBMS Abbreviation for Non-blocking DBMS. The name of the prototype DBMS used for testing in this thesis. Performance Degradation The degree of reduced performance, measured in throughput or response time. Physical Log A transaction log containing before and after values of the changed objects. Physiological Log A compromise between physical and logical logging. Uses logical logging to describe operations on the physical objects; blocks. Preparation Step First step of the DT creation framework. Necessary tables, indices etc are added to the database schema. Record Identification Policy The strategy a DBMS uses to uniquely identify records. There are four alternative strategies: Relative Byte Address, Tuple Identifier, Database Key and Primary Key. The DT creation framework presented in this thesis requires that either of the two latter strategies is used. Record Identifier A unique identifier assigned to all records in a database. RID Common abbreviation for Record Identifier. Schema Transformation A change to the database schema that happens after the schema has been put into use. Self-maintainable Highly desirable property for materialized views; used on MVs that can be maintained without querying the source tables. Throughout this thesis: also used on DT creations that can be performed without querying the source tables. Semantically rich locks Lock types that allow multiple transactions to lock the same data item. Requires that the operations are commutative, i.e. can be performed in any order. An example is “add $1000 to account X”, which commutes with “subtract $10 from account X”. SLF Abbreviation for Simple Log Forwarding. Technique used during nonblocking commit and abort synchronization of schema transformations. 194 Source Table The tables used to derive records from in the derived table creation framework. State Identifier A value assigned to records or blocks (containing records) which identifies the latest operation that was applied to it. Used to achieve idempotence when logical logging is used. Synchronization Step Fourth step of the DT creation framework. The derived tables are made consistent with the source table records, and are either turned into materialized views or used to perform a schema transformation. Transaction Log A file (normally) in which database operations are written. Used by the DBMS to recover a database after a failure. Vertical Merge Derived Table creation operator, corresponding to the left outer join relational operator in Ronström’s schema transformation method, and the full outer join operator in the DT creation method presented in this thesis. Vertical Split Derived Table creation operator, corresponding to the projection relational operator. Bibliography Aas, J. (2005). Understanding the linux 2.6.8.1 cpu scheduler. http://josh.trancesoftware.com/linux/linux cpu scheduler.pdf. Adiba, M. E. and Lindsay, B. G. (1980). Database snapshots. In Proceedings of the Sixth International Conference on Very Large Data Bases, 1980, Canada, pages 86–91. IEEE Computer Society. Agarwal, S., Keller, A. M., Wiederhold, G., and Saraswat, K. (1995). Flexible relation: An approach for integrating data from multiple, possibly inconsistent databases. In ICDE ’95: Proceedings of the Eleventh International Conference on Data Engineering, pages 495–504, Washington, DC, USA. IEEE Computer Society. Agrawal, R., Carey, M. J., and Livny, M. (1987). Concurrency control performance modeling: alternatives and implications. ACM Trans. Database Syst., 12(4):609–654. Alur, N., Haas, P., Momiroska, D., Read, P., Summers, N., Totanes, V., and Zuzarte, C. (2002). DB2 UDB’s High Function Business Intelligence in e-business. IBM Corp., 1st edition. Apache Derby (2007a). http://db.apache.org/derby/. Apache derby homepage. Apache Derby (2007b). Derby engine papers: Derby logging and recovery. http://db.apache.org/derby/papers/recovery.html. Apache Derby (2007c). Derby engine papers: Derby write ahead log format. http://db.apache.org/derby/papers/logformats.html. Austin, C. (2000). Sun developer network: Java technology on the linux platform. http://java.sun.com/developer/technicalArticles/Programming/linux/. 196 BIBLIOGRAPHY Ballinger, C. (1993). TPC-D: benchmarking for decision support. In Gray, J., editor, The Benchmark Handbook for Database and Transaction Systems. Morgan Kaufmann, 2nd edition. Bernstein, P. A., Hadzilacos, V., and Goodman, N. (1987). Concurrency Control and Recovery in Database Systems. Addison-Weslay Publishing Company, 1st edition. Blakeley, J. A., Coburn, N., and Larson, P.-A. (1989). Updating derived relations: detecting irrelevant and autonomously computable updates. ACM Transactions on Database Systems, 14(3):369–400. Blakeley, J. A., Larson, P.-A., and Tompa, F. W. (1986). Efficiently updating materialized views. In Proceedings of the 1986 ACM SIGMOD international Conference on Management of Data, pages 61–71. Bratsberg, S. E., Hvasshovd, S.-O., and Torbjørnsen, Ø. (1997a). Location and replication independent recovery in a highly available database. In 15th British Conference on Databases. Springer-Verlag LNCS. Bratsberg, S. E., Hvasshovd, S.-O., and Torbjørnsen, Ø. (1997b). Parallel solutions in ClustRa. IEEE Data Eng. Bull., 20(2):13–20. Caroprese, L. and Zumpano, E. (2006). A framework for merging, repairing and querying inconsistent databases. In Advances in Databases and Information Systems - Proceedings of ADBIS 2006, volume 4152 of Lecture Notes in Computer Science, pages 383–398. Springer-Verlag. Cha, S. K., Park, B. D., Lee, S. J., Song, S. H., Park, J. H., Lee, J. S., Park, S. Y., Hur, D. Y., and Kim, G. B. (1995). Object-oriented design of main-memory dbms for real-time applications. In Proceedings of the 2nd International Workshop on Real-Time Computing Systems and Applications, page 109, Washington, DC, USA. IEEE Computer Society. Cha, S. K., Park, J. H., and Park, B. D. (1997). Xmas: an extensible main-memory storage system. In Proceedings of the sixth international conference on Information and knowledge management, pages 356–362, New York, NY, USA. ACM Press. Cha, S. K. and Song, C. (2004). P*time: Highly scalable oltp dbms for managing update-intensive stream workload. In (e)Proceedings of the 30th International Conference on VLDB, pages 1033–1044. BIBLIOGRAPHY 197 Codd, E. F. (1970). A relational model of data for large shared data banks. Commununications of the ACM, 13(6):377–387. Colby, L. S., Griffin, T., Libkin, L., Mumick, I. S., and Trickey, H. (1996). Algorithms for deferred view maintenance. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 469– 480. ACM Press. Crus, R. A. (1984). Data Recovery in IBM Database 2. IBM Systems Journal, 23(2):178. Cyran, M. and Lane, P. (2003). Oracle database online documentation 10g release 1 (10.1) - ”concepts”, part no. b10743-01. Denning, P. J., Comer, D. E., Gries, D., Mulder, M. C., Tucker, A., Turner, A. J., and Young, P. R. (1989). Computing as a discipline. Commununications of the ACM, 32(1):9–23. Desmo-J (2006). Desmo-j: A framework for discrete-event modelling and simulation. http://desmoj.de/. Elmasri, R. and Navathe, S. B. (2000). Fundamentals of Database Systems. Addison-Weslay Publishing Company, 3rd edition. Elmasri, R. and Navathe, S. B. (2004). Fundamentals of Database Systems. Addison-Wesley, 4th edition. Flesca, S., Greco, S., and Zumpano, E. (2004). Active integrity constraints. In Proceedings of the 6th ACM SIGPLAN international conference on Principles and practice of declarative programming, pages 98–107, New York, NY, USA. ACM Press. Friedl, J. E. (2006). Mastering Regular Expressions. O’Reilly & Associates, 3rd edition. Garcia-Molina, H. and Salem, K. (1987). Sagas. In Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data, pages 249–259. ACM Press. Garcia-Molina, H. and Salem, K. (1992). Main memory database systems: An overview. IEEE Transactions on Knowledge and Data Engineering, 4(6):509–516. Garcia-Molina, H., Ullman, J. D., and Widom, J. (2002). Database Systems: The Complete Book. Prentice Hall PTR, Upper Saddle River, NJ, USA. 198 BIBLIOGRAPHY Gray, J. (1978). Notes on data base operating systems. In Operating Systems, An Advanced Course, pages 393–481, London, UK. Springer-Verlag. Gray, J. (1981). The transaction concept: Virtues and limitations. In Very Large Data Bases, 7th International Conference, September 9-11, 1981, Cannes, France, Proceedings, pages 144–154. IEEE Computer Society. Gray, J., editor (1993). The Benchmark Handbook for Database and Transaction Systems. Morgan Kaufmann, 2nd edition. Gray, J. and Reuter, A. (1993). Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, Inc. Greco, G., Greco, S., and Zumpano, E. (2001a). A logic programming approach to the integration, repairing and querying of inconsistent databases. In Proceedings of the 17th International Conference on Logic Programming, pages 348–364, London, UK. Springer-Verlag. Greco, G., Greco, S., and Zumpano, E. (2003). A logical framework for querying and repairing inconsistent databases. IEEE Transactions on Knowledge and Data Engineering, 15(6):1389–1408. Greco, S., Pontieri, L., and Zumpano, E. (2001b). Integrating and managing conflicting data. In PSI ’02: Revised Papers from the 4th International Andrei Ershov Memorial Conference on Perspectives of System Informatics, volume 4152 of Lecture Notes in Computer Science, pages 349–362, London, UK. Springer-Verlag. Griffin, T. and Libkin, L. (1995). Incremental maintenance of views with duplicates. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 328–339. ACM Press. Griffin, T., Libkin, L., and Trickey, H. (1997). An improved algorithm for the incremental recomputation of active relational expressions. TKDE, 9(3):508–511. Gupta, A., Jagadish, H. V., and Mumick, I. S. (1996). Data integration using self-maintainable views. In Proceedings of the 5th International Conference on Extending Database Technology, pages 140–144. SpringerVerlag. Gupta, A., Katiyar, D., and Mumick, I. S. (1992). Counting solutions to the view maintenance problem. In Workshop on Deductive Databases, JICSLP, pages 185–194. BIBLIOGRAPHY 199 Gupta, A., Mumick, I. S., and Subrahmanian, V. S. (1993). Maintaining views incrementally. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 157–166. ACM Press. Haerder, T. and Reuter, A. (1983). Principles of transaction-oriented database recovery. ACM Comput. Surv., 15(4):287–317. Hagmann, R. B. (1986). A crash recovery scheme for a memory-resident database system. IEEE Trans. Comput., 35(9):839–843. Highleyman, W. H. (1989). Performance analysis of transaction processing systems. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Houston, Leland, S. and Newton (2002). IBM Informix Guide to SQL: Syntax, version 9.3. IBM. Hvasshovd, S.-O. (1999). Recovery in Parallel Database Systems. Verlag Vieweg, 2nd edition. Hvasshovd, S.-O., Sæter, T., Torbjørnsen, Ø., Moe, P., and Risnes, O. (1991). A continuously available and highly scalable transaction server: Design experience from the HypRa project. In Proceedings of the 4th International Workshop on High Performance Transaction Systems. Hvasshovd, S.-O., Torbjørnsen, Ø., Bratsberg, S. E., and Holager, P. (1995). The ClustRa telecom database: High availability, high throughput, and real-time response. In Proceedings of the 21st VLDB Conference. IBM Information Center (2006). DB2 Version 9 Information Center, http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp (checked december 5. 2006). IBM Information Center (2007). Ibm db2 universal database glossary, version 8.2, checked february 6. 2007. Jonasson, Ø. A. (2006). Non-blocking creation and maintenance of materialized views. Master’s thesis, Norwegian University of Science and Technology. Kahler, B. and Risnes, O. (1987). Extended logging for database snapshot refresh. In Proceedings of the 13th International Conference on Very Large Data Bases. 200 BIBLIOGRAPHY Kawaguchi, A., Lieuwen, D. F., Mumick, I. S., Quass, D., and Ross, K. A. (1997). Concurrency control theory for deferred materialized views. In Proceedings of the 6th International Conference on Database Theory, ICDT 1997, volume 1186 of Lecture Notes in Computer Science, pages 306–320. Springer-Verlag. Knuth, D. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, 2nd edition. Korth, H. F. (1983). Locking primitives in a database system. Journal of the ACM, 30(1):55–79. Kruckenberg, M. and Pipes, J. (2006). Pro MySQL. Apress. Lesk, M. E. and Schmidt, E. (1990). Lex - a lexical analyzer generator. In UNIX Vol. II: research system, pages 375–387, Philadelphia, PA, USA. W. B. Saunders Company. Lin, J. (1996). Integration of weighted knowledge bases. Artificial Intelligence, 83(2):363–378. Lin, J. and Mendelzon, A. O. (1999). Knowledge base merging by majority. In Pareschi, R. and Fronhoefer, B., editors, Dynamic Worlds: From the Frame Problem to Knowledge Management. Kluwer Academic Publishers. Lindsay, B., Haas, L., Mohan, C., Pirahesh, H., and Wilms, P. (1986). A snapshot differential refresh algorithm. In Proceedings of the 1986 ACM SIGMOD international conference on Management of data, pages 53–60, New York, NY, USA. ACM Press. Lorentz, D. and Gregoire, J. (2002). Oracle9i SQL Reference Release 2 (9.2), Part no. A96540-01. Oracle. Lorentz, D. and Gregoire, J. (2003a). Oracle Database SQL Reference 10 g Release 1 (10.1), Part no. B10759-01. Oracle. Lorentz, D. and Gregoire, J. (2003b). Oracle Database SQL Reference 10g Release 1 (10.1). Oracle. Løland, J. (2003). Schema transformations in commercial databases. Report, Norwegian University of Science and Technology. Løland, J. and Hvasshovd, S.-O. (2006a). Non-blocking creation of derived tables. In Norsk Informatikkonferanse 2006. Tapir Akademisk Forlag. BIBLIOGRAPHY 201 Løland, J. and Hvasshovd, S.-O. (2006b). Non-blocking materialized view creation and transformation of schemas. In Advances in Databases and Information Systems - Proceedings of ADBIS 2006, volume 4152 of Lecture Notes in Computer Science, pages 96–107. Springer-Verlag. Løland, J. and Hvasshovd, S.-O. (2006c). Online, non-blocking relational schema changes. In Advances in Database Technology – EDBT 2006, volume 3896 of Lecture Notes in Computer Science, pages 405–422. Springer-Verlag. Marche, S. (1993). Measuring the stability of data. European Journal of Information Systems, 2(1):37–47. Microsoft TechNet (2006). Microsoft technet: SQL Server 2005, http://msdn2.microsoft.com/en-us/library/ms130214.aspx (checked march 22. 2007). Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwarz, P. (1992). Aries: a transaction recovery method supporting finegranularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems, 17(1):94–162. MySQL AB (2006). Mysql 5.1 reference manual. http://dev.mysql.com/doc/. MySQL AB (2007). Mysql homepage. http://dev.mysql.com/. Oracle Corporation (2006a). Berkeley DB reference guide, version 4.5.20. http://www.oracle.com/technology/documentation/berkeleydb/db/index.html. Oracle Corporation (2006b). White Paper: A comparison of Oracle Berkeley DB and relational database management systems. http://www.oracle.com/database/berkeley-db/. Pfleeger, S. L. (1999). Albert Einstein and Empirical Software Engineering. Computer, 32(10):32–38. PostgreSQL Global Development Group (2001). Postgresql online manuals postgresql 7.1. documentation. http://www.postgresql.org/docs/7.1/static/postgres.html. PostgreSQL Global Development Group (2002). Postgresql online manuals postgresql 7.3. documentation. http://www.postgresql.org/docs/7.3/interactive/index.html. 202 BIBLIOGRAPHY PostgreSQL Global Development Group (2007). Postgresql history. Qian, X. and Wiederhold, G. (1991). Incremental recomputation of active relational expressions. Knowledge and Data Engineering, 3(3):337–341. Quass, D., Gupta, A., Mumick, I. S., and Widom, J. (1996). Making views self-maintainable for data warehousing. In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems, 1996, USA, pages 158–169. IEEE Computer Society. Ronström, M. (1998). Design and Modelling of a Parallel Data Server for Telecom Applications. PhD thesis, Linkoping University. Ronström, M. (2000). On-line schema update for a telecom database. In Proceedings of the 16th International Conference on Data Engineering, pages 329–338. IEEE Computer Society. Serlin, O. (1993). The history of debitcredit and the TPC. In Gray, J., editor, The Benchmark Handbook for Database and Transaction Systems. Morgan Kaufmann, 2nd edition. Shapiro, L. D. (1986). Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3):239–264. Shirazi, J. (2003). Java Performance Tuning. O’Reilly & Associates. Sjøberg, D. (1993). Quantifying schema evolution. Information and Software Technology, 35(1):35–44. Solid Info. Tech. (2006a). Solid boostengine data http://www.solidtech.com/pdfs/SolidBoostEngine DS.pdf. sheet. Solid Info. Tech. (2006b). Solid database engine administration guide. Solid Info. Tech. (2007). http://www.solidtech.com/. Solid database homepage. Sun Microsystems (2006a). Sun developer network: Java 2 standard edition 5.0. http://java.sun.com/j2se/1.5.0/. Sun Microsystems (2006b). Sun developer network: Java se hotspot at a glance. http://java.sun.com/javase/technologies/hotspot/. Sun Microsystems (2007). Java 2 platform standard edition 5.0 api specification. http://java.sun.com/j2se/1.5.0/docs/api/. BIBLIOGRAPHY 203 Tichy, W. F. (1998). Should Computer Scientists Experiment More? IEEE Computer, 31(5):32–40. Turbyfill, C., Orji, C. U., and Bitton, D. (1993). AS3 AP - an ANSI SQL standard scaleable and portable benchmark for relational database systems. In Gray, J., editor, The Benchmark Handbook for Database and Transaction Systems. Morgan Kaufmann, 2nd edition. Walker, R. J., Briand, L. C., Notkin, D., Seaman, C. B., and Tichy, W. F. (2003). Panel: empirical validation: what, why, when, and How. In Proceedings of the 25th International Conference on Software Engineering (ICSE’03), pages 721–722. IEEE Computer Society Press. Weikum, G. (1986). A theoretical foundation of multi-level concurrency control. In PODS ’86: Proceedings of the fifth ACM SIGACT-SIGMOD symposium on Principles of database systems, pages 31–43, New York, NY, USA. ACM Press. Zaitsev, P. (2006). Presentation: Innodb architecture and performance optimization. Open Source Database Conference 2006, http://www.opendbcon.net/. Zelkowitz, M. V. and Wallace, D. R. (1998). Experimental Models for Validating Technology. Computer, 31(5):23–31.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Materialized View Creation and Transformation of Schemas in