* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Query Execution
Information privacy law wikipedia , lookup
Tandem Computers wikipedia , lookup
Business intelligence wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Serializability wikipedia , lookup
File locking wikipedia , lookup
Operational transformation wikipedia , lookup
Concurrency control wikipedia , lookup
Data vault modeling wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Disk formatting wikipedia , lookup
Relational algebra wikipedia , lookup
Database model wikipedia , lookup
CS157B - Fall 2004 Entity-Relationship Data Model Entity Sets ƒ a collection of similar entities Attributes ƒ properties of the entities in an entity set Relationships ƒ connection among two or more entity sets Constraints ƒ Keys : uniquely identify an entity within its entity set ƒ Single-value constraints : requirement that the value in a certain context be unique ƒ Referential integrity constraints : existence test ƒ Domain constraints : range test ƒ General constraints : data set constraints E/R - Cont. Constraints Are of the Schema ƒ Should be part of the DB design ƒ Must be carefully constructed ƒ Can be troublesome if major redesign is required Subclasses ƒ Special "isa" relationship ƒ Similar to OO's definition Weak Entity Sets ƒ Some of its key may be composed of attributes which belong to another entity set ƒ Need additional qualifier The Relational Data Model Relation ƒ Represented by a two-demensional table Attribute ƒ Represented as columns in a table Schema ƒ The name of a relation and the set of attributes of this relation ƒ A database schema consists of one of more relation schemas Tuple ƒ A row in the table Domain ƒ Each attribute of a relation is an elementary type, thus each tuple must be atomic Relational - Cont Converting Subclass Structures to Relations ƒ E/R-Style Conversion ƒ Object-Oriented Approach ƒ Null Values Functional Dependencies (FD) ƒ An unique-value constraint in a relational schema design ƒ Keys of Relations –{A(1), ..., A(n)} is a key of relation R if no distinct tuples agree on all of the {...} no other subset of {} determines all other attributes of R, must be minimal –If more than one key, then designate one of the keys as the primary key ƒ Superkey : Composed by a set of attributes Relational - Cont Design of Relational Database Schemas ƒ Anomalies –Redundancy : Many repeated information in each tuple –Update Anomalies : One tuple update does not cause other tuples to be updated –Deletion Anomalies : Lose other information as a side effect when a set of values becomes empty ƒ Decomposing Relations –Splitting the attributes of R to make the schemas of two new relations ƒ Boyce-Codd Normal Form (BCNF) –The left side of every nontrivial FD must be a superkey –Recursively decompose into smaller sets of tables until they all satisfy the BCNF rule –Must be able to join all set of tables into the original tuple Relational - Cont ƒ Third Nomal Form (3NF) –A relation R is in 3NF if: whenever A(1)...A(n) ->B is a nontrivial FD, either {A(1)...A(n)} is a superkey, or B is a member of some key –May allow minimal redundancy in the end ƒ Multivalued Dependencies (MVD) –A statement that two sets of attributes in a relation have sets of values that appear in all possible combinations –Example of MVD in p.118 figure 3.29 –No BCNF violation ƒ Fourth Normal Form (4NF) –A relation R is in 4NF if whenever A(a)..A(n)>->B(1)..B(m) is a nontrivial MVD, {A(1)...A(n)} is a superkey –Can be decomposed similar to the BCNF decomposition algorithm Other Data Models Review of Object-Oriented Concepts ƒ Type System –Record structures –Collection types –Reference types ƒ Classes and Objects –Contain attributes and methods ƒ Object Identity –OID ƒ Methods –Functions within a class ƒ Hierarchies –Superclass and subclass relationship Introduction to ODL ƒ Object-Oriented Design –Three main properties: attributes, relationships, and methods Other Data Models - Cont ƒ Class Declarations –e.g. class <name> { <list of properties> } ƒ Attributes in ODL –can be simple or complex type ƒ Relationships in ODL –e.g. in Movie: relationship Set<Star> stars; ƒ Inverse Relationships –e.g. in Star: relationship Set<Movie> starredIn; ƒ Methods in ODL –signatures ƒ Types in ODL –Atomic types: integer float character character string boolean enumerations Other Data Models - Cont –Structure types e.g. class name Structures - similar to a tuple of values –Collection types Set - distinct unordered elements Bag - unordered elements List - ordered elements Array - e.g. Array<char,10> Dictionary - e.g. Dictionary(Keytype, Rangetype) Additional ODL Concepts ƒ Multiway Relationships in ODL –ODL supports only binary relationships –Use several binary, many-one relationship instead ƒ Subclasses in ODL –Inherits all the properties of its superclass Other Data Models - Cont ƒ Multiple Inheritance in ODL –Needs to resolve name conflicts from multiple superclass ƒ Keys in ODL –Optional because of the existence of an OID –Can be declared with one or more attributes ODL Designs to Relational Designs ƒ Many issues –Possibly no key in ODL –Relational can't handle structure and collection types directly –Convert any ODL type constructor can lead to a BCNF violation ƒ Representing ODL Relationships –one relation for each inverse pairs Other Data Models - Cont Object-Relational Model ƒ O/R Features –Structured types for attributes –Methods –Identifiers for tuples (like OID) –References ƒ Compromise Between Pure OO and Relational Semistructured Data ƒ "Schemaless" - the schema is attached to the data itself ƒ Represented in a collection of nodes ƒ Interior node describes the data with a label ƒ Leaf node contains the data of any atomic type ƒ Legacy-database problem Relational Algebra Algebra of Relational Operations ƒ Bags rather than sets can be more efficient depending on the operation such as an union of two relations which contain duplicates ƒ Components –Variables that stand for relations and Constants which are finite relations –Expressions of relational algebra are referred to as queries ƒ Set Operations on Relations –R u S, the union of R and S (distinct) –R n S, the intersection of R and S (distinct) –R - S, the difference of R and S, is the set of elements that are in R but not in S; different from S - R –Tuples must be in the same order and attribute types must be the same Relational Algebra - Cont ƒ Projection –The projection operator is used to produce from a relation R a new relation that has only some of R's columns –e.g. Views ƒ Selection –The selection operator produces a new relation R with a subset of R's tuples; the tuples in the resulting relation are those that satisfy some condition C that involves the attributes of R –e.g. SQL select statement ƒ Cartesian Product –A cross-product of two sets R and S –e.g. R x S ; if R has 2 tuples and S has 3 tuples, the result has 6 tuples ƒ Natural Joins –Use matching attributes Relational Algebra - Cont ƒ Theta-Joins –Join two relations with a condition denoted by 'C' ƒ Combining Operations to Form Queries –Multiple queries can be combined into a complex query –e.g. AND, OR, (), ... ƒ Renaming –Control the names of the attributes used for relations that are constructed by applying relational-algebra operations ƒ Dependent and Independent Operations –Some relational expression can be "rewritten" in a different expression –e.g. R n S = R - (R - S) –e.g. Glass half-empty or half-full? Relational Algebra - Cont Extended Operators of Relational Algebra ƒ Duplicate Elimination –This operator converts a bag to a set –e.g. SQL keyword DISTINCT ƒ Aggregation Operators –SUM(), AVG(), MIN(), MAX(), COUNT() ƒ Grouping –This operator allows us to group a relation and/or aggregate some columns ƒ Extended Projection –This allows expressions as part of an attribute list ƒ Sorting Operator –This operator is anomalous, in that it is the only operator in our relational algrebra whose result is a list of tuples, rather than a set ƒ Outerjoin –Takes care of the dangling tuples; denote with special "null" symbol SQL Simple Queries in SQL ƒ select A(...) from R where C ƒ Projection & Selection in SQL –select title, length from Movie where studioName = 'Disney' and year = 1990 ƒ String Comparison –Bit Strings (bit data) e.g. B'011' or X'7ff' –Lexicographic order e.g. 'A' < 'B'; 'a' < 'b'; 'a' > 'B'? Depends on encoding scheme or standard (e.g. ASCII, UTF8) –LIKE keyword "s like p" denotes where s is a string and p is a pattern Special character % ƒ Dates and Times –DATE '2002-02-04' –TIME '19:00:30.5' SQL - Cont ƒ Null Values –3 cases: unknown, inapplicable, withheld –NOT a constant –Test expression for IS NULL ƒ Truth-Value "unknown" –Pitfalls regarding nulls –e.g. select * from Movie where length <= 120 or length > 120; ƒ Ordering the Output – ORDER BY <list of attributes> – Can be ASC or DESC, default is ASC Queries with > 1 Relation ƒ e.g. select name from Movie, MovieExec where title = 'Star Wars' and producerC# = cert# ƒ Disambiguating Attributes –select MovieStar.name, MovieExec.name from MovieStar, MovieExec where SQL - Cont ƒ UNION, INTERSECT, EXCEPT Keywords –Same logic as the set operators of u, n, and - Subqueries ƒ Can return a single constant or relations in the WHERE clause ƒ Can have relations appear in the FROM clauses ƒ Scalar Value –An atomic value that can appear as one component of a tuple (e.g. constant, attribute) ƒ e.g. select name from MovieExec where cert#=(select producerC# from Movie where title='Star Wars'); ƒ Conditions Involving Relations –If R is a relation, then EXISTS R is a condition that is true if R is not empty SQL - Cont –s IN R is true if s is equal to one of the values in R; s NOT IN R is true if s is equal to no value in R –s > ALL R is true if s is greater than every value in unary relation R –s > ANY R is true if s is greater than at least one value in unary relation R –EXISTS, ALL, ANY operators can be negated by putting NOT in front of the entire expression ƒ Subqueries in FROM Clauses –Can substitute a R in the FROM clause with a subquery –e.g. select name from MovieExec, (select producerC# from Movie, StarsIn where title = movieTitle and year = movieYear and starname = 'Harrison Ford' SQL - Cont ƒ Cross Joins –Known as Cartesian product or just product –e.g. Movie CROSS JOIN StarsIn; ƒ Natural Joins –The join condition is that all pairs of attributes from the two relations having a common name are equated, and no other conditions –One of each pair of equated attributes is projected out ƒ Outerjoins –e.g. MovieStar NATURAL FULL OUTER JOIN MovieExec; Full-Relation Operations ƒ Eliminating Duplicates –Use the DISTINCT keyword in SELECT –Performance consideration SQL - Cont ƒ Duplicates in U, I, and D –By default, UID operations convert bags to sets –Use keyword ALL after UNION, INTERSECT EXCEPT keywords to prevent the elimination of duplicates ƒ Grouping and Aggregation in SQL –Use the special GROUP BY clause ƒ Aggregation Operators –SUM, AVG, MIN, MAX are used by applying them to a scalar-valued expression, typically a column name, in a SELECT clause –COUNT(*) is used to counts all the tuples in R that is constructed from the FROM clause and WHERE clause of the query ƒ HAVING Clause –Use in conjunction with GROUP BY to narrow the aggregated list SQL - Cont Database Modifications (DML) ƒ Insertion –Insert into R(A1...An) values (V1,...Vn); –e.g. insert into Studio(name) values('S1'); –Can insert multiple tuples with subquery –e.g. insert into Studio(name) select distinct studioName from Movie where studioName not in (select name from studio); ƒ Deletion –Delete from R where <condition>; –Can delete multiple tuples with 1 delete statement depending on <condition> ƒ Updates –Update R set <new-value assignment> where <condition>; –Can update multiple tuples with 1 update statement depending on <condition> SQL - Cont DDL in SQL ƒ Data Types –Character strings, fixed or variable length –Bit strings, fixed or variable length –Boolean: true, false, unknown –Integer or int; shortint –Floating-point numbers e.g. decimal(n,d) where n is total number of digits with d is the decimal point from the right; 1234.56 can be described as decimal(6,2) –Dates and times can be represented by the data types DATE and TIME respectively ƒ Table Declarations – Use the keywords CREATE TABLE followed by the R name and list of As and their types –e.g. create table MovieStar(name char(30), address varchar(256), gender char(1), birthday DATE); SQL - Cont ƒ Modifying Relation Schemas –Drop table MovieStar; –Alter table MovieStar add phone char(16); –Alter table MovieStar drop birthdate; ƒ Default Values –Use the DEFAULT keyword to set default values for a column –e.g. alter table MovieStar add phone char(16) default 'unlisted'; ƒ Indexes –Allow faster access to data –e.g. create index YearIndex on Movie(year); –Can be one or more attributes –e.g. create index KeyIndex on Movie(title,year); –Delete the index using drop index statement ƒ Selection of Indexes –Selection vs IUD performance SQL - Cont View Definitions ƒ View does not contains any physical data –"virtual relation" ƒ Declaring Views –Create view <view-name> as <viewdefinition> ƒ Querying Views –Use the normal select syntax with a view name in place of the table name ƒ Renaming Attributes –Can map table attribute name from the base table to a new name in a view definition ƒ Modifying Views –Updatable views are useful in special cases; selective IUDs Constraints & Triggers Keys & Foreign Keys ƒ Primary Keys –Each relation can have only one primary key –Primary key attribute(s) can not be NULL –Two ways to specify the primary key 1) create table MovieStar( name char(30) primary key, address varchar(255), gender char(1), birthdate date); 2) create table MovieStar( name char(30), address varchar(255), gender char(1), birthdate date, primary key(name, birthday)); ƒ Unique Keys –Each relation can have >1 unique keys –Declared the same way as primary key –NULL is allowed Constraints & Triggers - Cont ƒ Enforcing Key Constraints –During insertion or update to the relation ƒ Foreign-Key –The "referenced" attribute(s) must be declared unique or the primary key for their relation; it must not have a NULL value –create table Studio( name char(30) primary key, address varchar(255), presC# int references MovieExec(cert#) ); –create table Studio( name char(30) primary key, address varchar(255), presC# int, foreign key (presC#) references MovieExec(cert#) ); Constraints & Triggers - Cont ƒ Maintaining Referential Integrity –Reject Violating Modifications (Default) Insert or update Studio tuple whose presC# value is not NULL and is not the cert# component of any MovieExec tuple Delete a MovieExec tuple and its cert# component appears as the presC# component of one or more Studio tuples Update a MovieExec tuple cert# value; but the old cert# is the value of presC# of some movie studio in Studio –Cascade Policy When deleting the MovieExec tuple for the president of a studio, then it will delete the referencing tuple from Studio By changing the cert# value for a MovieExec tuple from c1 to c2 and there was some Studio tuple with c1 as the value of its presC# component, then it will update this presC# component to have the value c2 Constraints & Triggers - Cont –Set-Null Policy Can handle the delete and update problem by setting the presC# to NULL e.g. create table Studio ( name char(30) primary key, address varchar(255), presC# int references MovieExec(cert#) on delete set null on update cascade ); –Deferring Checking of Constraints Do selective insert to default the presC# to null Insert tuple into MovieExec with new cert# Update the Studio tuple with matching presC# Use keyword DEFERRABLE and DEFERRED to delay the checking until the whole tranaction is "committed" Reverse the DEFERRED case with keyword IMMEDIATE Constraints & Triggers - Cont Constraints on Attributes and Tuples ƒ Not-Null Constraints –Use the NOT NULL keywords in create table statement for any attribute ƒ Attribute-Based Constraints –Use the CHECK keyword in create table statement –Limit the value for an attribute –e.g. gender char(1) check (gender in ('F','M')) ƒ Tuple-Based Constraints –Use the CHECK keyword in create table statement –Can compose of complex expression of multiple attributes Constraints Modification ƒ Naming Constraints –In order to change, it must have a name –Use the CONSTRAINT keyword Constraints & Triggers - Cont ƒ Altering Constraints on Tables –Can use ALTER TABLE to add or drop a constraint –Can use SET CONSTRAINTS to set it for deferred or immediate Schema-Level Constraints and Triggers ƒ Assertions (General Constraint) –A boolean-valued SQL expression that must be true at all times –create assertion <name> check (<condition>) –e.g. create assertion RichPres check (not exists (select * from Studio, MovieExec where presC# = cert# AND netWorth < 10000000)); ƒ Event-Condition-Action Rules (ECA Rules) –Triggers are awakened by certain events –The "action" will be preform only if C = true Constraints & Triggers - Cont ƒ Triggers in SQL –create trigger NetWorthTrigger after update of netWorth ON MovieExec referencing old row as OTuple, new row as NTuple for each row when (OTuple.netWorth > NTuple.netWorth) update MovieExec set netWorth = OTuple.netWorth where cert# = NTuple.cert#; –Default is "for each statement" –Besides update, can use insert and delete –Action can be "before" or "after" the event –Use BEGIN...END for multiple statements ƒ Instead-Of Triggers –Not part of SQL-99 –Replace event with new defined operations –Very powerful when used on a view System Aspects of SQL SQL Programming Environment ƒ Host language + Embedded SQL v Preprocessor v Host language + Function calls v Host-language compiler <= SQL Library v Object-code program ƒ Impedance Mismatch Problem –Different data model between SQL statements and programming langauges ƒ SQL/Host Language Interface –Use EXEC SQL keywords in front of an SQL statement –Use shared (host) variables for SQL stmt System Aspects of SQL - Cont ƒ The DECLARE Section and Its Usage –Shared variables are declared between two embedded SQL statements. –e.g. EXEC SQL BEGIN DECLARE SECTION; char studioName[50], studioAddr[256]; char SQLSTATE[6]; EXEC SQL END DECLARE SECTION; –A shared variable can be used within the SQL statement by placing a colon in front it. –e.g. EXEC SQL INSERT INTO Studio(name, address) VALUES (:studioName, :studioAddr); ƒ Single-Row Select Statement –e.g. EXEC SQL SELECT netWorth INTO :presNetWorth FROM Studio, MovieExec System Aspects of SQL - Cont ƒ Cursors –Allow programs to "fetch" multiple rows from a relation –Here are the steps for using a cursor EXEC SQL DECLARE <cursor> CURSOR FOR <query> EXEC SQL OPEN <cursor> EXEC SQL FETCH FROM <cursor> INTO <list-ofvariables> If SQLSTATE is "02000", then goto close <cursor>; otherwise fetch next row EXEC SQL CLOSE <cursor> –Row Modification with Cursor Use the WHERE CURRENT OF keywords e.g. EXEC SQL DELETE FROM MovieExec WHERE CURRENT OF execCursor; EXEC SQL UPDATE MovieExec SET netWorth = 2 * netWorth WHERE CURRENT OF execCursor; System Aspects of SQL - Cont –Concurrent Update of Tuple Use keywords INSENSITIVE CURSOR to ignore new changes which may affect the current cursor Use Keywords FOR READ ONLY to signal that this cursor does not allow any modification –Scrollable Cursors Allow a set of movements within a cursor ƒ Dynamic SQL –Flexibility to enter SQL statement at run time –Use EXEC SQL EXECUTE IMMEDIATE or ( EXEC SQL PREPARE ... and EXEC SQL EXECUTE ... ) –e.g. EXEC SQL BEGIN DECLARE SECTION; char *query; EXEC SQL END DECLARE SECTION; /* Allocate memory pointed to by query and fill in the SQL statement */ EXEC SQL EXECUTE IMMEDIATE :query; System Aspects of SQL - Cont Procedures Stored in the Schema ƒ Persistent Stored Modules (PSM) –Can build module to handle complex computations which cannot be expressed using SQL ƒ PSM Functions & Procedures –CREATE PROCEDURE <name> (<param>) local declarations procedure body; –Procedure parameter can be input-only, output-only, or both –CREATE FUNCTION <name> (<param>) RETURNS <type> local declarations function body; –Function parameter can only be input as PSM forbids side-effects in functions System Aspects of SQL - Cont ƒ Statements in PSM –Call statement CALL <proc name> (<arg list>); e.g. EXEC SQL CALL Foo(:x, 3); –RETURN <expression>; –DECLARE <name> <type>; –SET <variable> = <expression>; –BEGIN ... END –IF <condition> THEN <statement list> ELSEIF <condition> THEN <statement list> ELSEIF ... ELSE <statement list> END IF; –SELECT <attr> INTO <var> FROM <table> WHERE <condition> –LOOP <statement list> END LOOP; System Aspects of SQL - Cont –FOR <loop name> AS <cursor name> CURSOR FOR <query> DO <statement list> END FOR; –Support WHILE and REPEAT loops ƒ Exception Handler in PSM –DECLARE <where to go> HANDLER FOR <condition list> <statement> –<where to go> can be: CONTINUE - executing the handler statement and then execute the next statement after the one which cause the exception EXIT - execute the handler statement and then control leaves the BEGIN...END block in which the handler is declared UNDO - same as EXIT except that any changes to the DB or local variables that were made by the statements of the block are "undone" System Aspects of SQL - Cont SQL Environment ƒ Schema –A collection of tables, views, assertions, triggers, PSM modules, etc –CREATE SCHEMA <name> <declarations> –Use SET SCHEMA to change schema name ƒ Catalog –A collection of schemas –CREATE CATALOG <catalog name> –Use SET CATALOG to change the current catalog ƒ Cluster –A collection of catalogs –Can be view as a set of all catalogs accessible to a user ƒ Client/Server –Both client and server can be on the different or the same machine System Aspects of SQL - Cont ƒ Connection –CONNECT TO <server name> AS <connection name> AUTHORIZATION <name and password> –SET CONNECTION <name>; –DISCONNECT <name>; Call-Level Interface (CLI) ƒ In C, each CLI program must include sqlcli.h where it contains all the function, structure, constant, and type definitions ƒ 4 kinds of records: SQLHENV, SQLHDBC, SQLHSTMT, and SQLHDESC. ƒ Use SQLAllocHandle(hType, hIn, hOut) ƒ Processing Statements –Use SQLPrepare(sh, st, sl); & SQLExecute(sh); –or use SQLExecDirect(sh, st, sl); –Use SQLFetch(sh) from a query result System Aspects of SQL - Cont –Use SQLBindCol(sh, colNo, colType, pVar, varSize, varInfo) for column binding –Can use SQLGetData(...) in place of SQLBindCol(...) to extract data from a query ƒ Passing Parameters to Query –e.g. SQLPrepare(myStmt, "INSERT INTO Studio(name, address) VALUES (?, ?)", SQL_NTS); SQLBindParameter(myStmt, 1, ..., studioName, ...); SQLBindParameter(myStmt, 2, ..., studioAddr, ...); SQLExecute(myStmt); Transactions in SQL ƒ Serializability –Multiple selects followed by multiple updates to the same tuple; e.g. chooseSeat() –Use locks to handle this problem System Aspects of SQL - Cont ƒ Atomicity –Single user transaction may have multiple updates to different tables; e.g. transfer from account A to account B –Only "commit" after all the changes are made ƒ Transaction –A collection of one or more operations on the database that must be executed atomically –Use START TRANSACTION to begin –Use SQL COMMIT to commit –Use SQL ROLLBACK to abort and undo the changes prior to the start of the transaction –Can set the transaction to READ ONLY ƒ Dirty Reads (Uncommitted Reads) –Data read that were "dirty" or uncommitted ƒ Isolation Levels –Serializable, uncommitted read, committed read, and repeatable-read System Aspects of SQL - Cont Security & User Authorization in SQL ƒ Privileges –SQL defines nine types of privileges: SELECT, INSERT, DELETE, UPDATE, REFERENCES, USAGE, TRIGGER, EXECUTE, and UNDER ƒ Authorization Checking –First at connect time –Second at statement time –Additional checks with modules ƒ Grant & Revoke –GRANT <privilege list> ON <database element> TO <user list> –Allow other user to perform certain actions –REVOKE <privilege list> ON <database element> FROM <user list> –Disallow a previously granted privilege Data Storage Megatron 2002 Database System ƒ Store relation in ASCII text file ƒ Store the schema also in ASCII file ƒ Obvious problems: –Tuple layout on disk is not flexible; any small change may shuffle the whole file –Searching is expensive; must read the whole file –Query-processing is by brute force; nested loop to examine all possibilities –No memory buffering, every query requires direct access to disk –No concurrency control –No reliability; e.g. no crash recovery The Memory Hierarchy ƒ Cache Memory –Fast access to and from processor or I/O controller Data Storage - Cont ƒ Main Memory –Random access (RAM) –Both OS and applications reside in RAM ƒ Virtual Memory –Allows each application to have their own private memory space which mapped to physical memory (RAM) or disk memory –A page is a memory block used by main memory to/from disk ƒ Secondary Storage –Much slower than main memory –Two type of disk I/O Disk read means moving a block from disk to main memory Disk write means moving a block from main memory to disk –Most DBMS will manage disk blocks itself, rather than relying on the OS file manager Data Storage - Cont ƒ Volatile and Nonvolatile Storage –Main memory is typically volatile; thus when the power is off, the content is gone –Flash memory are nonvolatile but it is very expensive and currently not used in main memory –An alternative is to use "RAM disk" combine with a battery backup to the power supply Disks ƒ Disk Components –Head, platter (2 surfaces each), cylinder, tracks, sectors, gap ƒ Disk Controller –Controls the movement of the disk head(s) to a specific track and preforms reads and writes –Tranfers data to and from main memory Data Storage - Cont Effective Use of Secondary Storage ƒ CS studies of algorithm often assumes that the data are always in main memory; this is not a valid assumption for DBMS ƒ I/O Model of Computation –Dominance of I/O cost If a block needs to be moved between disk and main memory, then the time taken to perform the read/write is much larger than the time for manipulating that data in main memory; thus the I/O time is a good approximation of the total time Similar to Big O notation for algorithm study ƒ Sorting Data in Secondary Storage –If we need to sort 1.64 billion bytes and a disk block is configured to handle 16384 bytes, then 100000 blocks are required to read each tuple once from disk –Quicksort is one of the fastest algorithm but its assumption is all entries are in memory Data Storage - Cont ƒ Two-Phase, Multiway Merge-Sort (TPMMS) –Consists of 2 phases Phase 1: Sort main-memory-sized pieces of the data, so every record is part of a sorted list that just fits in the availabe main memory; the results are a set of sorted sublists on disk which we merge in the next phase Phase 2: Merge all the sorted sublists into a single sorted list –Example: If we have 100 MB of main memory using 16384 size block sorting 1.64 billion bytes, we can fit 6400 blocks at a time in main memory; thus the results from phase 1 will have 16 sorted sublists If merge two sublists at a time, we need 8 disk I/O's performed on it The better approach is to read the first block from each of the sorted list into main-memory buffer. Find smallest element into a output buffer and flush/reload when necessary. Data Storage - Cont Accelerating Access to Secondary Storage ƒ TPMMS example in 11.4.4 assumed that data was stored on a single disk and the blocks were chosen randomly ƒ There are certainly room for improvement with the following methods with their advantages and disadvantages –Cylinder-Based Organization Can reduce disk block access time in phase one of TPMMS by more than 95% Excellent for applications that has only one process accessing the disk and block reads are grouped in logical sequences Not useful when reading random blocks –Multiple Disks Increase both group and random access time Same disk access can't be parallel Can be expensive since single large disk is usually more cost effective than multiple smaller disks with the same capacity Data Storage - Cont –Mirroring Reduce access time for read/write requests Built-in fault tolerance for all applications Must pay for 2+ disks for the capacity of only 1 –Elevator Algorithm Reduce read/write access time when the blocks are random The average delays for each request can be high for any high-traffic system –Prefetching/Double Buffering Greatly improve access when the blocks are known or grouped together. Require extra main-memory buffers No help when accesses are random Disk Failures ƒ Intemittent Failures –An attempt to read or write a sector failed, but successful after n number of retries Data Storage - Cont ƒ Checksum –Widely used method to detect media errors –Use a collection of bits to calculate a fixed number; when the recalculation failed, then a media error is the likely cause –Detect errors but does not fix them ƒ Stable Storage –Similar to the disk mirroring except that this is achieved at the software/application level –Keep an extra "delta" copy of the data to prevent media error and possible data corruption caused by power failure Disk Crash Recovery ƒ Redundant Arrays of Independent Disks-RAID ƒ Redundancy Technique - Mirroring –Known as RAID level 1 –When one of the disk failed, then the other "mirroring" disk will become the main disk Data Storage - Cont ƒ Parity Blocks –Known as RAID level 4 –Use only 1 redundant disk no matter how many data disks it may support –Utilizing the modulo-2 sum for parity checks –Too many disks can cause the redundant disk to perform poorly since each disk write in any n disks can cause the check-sum bits to change ƒ RAID 5 –Improve the RAID 4 approach by sharing the redundant disk workload into all n disks ƒ Multiple Disk Crash –RAID 6: use error-correcting codes such as Hamming code –Use a combination of data and redundant disks to determine how many of each are required to prevent concurrent failures or "n" disks Representing Data Elements Data Elements and Fields ƒ Relational Database Elements –Since a relation is a set of tuples, and tuples are similar to a record/structure in C or C++, we may imagine that each tuple will be stored on disk as a record. ƒ Objects –An object is like a tuple with its instance variables are attributes. ƒ Data Elements –INTEGER Usually 2 or 4 bytes long –FLOAT Usually 4 or 8 bytes long –CHAR(n) Fixed length denoted by n Representing Data Elements Cont. –VARCHAR(n) Variable length with n as the maximum Two ways to represent varchar Length plus content Null-terminated string –Dates and Times Date is usually represented as char(n) Time is represented as varchar(n) because of the support for fractional of seconds –Bits Can pack 8 bits into a byte, use an 8 bits boundary meaning rounded into the next byte. –Enumerated Types Using a byte to represent each item, thus can have 256 different values Representing Data Elements Cont. Records ƒ Building Fixed-Length Records –Can concatenate the fields to form a record –Be aware of the 4 and 8 bytes boundary depending the HW and OS; therefore must organize data accordingly ƒ Record Headers –Also known as the record descriptor –Information about the record such as length, timestamp, record id, record type, etc. ƒ Packing Fixed-Length Records into Blocks –Using a block header followed by multiple records Representing Data Elements Cont. Representing Block and Record Addresses ƒ Client-Server Systems –The server's data lives in a database address space. The address space can refer to blocks and possibly to offsets within the block. Physical Addresses Byte strings that can determine the location within secondary storage system where the block or record can be found. Information such as hostname, cylinder number, track number, block number, and offset from the beginning of a record within a block. Logical Addresses Can be view as a flat model where all the records are in logical sequence in memory. Use a mapping table to map logical to physical addresses. Representing Data Elements Cont. ƒ Logical and Structured Addresses –Why logical addressing Movement of data can be done by changing the logical to physical mapping table rather than moving the actual data itself. –Structured addressing Using a key value and the physical address of a block can easily locate a record Can be view as a form of "hashing" Fast lookup if the each record is fixed-length – Offset table Keeping an offset table as part of a block header can handle variable length record with fast lookup. Allow easy movement of data as one of the main advantage of logical addressing. Representing Data Elements Cont. ƒ Pointer Swizzling –It means to translate the embedded pointers from secondary (database address) to main memory (virtual address). –A pointer usually consists of a bit indicating whether the pointer is currently a database address or a (swizzled) memory address; and the actual database or memory pointer. –Automatic swizzling means when we load a block, all its pointers and addresses are put into the translation table if not already existed. –Anthoer approach is to translate the pointer only when it is being used. –Programmer can control pointer swizzling by using a look-ahead logic (e.g. prefetch). Representing Data Elements Cont. ƒ Pinned Records and Blocks –A block in memory is pinned if it cannot be written back to disk safely. –Can view this as a constraint or dependency from other pointers in another block. Variable-Length Data and Records ƒ Variable-Length Fields Record –Must keep the length and offsets of the variable length fields. –One simple but effective method is to put all the fixed-length fields in the beginning and then follow by the variable-length fields. ƒ Repeating Fields Record –Use the same method as above but can move the repeating fixed-length fields to another block. Representing Data Elements Cont. ƒ Variable-Format Records –In certain situation, records may not have a fixed schema. The fields or their order are not completely determined by the relation. –Use tagged fields to handle such cases. We stored attribute or field name, the type of the field, and the length of the field. Very similar to the SQLDA definition and the bindin and bind-out operations of SQL ƒ Spanned Records –When a record can't fit into a block and must be broken up into multiple blocks, it is called spanned. –Spanned records require extra header information to keep track of their fragments. Representing Data Elements Cont. ƒ Binary Large Objects (BLOBS) –Can hold large audio and image files –Stored in a sequence of blocks but also can be striped across multiple disk for faster retrieval –Retrieval is usually done in small chunks Record Modifications ƒ Insertion –If order is not important, then just add it to the end of the free space within a block. –If order is important, then we must slide the data to fit the new record. In this case, the offset table implementation can help reduce the actual movement of data. –If the block is full, then either a) find space on a "nearby" block and keep the forwarding address or b) use a chain of overflow blocks. Representing Data Elements Cont. ƒ Deletion –When a record is deleted, the system must reclaim its space. –If using an offset table, then it should shuffle the free space to a central location –An alternative is to keep track of a link-listed of free space (e.g. the freelist). –When the record is spanned or flowed to a different block (nearby or overflow), we may need to do a "reorganizing" of all the blocks. –To avoid dangling pointers, we can replace some of the records with tombstones (dummy record indicating a dead end). ƒ Update –If the record is fixed-length, then there is no effect on the storage. –Otherwise we have the same problems as insertion or deletion of variable-length fields. Index Structures Indexes on Sequential Files ƒ Usually have a data file and an index file. A data file is a sorted file. An index file contains only keys and pointers related to the data file. ƒ Sequential Files –A file which contains records that were sorted by the keys defined by its index. ƒ Dense Indexes –Every key from the data file is represented in the index. –The index entry is small compare to an record entry. Thus we may be able to keep the index file content in memory, rather than read from the index file. –Since the keys are sorted, we can use binary search to find the key (K). Search time is about log n (base 2). ƒ Sparse Indexes –Hold only a subset of the dense indexes –Use much less space but slower search time. Index Structures - Cont. ƒ Multiple Levels of Index –If an index itself cover many blocks, then a binary search will still need to do many disk I/O's to get to the correct record. We can solve this problem by putting an index on the index. –The outer level of the index will be more sparse compared to the inner level. ƒ Indexes With Duplicate Search Keys –If the index is not unique, we can have multiple records with the same search key. –An efficient approach is to have only one record in the dense index for each search key K. Then find the record within the sorted sublist. ƒ Managing Indexes During IUDs –In Chapter 12, recall that different methods were discussed to handle IUDs of fixed or variable length records, similar logic applies to index files as well. Index Structures - Cont. ƒ Here is a list of actions on the sequential file which affect the index file: Action Dense Sparse ------------------------------------------------------------Create <> overflow block none none Delete <> overflow block none none Create <> sequential block none insert Delete <> sequential block none delete Insert record insert update(?) Delete record delete update(?) Slide record update update(?) –Empty overflow block has no effect because Index Structures - Cont. Secondary Index ƒ This is a dense index, usually with duplicates. ƒ Does not require the underlying file to be sorted on the search key like in primary index. ƒ The keys in the secondary index are sorted. ƒ The pointers in one index block can go to many different data blocks. ƒ Clustered file structure can help manage the many-one relationship. An example of this is the number of columns or indexes within a table. ƒ Indirection in Secondary Indexes –Using buckets in this index scheme to avoid duplicates in the higher level. –e.g. SELECT title FROM Movie WHERE studioName = 'Disney' AND year = 1995; –If studioName is the primary index and year is the secondary index, then the number of tuples which satisfy both condition will be reduced significantly. Index Structures - Cont. ƒ Document Retrieval and Inverted Indexes –The WWW has brought many new requirements for document retrieval and pattern match. This results in newer and better search engine as time passes. e.g. Search all the documents which contain the words "database", "programming", and "implementation". –A relational view of the Doc search A document may be thought of as a tuple in a relation. Each attribute/word can be represent as a bit and set to true if the Doc has at least one match. Use a secondary index on each of the attributes of Doc but only keep entries which has the search-key value TRUE. Instead of creating a separate index for each attribute, the indexes are combined into one, called inverted index. Each inverted index will point us to the bucket entry where we can "join" the each list of pointers to satisfy the three words. Index Structures - Cont. B-Trees ƒ B+ tree –It automatically maintains as many levels of index as is appropriate for the size of the file being indexed. –It manages the space on the blocks they use so that every block is between half used and completely full. No overflow blocks are needed. ƒ Structure –It is balanced which means all paths from the root to a leaf have the same length/distance. –It contains the root, interior, and leaf nodes. –Each node has n search-key and n+1 pointers. –The keys in leaf nodes are copies of keys from the data file. These keys are distributed among the leaves in sorted order, from left to right. Index Structures - Cont. –At a leaf, the last pointer points to the next leaf block to the right. Unused pointers are set to null. –To deal with duplicate keys, the duplicate will not be a key in the interior nodes. ƒ Lookup –Follow the keys until you reach a leaf node. Then determine whether it exists or not. ƒ Range Queries –e.g. select * from R where R.k >=10 and R.k <= 25. –Same lookup steps but continue to follow the leaf's rightmost pointer until the key condition is not true. ƒ Insertion and Deletion Into B-Trees –May require to spilt or merge the leaf and/or other higher level nodes. Recursively done from leaf back up to root. Index Structures - Cont. ƒ Efficiency of B-Trees –If each block has 255 pointers, then a 3 level B-Tree can handle up to 16.6 million pointers to records. –Depending of implementation, one may or may not delete from a B-Tree for further performance enhancement. Hash Table ƒ It contains a hash key and hash function which determine the correct bucket the record belongs to. ƒ Hash function is very important for a balance distribution. ƒ Each bucket can be pointed to a block in secondary storage. ƒ Use overflow bucket chain when the regular bucket is full. Index Structures - Cont. ƒ Hash Table Insertion and Deletion –Insert puts a record into a normal or overflow block. –Delete removes a record from a normal or overflow block. It may optionally consolidate the blocks of a chain into one fewer block. ƒ Efficiency of Hash Table Indexes –Design the bucket sizes to fit into one block. Thus, only one disk I/O is required for lookup and two disk I/O's for insert and delete. –Try not to search a chain of blocks in hash table which causes many disk I/O for a single lookup. ƒ Extensible Hash Tables –Instead of the array consisting of the data blocks themselves in the static hash tables, we keep an array of pointers to blocks representing the blocks. –Keep the number of bits on each block to use by the hash function Index Structures - Cont. –We may require to split the bucket (double the size) and redistribute the keys when the bits used reach maximum. –The main advantage of the extensible hash table is the fact that when looking for a record, we never need to search more than one data block. –It may required additional I/O when the bucket array no longer fit in main memory. ƒ Linear Hash Tables –The number of buckets n is always chosen so the average number of records per bucket is a fixed fraction, say 80%, of the number of records that fill one block. –Since blocks cannot always be split, overflow blocks are permitted. –If we exceed the n number of combinations, and we add 1 to i, meaning adding 1 extra bit in front of the key since it's 2 to the i power in this case. Multidimensional and Bitmap Indexes Multiple Dimensional Applications ƒ Geographic Information Systems (GIS) –Objects are stored in a typically twodimensional space. –e.g. DB are maps with houses, roads, bridges, and other physical objects. –Examples of the type of queries are: Partial match queries - Specify values for 1 or more dimensions and look for all points matching those values in those dimensions. Range queries - Specify ranges for 1 or more of the dimensions and get a set of points within those ranges. Nearest-neighbor queries - Ask for the closest point for a given point. Where-am-I queries - Given a point, find the current location. (e.g. xy coordinates in a map) Multidimensional and Bitmap Indexes - Cont. ƒ Data Cubes –See data as existing in a (multi-) highdimensional space. –Each tuple is a point in the space. Queries are used to group and aggregate these points for decision-support applications. ƒ Multidimensional Queries in SQL –A typical relational query: select day, store, count(*) as totalSales from sales where item = 'shirt' and color = 'pink' group by day, store; ƒ Range Queries Using Conventional Indexes –Assuming attribute x and y have their own Btree index file. Then find all x that fit into a given range and find all y that fit into a given range and intersect these pointers. Multidimensional and Bitmap Indexes - Cont. ƒ Nearest-Neighbor Queries Using Coventional Index –Assuming the x,y coordinates example, the closest x or the closest y alone can't determine the actual distance to target point. ƒ Two Categories of Multidimensional Index Structures –Hash-table-like approaches. –Tree-like approaches. Hash-Like Structures for Multi-D Data ƒ Grid Files –Using a 2 dimensional index example, we can divide a 2-D grid using grid lines. Thus it results in a set of grid partitions. Multidimensional and Bitmap Indexes - Cont. ƒ Lookup in a Grid File –Each region can be viewed as a hash table bucket and each point in that region is placed in a block belonging to that bucket. –To find a point, use the values for x and y to locate a region and search the data block within that bucket. ƒ Insertion Into Grid Files –Same logic as in lookup to locate the correct bucket. –Add overflow blocks as needed. –Reorganize the structure by adding or moving the grid lines. This is similar to the dynamic hashing technique in section 13.4. ƒ Performance of Grid Files –Problem arises when in high-dimensional case where the number of buckets grows exponentially with the dimension. Multidimensional and Bitmap Indexes - Cont. –The choosing of grid lines is important so that the data can be evenly distributed. –Behavior on different types of queries: Lookup of Specific Points - Just follow the path to the proper bucket and search. Partial-Match Queries - Use only the grid line(s) needed to find the bucket(s) and search. Range Queries - If ranges is between grid lines, and including the border buckets in search. Nearest-Neighbor Queries - If grid lines are not proportional, then it may have to search buckets that are not adjacent to the one containing the point P. ƒ Partitioned Hash Functions –Use the set of attribute values to develop the hash function. Multidimensional and Bitmap Indexes - Cont. –If we take 2 values from x and 3 values from y, then there are 6 combinations which can be represented with only 3 bits in the hash key. ƒ Comparison of Grid Files and Partitioned Hashing –Partitioned hash tables are useless for nearest-neighbor queries or range queries. But a well chosen hash function will randomize the buckets evenly. If we only need to support partial match queries, then partitioned hash table is likely to outperform the grid file. –Grid file is superior for most types of queries. But when the dimension is very large, there could be many buckets that are emtpy or nearly empty. Thus we will need to keep a larger in-memory structure than necessary. Multidimensional and Bitmap Indexes - Cont. Tree-Like Structures for Multi-D Data ƒ Multiple-Key Indexes –Use the root/first level for attribute 1 and the second level for attribute 2, etc. –Performance Partial-Match Queries - If the attributes in the toplevels are specified, then the access is quite efficient. Otherwise it must search every subindex which can be time-consuming. Range Queries - If the individual indexes support range queries on their attributes, then the query will be fairly easy. ƒ kd-Trees (k-dimensional search tree) –It is a binary tree in which interior nodes have an associated attribute A and a value V that splits the data points into two parts, less than V and greater than or equal to V. The levels of the trees are filled with rotating attributes of all dimensions. Multidimensional and Bitmap Indexes - Cont. –Interior nodes have an attribute, dividing value, and left and right pointers. –Leaves nodes are blocks with space for as many records as a block can hold. Spliting a block requires adding a new interior node. –Performance Partial-Match Queries - If a value is available for a given attribute, then go left or right. Otherwise it will need to examine both left and right nodes. Range Queries - If the range straddles the splitting value at the node, then we must explore both children. The further it is from the leaf node, the more possibilities it has to examine. Nearest-Neighbor Queries - Treat the problem as a range query and repeat with a larger range if necessary. –Can optimize kd-tree by using B-tree nodes as its interior nodes. Alternatively it can pack the interior nodes into a single block, thus it reduces the number of disk I/O's. Multidimensional and Bitmap Indexes - Cont. ƒ Quad Trees –Assuming that N number of records fit into a block. Starting from the root, if it has more than N records, then it divide into 4 branches, try to spilt them according to certain values of an attribute pair. Recursively until all the nodes satisfy the N record requirement. ƒ R-Trees (region tree) –Use data regions (leaf nodes) to represent data block(s) and interior region (region) as links to the data regions. –Keep overlapping to a minimal. Bitmap Indexes ƒ A bitmap index for a field F is a collection of bitvectors of length N, one for each possible value that may appear in the field F. For example, if there are 6 records, we use 6 bits. Multidimensional and Bitmap Indexes - Cont. ƒ Motivation for Bitmap Indexes –Questions: too much space? dependent of the number of records? –A major advantage of bitmap indexes is that they allow us to answer partial-match queries very efficiently. Find that proper vectors for attributes A and B, then take the bitwise AND of these vectors. –For range queries, we take the vectors in the range of A and bitwise OR them together. Second we take the vectors in the range of B and bitwise OR them together. Lastly we bitwise AND the two results and we have identified the correct records which satisfy the query. ƒ Compressed Bitmaps –Use an encoding algorithm to reduce the number of bits required in storage. Multidimensional and Bitmap Indexes - Cont. ƒ Managing Bitmap Indexes –Finding Bit-Vectors We can use B-trees or hash tables to locate the key-pointer pairs which leads us to the bit-vector for the key value. –Finding Records Once a specific record is found, we can use a secondary index on the data file whose search key is the record number. –Handling Modifications to the Data File Record numbers must remain fixed once assigned. If a record is deleted, then it must be replaced by a "tombstone". Inserting a value which has a corresponding bitvector will just turn on the new bit. If the value is new, then it will create a bit-vector and add it to the secondary-index that is used to find the bit-vector. Updating a value from one bit-vector to another will need to update both the bit-vectors. Query Execution Parts of the Query Processor query v metadata-> Query Compilation v <--> query plan Query Execution ^ + v data Query Execution - Cont. ƒ Scanning Tables –The most basic thing we can do in a physical query plan is to read the entire contents of a relation R. –Two approaches to scan tuples of a relation. Table-scan - Read the blocks containing all the tuples of R. Index-scan - An index containing attribute A of the R can be used to get all the tuples of R. ƒ Sorting While Scanning Tables –An ORDER BY clause will require the tuples to be sorted. Additionally, various algorithms for relational-algebra operations require one or both arguments to be sorted relations. –The physical-query-plan operator sort-scan takes a relation R and a set of attributes on which the sort is to be made, and produces R in that sorted order. Query Execution - Cont. –Several ways to implement sort-scan If there is an index to the attribute(s), then an index-scan will produce R in the desired order. If all the tuples can fit into main memory, then we can do a table-scan and apply one of the efficient main-memory sorting algorithms. If R is too large to fit in main-memory, then we can apply the TPMMS algorithm to generate the results to memory rather than to disk. ƒ Model of Computation for Physical Operators –To estimate the cost for each operator, we shall use the number of disk I/O's as the measuring factor for an operation. –When comparing algorithms for the same operations, we assume that the arguments of any operator are found on disk, but the result of the operator is left in main memory. –In most cases, the result from an operator is not written to disk but it is passed in memory from one operator to another. Query Execution - Cont. ƒ Parameters for Measuring Costs –The parameters are usually called statistics. Incorrect statistics can fool the query optimizer to choose the wrong plan, thus it may impact the performance of the query. –There are 3 parameter families: B,T, and V B(R) denotes the number of blocks that are needed to hold all the tuples of R. We usually assume that R is clustered (nearby blocks). T(R) or T or denotes the number of tuples in R. Then T(R) / B(R) is the ratio of tuples per block. V(R,a) denotes the number of distinct values that appear in a column of a relation. ƒ Iterator –It is a group of three functions that allows a consumer of the result of the physical operator to get the result one tuple at a time. –The 3 functions are Open, GetNext, and Close. Query Execution - Cont. Open - Initialize any data structures needed to perform the operation and calls Open for any arguments of the operation. It does not actually get any tuples. GetNext - Returns the next tuple in the result. If there are no more tuples, then return NotFound. Close - Ends the iteration after all required tuples have been consumed or obtained. –Iterators are used to implement many of the physical operators such as a table-scan. One-Pass Algorithms for DB Operations ƒ Three classes of algorithms for operators, sorting-based, hash-based, and index-based. ƒ There are one, two, and three+ pass algorithms depending on the data size. An example of two-pass algorithm is TPMMS. ƒ There are three classifications of operators. Query Execution - Cont. –Tuple-at-a-time, unary operations - selection and projection do not require an entire relation, we can read a block at a time and produce our output. –Full-relation, unary operations - these one argument operations require seeing all or most of the tuples in memory at once. An example is the grouping operator. –Full-relation, binary operations - All other operations such as union, intersection, difference, joins, and products. ƒ Tuple-at-a-time Operations –This can be a table-scan or index-scan. If the selection is done based on an attribute in an index, then only a subset of tuples will be return. –Can improve performance by disk clustering. Query Execution - Cont. ƒ Unary, Full-Relation Operations –Duplicate Elimination Read one entry at a time. If it is a duplicate, then skip; otherwise add it in memory. If the number of tuples are large, then the memory may not be able to store all the unique entries. Must use an efficient data structure such as BST or hash table for searching duplicate. –Grouping For MIN(a) or MAX(a), just save the minimum or maximum value while looping thru all the tuples. For COUNT, add one for each tuple of the group that is seen. For SUM(a), add the value of attribute a to the total for its group. For AVG(a), do both COUNT and SUM(a) at the same time. Then take the quotient of the sum and count to get the average. Query Execution - Cont. ƒ One-Pass Algorithms for Binary Operations –Use a B or S next to the operators to denote whether it is a bag or set version of the operators. –Bag Union Read all B(R) + B(S) and output all the results. –Set Union Read table S into M-1 buffers in main memory and build a search structure. Read each block of table R into the Mth buffer one at a time. For each tuple t of R, if t is in S, skip, else copy. –Set Intersection Same logic as set union except that if each tuple t of R is found in S, copy, else skip. –Set Difference Assuming R is the larger relation, read S into memory. R-S: for each t in R, if in S, ignore, otherwise copy. S-R: for each t in R, if in S, delete t from S in memory, else do nothing. Query Execution - Cont. –Bag Intersection Read table S into M-1 buffers in main memory and build a search structure. If there is any duplicates from S, we keep only one copy but with a counter. Read each block of table R into the Mth buffer one at a time. For each tuple t of R, if t is in S, copy and decrement the counter if the counter is not 0; otherwise skip. –Bag Difference S-R: Read distinct tuples of S into memory with a count. For each t in R, if t is in S, decrement count. At the end, copy the tuples with a count of greater than 0. R-S: Read distinct tuples of S into memory with a count. For each t in R, if t is not in S, copy to output. Otherwise if count=0, copy t to output; if count>0, do not copy t but decrement count. –Product Read S into M-1 buffers in memory. Read each block of R into the Mth buffer. For each t in R, concatenate t with each tuple of S to form the output set. Query Execution - Cont. –Natural Join If we have R(x,y) and S(y,z) with y being common in both relations, then to compute the natural join, do the following. Read all the S into M-1 buffers of memory using y as the search key, probably with a balance tree or hash table. Read each block of R into the remaining buffer. For each tuple t of R, find the tuples of S that agree with t on all attributes of y using the search structure. For each matching tuple of S, output the join result of this combination. Nested-Loop Joins ƒ Tuple-Based Nested-Loop Join –If we have R(x,y) and S(y,z), in this algorithm we compute the join as follows: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to make a tuple t THEN output t; Query Execution - Cont. ƒ Block-Based Nested-Loop Join Algorithm –Improvement over the tuple-based method Orgranizing access to both argument relations by blocks. Using M-1 blocks to store tuples belonging to the Relation which is part of the outer loop. Two-Pass Algorithms Based on Sorting ƒ Same idea as TPMMS –Read M blocks of R into main memory. –Sort these M blocks using an efficient mainmemory sorting algorithm. –Write the sorted list into M blocks of disk. –Use a second pass to merge the sorted sublists to execute the desired operator. ƒ Duplicate Elimination –In second pass, just copy the output without the duplicates. Query Execution - Cont. ƒ Grouping and Aggregation –In first pass, sort each M blocks using the grouping attributes of L as the sort key. –In second pass accumulate the needed aggregates based on each sort key of v. –When all the tuples of sort key v is done, output a tuple consisting of the result for this L group. ƒ Union –Bag unions do not require two passes. –Set union works very similar to duplicate elimination except the following: In pass one, sort both relations, R and S. In pass two, make sure all the sublist of both R and S are read in M and merged without duplicates until either R or S is empty, then finish the merging of the remaining relation. Query Execution - Cont. ƒ Intersection and Difference –Pass one remains the same for all cases. The output conditions are determined in pass two. Set intersection - output t if it appears in both R and S. Bag intersection - output t the minimum of the number of times it appears in R and S. Set difference - output t if and only if it appears in R but not in S. bag difference - output t the number of times it appears in R minus the number of times it appears in S. ƒ Simple Sort-Based Join Algorithm –Given R(X,Y) and S(Y,Z), sort R and S independently using TPMMS with sort key Y. –Merge the sorted R and S, only output tuples from R and S in sorted order if they both satisfy the sort key value of y. We need two large buffers for this merge. Query Execution - Cont. ƒ More Efficient Sort-Based Join –Can improve the simple join by sorting into sublists for both R and S. –Bring the first block of each sublist into a buffer assuming there are no more than M sublists in all. –Repeat the same aglorithm for M buffers rather than just two. Two-Pass Algorithms Based on Hashing ƒ Partitioning Relations by Hashing –Initialize M-1 buckets using M-1 empty buffers. Read each one at time into M. For each tuple t in M, copy t to one of the hash bucket based of h(t). –When a bucket is full, write to disk and continue until all the tuples have been read. Query Execution - Cont. ƒ Duplicate Elimination –Since the same key value will be hash into the same bucket, then process one bucket at a time and output the results. –If M is not large enough to hold the distinct tuples, then one extra pass may be needed. ƒ Grouping and Aggregation –Change the hash function to depend on the grouping attributes of the list L. –On the second pass, we only read one record per group as we process each bucket. ƒ Union, Intersection, Difference –We must use the same hash function for both R and S. –Apply the one-pass algorithm depending on the operator. ƒ Hash-Join Algorithm –Use the join attribute(s) in hash function. –Join R and S with corresponding buckets. Query Execution - Cont. Index-Based Algorithms ƒ Clustering and Nonclustering Indexes –A relation is "clustered" means the tuples in a relation are packed together, minimize the number of blocks used. –Clustering indexes mean the search key is packed into the least amount of blocks. ƒ Index-Based Selection –An index-scan can significantly reduce the number of tuples. Thus it can remove the need for the two-pass algorithms. ƒ Joining by Using an Index –The bitmap index structure is an example of unordered index join. ƒ Joins Using a Sorted Index –Zig-zag join happens when we have sorting indexes on Y for both R and S, then all the tuples will not need to be retrieved during the intermediate steps. Query Execution - Cont. Buffer Management ƒ Accept requests for main-memory access to disk blocks. Obviously not all of M are available during query processing. ƒ BM Architecture –Two approaches Controls main memory directly, common in many relational DBMS's. e.g. It reserves a fixed size buffer pool based on a configuration variable at startup time. Uses virtual memory, let the OS to decide which buffers are actually in main memory or in swap space on disk. ƒ BM Strategies –The buffer-replacement strategies are similar to those used such as the OS scheduler. –Least-Recently Used (LRU) –First-In-First-Out (FIFO) Query Execution - Cont. –The "Clock" Algorithm Line up all the buffers in a circle. When we read into a buffer, set the flag to 1. Default is 0. Look for the next 0 block, if you pass a block with flag=1, then set it to 0 and continue the search. –Pinned Blocks Change the LRU, FIFO, Clock policy to allow a block to be pinned (possibly with timeout). In the Clock, we can use the flag as a reference count and decrement when it's not being used. ƒ Physical Operator Selection and BM –The query optimizer will select a set of physical operators that will be used to execute a given query. –The selection may assume that a certain number of buffers M is available at that time but it is not guaranteed by BM. Query Execution - Cont. –Impact of algorithms when M shrinks In sort-based algorithms for certain operators, we still can function by reducing the size of a sublist. If there are two many sublists, it can cause a merging problem. Main-memory sorting of lists are not affected. In hash-based algorithm, we can reduce the number of buckets but it could be a problem if each bucket becomes so large that they don't fit in main memory. Once it is started, the number of buckets remain fixed throughout the first pass. Algorithms Using More Than Two Passes ƒ Multipass Sort & Hash-based Algorithms –A natural extension of the two-pass algorithms. Instead of 2 passes, you have k number of passes and M number of buffers. –Performance of Multipass Algroithms are still depend on the number of disk/buffer I/O. Query Execution - Cont. Parallel Algorithms for Relational Operations ƒ Models of Parallelism –With p number of processors, M number of memory buffers, and d number of disks, here are three most important classes of parallel machines. Shared Memory - If each processor has its own cache, then it's important for a processor to ensure the data required is indeed in its cache, otherwise it must fetch from shared memory. Shared Disk - Each processor has its own private memory which is not shared among other processors. All the disks are shared and therefore the burden is put on the disk controllers for competing requests. Shared Nothing - Each processor has its own memory and disk(s). It is very efficient unless one processor have to communicate with another. It is usually handle with message Qs. Query Execution - Cont. ƒ Tuple-at-a-Time Operations in Parallel –If the relation R is evenly divided among d disks with p number of processors, when d & p is approximately equal, this can achieve maximum performance during projection operator. On the other hand, a selection operator may or may not benefit from this setup. ƒ Parallel Algorithms for Full-Relation Operations –We can further change the algorithms to process the non-dependent components of any operator. ƒ Performance of Parallel Algorithms –The idea here is to reduce the time spent to 1/p where p is the number of processors. –Hashing is best suited to help distributing the relation R into multiple disks using p processors. The Query Compiler Parsing ƒ Syntax Analysis and Parse Trees –A parser translates language text such as SQL and convert it to a parse tree. Atoms - basic lexical elements such as symbols, keywords, identifiers, etc. It has no children. Syntactic categories - represented by triangular brackets around a descriptive name. It contains rules which are composed of another syntactic category or atoms. ƒ Grammar for a Simple Subset of SQL –<Query> ::= <SFW> –<Query> ::= ( <Query> ) –<SFW> ::= select <SelList> from <FromList> where <Condition> –<SelList> ::= <Attribute>, <SelList> –<SelList> ::= <Attribute> The Query Compiler - Cont. –<FromList> ::= <Relation> , <FromList> –<FromList> ::= <Relation> –<Condition> ::= <Condition> and <Condition> –<Condition> ::= <Tuple> in <Query> –<Condition> ::= <Attribute> = <Attribute> –<Condition> ::= <Attribute> like <Pattern> –Example: –StarsIn(movieTitle, movieYear, starname) –MovieStar(name, address, gender, birthdate) –From the following query: SELECT movieTitle FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE '%1960' ); The Query Compiler - Cont. ƒ The Preprocessor –If a view is specified instead of a table, it substitutes the table in the from-list with the view definition. –Responsible for semantic checking. Check for valid relation uses Check and resolve attribute uses Check data types –Once the parse tree passes all the tests, it is given to the logical query-plan generator. Algebraic Laws for Improving Query Plans ƒ Commutative and Associative Laws –Commutative - order not important e.g. x + y = y + x OR x * y = y * x –Associative - grouping e.g. (x + y) + z = x + (y + z) OR (x * y) * z = x * (y * z) The Query Compiler - Cont. ƒ Laws Involving Selection –Splittng laws in multi-condition cases SEL c1 AND c2 (R) = SEL c1(SEL c2(R)) SEL c1 OR c2 (R) = (SEL c1(R)) us (SEL c2(R)) –Laws that allow us to push selections thru the binary operators: product, union, intesection, difference, and join. For a union, the selection must be pushed to both arguments. For a difference, the selection must be pushed to the first argument and optionally the second. For other operators, it is only required that the selection be pushed to one argument Union law: SEL c(R u S) = SEL c(R) u SEL c(S) Difference law: SEL c(R - S) = SEL c(R) - S or SEL c(R - S) = SEL c(R) - SEL c(S) The Query Compiler - Cont. For product, intesection, and joins; if R has all the attributes mentioned in c, then we can write: SEL c(R x S) = SEL c(R) x S or if c has only attributes of S then: SEL c(R x S) = R x SEL c(S) or if R and S happen to have all attributes of C, then we can use the following law: SEL c(R n S) = SEL c(R) n SEL c(S) ƒ Pushing Selections –One main strategy is to push down the selection operators to the lowest level as possible. –This strategy is to reduce the resulting data set being passed upward in a parse tree. ƒ Laws Involving Projection –We may introduce a projection anywhere in an expression tree, as long as it eliminates only attributes that are never used by any of the operators above and are not in the result set of the entire expression. Coping With System Failures Issues and Models for Resilient Operation ƒ Failure Modes –There are many different things that can go wrong while a database is queried and modified. Here are several types of failures. Erroneous Data Entry These errors can be dealt with using SQL features such as constraints and triggers. Media Failures Use one of the RAID schemes. Maintain an archive/backup of the database and its logs for recovery. Use distributed redundant copies of the database run simultaneously. Catastrophic Failure Events such a explosions, fires, or vandalism at the location of the database. Archive and redundant copies method can be used for recovery. Coping With System Failures Cont. System Failure The processes that query and modify the database are called transactions. There are a few cases where the transactions are not completed and their transaction states are lost. Such events include power failure, operating system crashes, and software crashes. The recovery of this type of failures required all the database transactions to be logged. When the DBMS is restarted, these logs are used to "undo" the uncommitted transactions. ƒ Transactions (Ts) –Relationship between the log manager and the transaction manager: QP --- TM --- LM \ I / BM ----- RM I Disk(Data & Log) Coping With System Failures Cont. ƒ Correct Execution of Transactions –Basic elements of a T include relations, disk blocks or pages, and individual tuples. –The database can be in a consistent or inconsistent state. The Correctness Principle stated that if a T executes and there are no other Ts or system errors, and it starts in a consistent state. Then when it ends, it remains in the consistent state. –A T is atomic, all or nothing. If only part of a T is executed, then the DB is likely to be inconsistent. –Simultaneous Ts for the same database elements can results in an inconsistent DB unless the proper control or locking techniques are used. ƒ Primitive Operations of Transactions –There are three layers of addressing space. Disk blocks <-> M <-> local transcation address Coping With System Failures Cont. Here are some sample primitives of a transaction given the assumption that X fit into a single disk block. INPUT(X) - copy the disk block containing DB element X to a memory buffer READ(X,t) - copy the DB element X to the transaction's local variable t. If X is not in memory, then do INPUT(X) first. WRITE(X,t) - copy the value of local variable t to the DB element X in a memory buffer. If X is not in memory, then do INPUT(X) first. OUTPUT(X) - copy the block containing X from its buffer to disk. For example, if a T contains steps to READ(X,t), WRITE(X,t), OUTPUT(X) in this sequence, the system crashes before the OUTPUT(X) is completed. Then this DB relation is in an inconsistent state. Thus it will require the proper log files to undo or redo the operation(s). Coping With System Failures Cont. Undo Logging ƒ A log is a sequence of log records. Each T can have multiple log entries. In most cases, there are many concurrent Ts; therefore log records are not necessary in order of a single T. ƒ Log Records –There are several forms of log record discussed in this chapter: <START TID> - this record indicates that T TID has begun. <COMMIT TID> - this indicates that T TID has completed successfully. Changes to the DB should appear on disk after commit but the BM will decide when exactly to write/flush M to disk. <ABORT TID> - TID failed. LM ensures that no changes will be written to disk. <TID,X,v> - this indicates an update record. This log record corresponds to a WRITE action, not a OUTPUT action. Coping With System Failures Cont. ƒ Undo-Logging Rules –U1: If T modifies DB element X, then the log record of the form <TID,X,v> must be written to disk before the new value of X is written. –U2: If a T commits, then its COMMIT log record must be written to disk only after all the DB elements changed by the transaction have been written to disk. –The sequence of steps for a T are listed in the following order. a) log record(s) indicating changed DB elements b) actual changed DB elements c) COMMIT log record –The LM will need a flush-log command to inform the BM to write the logs to disk. –If there are two Ts updating the same element, a locking scheme must be in place to prevent an element to be written to disk prematurely. Coping With System Failures Cont. ƒ Recovery Using Undo Logging –The first task of the RM is to separate the committed and uncommitted records. e.g. If there was an <START TID> entry, then it expected to find an <COMMIT TID> entry. Otherwise the T is considered uncommitted. –Once we know all the uncommitted TID, then we start going thru the T entries in a reversed order usually order by time. There are a few different scenarios. If the <COMMIT TID> record got to disk before the crash. Everything is OK. If the <COMMIT TID> is not on disk but the changes were, then we need to undo the changes and write the <ABORT TID> to log. If the any element changes did not make it to disk, then we just need to abort the T without the undo step. Coping With System Failures Cont. ƒ Checkpointing –We use checkpoint to elimilate the need to scan the entire log file. –One technique is to stop accepting new Ts, wait until all the active Ts have a COMMIT or ABORT record on the log, flush the log to disk, write a log record <CKPT>, and flush the log again, at last resume accepting Ts. ƒ Nonquiescent Checkpointing –The normal checkpoint method required all incoming Ts to be paused. In most cases, this behavior is unacceptable in a production environment. –Allow incoming Ts while we write a log record <START CKPT (T1, T2, ...Tk)> and flush the log. T1, ..., Tk are the names for all the active transactions. When all the T1, ... Tk have completed or aborted, write a log record <END CKPT> and flush the log. Coping With System Failures Cont. Redo Logging ƒ Differences between Undo and Redo Logging –Undo logging cancels the effect of incomplete Ts and ignores committed ones during recovery, redo logging ignores incomplete Ts and repeats the changes made by committed Ts. –Undo logging requires us to write changed DB element to disk before the COMMIT log record reaches disk, redo logging requires that the COMMIT record appear on disk before any changed values reach disk. –Undo logging needs the old values of changed DB elements to recover using U1 and U2 rules. Redo logging needs the new values instead. Coping With System Failures Cont. ƒ Redo-Logging Rule –R1: Before modifying any DB element X on disk, it is necessary that all log records pertaining to this modification of X, including both the update record <TID,X,v> and the <COMMIT TID> record, must appear on disk. –The sequence of steps for a T are listed in the following order in the redo context. a) log record(s) indicating changed DB elements b) COMMIT log record c) actual changed DB elements ƒ Recovery With Redo Logging –1) Identify the committed Ts. –2) Scan the log forward from the beginning. For each log record <TID,X,v> encountered: a) If TID is not a committed T, do nothing. b) If TID is committed, write value v for DB element X. Coping With System Failures Cont. –3) For each incomplete transaction TID, write an <ABORT TID> record to the log and flush the log. ƒ Checkpointing a Redo Log –1) Write a log record<START CHPT (T1,...Tk)>, where T1,...Tk are all the active (uncommitted) Ts, and flush the log. –2) Write to disk all DB elements that were written to buffers but not yet to disk by Ts that had already committed when the START CKPT record was written to the log. –3) Write an <END CKPT> record to the log and flush the log. ƒ Recovery Using a Checkpointed a Redo Log –If the last CHPT record is a END CKPT: We know that all the Ts committed before the matching START CKPT has its changes written to disk. Coping With System Failures Cont. Any T(i) in the T1...Tk list or that started after the START CKPT can still have changes it made not yet migrated to disk, even though the T has committed. In this case, we follow the recovery steps described earlier. –If the last CKPT record is a START CKPT: We cannot be sure that committed Ts prior to the start of this checkpoint had their changes written to disk. Therefore we find the last END CKPT record and repeat the previous step. Undo/Redo Logging ƒ Undo/Redo Rules –It has the same type of log records as the other kinds of log except for the update record which has the format of <TID,X,v,w>. Value v is the original and w is the new value. –UR1: <TID,X,v,w> record must appear on disk before DB element X is modified on disk. Coping With System Failures Cont. ƒ Recovery With U/R Logging –Redo all the committed Ts in the order earliest-first and undo all the incomplete Ts in the order latest-first. –These two steps allow flexibility in which the order of the COMMIT log records written. ƒ Checkpointing an U/R Log –1) Write a <START CKPT(T1,...Tk)> record to the log, where T1...Tk are all active Ts, and flush the log. –2) Write to disk all the buffers that are dirty. Unlike redo logging, we flush all buffers, not just those written by committed Ts. –3) Write an <END CKPT> record to the log, and flush the log. –The only requirement is that a T must not write any values until it is certain not to abort. Coping With System Failures Cont. Protecting Against Media Failures ƒ The Archive –Full dump/backup means the entire DB is copied. –Incremental dump/backup means only those DB elements changed since the previous full or incremental backup are taken. ƒ Nonquiescent Archiving –Since a static backup requires the DB to be stopped for a period of hours if the DB is large, therefore we need to able to take a "online" backup while it is stills processing Ts. –As long as the log for the duration of the backup is preserved, then the discrepancies (between the time it started the backup and the completion time) can be corrected from the log. Coping With System Failures Cont. –Steps for making an archive (backup image) Write a log record <START DUMP>. Perform a checkpoint appropriate for whichever logging method is being used. Perform a full or incremental dump of the data disk(s), making sure that the copy of the data has reached a secure place, e.g. remote site. Make sure that enough of the log has been copied to the secure place and at least the prefix of the log up to and including the checkpoint in step 2 will survive a media failure of the DB. Write a log record <END DUMP>. –At the completion of the backup, it is safe to throw away log prior to the beginning of the checkpoint previous to the one performed. Note that the DB administrator may want to save multiple backups, therefore old backup images and logs may be saved for a while. Coping With System Failures Cont. ƒ Recovery Using an Archive and Log –1) Restore the database from archive Find the most recent full backup and reconstruct the DB from it. If there are any later incremental backups, apply them with the earliest first. –2) Modify the database using the log files. Use the method of recovery appropriate to the log method being used. Concurrency Control Serial and Serializable Schedules ƒ Schedules –A schedule is a time-ordered sequence of the important actions taken by one or more Ts. –Actions are viewed as the READ, and WRITE operations to/from the BM, not the INPUT and OUTPUT actions to/from disk. ƒ Serial Schedules –It is defined by all the actions of any single T are group together and then follow by another single T group of actions. ƒ Serializable Schedules –A schedule is serializable if its effect on the DB state is the same as that of some serial schedule, regardless of what the initial state of the DB is. Concurrency Control - Cont. ƒ Effect of Transaction Semantics –Since the scheduler does not care/check the semantics of a T, therefore it might not know whether any given T will modify a certain DB elements. –Scheduler can be smarter by just knowing whether a T intends to modify any DB elements. ƒ Notation for Transactions and Schedules –1) An action is an expression of the form r1(X) or w1(X), meaning that T1 reads or writes, respectively, the DB element X. –2) Ti is a sequence of actions with subscript i. –3) A schedule (denoted as S) of a set of Ts (denoted as TT) is a sequence of actions, in which for each Ti in TT, the actions of Ti appear in S in the same order that they appear in the definition of Ti itself. S is an interleaving of the actions of all the their Ts. Concurrency Control - Cont. Conflict-Serializability (C-S) ƒ Conflicts –If a pair of consecutive actions in a S such that, if their order is swapped, then the behavior of at least one of the Ts involved can change. –If we have Ti and Tj, then 1) ri(X);rj(Y) is never a conflict, even if X=Y. 2) ri(X);wj(Y) is not a conflict if X!=Y 3) wi(X);rj(Y) is not a conflict if X!=Y 4) wi(X);wj(Y) is not a conflict if X!=Y –Situations where we may not swap the order of actions Two actions of the same T, e.g. ri(X);wi(Y). Two writes of the same X by different Ts, e.g. wi(X);wj(X). Read and write of the same X by different Ts, e.g. ri(X);wi(X) or wi(X);rj(X). Concurrency Control - Cont. –Two actions of different Ts may be swapped unless they involve the same DB element, and at least one is a write. –The scheduler can make as many swaps as possible with the goal of making the schedule into a serial schedule. ƒ Precedence Graphs and a Test for C-S –A graphical way to detect the C-S condition based on whether the conflict can be resolved. Enforcing Serializability by Locks ƒ Locks –A simple locking scheme is to request a lock on a DB element before performing any operation on it. –A lock table can be used to manage locks for all the elements within a DB. Concurrency Control - Cont. –Consistency of Ts A T can only read or write an element if it previously requested a lock on that element and hasn't yet released the lock. If a T locks an element, it must later unlock that element. –Legality of Schedules Two Ts cannot lock the same element without one having first released the lock. –Notation for locks li(X): T Ti requests a lock on DB element X. ui(X): T ti releases/unlocks its lock on DB element X. ƒ Locking Scheduler –It utilizes a lock table to manage locks. In this section, this can be viewed as a relation of Locks(element, TID), consisting of pairs (X,TID) such that TID has a lock on DB element X. Simple SQL insert and delete will be sufficient to manage this table. Concurrency Control - Cont. ƒ Two-Phase Locking (2PL) –In every T, all lock requests precede all unlock requests. The first phase obtains the locks and releases them at the end of the second phase. –It guarantee that a legal schedule of consistent Ts is C-S. –Does not solve possible deadlock conditions. Different Lock Modes ƒ Shared and Exclusive Locks –Shared lock are known as "read lock" since two read actions on the same DB element do not create a conflict. –Exclusive lock are known as "write lock" since any writes to the DB element will affect the behavior of other reads and writes requests. Concurrency Control - Cont. –Three kinds of requirements Consistency of Ts: A read action r(X) must be preceded by sl(X) or xl(X), with no intervening u(X). A write action w(X) must be preceded by xl(X), with no intervening u(X). Two-phase locking of Ts: Locking must precede unlocking. No action sl(X) or xl(X) can be preceded by an action u(Y) for any Y. Legality of schedules: An element may either be locked exclusively by one T or by several Ts in shared mode, but not both. If xl,i(X) appears in a schedule, then there cannot be a following xl,j(X) or sl,j(X), for some j other than i, without intervening u,i(X). If sl,i(X) appears in a schedule, then there cannot be a following xl,j(X), for j!=i, without an intervening u,i(X). Concurrency Control - Cont. ƒ Upgrading Locks –Any T can request a S lock first and then request a X lock later when it is ready to write. This allows more concurrency, less blocking time, thus improve performance. –It can lead to a deadlock ƒ Update Locks –An update lock ul,i(X) gives T only the privilege to read X, not to write X. Once an U lock is obtained, then nobody can request ANY locks on X. Only an update lock can be upgraded to a X lock. –U lock can help to avoid a deadlock scenario. ƒ Increment Locks –Action INC(A,c) can be defined as follow: READ(A,t); t:=t+c; WRITE(A,t); –Use il,i(X) notation for an increment lock. Concurrency Control - Cont. –A consistent T can only have an increment action on X if it holds an I lock on X at the time. An I lock does not enable either read or write actions. –In a legal schedule, any number of Ts can hold an I lock on X at any time. If an I lock on X is held by a T, then no other T can hold either a S or X lock on X at the same time. An Architecture for a Locking Scheduler ƒ Assumptions: –T do not request locks. It is the job of the scheduler to insert lock actions into the stream of reads, writes, and other actions that access data. –T do not release locks. The scheduler releases the locks when the TM tells it that the T will commit or abort. Concurrency Control - Cont. ƒ A Scheduler That Inserts Lock Actions –Two main parts of the scheduler. Here are some steps taken by the scheduler. Part I takes the stream of requests generated by the Ts and inserts the right lock actions ahead of all DB-access operations such as read, write, increment, or update. Part II takes the sequence of lock and DB-access actions and executes them. If a lock can not be granted, then that T will be delayed (blocked). If all the locks are granted, then a) If the action is DB-access, execute b) If it is a lock action, then it examine the lock table and modify it if necessary. When a T commits or aborts, the TM will notify Part I to release all the locks held by T. When Part II is notified that a lock on element X is available, it determines the next T or Ts that can now be given a lock on X and executed. Concurrency Control - Cont. Managing Hierarchies of DB Elements ƒ Locks with Multiple Granularity –In the previous discussions, we use the term DB element X. Element X can be a relation, a block or page, or tuple(s). –Large X will limit concurrency and small X can lead to excessive I/O and overhead for the lock manager. ƒ Warning Locks –Assumption: lock scheme of S and X, the warning locks will be prefixed by I (meaning "intention to") to the ordinary locks. (e.g. IS means intention to obtain a S lock) –Rules of the warning protocol are 1) To place an S or X lock on any element, we begin at the root of the hierarchy. 2) If we are at the element that we want to lock, then request an S or X lock on that element. Concurrency Control - Cont. 3) If the element is further down the hierarchy, then we place a warning at the current node. Proceed to the appropriate child and repeat step 2 or 3 until we reach the desired node. –Compatibility matrix for shared, exclusive, and intention locks. IS IX ----------------------------IS Y Y N IX Y Y N S Y N N X N N N S X Y N Y N ƒ Phantoms and Handling Insertions Correctly –In the "intention to" locking scheme, we must place an X lock on the whole relation to Concurrency Control - Cont. Concurrency Control by Timestamps ƒ Timestamps –Each T is assigned a unique number by the scheduler to identify the ordering of each T. Two ways to generating timestamps: a) Use the system clock, as long as we can guarantee that no more than one T can use one tick of the clock. b) Use a internal counter in ascending order. DB2, for example, uses a "log sequence number" or (LSN). –Representation of timestamps in Ts. RT(X) denotes the read time of X. WT(X) denotes the write time of X. C(X) denotes the commit bit of X, which is true if and only if the most recent T to write X has already committed. It is used to prevent the "dirty read" problem. Concurrency Control - Cont. ƒ Physically Unrealizable Behaviors –Two kinds of problems when we try to serialize based on timestamps. 1) Read too late - Transaction T tries to read X, but transaction U which started after T wrote X before T can read X. Therefore we must abort T when this scenario occurs. 2) Write too late - Transaction T tries to write X, but transaction U which started after T read X before T can write X. Thus it is causing a conflict. ƒ Problems With Dirty Data –Using the commit bit can help these issues. 1) At T0, U starts. At T1, U writes X. At T2 T starts. At T3, T reads X. At T4, U aborts. 2) At T0, T starts. At T1, U starts. At T2, U writes X. At T3, T writes X. At T4, T commits. At T5, U aborts. –When a T writes X, it is tentative until a commit or abort is issued. Concurrency Control - Cont. ƒ The Rules for Timestamp-Based Scheduling –In response to a read or write request from a T, the scheduler can: Grant the request. Abort T and restarting T with a new timestamp. Delay T and later decide whether to abort T or grant the request. ƒ Timestamps and Locking –Timestamping is better when most Ts are read-only, or it is rare that concurrent Ts will try to read and write the same element. In high conflict situations, locking performs better. Locking will frequently delay Ts as they wait for locks, and can lead to deadlocks. If concurrent Ts frequently read and write elements in common, then rollbacks will be frequent. This leads to even more delay than a locking system. Concurrency Control - Cont. Concurrency Control by Validation ƒ It allows Ts to access data without locks. The scheduler maintains a record of what active Ts are doing, rather than read and write times in the case of timestamp. Before a T starts to write element X, it goes thru a "validation phase" where the sets of elements it read and wrote and compared with the write sets of other active Ts. If conflict exists, the T is rolled back. ƒ Validation-Based Scheduler –The scheduler is given a read set RS(T) and write set WS(T) for each T. –Ts are executed in three phases: 1) Read - the T reads from the DB all the elements in its read set. The T computes in its local address space all the results it is going to write. 2) Validate - it compares the RS(Ti) and WS(Ti) with other active Ts. If it fails, then Ti rolled back, else it proceeds to the third phase. Concurrency Control - Cont. 3) Write - the T writes to the DB for each DB elements in the WS(T). –To support the decision whether to validate a T, the scheduler maintains three sets: START - the set of Ts that have started but not yet completed validation. The scheduler maintains START(T), start time of T for each T. VAL - the set of Ts that have been validated but not yet finished the writing of phase 3. In this set, the scheduler maintains both the START(T) and VAL(T). FIN - the set of Ts that have completed phase 3. For these Ts, the scheluder records START(T), VAL(T), and FIN(T). ƒ Concurrency-Control Mechanism Comparisons –Storage utilization: Locks - Space in the lock table is proportional to the number of DB elements locked. Timestamps - Can be stored in a table for all the active Ts. Concurrency Control - Cont. Validation - Space is need for timestamps and read/write sets for each currently active T, plus a few more Ts that finished after some currently active T began. –Performance with competing Ts. Locking delays Ts but avoids rollbacks, even when interaction is high. Timestamps and validation do not delay Ts, but can cause them to rollbacks, which is a more serious form of delay. If interference is low, then neither timestamps nor validation will cause many rollbacks, and may be preferable to locking because they generally have lower overhead than a locking scheduler. When a rollback is necessary, timestamps catch some problems earlier than validation, which always lets a T do all its internal work before the "validation phase" which decides whether a T must rollback or not. More About Transaction Management Serializability and Recoverability ƒ Cascading Rollback –If T1 reads any uncommitted "dirty" data modified by T2, when T2 aborts, T1 must also abort as well. –Both the timestamp-based (with commit bit) and validation-based schedules can avoid cascading rollback. ƒ Recoverable Schedules –A schedule is recoverable if each T commits only after each T from which it has read has committed. –In order to use one of the logging methods, the log's commit records reach disk in the order in which they are written. ƒ Schedules That Avoid Cascading Rollback –A schedule avoids cascading rollback (ACR) if Ts may only read committed values. –Every ACR schedule is recoverable. More About Transaction Management - Cont. ƒ Managing Rollbacks Using Locking –In a lock-based scheduler, a simple way to guarantee that there are no cascading rollbacks is Strict Locking. Any T must not release any X locks (or other "write" locks) until the T has either committed or aborted, and the commit or abort log record has been flushed to disk. Every strict schedule is ACR. Every strict schedule is serializable. ƒ Group Commit –Allow the releasing of locks before the commit record on the log is actually flushed to disk. Do not release locks until the T finishes, and the commit log record at least appears in a buffer. Flush log blocks in the order that they were created. More About Transaction Management - Cont. ƒ Logical Logging –Only the changes to the blocks are described. Depending on the change, there are several levels of complexity. A small number of bytes of the DB element are changed, e.g. update of a fixed-length field. Only record the changed bytes and their positions. Small change but it affects the whole record or block, e.g. update of a variable-length field. The change affects many bytes of a DB element, and further changes can prevent this change from ever being undone. ƒ Recovery From Logical Logs –Log Sequence Numbers (LSNs) Each log record is given a number one greater than that of the previous log record. Thus a typical log record has the form <L,T,A,B> where: More About Transaction Management - Cont. L is the log sequence number, an integer. T is the transaction involved. A is the action performed by T, e.g., "insert of tuple t." B is the block on which the action was performed. For each action, there is a compensating action that logically undoes the action. For example, the compensating action for "insert tuple t" is "delete tuple t". If a T aborts, then for each action performed on the DB by T, the compensating action is performed, and the fact that this action was performed is also recorded in the log. Each block maintains, in its header, the LSN of the last action that affected that block. –Suppose we need to use the logical log to recover after a crash. Here is an outline of the steps to take. More About Transaction Management - Cont. 1. Reconstruct the state of the DB at the time of crash, including blocks whose current values were in buffers and therefore got lost. a) Find the most recent checkpoint on the log and determine from it the set of Ts that were active at that time. b) For each log entry <L,T,A,B>, compare the LSN N on block B with the LSN L for this log record. If N < L, then redo action A; that action was never performed on block B. Otherwise do nothing. c) For each log entry that informs us that a T started, committed, or aborted, adjust the set of active Ts accordingly. 2. The set of Ts that remain active when we reach the end of the log must be aborted. a) Scan the log again, this time from the end back to the previous check-point. When we see a record <L,T,A,B> for T that must be aborted, perform the compensating actions. More About Transaction Management - Cont. b) If we must abort a T that began prior to the most recent checkpoint, then continue back in the log until the start-records for all such Ts have been found. c) Write abort-records in the log for each of the Ts we had to abort. Resolving Deadlocks ƒ There two broad approaches to dealing with deadlock. We can detect deadlocks and fix them, or we can manage Ts in such a way that deadlocks are never able to form. ƒ Deadlock Detection by Timeout –When a deadlock occurs, it is very difficult to recover from it unless one of more Ts will have to be rolled back. –Timeout limits the amount of time that each T can be active before it is rolled back. More About Transaction Management - Cont. ƒ Waits-For Graph –Use to indicate potential deadlocks –Each T who is waiting or holding a lock will be represent in a node of the graph. An arc/line is drawn from node A to node B if: 1) B holds a lock on X, 2) A is waiting for a lock on X, and 3) A cannot get a lock on X in its desired mode unless B first releases its lock on X. –If there are no cycles in the waits-for graph, then each T can eventually complete. If there is a cycle, then no T in the cycle can ever make progress, therefore we have a deadlock. –To avoid a deadlock, one strategy is to roll back any T that makes a request that would cause a cycle in the waits-for graph. More About Transaction Management - Cont. ƒ Deadlock Prevention by Ordering Elements –Order the DB elements in some arbitrary but fixed order (e.g. block addresses). –If every T is required to request locks on elements in order, then there can be no deadlock due to Ts waiting for locks. –Good idea but not practical ƒ Detecting Deadlocks by Timestamps –Note that this timestamp is not the same as the timestamp used in concurrency control where a new timestamp is generated when the T is restarted. –The timestamp for deadlock detection never changes. –If we have two transactions, T and U, one of them is older (has the earlier timestamp). There are two different policies that can be used to detect deadlocks. More About Transaction Management - Cont. –T is waiting for a lock that is held by U. 1) The Wait-Die Scheme: a) If T is older than U, then T is allowed to wait for the lock(s) held by U. b) If U is older than T, then T "dies"; it is rolled back. 2) The Wound-Wait Scheme: a) If T is older than U, it "wounds" U. U must roll back and relinquish to T the lock(s) that T needs from U. But if U has already finished and released its locks by the time the "wound" takes effect, then U doesn't need to be rolled back. b) If U is older than T, then T waits for the lock(s) held by U. ƒ Comparison of DL-Management Methods –In the wait-die and wound-wait cases, older Ts kill off the newer ones, thus there will be no starvation. More About Transaction Management - Cont. –Both wound-wait and wait-die are easier to implement than a system that maintains or periodically constructs the waits-for graph. –Using the waits-for graph minimizes the number of times we must abort a T because of a deadlock. On the other hand, either wound-wait or wait-die will sometimes roll back a T when there was no deadlock. Distributed Databases ƒ Distribution of Data –Horizontal Decomposition Each local relations are called fragments Visualize a single relation with its tuples separated by horizontal lines into sets of tuples at each location. –Vertical Decomposition Each location has its own relation Logically join or union by using a view More About Transaction Management - Cont. ƒ Distributed Transactions –A single transaction can cause a multi-site I/U/D. –Issues including local scheduler and logger, commit/abort, global and local locks. ƒ Data Replication –Can improve performance and reliability using the HA or fail-safe setup. ƒ Distributed Query Optimization –Cause an extra layer of complexity when dealing with distributed data. Distributed Commit ƒ Atomicity –Many possible errors during a distributed T. –e.g. Network problems, tranactional errors that require a multi-site rollback. More About Transaction Management - Cont. ƒ Two-Phase Commit –Make distributed T atomic by enforcing an all or nothing for each local component commit. –Designate the node which the T originated as the "coordinator node". –Phase I Coordinator places a log record <Prepare T> on its local log. It sends to each component's site the "Prepare T" message. Each site will determine whether to commit or abort its component of T and send a reply back. If a site wants to commit its component, it must enter a "precommitted" state with a local <Ready T> log record. Once in this state, the T cannot be aborted unless the coordinator instruct it to do so. If a site wants to abort its component of T, then it logs the record <Don't commit T> and sends the message back to the coordinator. More About Transaction Management - Cont. –Phase II If the coordinator received "Ready T" from all components of T, then it logs <Commit T> at its site and sends "Commit T" message to all the sites involved in T. If the coordinator received a "Don't commit T" from one or more sites, it logs <Abort T> at its site and sends "Abort T" message to all the sites involved in T. For each site who received a "Commit T", it commits that component of T and writes <Commit T> in its log. For each site who received a "Abort T", it aborts T and writes <Abort T> in its log. ƒ Recovery of Distributed Transactions –Use the same techniques described in Chapter 17 such as undo/redo logging except the added communication between sites. –The difficult case is when the C node also failed. Then all other sites must wait for the C node to recover first before continuing. More About Transaction Management - Cont. Distributed Locking ƒ Centralized Lock Systems –A simple approach to designate one site as the "lock site" to maintain a lock table for all logical elements, whether they are local or remote. –Cost is 3 messages per lock (request, grant, and release). –Can be a bottleneck and crash recovery at the lock site can be problematic. ƒ Distributed Locking Algorithms –Use messages to request and release locks to other sites or nodes. –Once all the locks are obtained, the T will continue as normal. –Distributed locks wait time is usually longer than local locks. –Lock coordinator can reduce the number of messages between all the sites. More About Transaction Management - Cont. ƒ Locking Replicated Elements –If we have replication setup, then an X lock must be distributed as well. e.g. All replica copies of DB element A must be locked currently. ƒ Primary-Copy Locking –Each logical element X has a primary copy. –All lock requests must goes to the site of the primary copy. Long-Duration Transactions ƒ Problems of Long Transactions –A long running T holds locks for a long period of time and it essentially blocks other Ts from obtaining the same DB elements. Thus other short Ts may be timeout and/or rolled back. –There are three broad classes of applications that involve long Ts. More About Transaction Management - Cont. 1) Conventional DBMS Applications - While most DBMS applications contain short Ts, some operation may require a few long Ts. 2) Design Systems - When there are many components such as source files of a software project, we can have many engineers who may need to modify a common components. Thus a check-in/check-out system is commonly used to control such process. If the check-out operation is viewed as an exclusive lock, then it can be locked for a long time. 3) Workflow Systems - These systems involve collections of processes, some are automated and some can be manual, it can hold an exclusive lock on all the data involved in a T. Again, it can possibly affect many other Ts for hours or days. ƒ Sagas –It is a collection of actions that form a longduration T. More About Transaction Management - Cont. –A saga consists of 1. A collection of actions. 2. A graph whose node are either actions or the special Abort and Complete nodes, and whose arcs link pairs of nodes. No arcs leave the two special nodes, which we call terminal nodes. 3. An indication of the node at which the action starts, called the start node. –Concurrency control for sagas Each action can be viewed as a short T. Using a simple locking scheme, it can prevent two Ts from trying to write new values at the same time. The overall T, A1,A2,A3,..., is managed using "compensating transactions". Their job is to roll back the effect of a committed action in a way that does not depend on what has happened to the DB between the time the action was executed and the time the compensating T is executed. More About Transaction Management - Cont. ƒ Compensating Transactions –In a saga, each action A has a compensating transaction, with denoted by A(-1). If we execute A and followed by A(-1), the result is the same if A was never had executed. –If D is any DB state, and B1B2...Bn is any sequence of actions and compensating Ts (whether from the saga in question or any other saga or T that may legally execute on the database) then the same DB states result from running the sequences B1B2...Bn and AB1B2...BnA(-1) on the DB state D. –If a saga execution leads to the Abort node, then we roll back the saga by executing the compensating Ts for each executed action, in the reverse order of those actions.