* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Review 2
Survey
Document related concepts
Microsoft Access wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Serializability wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational algebra wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Oracle Database wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Functional Database Model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Clusterpoint wikipedia , lookup
Database model wikipedia , lookup
Transcript
Database: Review Database Hierarchical databases Multi-dimensional indexes Lossless join Normalization JDBC (DB application in Java) Relational algebra, Relational data model B+-tree Hashing Introduction 1 not covered ACS-3902 Yangjun Chen Sept. 2012 Database: Review What is a database? The main characters of a database The basic database design method The entity-relationship data model for application modeling Sept. 2012 Yangjun Chen ACS-3902 2 Database: Review The main characteristics of the database approach: single repository of data • sharable by multiple users • concurrency control and transaction concept • security and integrity constraints • self-describing - system catalogue contains meta data • program-data independence • some changes to the database are transparent to programs/users • multiple views of data - to support individual needs of programs/users Sept. 2012 Yangjun Chen ACS-3902 3 Database: Review Database schema, Schema evolution, Database state Concepts and Architecture Working process with a database system Database system architecture Data independence concept Sept. 2012 Yangjun Chen ACS-3902 4 Database: Review Database schema Course CName CNo CrHrs Dept Database 8803 3 CS C 2606 3 CS Relation schema Schema evolution Database state Student Name StNo Class Major Smith 17 1 CS Brown 8 2 Grades StNo Sid Grade 17 25 A CS 17 43 B Section SId CNo Semester Yr Instructor 32 8803 Spring 2000 Smith 25 43 Sept. 2012 8803 Winter 2000 Smith 2606 Spring 2000 Jones Yangjun Chen ACS-3902 5 Database: Review Working process with a database system: Definition •record structure •data elements •names •data types •constraints etc Sept. 2012 Construction •create database files •populate the database with records Yangjun Chen ACS-3902 Manipulation •querying •updating 6 Database: Review Database Management System (DBMS) •collection of software facilitating the definition, construction and manipulation of databases DBMS Request manager Users/ actors Sept. 2012 Storage manager, Query evaluation Yangjun Chen ACS-3902 Meta data Stored database 7 Database: Review Three-schema architecture External view Sept. 2012 External view A specific user or groups view of the database Conceptual schema Describes the whole database for all users Internal schema Physical storage structures and details Yangjun Chen ACS-3902 8 Database: Review Data modeling using ER-model Sept. 2012 Entity-relationship model - Entity types - strong entities - weak entities - Relationships among entities - Attributes - attribute classification - Constraints - cardinality constraints - participation constraints - is-a relationship ER-to-Relation-mapping Yangjun Chen ACS-3902 9 Database: Review ER-model: fname sex address employee 1 M hours M supervisor 1 supervisee 1 number of employees N N works on supervision M controls manages bdate 1 department works for N salary startdate ssn location 1 lname minit name name number dependents of project name number location N dependent name Sept. 2012 sex birthdate relationship Yangjun Chen ACS-3902 10 Database: Review Subtype is determined by the student_class attribute student A student must be a graduate or undergraduate The bubble and the d imply disjoint subtypes (o - overlap subtypes) Student_class d graduate undergraduate The arc implies graduate and undergraduate are subtypes of student • Participation of supertype may be mandatory or optional • Subtypes may be disjoint or overlapping • a predicate (on an attribute) determines the subtype: e.g. attribute Student_class Student_class = ‘graduate’; Student_class = ‘undergraduate’ Sept. 2012 Yangjun Chen ACS-3902 11 Database: Review Mapping to a relational database • 4 choices: 1. Create separate relations for the supertype and each of the subtypes. 2. Create relations for the subtypes only - each contains attributes from the supertype. 3. (disjoint subtypes) Create only one relation - includes all of the attributes for the supertype and all for the subtypes, and one discriminator attribute. 4. (overlapping subtypes) Create only one relation includes all of the attributes for the supertype and all for the subtypes, and one logical discriminator attribute per subtype. PK is always the same - determined from the supertype Sept. 2012 Yangjun Chen ACS-3902 12 Database: Review fname minit lname name Ssn bDates Address JobType EMPLOYEE TypingSpeed d EngType TGrade SECRETARY TECHNICIAN ENGINEER EMPLOYEE fname, minit, lname, ssn, bdate, address, JobType SECRETARY Essn, TypingSpeed Sept. 2012 TECHNICIAN ENGINEER Essn, TGrade Yangjun Chen ACS-3902 Essn, EngType 13 Database: Review Example for super- & sub-types: choice 2 VehicleId Price LicensePlate Vehicle TNoOfPassengers d NoOfAxles CAR MaxSpeed TRUCK Tonnage CAR VehicleId, LicensePlate, Price, MaxSpeed, NoOfPassenger TRUCK VehicleId, LicensePlate, Price, NoOfAxles, Tonnage Sept. 2012 Yangjun Chen ACS-3902 14 Database: Review fname minit lname name Ssn Example for super- & sub-types: choice 3 bDates Address JobType EMPLOYEE TypingSpeed d EngType TGrade SECRETARY TECHNICIAN ENGINEER EMPLOYEE fname, minit, lname, ssn, bdate, address, JobType, TypingSpeed, Tgrade, EngType Sept. 2012 Yangjun Chen ACS-3902 15 Database: Review Example for super- & sub-types: choice 4 Description PartNo Part manufactureDate o Supplier DrawingNo Manufacture_Part ListPrice BatchNo Purchased_Part Part PartNo, Desription, MFlag, Drawing, ManufactureDate, BatchNo, Pflag, Supplier, ListPrice Sept. 2012 Yangjun Chen ACS-3902 16 Database: Review external hashing static hashing & dynamic hashing hash function mathematical function that maps a key to a Hashing technique bucket address collisions collision resolution scheme - open addressing - chaining - multiple hashing linear hashing Sept. 2012 Yangjun Chen ACS-3902 17 Database: Review External hashing: the data are on the disk. Static hashing: using a hashing function to map keys to bucket addresses primary area can not be changed collision resolution schema: open addressing chaining multiple hashing Dynamic hashing: primary area can be changed linear hashing Sept. 2012 Yangjun Chen ACS-3902 18 Database: Review Linear hashing: 1. What is a phase? 2. How to split a bucket? 3. When to split a bucket? 4. What bucket will be chosen to split next? Sept. 2012 Yangjun Chen ACS-3902 19 Database: Review Linear hashing: initially hash file contains M buckets hi = key mod 2iM (i = 0, 1, 2, ...) insertion process can be divided into several phases phase 1: insertion using h0 = key mod M splitting using h1 = key mod 2M splitting rule: overflow of a bucket or if load factor > constant (e.g., 0.70) overflow will be put in the overflow area or redistributed through splitting a bucket splitting buckets from n = 0 to n = M- 1 (after each splitting n is increased by 1. Phase 1 finishes when n = M (in this case, the primary area becomes 2M buckets long) Sept. 2012 Yangjun Chen ACS-3902 20 Database: Review phase 2: insertion using h1 = key mod 2M splitting using h2 = key mod 4M splitting rule: overflow of a bucket or if load factor > constant (e.g., 0.70) overflow will be put in the overflow area or redistributed through splitting a bucket splitting buckets from n = 0 to n = 2M- 1 (after each splitting n is increased by 1. Phase 1 finishes when n = 2M (in this case, the primary area will contain 4M buckets.) phase 3: ... … h2 = …, h3 = …, ... Sept. 2012 Yangjun Chen ACS-3902 21 Database: Review tree - root, internal, leaf, subtree - parent, child, sibling Multi-level index balanced, unbalanced b+-tree - splits on overflow; merge on underflow - in practice it is usually 3 or 4 levels deep search, insert, delete algorithms Sept. 2012 Yangjun Chen ACS-3902 22 Database: Review Motivation • B+-tree provides a short access path. file of records page1 Index page2 page3 Inverted index Signature file B+-tree Hashing …… Sept. 2012 Yangjun Chen ACS-3902 23 Database: Review A B+-tree pinternal = 3, pleaf = 2. 5 3 1 3 5 7 6 7 8 8 9 12 Records in a file Sept. 2012 Yangjun Chen ACS-3902 24 Database: Review B+-tree insertion: leaf node splitting, internal node splitting Leaf splitting When a leaf splits, a new leaf is allocated •the original leaf is the left sibling, the new one is the right sibling •key and pointer pairs are redistributed: the left sibling will have smaller keys than the right sibling •a 'copy' of the key value which is the largest of the keys in the left sibling is promoted to the parent 22 33 33 12 22 33 44 48 55 12 22 31 33 44 48 55 insert 31 Sept. 2012 Yangjun Chen ACS-3902 25 Database: Review Internal node splitting If an internal node splits and it is not the root, • insert the key and pointer and then determine the middle key • a new 'right' sibling is allocated • everything to its left stays in the left sibling • everything to its right goes into the right sibling • the middle key value along with the pointer to the new right sibling is promoted to the parent (the middle key value 'moves' to the parent to become the discriminator between this left and right sibling) 26 55 55 22 33 22 33 Insert 26 Sept. 2012 Yangjun Chen ACS-3902 26 Database: Review Internal node splitting When a new root is formed, a key value and two pointers must be placed into it. 40 26 55 26 55 Insert 40 Sept. 2012 Yangjun Chen ACS-3902 27 Database: Review Deleting nodes from a B+-tree: 1. When deleting a key from a node A, check whether the number of the remaining keys (or pointers) is p/2. 2. If it is not the case, redistribute the keys in the left sibling B or in the right sibling C if it is possible. Otherwise, merge A and B or merge A and C. 3. When redistributing or merging, change the key values in the parent node so that the following condition is satisfied: • < P1, K1, P2, K2, …, Pq-1, Kq-1, Pq > • K1 < K2 < ... < Kq-1 (i.e. it is an ordered set) • for the key values, X, in the subtree pointed to by Pi • Ki-1 < X <= Ki for 1 < i < q • X <= K1 for i = 1 • Kq-1 < X for i = q Sept. 2012 Yangjun Chen ACS-3902 28 Database: Review A b+-tree p = 3, pleaf = 2. 5 3 1 3 5 7 6 7 8 8 9 12 Records Sept. 2012 Yangjun Chen ACS-3902 29 Database: Review Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 1 3 5 7 6 7 9 9 12 Deleting 8 causes the node redistribute. Sept. 2012 Yangjun Chen ACS-3902 30 Database: Review Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 1 3 5 7 6 7 9 12 is removed. Sept. 2012 Yangjun Chen ACS-3902 31 Database: Review Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 1 3 5 6 6 7 9 is removed. Sept. 2012 Yangjun Chen ACS-3902 32 Database: Review Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 1 3 5 6 6 Deleting 7 makes this pointer no use. Therefore, a merge at the level above the leaf level occurs. Sept. 2012 Yangjun Chen ACS-3902 33 Database: Review Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 A 5 This point becomes useless. The corresponding node should also be removed. B 1 3 5 6 C For this merge, 5 will be taken as a key value in A since any key value in B is less than or equal to 5 but any key value in C is larger than 5. Sept. 2012 Yangjun Chen ACS-3902 34 Database: Review Entry deletion - deletion sequence: 8, 12, 9, 7 3 1 Sept. 2012 3 5 Yangjun Chen 5 6 ACS-3902 35 Database: Review Relational Data Model Data modeling using Relational model Relational algebra - relational schema (database schema) - relation schema, relations, database state - integrity constraints and updating Relational algebra - select, project, join, cartesian product - division - set operations: union, intersection, difference, Sept. 2012 Yangjun Chen ACS-3902 36 Database: Review Integrity Constraints • any database will have some number of constraints that must be applied to ensure correct data (valid states) 1. domain constraints • a domain is a restriction on the set of valid values • domain constraints specify that the value of each attribute A must be an atomic value from the domain dom(A). 2. key constraints • a superkey is any combination of attributes that uniquely identify a tuple: t1[superkey] t2[superkey]. - Example: <Name, SSN> (in Employee) • a key is superkey that has a minimal set of attributes - Example: <SSN> (in Employee) Sept. 2012 Yangjun Chen ACS-3902 37 Database: Review Integrity Constraints • If a relation schema has more than one key, each of them is called a candidate key. • one candidate key is chosen as the primary key (PK) • foreign key (FK) is defined as follows: i) Consider two relation schemas R1 and R2; ii ) The attributes in FK in R1 have the same domain(s) as the primary key attributes PK in R2; the attributes FK are said to reference or refer to the relation R2; iii) A value of FK in a tuple t1 of the current state r(R1) either occurs as a value of PK for some tuple t2 in the current state r(R2) or is null. In the former case, we have t1[FK] = t2[PK], and we say that the tuple t1 references or refers to the tuple t2. Example: FK Employee(SSN, …, Dno) Sept. 2012 Yangjun Chen Dept(Dno, … ) ACS-3902 38 Database: Review Integrity Constraints 3. entity integrity • no part of a PK can be null 4. referential integrity • domain of FK must be same as domain of PK • FK must be null or have a value that appears as a PK value 5. semantic integrity • other rules that the application domain requires: • state constraint: gross salary > net income • transition constraint: Widowed can only follow Married; salary of an employee cannot decrease Sept. 2012 Yangjun Chen ACS-3902 39 Database: Review Other SQL capabilities • Assertions can be used for some constraints • e.g. Create Assertion ... ... Executed and enforced by DBMS Constraint: The salary of an employee must not be greater than the salary of the manager of the department that the employee works for. CREATE ASSERTION salary_constraint CHECK (NOT EXISTS (SELECT * FROM employee e, employee m, department d where e.salary > m.salary and e.dno=d.dnumber and d.mgrssn=m.ssn)); Sept. 2012 Yangjun Chen ACS-3902 40 Database: Review Relational algebra Retrieve for each female employee a list of the names of her dependents: FEMALE_EMPS SEX = ‘F’ (EMPLOYEE) EMPNAMES FNAME,LNAME, SSN(FEMALE_EMPS) ACTUAL_DEPENDENTS EMPNAMES DEPENDENT SSN = ESSN RESULT FNAME, LNAME, DEPENDENT_NAME(ACTUAL_DEPENDENTS ) Sept. 2012 Yangjun Chen ACS-3902 41 Database: Review DDL - creating schemas - modifying schemas DML SQL - select-from-where clause - group by, having, order by - update - view Sept. 2012 Yangjun Chen ACS-3902 42 Database: Review DDL - Examples: • Create schema: Create schema COMPANY authorization JSMITH; • Create table: Create table EMPLOYEE (FNAME VARCHAR(15) NOT NULL, MINIT CHAR, LNAME VARCHAR(15) NOT NULL, SSN CHAR(9) NOT NULL, BDATE DATE, ADDRESS VARCHAR(30), SEX CHAR, SALARY DECIMAL(10, 2), SUPERSSN CHAR(9), DNO INT NOT NULL, PRIMARY KEY(SSN), FOREIGN KEY(SUPERSSN) REFERENCES EMPLOYEE(SSN), FOREIGN KEY(DNO) REFERENCES DEPARTMENT(DNUMBER)); Sept. 2012 Yangjun Chen ACS-3902 43 Database: Review DDL - Examples: • drop schema DROP SCHEMA CAMPANY CASCADE; DROP SCHEMA CAMPANY RESTRICT; • drop table DROP TABLE DEPENDENT CASCADE; DROP TABLE DEPENDENT RESTRICT; • alter table ALTER TABLE COMPANY.EMPLOYEE ADD JOB VARCHAR(12); ALTER TABLE COMPANY.EMPLOYEE DROP ADDRESS CASCADE; Sept. 2012 Yangjun Chen ACS-3902 44 Database: Review DDL - Examples: • Specifying constraints: Create table EMPLOYEE (…, DNO INT NOT NULL DEFAULT 1, CONSTRAINT EMPPK PRIMARY KEY(SSN), CONSTRAINT EMPSUPERFK FOREIGN KEY(SUPERSSN) REFERENCES EMPLOYEE(SSN) ON DELETE SET NULL ON UPDATE CASCADE, CONSTRAINT EMPDEPTFK FOREIGN KEY(DNO) REFERENCES DEPARTMENT(DNUMBER) ON DELETE SET DEFAULT ON UPDATE CASCADE); • Create domain: CREATE DOMAIN SSN_TYPE AS CHAR(9); Sept. 2012 Yangjun Chen ACS-3902 45 Database: Review set null or cascade: strategies to maintain data consistency Employee ssn ... ... supervisor 234589710 123456789 ... ... 234589710 null delete Employee ssn ... ... supervisor 234589710 123456789 ... ... 234589710 Sept. 2012 null Yangjun Chen ACS-3902 not reasonable delete cascade delete 46 Database: Review set null or cascade: strategies to maintain data consistency Employee ssn ... ... supervisor 234589710 123456789 ... ... 234589710 null delete Employee ssn ... ... supervisor null 123456789 ... ... 234589710 Sept. 2012 reasonable set null null Yangjun Chen ACS-3902 delete 47 Database: Review set default: strategy to maintain data consistency Employee ssn ... ... DNO 4 123456789 ... ... 234589710 …… Department DNUMBER ... ... change this value to the default value 1. …… …… 1 ... ... …… 4 Sept. 2012 Yangjun Chen ACS-3902 delete 48 Database: Review DML - select-from-where clause Retrieve a list of employees and the projects they are working on, ordered by department, within each department, ordered alphabetically by last name, first name: SELECT DNAME, LNAME, FNAME, PNAME FROM DEPARTMENT, EMPLOYEE, WORKS_ON, PROJECT WHERE DNUMBER = DNO AND SSN = ESSN AND PNO = PNUMBER ORDER BY DNAME, LNAME, FNAME order by – clause group by – clause having – clause aggregation functions: max, min, average, count, sum Sept. 2012 Yangjun Chen ACS-3902 49 Database: Review DML - select-from-where clause • Insert • Update • Delete INSERT INTO employee ( fname, lname, ssn, dno ) VALUES ( "Joe", "Smith", 909, 1); Note that Access changes the above to read: INSERT INTO employee ( fname, lname, ssn, dno ) SELECT "Joe", "Smith", 909, 1; UPDATE employee SET salary = 100000 WHERE ssn=909; DELETE FROM employee WHERE ssn=909; Sept. 2012 Yangjun Chen ACS-3902 50 Database: Review View definition • Use a Create View command • essentially a select specifying the data that makes up the view • Create View Enames as select lname, fname from employee CREATE VIEW AS SELECT FROM Sept. 2012 Enames (lname, fname) LNAME, FNAME EMPLOYEE Yangjun Chen ACS-3902 51 Database: Review CREATE VIEW AS SELECT FROM WHERE GROUP BY DEPT_INFO (DEPT_NAME, NO_OF_EMPS, TOTAL_SAL) DNAME, COUNT(*), SUM(SALARY) DEPARTMENT, EMPLOYEE DNUMBER = DNO DNAME; Sept. 2012 Yangjun Chen ACS-3902 52 Database: Review JDBC (Database application in Java) Sept. 2012 Yangjun Chen ACS-3902 53 Database: Review To develop a database application, JDBC or ODBC should be used. JDBC – JAVA Database Connectivity ODBC – Open Database Connectivity Client Server JDBC-ODBC Bridge Database ODBC Driver Database Client Sept. 2012 Yangjun Chen ACS-3902 54 Database: Review Connection to a database: 1. Loading driver class Class.forName(“sun.jdbc.odbc.JdbcOdbcDriver”); 2. Connection to a database String url = “jdbc:odbc:<databaseName>”; Connction con = DriverManager.getConnection(url, <userName>, <password>) Sept. 2012 Yangjun Chen ACS-3902 55 Database: Review 3. Sending SQL statements Statement stmt = con.createStatement(); ResultSet rs = stmt.executeQuery(“SELECT * FROM Information WHERE Balance >= 5000”); 4. Getting results while (rs.next()) { … } Sept. 2012 Yangjun Chen ACS-3902 56 Database: Review import java.sql.*; public class DataSourceDemo1 { public static void main(String[] args) { Connection con = null; try {//load driver class Class.forName{“sun.jdbs.odbs.JdbsOdbcDriver”); //data source String url = “jdbs:odbc:Customers”; //get connection con = DriverManager.getConnection(url, “sa”, “ “) password Sept. 2012 Yangjun Chen ACS-3902 57 Database: Review //create SQL statement Statement stmt = con.createStatement(); //execute query Result rs = stmt.executeQuery(“SELECT * FROM Information WHERE Balance >= 5000”); String firstName, lastName; Date birthDate; float balance; int accountLevel; Sept. 2012 Yangjun Chen ACS-3902 58 Database: Review while(rs.next()) { firstName = rs.getString(“FirstName”); lastName = rs.getString(“lastName”); balance = rs.getFloat(“Balance”); System.out.println(firstName + “ “ + lastName + “, balance = “ + balance); } } catch(Exception e) {e.printStackTrace();} finally {try{con.close();} catch(Exception e){ } } } } Sept. 2012 Yangjun Chen ACS-3902 59 Database: Review Programming in an dynamical environment: Disadvantage of DataSourceDemo1: If the JDBC-ODBC driver, database, user names, or password are changed, the program has to be modifid. Solution: file name: datasource.config Configuration file: config.driver=sun.jdbc.odbc.JdbcOdbcDriver config.protocol=jdbc <property> = <property value> config.subprotocol=odbc config.dsname=Customers config – datasource name config.username=sa config.password= Sept. 2012 Yangjun Chen ACS-3902 60 Database: Review import java.sql.*; import java.io.*; import java.util.Properties; public class DatabaseAccess { private String configDir; //directory for configuration file private String dsDriver = null; private String dsProtocol = null; private String dsSubprotocol = null; private String dsName = null; private String dsUsername = null; private String dsPassword = null; Sept. 2012 Yangjun Chen ACS-3902 61 Database: Review public DatabaseAccess(String configDir) { this.configDir = configDir; } public DatabaseAccess() { this(“.”); } //source: data source name //configFile: source configuration file public Connection getConnection(String source, String configFile) throws SQLException, Exception { Connection con; try {Properties prop = loadConfig(ConfigDir, ConfigFile); Sept. 2012 Yangjun Chen ACS-3902 62 Database: Review if (prop != null) {dsDriver = prop.getProperty(source + “.driver”); dsProtocol = prop.getPropert(source + “.protocol”); dsSubprotocol = prop.getPropert(source + “.subprotocol”); if (dsName == null) dsName = prop.getProperty(source + “.dsName”); if (dsUsername == null) dsUsername = prop.getProperty(source + “.username”); if (dsPassword == null) dsPassword = prop.getProperty(source + “.password”); Sept. 2012 Yangjun Chen ACS-3902 63 Database: Review //load driver class Class.forName(dsDriver); //connect to data source String url = dsProtocol + “:” + dsSubprotocol + “:” + dsName; con = DriverManager.getConnection(url, dsUsername, dsPassword) } else throw new Exception(“*Cannot find property file” + configFile); return con; } catch (ClassNotFoundException e) { throw new Exception(“* Cannot find driver class “ + dsDriver + “!”); } } Sept. 2012 Yangjun Chen ACS-3902 64 Database: Review //dir: directory of configuration file //filename: file name public Properties loadConfig(String dir, String filename) throws Exception { File inFile = null; Properties prop = null; try { inFile = new File(dir, filename); if (inFile.exists() { prop = new Properties(); prop.load(new FileInputStream(inFile)); } else throw new Exception(“* Error in finding “ + inFile.toString()); } finally {return prop;} } } Sept. 2012 Yangjun Chen ACS-3902 65 Database: Review Using class DatabaseAccess, DataSourceDemo1 should be modified a little bit: DatabaseAccess db = new databaseAccess(); con = db.getConnection(“config”, “datasource.config”); Sept. 2012 Yangjun Chen ACS-3902 66 Database: Review function dependencies - data redundancy, update anomalies - what is a function dependency? - inference rules, minimal set of FDs Normalization normal forms - first normal form - second normal form - third normal form - Boyce Codd normal form Sept. 2012 Yangjun Chen ACS-3902 67 Database: Review Data redundancy and update anomalies: EmployeeDepartment ename ssn bdate address dnumber dname This is similar to Employee, but we have included dname. Sept. 2012 Yangjun Chen ACS-3902 68 Database: Review EmployeeProject ssn pnumber hours ename plocation This is similar to Works_on, but we have included ename and plocation Sept. 2012 Yangjun Chen ACS-3902 69 Database: Review In the two prior cases with EmployeeDepartment and EmployeeProject, we have redundant information in the database … • if two employees work in the same department, then that department name is replicated • if more than one employee works on a project then the project location is replicated • if an employee works on more than one project his/her name is replicated Redundant data leads to • additional space requirements • update anomalies Sept. 2012 Yangjun Chen ACS-3902 70 Database: Review Suppose EmployeeDepartment is the only relation where department name is recorded insert anomalies • adding a new department is complicated unless there is also an employee for that department deletion anomalies • if we delete all employees for some department, what should happen to the department information? modification anomalies • if we change the name of a department, then we must change it in all tuples referring to that department Sept. 2012 Yangjun Chen ACS-3902 71 Database: Review Functional dependencies: Suppose we have a relation R comprising attributes X,Y, … We say a functional dependency exists between the attributes X and Y, X Y if, whenever a tuple exists with the value x for X, it will always have the same value y for Y. X LHS Sept. 2012 Y RHS Yangjun Chen ACS-3902 72 Database: Review Student course_no student_no student_name gender Student_no Student_name gender Given a specific student number, there is only one value for student name and only one value for gender found with it. Sept. 2012 Yangjun Chen ACS-3902 73 Database: Review Inference Rules for Function Dependencies • From a set of FDs, we can derive some other FDs Example: F = {ssn {Ename, Bdate, Address, dnumber}, dnumber {dname, dmgrssn}} inference ssn {dname, dmgrssn}, ssn dnumber, dnumber dname. • F+ (closure of F): The set of all FDs that can be deduced from F (with F together) is called the closure of F. Sept. 2012 Yangjun Chen ACS-3902 74 Database: Review Inference Rules for Function Dependencies • Inference rules: - IR1 (reflexive rule): If X Y, then X Y. (X X.) - IR2 (augmentation rule): {X Y} |= ZX ZY. - IR3 (transitive rule): {X Y, Y Z} |= X Z. - IR4 (decomposition, or projective, rule): {X ZY} |= X Y, X Z. - IR5 (union, or additive, rule): {X Y, Y Z} |= X ZY. - IR6 (pseudotransitive rule): {X Y, WY Z} |= WX Z. Sept. 2012 Yangjun Chen ACS-3902 75 Database: Review Equivalence of Sets of FDs E and F are equivalent if E+ = F+. Minimal sets of FDs • every dependency has a single attribute on the RHS • the attributes on the LHS of a dependency are minimal • we cannot remove any dependency from F and still have a set of dependencies that is equivalent to F. ssn pnumber hours ename plocation {ssn, pnumber} hours, ssn ename, pnumber plocation. Sept. 2012 Yangjun Chen ACS-3902 76 Database: Review Normal Forms • A series of normal forms are known that have, successively, better update characteristics. • We’ll consider 1NF, 2NF, 3NF, and BCNF. • A technique used to improve a relation is decomposition, where one relation is replaced by two or more relations. When we do so, we want to eliminate update anomalies without losing any information. Sept. 2012 Yangjun Chen ACS-3902 77 Database: Review 1NF - First Normal Form The domain of an attribute must only contain atomic values. • This disallows repeating values, sets of values, relations within relations, nested relations, … • In the example database we have a department located in possibly several locations: department 5 is located in Bellaire, Sugarland, and Houston. • If we had the relation Department dnumber dname dmgrssn dlocations 5 Research 333445555 Bellaire, Sugarland, Houston then it would not be 1NF because there are multiple values to be kept in dlocations. Sept. 2012 Yangjun Chen ACS-3902 78 Database: Review 1NF - First Normal Form If we have a non-1NF relation we can decompose it, or modify it appropriately, to generate 1NF relations. There are 3 options: • option 1: split off the problem attribute into a new relation (create a DepartmentLocation relation). Department DepartmentLocation dnumber dname dmgrssn dnumber dlocation 5 Research 333445555 5 Bellaire 5 5 Generally considered the best solution Sept. 2012 Yangjun Chen ACS-3902 Sugarland Houston 79 Database: Review 2NF - Second Normal Form • full functional dependency X Y is a full functional dependency if removal of any attribute A from X means that the dependency does not hold any more. EmployeeProject ssn pnumber hours ename plocation {ssn, pnumber} hours is a full dependency (neither ssn hours , nor pnumber hours). Sept. 2012 Yangjun Chen ACS-3902 80 Database: Review 2NF - Second Normal Form • partial functional dependency X Y is a partial functional dependency if removal of some attribute A from X does not affect the dependency. EmployeeProject ssn pnumber hours ename plocation {ssn, pnumber} ename is a partial dependency because ssn ename holds.) Sept. 2012 Yangjun Chen ACS-3902 81 Database: Review 2NF - Second Normal Form A relation schema is in 2NF if (1) it is in 1NF and (2) every non-key attribute must be fully functionally dependent on the primary key. If we had the relation EmployeeProject ssn pnumber hours ename plocation then this relation would not be 2NF because of two separate violations of the 2NF definition: Sept. 2012 Yangjun Chen ACS-3902 82 Database: Review 2NF - Second Normal Form •We correct this by decomposing the relation into three relations - splitting off the offending attributes - splitting off partial dependencies on the key. EmployeeProject ssn pnumber ssn pnumber hours ename plocation hours 2NF ssn ename pnumber plocation Sept. 2012 Yangjun Chen ACS-3902 83 Database: Review 3NF - Third Normal Form • Transitive dependency A functional dependency X Y in a relation schema R is a transitive dependency if there is a set of attributes Z that is not a subset of any key of R, and both X Z and Z Y hold. EmployeeDept ename ssn bdate address dnumber dname ssn dnumber and dnumber dname Sept. 2012 Yangjun Chen ACS-3902 84 Database: Review 3NF - Third Normal Form A relation schema is in 3NF if (1) it is in 2NF and (2) each non-key attribute must not be fully functionally dependent on another non-key attribute (there must be no transitive dependency of a non-key attribute on the PK) • If we had the relation ename ssn bdate address dnumber dname then this relation would not be 3NF because • dname is functionally dependent on dnumber and neither is • a key attribute Sept. 2012 Yangjun Chen ACS-3902 85 Database: Review 3NF - Third Normal Form • We correct this by decomposing - splitting off the transitive dependencies EmployeeDept ename ssn ename ssn bdate bdate address dnumber dname address dnumber 3NF dnumber Sept. 2012 Yangjun Chen ACS-3902 dname 86 Database: Review Boyce Codd Normal Form, BCNF • Consider a different definition of 3NF, which is equivalent to the previous one. A relation schema R is in 3NF if, whenever a function dependency X A holds in R, either (a) X is a superkey of R, or (b) A is a prime attribute of R. A superkey of a relation schema R = {A1, A2, ..., An} is a set of attributes S R with the propertity that no tuples t1 and t2 in any legal state r of R will have t1[S] = t2[S]. An attribute is called a prime attribute if it is a member of any key. Sept. 2012 Yangjun Chen ACS-3902 87 Database: Review Boyce Codd Normal Form, BCNF • If we remove (b) from the previous definition for 3NF, we have the definition for BCNF. • A relation schema is in BCNF if every determinant is a superkey key. Stronger than 3NF: - no partial dependencies - no transitive dependencies where a non-key attribute is dependent on another non-key attribute - no non-key attributes appear in the LHS of a functional dependency. Sept. 2012 Yangjun Chen ACS-3902 88 Database: Review Boyce Codd Normal Form, BCNF Consider: Instructor teaches one course only. student_no course_no instr_no In 3NF! Student takes a course and has one instructor. {student_no, course_no} instr_no instr_no course_no Sept. 2012 Yangjun Chen ACS-3902 89 Database: Review Boyce Codd Normal Form, BCNF Some sample data: student_no course_no instr_no Sept. 2012 121 121 222 1803 1903 1803 99 77 66 222 1903 77 Yangjun Chen ACS-3902 Instructor 99 teaches 1803 Instructor 77 teaches 1903 Instructor 66 teaches 1803 90 Database: Review Boyce Codd Normal Form, BCNF Some sample data: student_no course_no instr_no Sept. 2012 121 121 222 1803 1903 1803 99 77 66 222 1903 77 Yangjun Chen ACS-3902 Instructor 99 teaches 1803 Instructor 77 teaches 1903 Instructor 66 teaches 1803 91 Database: Review Boyce Codd Normal Form, BCNF student_no course_no instr_no 121 121 222 1803 1903 1803 99 77 66 222 1903 77 Instructor 99 teaches 1803 Instructor 77 teaches 1903 Instructor 66 teaches 1803 Deletion anomaly: If we delete all rows for course 1803 we’ll lose the information that instructors 99 teaches student 121 and 66 teaches student 222. Insertion anomaly: How do we add the fact that instructor 55 teaches course 2906? Sept. 2012 Yangjun Chen ACS-3902 92 Database: Review Boyce Codd Normal Form, BCNF Which decomposition preserves all the information? S# C# C# I# 121 121 222 222 1803 1903 1803 1903 1803 1903 1803 99 77 66 student_no course_no ? course_no instr_no student_no course_no Joining these two tables leads to spurious tuples - result includes 121 1803 66 222 1803 99 student_no instr_no student_no instr_no course_no instr_no Sept. 2012 Yangjun Chen ACS-3902 93 Database: Review student_no course_no course_no instr_no student_no course_no instr_no 121 1803 99 S# C# C# I# 121 1903 77 121 1803 1803 99 222 1803 66 121 1903 1903 77 222 1903 77 222 1803 1803 66 222 1903 Sept. 2012 Yangjun Chen ACS-3902 94 Database: Review Boyce Codd Normal Form, BCNF Which decomposition preserves all the information? S# C# S# I# 121 121 222 222 1803 1903 1803 1903 121 121 222 222 99 77 66 77 Joining these two tables leads to spurious tuples - result includes 121 1803 77 121 1903 99 222 1803 77 222 1903 66 Sept. 2012 Yangjun Chen student_no course_no course_no instr_no student_no course_no ? student_no instr_no student_no instr_no course_no instr_no ACS-3902 95 Database: Review student_no course_no student_no instr_no student_no course_no instr_no 121 1803 99 S# C# S# I# 121 1903 77 121 1803 121 99 222 1803 66 121 1903 121 77 222 1903 77 222 1803 222 66 222 1903 222 77 Sept. 2012 Yangjun Chen ACS-3902 96 Database: Review Boyce Codd Normal Form, BCNF This decomposition preserves all the information. S# I# C# I# 121 121 222 222 99 77 66 77 1803 1903 1803 99 77 66 Only FD is instr_no student_no instr_no course_no instr_no course_no but the join preserves {student_no, course_no} Sept. 2012 Yangjun Chen ACS-3902 instr_no 97 Database: Review student_no Instr_no course_no instr_no student_no course_no instr_no 121 1803 99 S# I# C# I# 121 1903 77 121 99 1803 99 222 1803 66 121 77 1903 77 222 1903 77 222 66 1803 66 222 77 Sept. 2012 Yangjun Chen ACS-3902 98 Database: Review Definition of lossless join property - relation decomposition - lossless join property Lossless join Testing algorithm - matrix construction - matrix initialization - matrix modification Sept. 2012 Yangjun Chen ACS-3902 99 Database: Review • Basic definition of Lossless-join A decomposition D = {R1, R2,..., Rm} of R has the lossless join property with respect to the set of dependencies F on R if, for every relation r of R that satisfies F, the following holds, (R1(r), ..., Rm(r)) = r, where is the natural join of all the relations in D. The word loss in lossless refers to loss of information, not to loss of tuples. Sept. 2012 Yangjun Chen ACS-3902 100 Database: Review Emp_PROJ SSN PNUM hours ENAME PNAME PLOCATION F = {SSN ENAME, PNUM {PNAME, PLOCATION}, {SSN, PNUM} hours} R1 SSN ENAME R3 SSN PNUM Sept. 2012 R2 PNUM PNAME PLOCATION hours Yangjun Chen Lossless join ACS-3902 101 Database: Review • decomposion-1 A1 SSN A2 ENAME A3 PNUM A4 PNAME A5 PLOCATION A6 hours R1 b11 b12 b13 b14 b15 b16 R2 b21 b22 b23 b24 b25 b26 R3 b31 b32 b33 b34 b35 b36 R1 a1 a2 b13 b14 b15 b16 R2 b21 b22 a3 a4 a5 b26 R3 a1 b32 a3 b34 b35 a6 Sept. 2012 Yangjun Chen ACS-3902 102 Database: Review SSN ENAME SSN ENAME R1 a1 a2 b13 b14 b15 b16 R2 b21 b22 a3 a4 a5 b26 R3 a1 a2 a3 b34 b35 a6 PNUM {PNAME, PLOCATION} PNUM PNAME PLOCATION R1 a1 a2 b13 b14 b15 b16 R2 b21 b22 a3 a4 a5 b26 R3 a1 a2 a3 a4 a5 a6 Sept. 2012 Yangjun Chen ACS-3902 103 Database: Review • Example: decomposition-2 Emp_PROJ SSN PNUM hours ENAME PNAME PLOCATION F = {SSN ENAME, PNUM {PNAME, PLOCATION}, {SSN, PNUM} hours} R1 ENAME PLOCATION Not lossless join R2 SSN PNUM Sept. 2012 hours PNAME PLOCATION Yangjun Chen ACS-3902 104 Database: Review • decomposition-2 A1 SSN A2 ENAME A3 PNUM A4 PNAME A5 PLOCATION A6 hours R1 b11 b12 b13 b14 b15 b16 R2 b21 b22 b23 b24 b25 b26 R1 b11 a2 b13 b14 a5 b16 R2 a1 b22 a3 a4 a5 a6 SSN ENAME PNUM {PNAME, PLOCATION} {SSN, PNUM} hours The matrix can not be changed! Sept. 2012 Yangjun Chen ACS-3902 105 Database: Review Multi-Dimensional Indexes • Multiple-key indexes • kd-trees • Quad trees • R-trees • Bit map • Inverted files Sept. 2012 Yangjun Chen ACS-3902 106 Database: Review Multiple-key indexes (Indexes over more than one attributes) Employee ename ssn age salary dnumber Aaron, Ed Abbott, Diane Adams, John Adams, Robin Sept. 2012 Yangjun Chen ACS-3902 107 Database: Review Multiple-key indexes (Indexes over more than one attributes) Index on age Sept. 2012 Index on salary Yangjun Chen ACS-3902 108 Database: Review Multiple-key indexes 60 400 260 25 30 45 50 60 70 85 60 350 75 100 120 275 260 110 140 Sept. 2012 Yangjun Chen ACS-3902 109 Database: Review kd-Trees (A generalization of binary trees) A kd-tree is a binary tree in which interior nodes have an associated attribute a and a value v that splits the data points into two parts: those with a-value less than v and those with a-value equal or larger than v. Sept. 2012 Yangjun Chen ACS-3902 110 Database: Review kd-Trees salary 150 age 60 70, 110 85, 140 salary 80 50, 100 50, 120 age 38 25, 60 Sept. 2012 age 47 salary 300 30, 260 50, 275 60, 260 25, 400 45, 350 45, 60 50, 75 Yangjun Chen ACS-3902 111 Database: Review Insert a new entry into a kd-tree: insert(35, 500): salary 150 age 60 70, 110 85, 140 salary 80 50, 100 50, 120 age 38 25, 60 Sept. 2012 age 47 30, 260 salary 300 50, 275 60, 260 25, 400 45, 350 45, 60 50, 75 Yangjun Chen ACS-3902 112 Database: Review Insert a new entry into a kd-tree: salary 150 insert(35, 500): age 60 70, 110 85, 140 salary 80 50, 100 50, 120 age 38 25, 60 Sept. 2012 age 47 30, 260 age 35 25, 400 45, 60 50, 75 Yangjun Chen 50, 275 60, 260 salary 300 ACS-3902 35, 500 45, 350 113 Database: Review Quad-trees In a Quad-tree, each node corresponds to a square region in two dimensions, or to a k-dimensional cube in k dimensions. • If the number of data entries in a square is not larger than what will fit in a block, then we can think of this square as a leaf node. • If there are too many data entries to fit in one block, then we treat the square as an interior node, whose children correspond to its four quadrants. Sept. 2012 Yangjun Chen ACS-3902 114 Database: Review Quad-trees name age … salary … … 25 … 400 … 400k salary 0 Sept. 2012 100 age Yangjun Chen ACS-3902 115 Database: Review 400k Quad-trees SW 25, 60 46, 60 50, 75 50, 100 SW – south-west SE – south-east Sept. 2012 SE NE 75, 100 85, 140 100 0 50, 200 NW 50, 275 60, 260 50, 120 70, 110 25, 300 30, 260 25, 400 45, 350 NW – north-west NE – north-east Yangjun Chen ACS-3902 116 Database: Review R-trees An R-tree is an extension of B-trees for multidimensional data. • An R-tree corresponds to a whole area (a rectangle for two-dimensional data.) • In an R-tree, any interior node corresponds to some interior regions, or just regions, which are usually a rectangle • Each region x in an interior node n is associated with a link to a child of n, which corresponds to all the subregions within x. Sept. 2012 Yangjun Chen ACS-3902 117 Database: Review Suppose that the local cellular phone company adds a POP (point of presence, or base station) at the position shown below. 100 POP school road1 house2 house1 pipeline road2 Sept. 2012 0 Yangjun Chen ACS-3902 100 118 Database: Review R-trees ((0, 0), (60, 50)) road1 Sept. 2012 road2 house1 ((20, 20), (100, 80)) school house2 pipeline pop Yangjun Chen ACS-3902 119 Database: Review Insert a new region r into an R-tree. 100 POP school road1 house2 house1 pipeline road2 house3 100 0 Sept. 2012 ((70, 5), (980, 15)) Yangjun Chen ACS-3902 120 Database: Review Insert a new region r into an R-tree. 1. Search the R-tree, starting at the root. 2. If the encountered node is internal, find a subregion into which r fits. • If there is more than one such region, pick one and go to its corresponding child. • If there is no subregion that contains r, choose any subregion such that it needs to be expanded as little as possible to contain r. ((70, 5), (980, 15)) ((0, 0), (60, 50)) road1 Sept. 2012 road2 house1 Yangjun Chen ((20, 20), (100, 80)) school house2 pipeline pop ACS-3902 121 Database: Review Two choices: • If we expand the lower subregion, corresponding to the first leaf, then we add 1000 square units to the region. • If we extend the other subregion by lowering its bottom by 5 units, then we add 1200 square units. ((0, 0), (80, 50)) road1 road2 house1 house3 Sept. 2012 Yangjun Chen ((20, 20), (100, 80)) school house2 pipeline pop ACS-3902 122 Database: Review Insert a new region r into an R-tree. 3. If the encountered node v is a leaf, insert r into it. If there is no room for r, split the leaf into two and distribute all subregions in them as evenly as possible. Calculate the ‘parent’ regions for the new leaf nodes and insert them into v’s parent. If there is the room at v’s parent, we are done. Otherwise, we recursively split nodes going up the tree. Suppose that each leaf has room for 6 regions. road1 Sept. 2012 ((0, 0), (100, 100)) road2 Add POP (point of presence, or base station) house1 school house2 pipeline Yangjun Chen ACS-3902 123 Database: Review Bit map 1. Image that the records of a file are numbered 1, …, n. 2. A bitmap for a data field F is a collection of bit-vector of length n, one for each possible value that may appear in the field F. 3. The vector for a specific value v has 1 in position i if the ith record has v in the field F, and it has 0 there if not. Sept. 2012 Yangjun Chen ACS-3902 124 Database: Review Example Employee ename ssn age salary dnumber Aaron, Ed Abbott, Diane 30 30 60 60 Adams, John Adams, Robin Brian, Robin Brian, Mary Widom, Jones 40 50 55 55 60 75 75 78 80 100 Bit maps for age: 30: 1100000 40: 0010000 50: 0001000 Sept. 2012 Bit maps for salary: 55: 0000110 60: 0000001 Yangjun Chen 60: 1100000 75: 0011000 78: 0000100 ACS-3902 80: 0000010 100: 0000001 125 Database: Review Query evaluation Select ename From Employee Where age = 55 and salary = 80 In order to evaluate this query, we intersect the vectors for age = 55 and salary = 80. 0000110 0000010 vector for age = 55 vector for salary = 80 0000010 This indicates the 6th tuple is the answer. Sept. 2012 Yangjun Chen ACS-3902 126 Database: Review Range query evaluation Select ename From Employee Where 30 < age < 55 and 60 < salary < 78 We first find the bit-vectors for the age values in (30, 50); there are only two: 0010000 and 0001000 for 40 and 50, respectively. Take their bitwise OR: 0010000 0001000 = 0011000. Next find the bit-vectors for the salary values in (60, 78) and take their bitwise OR: 1100000 0011000 = 1111000. 0011000 1111000 0011000 The 3rd and 4th tuples are the answer. Sept. 2012 Yangjun Chen ACS-3902 127 Database: Review Compression of bitmaps Run-length encoding: Run in a bit vector: a sequence of i 0’s followed by a 1. 000000010001 This bit vector contains two runs. Run compression: a run r is represented as another bit string r’ composed of two parts. part 1: i expressed as a binary number, denoted as b1(i). part 2: Assume that b1(i) is j bits long. Then, part 2 is a sequence of (j – 1) 1’s followed by a 0, denoted as b2(i). r’ = b2(i)b1(i). Sept. 2012 Yangjun Chen ACS-3902 128 Database: Review Compression of bitmaps Run-length encoding: Run in a bit vector s: a sequence of i 0’s followed by a 1. 000000010001 This bit vector contains two runs. r’ = b2(i)b1(i). r1 = 00000001 r1’ = 110111 b11 = 7 = 111, b12 = 110 r2 = 0001 r2’ = 1011 b11 = 3 = 11, b12 = 10 Sept. 2012 Yangjun Chen ACS-3902 129 Database: Review 000000010001 r1’ r2’ = 1101111011 Starting at the beginning, find the first 0 at the 3rd bit, so j = 3. The next 3 bits are 111, so we determine that the first integer is 7. In the same way, we can decode1011. Decoding a compressed sequence s’: 1. Scan s’ from the beginning to find the first 0. 2. Let the first 0 appears at position j. Check the next j bits. The corresponding value is a run. 3. Remove all these bits from s’. Go to (1). Sept. 2012 Yangjun Chen ACS-3902 130 Database: Review Inverted files An inverted file - A list of pairs of the form: <key word, pointer> … the cat is fat cat … was raining cats and dogs … dog … Fido the Dogs … a bucket of pointers Sept. 2012 Yangjun Chen ACS-3902 131 Database: Review Inverted files When we use “buckets” of pointers to occurrences of each word, we may extend the idea to include in the bucket array some information about each occurrence. cat type position title header anchor text 5 10 3 57 dog … … the cat is fat … … … Sept. 2012 Yangjun Chen ACS-3902 … was raining cats and dogs … … Fido the Dogs … 132 Database: Review Hierarchical database schema - hierarchical schema - record type, PCR type Hierarchical databases - virtual PCR: virtual child, virtual parent Database languages - HDDL - HDML Sept. 2012 Yangjun Chen ACS-3902 133 Database: Review ERD for Chapter 6 database example dependent n n 1 1 m Works on employee n project 1 n 1 1 n 1 Dept_locations Sept. 2012 n Yangjun Chen 1 ACS-3902 department 134 Database: Review • Virtual Parent-child Relationships - Hierarchical schema using VPCR - for a Company database D Department Dname Dnum E Employee Ename Minit … ... P L Dlocation Location Project Pname … ... Esupervisee SPTR S Y Demployee EPTR M Dmanager StartDate MPTR Sept. 2012 T Dependent DEPname Minit ... W Pworker Hours WPTR Yangjun Chen ACS-3902 135