* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Mining - Lyle School of Engineering
Survey
Document related concepts
Serializability wikipedia , lookup
Concurrency control wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Functional Database Model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Clusterpoint wikipedia , lookup
Relational algebra wikipedia , lookup
Transcript
CSE 5330/7330 Database Introduction Fall 2009 Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University POBox 750122 Dallas, Texas 75275-0122 CSE 5330/7330 Fall 2009 1 Database Introduction NOTE: These slides provide an overview of the basic database concepts. During the semester we will return to them to provide an overview and summary of each section covered. CSE 5330/7330 Fall 2009 2 Database Outline Introduction File Organization & Indexing Data Models Relational Model SQL/Query Processing Transactions CSE 5330/7330 Fall 2009 3 Database History A Short Database History by John Vaughn http://math.hws.edu/vaughn/cpsc/343/2003/hist ory.html A Brief History of Database Systems http://www.comphist.org/computing_history/new _page_9.htm CSE 5330/7330 Fall 2009 4 DB History Snapshots CFMS/DBOMP (Late 60s) EF Codd Paper (1970) DBTG Report (1974) IMS/IDMS (Early 1970s) System R (1970s) Transactions – Jim Gray (1970s) ER Model (1976) OODB (1985) XML/Internet (1990s) CSE 5330/7330 Fall 2009 5 What is a Database? Collection of Related Data – Data – Hardware – Software (DBMS) – Users CSE 5330/7330 Fall 2009 6 File vs. Database Single user vs. multiple users Simple relationships vs. complex relationships Integrity support Concurrency control Recovery Query language Security Different Views of data CSE 5330/7330 Fall 2009 7 Some DB Terms Data/Information/Knowledge DataBase Management System (DBMS) Data Dictionary/Directory/Metadata Data Model Data Definition Languare (DDL) Data Manipulaiton Language (DML) DataBase Administrator (DBA) Data Administrator (DA) Database designer Information Resource Manager (IRM) Chief Information Officer (CIO) … CSE 5330/7330 Fall 2009 8 DBMS Components DDL Compiler DML Compiler Precompiler (embedded language support) Access methods Concurrency Control Recovery Security Data Dictionary (Metadata) Utility Services … CSE 5330/7330 Fall 2009 9 Views of Data Levels of Abstraction – External view – Conceptual schema – Physical (Internal schema) Data independence CSE 5330/7330 Fall 2009 10 Data Model Way to “picture” and access data independent of how it is actually stored. – – – – Data Description Data Relationships Operations Integrity/Consistency constraints Examples: – – – – – Entity-Relationship (ER/ERA) Relational Object Oriented Object/Relational Older – Network/Hierarchical CSE 5330/7330 Fall 2009 11 Relational Model* Based on tables, as: acct # Name 12345 Sally 34567 Sue … … Rows (tuples) Columns (attributes) Today used in most DBMS's. Balance 1000.21 285.48 … Most of the following slides were obtained from the home page for A First Course in Database Systems by Jeffrey D. Ullman and Jennifer Widom, Prentice Hall, 2002, http://www-db.stanford.edu/~ullman/fcdb.html CSE 5330/7330 Fall 2009 12 Data Relationships One-to-one (1:1) Ex: Name to SSN One-to-many (1:M) Ex: Name to Phone Many-to many (M:n) Ex: Part to Supplier What data structures can be used to store these relationships? CSE 5330/7330 Fall 2009 13 Database Outline Introduction File Organization & Indexing Data Models Relational Model SQL/Query Processing Transactions CSE 5330/7330 Fall 2009 14 Typical Data Structures used in DBMSs Sequential Files Hash Table – Extendible Hash Table B-Tree (Multiway Search Tree) – B+-Tree Combinations of these Indices CSE 5330/7330 Fall 2009 15 Placement of Data on Disk Record/Cylinder/Block/Sector Blocking Factor Allocation – – – – Contiguous Linked Extents Indexed Clustering Partitioning RAID CSE 5330/7330 Fall 2009 16 Data Structure Pointers Logical – Key – Relative Block Physical – Memory - Physical Address (offset) – Disk – Physical Address » Device/Cylinder/Track/Sector/Block/Offset CSE 5330/7330 Fall 2009 17 Disk vs. Memory Data Structures Objective – Disk – minimize I/O – Memory – minimize memory accesses or CPU time Tree – Disk – large nodes, shallow – Memory – small nodes. Deep CSE 5330/7330 Fall 2009 18 Access Sequential – Retrieve records in order (logical/physical) Random – Retrieve record based on key Direct – Retrieve record based on physical address Relative – Retrieve record based on relative position in file Binary Search – Randomly retrieve record doing binary search CSE 5330/7330 Fall 2009 19 Organization Sequential – Records stored in logical order of key. Access: Sequential, relative, binary search Heap – Records added to end or where space. Access: Direct Btree – Multiway balanced search tree. Access: Sequential, random Hash – Store and access record based on address determined when key is hashed. Access: Random CSE 5330/7330 Fall 2009 20 Indexing Speed up processing of data by providing alternative access path. Both index and primary storage of data provide access method. Ex: BTree index on last name Employee Data Hash on ID BTree index on job type CSE 5330/7330 Fall 2009 21 Index Types Number of entries – Dense – One index entry for each record in file – Sparse – One index entry for many records Key – Primary – Same key as main file – Secondary – Different key from original file Organizations: Hash, BTree, Sequential, BST CSE 5330/7330 Fall 2009 22 Index Search Times Organization Worst Expected Sequential O(n) O(n/2) Hash O(n) O(1.??) Tree (Balanced) O(n) O(lg n) B+-Tree O(lg n) O(lg n) CSE 5330/7330 Fall 2009 23 Hashing Bucket in one block (or cluster thereof) Hash value may be precise address or relative block (bucket) number Collisions handled by linked lists Dynamic Hashing – Allows hash table size to grow CSE 5330/7330 Fall 2009 24 Multiple Key Indexing Key composed of many subkeys Access based on all or subset of these Some indexing structures specifically targeted to n-dimensional accessing CSE 5330/7330 Fall 2009 25 Database Outline Introduction File Organization & Indexing Data Models Relational Model SQL/Query Processing Transactions CSE 5330/7330 Fall 2009 26 Data Model Evolution 60’s Hierarchical Network 70's 80's Choice for most new applications Relational 90’s Object Bases Knowledge Bases now CSE 5330/7330 Fall 2009 27 Entity/Relationship Model Diagrams to represent designs. Entity like object, = “thing.” Entity set like class = set of “similar” entities/objects. Attribute = property of entities in an entity set. In diagrams: – entity set rectangle – attribute oval. name ID Students CSE 5330/7330 Fall 2009 phone height 28 Relationships Connect two or more entity sets. Represented by diamonds. Students Taking CSE 5330/7330 Fall 2009 Courses 29 Relationship Set Think of the “value” of a relationship set as a table. One column for each of the connected entity sets. One row for each list of entities, one from each set, that are connected by the relationship. Students Sally Sally Joe … Courses CS180 CS111 CS180 … CSE 5330/7330 Fall 2009 30 Courses Students Enrolls TAs Students Ann Sue Bob … Courses CS180 CS180 CS180 … TAs Jan Pat Jan … CSE 5330/7330 Fall 2009 31 Beers-Bars-Drinkers Example name addr license Serves Bars Frequents Beers Likes Drinkers name manf name CSE 5330/7330 Fall 2009 addr 32 Multiplicity of Relationships Many-many Many-one One-one Representation of Many-One E/R: arrow pointing to “one.” – Rounded arrow = “exactly one.” CSE 5330/7330 Fall 2009 33 Example: Drinkers Have Favorite Beers name Serves addr license Bars Frequents Likes Beers name manf Drinkers Favorite CSE 5330/7330 Fall 2009 name addr 34 One-One Relationships Put arrows in both directions. Manfs Bestseller Beers Design Issue: Is the rounded arrow justified? Design Issue: Here, manufacturer is an E.S. In earlier diagrams it is an attribute. Which is right? CSE 5330/7330 Fall 2009 35 Attributes onpriceRelationships Bars Sells Beers Shorthand for 3-way relationship: price Prices Bars Sells CSE 5330/7330 Fall 2009 Beers 36 Roles Sometimes an E.S. participates more than once in a relationship. Label edges with roles to distinguish. Husband Wife Married d1 d2 d3 d4 husband wife … … Drinkers CSE 5330/7330 Fall 2009 37 Buddies 1 2 Drinkers Buddy1 d1 d1 d2 d2 … Buddy2 d2 d3 d1 d4 … Notice Buddies is symmetric, Married not. – No way to say “symmetric” in E/R. Design Question Should we replace husband and wife by one relationship spouse? CSE 5330/7330 Fall 2009 38 Multiple Inheritance Theoretically, an E.S. could be a subclass of several other entity sets. name manf name manf Beers Wines isa isa Grape Beers CSE 5330/7330 Fall 2009 39 Keys A key is a set of attributes whose values can belong to at most one entity. In E/R model, every E.S. must have a key. – It could have more than one key, but one set of attributes is the “designated” key. In E/R diagrams, you should underline all attributes of the designated key. CSE 5330/7330 Fall 2009 40 Example Suppose name is key for Beers. name Beers manf isa color Ales Beer name is also key for ales. – In general, key at root is key for all. CSE 5330/7330 Fall 2009 41 Example: A Multiattribute Key number dept hours Courses room Possibly, the combination of hours + room also forms a key, but we have not designated it as such. CSE 5330/7330 Fall 2009 42 Database Outline Introduction File Organization & Indexing Data Models Relational Model SQL/Query Processing Transactions CSE 5330/7330 Fall 2009 43 Relational Model Table = relation. Column headers = attributes. Row = tuple name WinterBrew BudLite … Beers manf Pete’s A.B. … Relation schema = name(attributes) + other structure info., e.g., keys, other constraints. Example: Beers(name, manf) – Order of attributes is arbitrary, but in practice we need to assume the order given in the relation schema. Relation instance is current set of rows for a relation schema. Database schema = collection of relation schemas. CSE 5330/7330 Fall 2009 44 Relation Instance Name Address Telephone Bob 123 Main St 555-1234 Bob 128 Main St 555-1235 Pat 123 Main St 555-1235 Harry 456 Main St 555-2221 Sally 456 Main St 555-2221 Sally 456 Main St 555-2223 Pat 12 State St 555-1235 CSE 5330/7330 Fall 2009 45 Why Relations? Very simple model. Often a good match for the way we think about our data. Abstract model that underlies SQL, the most important language in DBMS’s today. CSE 5330/7330 Fall 2009 46 Relational Design Simplest approach (not always best): convert each E.S. to a relation and each relationship to a relation. Entity Set Relation E.S. attributes become relational attributes. name manf Beers Becomes: Beers(name, manf) CSE 5330/7330 Fall 2009 47 Keys in Relations An attribute or set of attributes K is a key for a relation R if we expect that in no instance of R will two different tuples agree on all the attributes of K. Indicate a key by underlining the key attributes. Example: If name is a key for Beers: Beers(name, manf) CSE 5330/7330 Fall 2009 48 E/R Relationships Relations Relation has attribute for key attributes of each E.S. that participates in the relationship. Add any attributes that belong to the relationship itself. Renaming attributes OK. – Essential if multiple roles for an E.S. CSE 5330/7330 Fall 2009 49 name addr Drinkers 1 Likes manf Beers 2 Buddies husband Favorite wife Married name Likes(drinker, beer) Favorite(drinker, beer) Married(husband, wife) Buddies(name1, name2) For one-one relation Married, we can choose eitherCSEhusband or wife as key. 5330/7330 Fall 2009 50 Combining Relations Sometimes it makes sense to combine relations. Common case: Relation for an E.S. E plus the relation for some many-one relationship from E to another E.S. Example Combine Drinker(name, addr) with Favorite(drinker, beer) to get Drinker1(name, addr, favBeer). Danger in pushing this idea too far: redundancy. e.g., combining Drinker with Likes causes the drinker's address to be repeated, viz.: name Sally Sally addr 123 Maple 123 Maple beer Bud Miller Notice the difference: Favorite is many-one; Likes is many-many. CSE 5330/7330 Fall 2009 51 Keys of Relations K is a key for relation R if: 1. K all attributes of R. (Uniqueness) 2. For no proper subset of K is (1) true. (Minimality) If K at least satisfies (1), then K is a superkey. Conventions Pick one key; underline key attributes in the relation schema. CSE 5330/7330 Fall 2009 52 Example Drinkers(name, addr, beersLiked, manf, favoriteBeer) {name, beersLiked} FD’s all attributes, as seen. – Shows {name, beersLiked} is a superkey. name beersLiked is false, so name not a superkey. beersLiked name also false, so beersLiked not a superkey. Thus, {name, beersLiked} is a key. No other keys in this example. – Neither name nor beersLiked is on the right of any observed FD, so they must be part of any superkey. CSE 5330/7330 Fall 2009 53 Example 2 Lastname Firstname Key Student ID Major Key (2 attributes) Superkey Note: There are alternate keys Keys are {Lastname, Firstname} and {StudentID} CSE 5330/7330 Fall 2009 54 Normalization Process of simplifying relational design: – Avoid redundancy Functional Dependencies (FD) – Identify relationships between data values – SSN Name » In any tuple, the value for SSN determines a unique value for Name. » If the same SSN exists in two tuples, you’ll have the same Name duplicated. FDs are used by algorithms to determine best relations to be used given a set of attributes. CSE 5330/7330 Fall 2009 55 Example of Problems Drinkers(name, addr, beersLiked, manf, favoriteBeer) name addr beersLiked manf favoriteBeer Janeway Voyager Bud A.B. WickedAle Janeway ??? WickedAle Pete's ??? Spock Enterprise Bud ??? Bud FD’s: 1. name addr 2. name favoriteBeer 3. beersLiked manf ???’s are redundant, since we can figure them out from the FD’s. Update anomalies: If Janeway gets transferred to the Intrepid, will we change addr in each of her tuples? Deletion anomalies: If nobody likes Bud, we lose track of Bud’s manufacturer. CSE 5330/7330 Fall 2009 56 Database Outline Introduction File Organization & Indexing Data Models Relational Model SQL/Query Processing Transactions CSE 5330/7330 Fall 2009 57 “Core” Relational Algebra A small set of operators that allow us to manipulate relations in limited but useful ways. The operators are: 1. Union, intersection, and difference: the usual set operators. – But the relation schemas must be the same. 2. Selection: Picking certain rows from a relation. 3. Projection: Picking certain columns. 4. Products and joins: Composing relations in useful ways. 5. Renaming of relations and their attributes. CSE 5330/7330 Fall 2009 58 Selection R1 = C(R2) where C is a condition involving the attributes of relation R2. Example Relation Sells: bar Joe's Joe's Sue's Sue's beer Bud Miller Bud Coors price 2.50 2.75 2.50 3.00 beer Bud Miller price 2.50 2.75 JoeMenu = bar=Joe's(Sells) bar Joe's Joe's CSE 5330/7330 Fall 2009 59 Projection R1 = L(R2) where L is a list of attributes from the schema of R2. Example beer,price(Sells) beer Bud Miller Coors price 2.50 2.75 3.00 Notice elimination of duplicate tuples. CSE 5330/7330 Fall 2009 60 Product R = R1 R2 pairs each tuple t1 of R1 with each tuple t2 of R2 and puts in R a tuple t1t2. A B C D D E F A B C D D' E F CSE 5330/7330 Fall 2009 61 Join Sells Bars bar Joe's Joe's Sue's Sue's beer Bud Miller Bud Coors price 2.50 2.75 2.50 3.00 name Joe's Sue's addr Maple St. River Rd. BarInfo = Sells Sells.Bar=Bars.Name Bars bar Joe's Joe's Sue's Sue's beer Bud Miller Bud Coors price 2.50 2.75 2.50 3.00 name Joe's Joe's Sue's Sue's CSE 5330/7330 Fall 2009 addr Maple St. Maple St. River Rd. River Rd. 62 SQL SEQUEL in System R Structured English QUEry Language DDL and DML Standard Relational query language CSE 5330/7330 Fall 2009 63 SQL Operations SELECT … FROM … WHERE … UPDATE … SET … WHERE … INSERT INTO … VALUES (…) DELETE … WHERE … CSE 5330/7330 Fall 2009 64 SQL Employee Name Department Dept Dept Manager SQL SELECT Manager FROM Employee, Department WHERE Employee.name = "Clark Kent” AND Employee.Dept = Department.Dept CSE 5330/7330 Fall 2009 65 Host Languages C, C++, Fortran, Lisp, COBOL Application prog. Calls to DB DBMS Local Vars (Memory) Different DBMSs support different host language interfaces Precompiler ODBC/JDBC CSE 5330/7330 Fall 2009 (Storage) 66 Embedded SQL Add to a conventional programming language (C in our examples) certain statements that represent SQL operations. Each embedded SQL statement introduced with EXEC SQL. Preprocessor converts C + SQL to pure C. – SQL statements become procedure calls. CSE 5330/7330 Fall 2009 67 Example Find the price for a given beer at a given bar. Sells(bar, beer, price) EXEC SQL BEGIN DECLARE SECTION; char theBar[21], theBeer[21]; float thePrice; EXEC SQL END DECLARE SECTION; ... /* assign to theBar and theBeer */ ... EXEC SQL SELECT price INTO :thePrice FROM Sells WHERE beer = :theBeer AND bar = :theBar; .CSE . . 5330/7330 Fall 2009 68 Call-Level Interfaces A more modern approach to the hostlanguage/SQL connection is a call-level interface, in which the C (or other language) program creates SQL statements as character strings and passes them to functions that are part of a library. Similar to what really happens in embedded SQL implementations. Two major approaches: SQL/CLI (standard of ODBC = open database connectivity) and JDBC (Java database connectivity). CSE 5330/7330 Fall 2009 69 JDBC Start with a Connection object, obtained from the DBMS (see text). Method createStatement() returns an object of class Statement (if there is no argument) or PreparedStatement if there is an SQL statement as argument. Example Statement stat1 = myCon.createStatement(); PreparedStatement stat2 = myCon.createStatement( "SELECT beer, price " + "FROM Sells" + "WHERE bar = 'Joe''s Bar'" ); myCon is a connection, stat1 is an “empty” statement object, and stat2 is a (prepared) statement object that has an SQL statement associated. CSE 5330/7330 Fall 2009 70 Executing Statements JDBC distinguishes queries from updates Methods executeQuery() and executeUpdate() are used to execute these two kinds of SQL statements. When a query is executed, it returns an object of class ResultSet. Example stat1.executeUpdate( "INSERT INTO Sells" + "VALUES('Brass Rail', 'Bud', 3.00)" ); ResultSet Menu = stat2.executeQuery(); CSE 5330/7330 Fall 2009 71 Getting the Tuples of a ResultSet Method Next() applies to a ResultSet and moves a “cursor” to the next tuple in that set. – Apply Next() once to get to the first tuple. – Next() returns FALSE if there are no more tuples. While a given tuple is the current of the cursor, you can get its ith component by applying to a ResultSet a method of the form get X(i), where X is the name for the type of that component. Example while(Menu.Next()) { theBeer = Menu.getString(1); thePrice = Menu.getFloat(2); ... } CSE 5330/7330 Fall 2009 72 Database Outline Introduction File Organization & Indexing Data Models Relational Model SQL/Query Processing Transactions CSE 5330/7330 Fall 2009 73 Transactions = units of work that must be: 1. Atomic = either all work is done, or none of it. 2. Consistent = relationships among values maintained. 3. Isolated = appear to have been executed when no other DB operations were being performed. – Often called serializable behavior. 4. Durable = effects are permanent even if system crashes. CSE 5330/7330 Fall 2009 74 Commit/Abort Decision Each transaction ends with either: 1. Commit = the work of the transaction is installed in the database; previously its changes may be invisible to other transactions. 2. Abort = no changes by the transaction appear in the database; it is as if the transaction never occurred. – ROLLBACK is the term used in SQL and the Oracle system CSE 5330/7330 Fall 2009 75 Example Sells(bar, beer, price) Joe's Bar sells Bud for $2.50 and Miller for $3.00. Sally is querying the database for the highest and lowest price Joe charges: (1) SELECT MAX(price) FROM Sells WHERE bar = 'Joe''s Bar'; (2) SELECT MIN(price) FROM Sells WHERE bar = 'Joe''s Bar'; At the same time, Joe has decided to replace Miller and Bud by Heineken at $3.50: (3) DELETE FROM Sells WHERE bar = 'Joe''s Bar' AND (beer = 'Miller' OR beer = 'Bud'); (4) INSERT INTO Sells VALUES('Joe''s bar', 'Heineken', 3.50); CSE 5330/7330 Fall 2009 76 Example: Problem With Rollback Suppose Joe executes statement 4 (insert Heineken), but then, during the transaction thinks better of it and issues a ROLLBACK statement. If Sally is allowed to execute her statement 1 (find max) just before the rollback, she gets the answer $3.50, even though Joe doesn't sell any beer for $3.50. Fix by making statement 4 a transaction, or part of a transaction, so its effects cannot be seen by Sally unless there is a COMMIT action. CSE 5330/7330 Fall 2009 77 Deadlock AND 1. Wait and hold hold some locks while you wait for others 2. Circular chain of waiters wait-for graph T4 T1 T3 T2 3. No pre-emption We can avoid deadlock by doing at least ONE of: 1. Get all your locks at once 2. Apply an ordering to acquiring locks 3. Allow preemption (for example, use timeout on waits) CSE 5330/7330 Fall 2009 78 Serializability of schedules T1 T2 Read (A) Read (A) A:= A-50 temp:= A * 0.1 Write (A) A:= A + temp Read (B) Write (A) B:= B+50 Read (B) Write (B) B:= B - temp Write (B) A disk 100 B 200 T1 T2 Schedule is serializable if effect is the same as a serial schedule A T1 –> T2 A= B= T2 –> T1 A= B= T1 CSE 5330/7330 Fall 2009 T2 B 79 T1 T1 T2 T3 T4 T5 T6 C D T2 T3 A T4 B A D T5 C T6 If no progress is possible, then there is a cycle CSE 5330/7330 Fall 2009 80 Cascading Abort T1 T2 LOCK A Read A change A Write A UNLOCK A LOCK A Read A change A Write A UNLOCK A LOCK B Read B Discover problem ABORT CSE 5330/7330 Fall 2009 81 Two-Phase Locking (2PL) Phase Phase I: All requesting of locks precedes II: Any releasing of locks Theorem: Any schedule for 2-phase locked transaction is serializable Locks Time CSE 5330/7330 Fall 2009 82