* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 1: Introduction
Survey
Document related concepts
Transcript
CS 505 Intermediate Topics in Database Systems Prof. Tingjian Ge with thanks to Prof. Stan Zdonik, Brown University Prof. Sam Madden, MIT Prof. Avi Silberschatz, Yale University What is a Database System? Database: A very large collection of related data Models a real world enterprise: Entities (e.g., teams, games / students, courses) Relationships (e.g., The Wildcats are playing in the NCAA) DBMS: A software system that can be used to store, manage and retrieve data from databases Database System: DBMS+data (+ applications) 1.2 Why Study Databases?? Shift from computation to information Always true for corporate computing More and more true in the scientific world and of course, the Web DBMS encompasses much of CS in a practical discipline Operating systems Languages Distributed systems Performance Theory AI Jobs! High-paying jobs!!! 1.3 Why Databases?? Why not store everything in flat files: i.e., use the file system of the OS, cheap/simple… Name, Course, Grade John Smith, CS115, B Mike Stonebraker, CS405G, A This is how things were in the “Bad Old Days” Jim Gray, CS505, A John Smith, CS315, B+ ………………… Yes, but not scalable… 1.4 Problem 1 Data redundancy and inconsistency Multiple file formats, duplication of information in different files Name, Course, Email, Grade John Smith, [email protected], CS112, B Jim Gray, CS560, [email protected], A John Smith, CS560, [email protected], B+ Name, Email, Course, Grade Mike Stonebraker, [email protected], CS234, A J. Smith, [email protected], CS560, B+ Why is this a problem? Wasted space Potential inconsistencies (e.g., multiple formats, John Smith vs Smith J.) 1.5 Problem 2 Data retrieval: Find the students who took CS505 Find the students with GPA > 3.5 For every query we need to write a program! We need the retrieval to be: Easy to write Execute efficiently 1.6 Problem 3 Data Integrity No support for sharing: Prevent simultaneous modifications No coping mechanisms for system crashes No means of Preventing Data Entry Errors (checks must be hard-coded in the programs) Security problems Database systems offer solutions to all the above problems 1.7 Problem 4 Long-lived data Evolution What happens if I need to change my mind about how the data is stored? Access patterns change Tuning Don’t want to have to re-write all my applications. Solution: Data independence! 1.8 Data Organization Data Models: a framework for describing data objects data relationships data semantics data constraints Presents primitives for Representing Data Data Definition Language (DDL) Manipulating Data Data Manipulation Language (DML) Entity-Relationship model We will concentrate on Relational model Other models: object-oriented model semi-structured data models, XML 1.9 Database Schema Similar to types and variables in programming languages Schema – the structure of the database e.g., the database consists of information about a set of customers and accounts and the relationship between them) Expressed in some data model Occurs at multiple levels Logical schema: database design at the logical level Physical schema: database design at the physical level 1.10 Data Organization Two levels of data modeling Logical level: describes data stored in database, and the relationships among the data. type customer = record name : string; street : string; city : integer; end; Physical level: describes how a record (e.g., customer) is stored. Also, View level: application programs hide details of data types. Views can also hide information (e.g., salary) for security purposes. 1.11 View of Data A logical architecture for a database system Data independence! 1.12 Entity-Relationship Model Example of schema in the entity-relationship model 1.13 Entity Relationship Model (Cont.) E-R model of real world Entities (objects) E.g. customers, accounts, bank branch Relationships between entities E.g. Account A-101 is held by customer Johnson Relationship set depositor associates customers with accounts Widely used for database design Database design in E-R model usually converted to design in the relational model (coming up next) which is used for storage and processing 1.14 Relational Model Attributes Example of tabular data in the relational model Customer-id customername 192-83-7465 Johnson 019-28-3746 Smith 192-83-7465 Johnson 321-12-3123 Jones 019-28-3746 Smith Key customerstreet customercity accountnumber Alma Palo Alto A-101 North Rye A-215 Alma Palo Alto A-201 Main Harrison A-217 North Rye A-201 Schema = Customer (Customer-id, customer-name, customer-table, account-number) This whole table is an instance of the Customer schema. 1.15 Data Organization Data Storage (Ch 11) Where can data be stored? Main memory Secondary memory (hard disks) Optical store Tertiary store (tapes) Move data? Determined by buffer manager Mapping data to files? Determined by file manager 1.16 Data retrieval Queries Query = Declarative data retrieval describes what data, not how to retrieve it Ex. Give me the students with GPA > 3.5 vs Scan the student file and retrieve the records with gpa>3.5 Why? 1. Easier to write 2. Efficient to execute (why?) 1.17 Data retrieval Query Query Processor Plan Query Optimizer Query Evaluator Data Query Optimizer “compiler” for queries (aka “DML Compiler”) Plan ~ Assembly Language Program Optimizer Does Better With Declarative Queries: 1. Algorithmic Query (e.g., in C) 1 Plan to choose from 2. Declarative Query (e.g., in SQL) n Plans to choose from 1.18 SQL SQL: widely used (declarative) non-procedural language E.g. find the name of the customer with customer-id 192-83-7465 select customer.customer-name from customer where customer.customer-id = ‘192-83-7465’ E.g. find the balances of all accounts held by the customer with customer-id 192-83-7465 select account.balance from depositor, account where depositor.customer-id = ‘192-83-7465’ and depositor.account-number = account.account-number Procedural languages: C++, Java, relational algebra 1.19 Data Integrity Transaction processing (Ch 15, 16) Why Concurrent Access to Data must be Managed? John and Jane withdraw $50 and $100 from a common account… John: 1. get balance 2. if balance > $50 3. balance = balance - $50 4. update balance Jane: 1. get balance 2. if balance > $100 3. balance = balance - $100 4. update balance Initial balance $300. Final balance=? It depends… 1.20 Data Integrity Recovery (Ch 17) Transfer $50 from account A ($100) to account B ($200) 1. get balance for A 2. If balanceA > $50 3. balanceA = balanceA – 50 4.Update balanceA in database 5. Get balance for B System crashes…. 6. balanceB = balanceB + 50 7. Update balanceB in database Recovery management 1.21 Outline CS 405G: application-oriented How to develop database applications: User, application developer, DBA CS 505 (This course!) : system-oriented Learn the internals of a relational DBMS (developer for Oracle, better application developer, or system researcher) 1.22