Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSCI485 – File & Database Management Systems Bahram Zartoshty Office: SAL 346 Phone: TBA Office Hours: TTH 1:15-2:50pm Note: Parts of this lecture were developed by Professor Ghandeharizadeh 1 Logistics Required text book: Database System Concepts, Silberschatz, Korth & Sudarshan, Fifth edition. Pre-req for the course: CS201: Data Structures Knowledge of an object-oriented programming language such as C++, Java, C# Teaching Assistant Shahin Shayande Office: (Microsoft Lab)SAL 200C Office Hours: TBA 3 Grading Midterm 1: 35% Midterm 2: 35% Project & Assignments: 30% 4 What to do immediately? Register with the web site http://dblab.usc.edu/csci485 5 Database Management System (DBMS) Database: An integrated collection of data, usually stored on secondary storage, typically describing the activities of one or more related organizations. Database management system (DBMS): A collection of software/programs designed to assist in maintaining and utilizing large collections of data. DBMS contains information about a particular enterprise Used almost on a daily basis for either individual or business use. Relational database vendors were one of the fastest growing sectors during the .COM boom! 6 BEFORE DBMS In the early days, database applications were built on top of file systems User 1 User 2 Application programs Application programs 7 Data Data AFTER DBMS User 1 Application programs User 2 Application programs DBMS 8 Data managed by DBMS WHY A DBMS? 1. 2. 3. 4. 5. 6. 7. 8. Reduced application development time Data independence: Application programs not dependent on data representation and storage details Data sharing: data is better utilized (discovered and reused), redundancy of data is minimized Data integrity and consistency: one may enforce consistency constraints on data, e.g., number of seats sold ≤ number of seats on the plane × 1.1 Centralized control: DBA tunes the database to balance user's needs Security: mechanisms to prevent unauthorized access. These mechanisms are based on content instead of fileoriented approach. Concurrency control: avoids undesirable race conditions that arise with simultaneous access/updates to data Crash recovery: ensures the integrity of data in the presence of failures 9 DATABASE MANAGEMENT SYSTEMS ARCHITECTURE User 1 DBMS User n DB Physical data Conceptual schema 10 Data Models A collection of tools for describing Data Data relationships Data semantics Data constraints Relational model Entity-Relationship data model (mainly for database design) Object-based data models (Object-oriented and Objectrelational) Semistructured data model (XML) Other older models: Network model Hierarchical model 11 Challenges Conceptual Logical Physical Abstraction, Inheritance, Encapsulation Reduction to tables with minimal: data duplication, potential for data loss and update anomalies Effective use of a DBMS, management of mismatch between tables and OO constructs, Index structures, CC & Crash recovery, Optimization techniques Conceptual Data Models Entity-Relationship (ER) data model Entities, Attributes, Relationships SS# name address Emp Conceptual Data Models Entity-Relationship (ER) data model Entities, Attributes, Relationships Co-Pay SS# name address Emp Enrolled in Health Plan name Conceptual Data Models Entity-Relationship (ER) data model Entities, Attributes, Relationships Recursive relationships SS# name address Emp Married to Conceptual Data Models Entity-Relationship (ER) data model Entities, Attributes, Relationships Recursive relationships SS# name address Emp Works for Conceptual Data Models Entity-Relationship (ER) data model Entities, Attributes, Relationships Recursive relationships SS# name address Emp Works for date Conceptual Data Models Entity-Relationship (ER) data model Entities, Attributes, Relationships Recursive relationships Inheritance student sid name Generalization Undergrad ISA Specialization graduate Conceptual Data Models Abstraction, Inheritance, Encapsulation Exercise these concepts using in-class examples and homework assignments A library database contains a listing of authors who have written books on various subjects (one author per book). It also contains information about libraries that carry books on various subjects. Conceptual Data Models Abstraction, Inheritance, Encapsulation Exercise these concepts using in-class examples and homework assignments A library database contains a listing of authors who have written books on various subjects (one author per book). It also contains information about libraries that carry books on various subjects. Entity sets: authors, subjects, books, libraries Relationship sets: wrote, carry, indexed Conceptual Data Models Abstraction, Inheritance, Encapsulation Exercise these concepts using in-class examples and homework assignments A library database contains a listing of authors who have written books on various subjects (one author per book). It also contains information about libraries that carry books on various subjects. title Subject matter isbn SS# authors wrote books libraries carry name address index subject Data Models SS# name Emp address Works for Logical Physical Relational Data Model Prevalent in today’s market place. Why? Performance! Everything is a table! Logical data design is the process of reducing an ER diagram to a collection of tables. Logical Data Design Trivial reduction: An entity set = a table A relationship set = a table Pitfalls: Duplication of data Unintentional loss of data Data ambiguity that impacts software design, resulting in update anomalies Data Duplication SS# Emp name Works for address SS# Name Address SS# 396 Shahram Seattle 396 400 400 Asoke Chicago 200 400 120 400 200 Joe New York MGR SS# Data Duplication SS# Emp name Works for address SS# Name Address SS# 396 Shahram Seattle 396 400 400 Asoke Chicago 200 400 120 400 200 Joe MGR SS# New York The SS# column is duplicated! Data Duplication: Solution Merge the two tables into one: SS# Emp name Works for address SS# Name Address MGR SS# 396 Shahram Seattle 400 400 Asoke Chicago NULL 200 Joe New York 400 Data Loss Ford maintains warehouses containing different automobile parts Part# Description Location 123 Piston Tijuana 203 Cylinder Michigan 877 Bumper Michigan 389 Seats Arizona Records are inserted and deleted based on availability of a part at a warehouse Data Loss (Cont…) When a warehouse becomes empty, it is lost from the database: Part# Description Location 123 Piston Tijuana 389 Seats Arizona Solution: utilize two different tables Part# Description WHID WHID Location 123 Piston 12 12 Tijuana 389 Seats 45 45 Arizona Data Ambiguity Represent faculty of a department as: Faculty Department Location Ghandeharizadeh Comp Sci SAL Zartoshty Comp Sci SAL Bohem Comp Sci SAL A change of address for a faculty might be for the entire department. This cannot be differentiated with this table design! Data Ambiguity Utilize two tables: Faculty Department Department Location Ghandeharizadeh Zartoshty Jenkins Bohem Comp Sci Comp Sci Bio Medical Comp Sci Comp Sci SAL Sex Ed BOVARD Bio Medical HEDCO Data Ambiguity (Cont…) Employees of a bi-lingual company having different skills. Employee Skill Language Asoke Teach Hindi Asoke Cook French Asoke Null German Asoke Program English Update anomalies! Data Ambiguity: Solution Utilize two tables: Employee Employee Language Asoke Hindi Asoke French Asoke German Asoke English Skill Asoke Teach Asoke Cook Asoke Program Logical Data Design A quest to flatten objects with minimal data duplication, loss of data, and update anomalies! William Kent, “A Simple Guide to Five Normal Forms in Relational Database Theory”, Communications of the ACM 26(2), Feb 1983, 120-125. Data Models SS# name Emp address Works for Logical Data Design SS# Name Address MGR SS# 396 Shahram Seattle 400 400 Asoke Chicago Null Physical Physical Implementation Reconstruct main memory objects for manipulation and presentation: Specify class definitions Typically correspond to entity-sets Populate an instance of a class by issuing SQL queries to a DBMS Update instances in memory Flush dirty instances back to DBMS Potential use of transactions Type Mismatch A column of a row must be a primitive such as an integer, real, etc. It may NOT be an array of integers or object pointers A property (attribute) of a class might be of a multi-valued type, e.g., an array, a vector, etc. Changes in software may impact the design of tables. (Management of type mismatch by the system designer.) Implementation Set operators in the DBMS Does set A contain set B? Does value v1 appear in set A? Aggregates in the DBMS Compute average employee salary Count the number of employees Find the oldest employee Challenges Conceptual Logical Physical Abstraction, Inheritance, Encapsulation Reduction to tables with minimal: data duplication, potential for data loss and update anomalies Effective use of a DBMS, management of mismatch between tables and OO constructs, Index structures, CC & Crash recovery, Optimization techniques Entity-Relationship Model Example of schema in the entity-relationship model 40 Entity Relationship Model (Cont.) E-R model of real world Entities (objects) E.g. customers, accounts, bank branch Relationships between entities E.g. Account A-101 is held by customer Johnson Relationship set depositor associates customers with accounts Widely used for database design Database design in E-R model usually converted to design in the relational model (coming up next) which is used for storage and processing 41 Relational Model Example of tabular data in the relational model 42 Attributes A Sample Relational Database 43 DATA INDEPENDENCE 1. 2. Physical data independence: modify the physical scheme (data structures, e.g., Btree or hash index) without causing application programs to be rewritten. These modifications are necessary to enhance performance and new software releases. Most relational vendors support this kind of data independence. Logical data independence: Modify the conceptual scheme (e.g., add a new attribute to a table, rename an attribute) without causing application programs to be rewritten. This kind of data independence is harder to achieve. 44 DATABASE LANGUAGES There are several languages associated with a database: Data Definition Language (DDL): The database scheme is specified by a set of definitions that are expressed by a special language named DDL. The result of compiling DDL statements is a set of tables stored in a file called data dictionary. This file contains meta-data (data about the data stored in the database). Data Manipulation Language (DML): a language that enables users to access or manipulate data (retrieve, insert, replace, delete) as organized by a certain data model. We will look at a commercial DML named SQL. In general, there are two types of DML: 1. 2. • • Procedural: Describes what data is needed and how to get it: e.g., relational algebra Non-procedural: Describes what data is needed without specifying how to get it: e.g., tuple relational calculus 45 SQL SQL: widely used non-procedural language E.g. find the name of the customer with customer-id 192-83-7465 select customer.customer-name from customer where customer.customer-id = ‘192-83-7465’ E.g. find the balances of all accounts held by the customer with customer-id 192-83-7465 select account.balance from depositor, account where depositor.customer-id = ‘192-83-7465’ and depositor.account-number = account.account-number Application programs generally access databases through one of Language extensions to allow embedded SQL Application program interface (e.g. ODBC/JDBC) which allow SQL queries to be sent to a database 46 Storage Management Storage manager is a program module that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system. The storage manager is responsible to the following tasks: Interaction with the file manager Efficient storing, retrieving and updating of data Issues: Storage access File organization Indexing and hashing 47 Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation 48 Query Processing (Cont.) Alternative ways of evaluating a given query Equivalent expressions Different algorithms for each operation Cost difference between a good and a bad way of evaluating a query can be enormous Need to estimate the cost of operations Depends critically on statistical information about relations which the database must maintain Need to estimate statistics for intermediate results to compute cost of complex expressions 49 Transaction Management A transaction is a collection of operations that performs a single logical function in a database application Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures. Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database. 50 SYSTEM USERS There are several kind of users associated with a system: Database administrator: defines schemas, storage structures and access method definitions, physical organization, authorization, integrity constraints. Application programmers: they write a program and make it available to the endusers Sophisticated users: they use a query language (SQL) to access the database interactively Naive (end) users: they invoke the application programs 51 Overall System Structure 52