Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS6482 Topics on Data Engineering Qing Li (E-mail: [email protected]) Dept of Computer Science City University of Hong Kong 2009 Qing Li Course Overview Course Format: Tutorial classes and exercises which provide students with supervised problem-solving exercises Class on Wednesday in Y4701: 6:30 - 7:20pm (tutorials only) Regular lectures, each lecturing session is about two-hour Classes on Wednesday in Y4701: 7:30 - 9:20pm (lectures only) 2009 Qing Li Suggested Assessment Continuous assessment -- 70% : Term project -- 35% Midterm quiz -- 25% tutorial exercises -- 10% Final examination -- 30% 2009 Qing Li Course Materials Reference books R. Elmasri and S. Navathe, Fundamental of Database Systems, 5th Edition (or later), Addison-Wesley. M.T. Ozsu and P. Valduriez, Principles of Distributed Database Systems, 2nd Edition, Prentice-Hall. M. Stonebraker and J.M. Hellerstein, Readings in Database Systems, 3rd Edition (or later), Morgan Kaufmann. Literature selected papers from research journals, surveys, conf. proceedngs, and collection of readings 2009 Qing Li DB Systems: an Overview Motivations Information about a particular enterprise File-processing Systems permanent records stored in various files application programs written to extract & add records Disadvantages data redundancy & inconsistency difficulty in accessing data data isolation & different data formats concurrent access anomalies security problem integrity problem 2009 Qing Li DB Systems: an Overview What is a Database (DB)? A non-redundant, persistent collection of logically related records/files that are structured to support various processing and retrieval needs Database Management System (DBMS) A set of software programs for creating, storing, updating, and accessing the data of a DB. Software interface DB DBMS 2009 Qing Li DB Systems: an Overview Difference between systems DBMS & other programming the ability to manage persistent data primary goal of DBMS: to provide an environment that is convenient, efficient, and robust to use in retrieving & storing data Other DBMS capabilities data modeling high-level languages to define, access and manipulate data transaction managent & concurrency control access control resiliency (recovery) 2009 Qing Li DB Systems: an Overview Data Abstraction Abstract view of the data simplify interaction with the system hide details of how data is stored and manipulated Levels of abstraction (“ANSI/SPARC 3 level architecture) physical/internal level: data structures; how data are actually stored conceptual level: schema, what data are actually stored view/external level: partial schema 2009 Qing Li Data Abstracion: 3-level architecture 2009 Qing Li Data Models What is a data model? A data model is a collection of conceptual tools for describing data, data relationships, operations, data semantics and consistency constraints the “core” of a database Catagories of data models Object-based logical models (conceptual & view levels) the Entity-Relationship (ER) model -- mid 70’s the Object-Oriented data models -- late 80’s the Semantic Data Models -- early/mid 80’s Record-bsaed logical models (conceptual & view levels) the Relational model -- early 70’s the Network and Hierarchical models -- 60’s 2009 Qing Li Data Models Catagories of data models (cont’d) Physical data models (internal level) Unifying model Frame memory model (these will NOT be studied in this course.) Basic Concepts and Terminologies instance - the collection of data (information) stored in the DB at a particular moment (ie, a snapshot) scheme/schema - the overall structure (design) of the DB -- relatively static 2009 Qing Li Data Models Basic Concepts and Terminologies (cont’d) Data Independence - the ability to modify a schema definition in one level without affecting a schema in the next higher level - there are two kinds (a result of the 3-level architecture): physical data independence -- the ability to modify the physical schema without altering the conceptual schema and thus, without causing the application programs to be rewritten logical data independence -- the ability to modify the conceptual schema without causing the application programs to be rewritten 2009 Qing Li Data Models Basic Concepts and Terminologies (cont’d) Data Definition Language (DDL) - a language for defining DB schema - DDL statements compile to a data dictionary which is a file containing metadata (data about data), eg, descriptions about the tables Data Manipulation Language (DML) - a language that enables users to access and manipulate data as organised by appropriate data model - an important subset for retrieving data is called Query Language - two types of DML: procedural (specify “what” & “how”) vs. declarative (just specify “what”) 2009 Qing Li Data Models Basic Concepts and Terminologies (cont’d) Database Administrator (DBA) - DBA is the person who has central control over the DB - Main functions of DBA: schema definition storage structure and access method definition schema and physical organization modification granting of authorization for data access integrity constraint specification 2009 Qing Li Data Models Basic Concepts and Terminologies (cont’d) Database Users - Application Programmers embedded DML in a host language fourth-generation languages (4GL) - Interactive Users: query language - Specialized Users: non-traditional applications -Naive Users: running application programs 2009 Qing Li “Reference” DB System Architecture Naïve user Interactive user Appl. Prog’er Application interfaces Application programs DML compiler Application programs object code DBA (SQL) query Query processor DB schema DDL compiler Database manager DBMS File manager Data files 2009 Qing Li disk storage Data dict. DB DB Concepts and Architecture 2009 Qing Li “Reference” System Architecture File Manager allocation of space operations on files DB Manager interface between stored data and application programs/queries translate conceptual level commands into physical level ones responsible for access control concurrency control backup & recovery integrity 2009 Qing Li “Reference” System Architecture Query Processor translate high-level queries into low-level instructions query optimization DML (Pre)compiler translates DML statements embedded in application program into procedure calls DDL (Pre)compiler converts DDL statements to data dictionary items (eg, table descriptions) 2009 Qing Li DB Concepts and Architecture DB System Environment (cont’d) DB System Utilities loading back up file re-organization report generation data dictionary … NEXT: Classification of DBMSs! 2009 Qing Li Classification of DBMSs Criteria: Data/Database Model Number of Users single-user (eg, PC databases) multi-user (concurrency control) Number of sites centralized (logically, physically) decentralized (logically, physically) homogeneity vs. heterogeneity Other Criterion: cost general-purpose vs. specialized DBMSs, ... 2009 Qing Li Classification of DBMSs Classification based on Data Model Hierarchical (late 60’s) Network (late 60’s) Relational (70’s) Entity-Relationship (ER) Semantic (80’s) Functional Object-Oriented (late 80’s/early 90’s) “Intelligent” logic-based/deductive expert/knowledge-based hypermedia, ... 2009 Qing Li The Entity-Relationship Model Preliminaries Proposed by P. Chen in 1976 One of the earliest “semantic” database model Mainly a design tool for record-based (ie, hierarchical, network, relational) databases Modeling Constructs Entity -- a distinguishable object with an independent existence Example: John Chan, CityU, HK Bank, … Entity Set -- a set of entities of the same type Example: Student, Employee, Customers, ... 2009 Qing Li The Entity-Relationship Model Modeling Constructs (cont’d) Attribute (Property) -- a piece of information describing an entity Example: Name, ID, Address, DoB are attributes of a student entity Each attribute can take a value from a domain Example: Name Character String, ID Integer, ... Formally, an attribute A is a function which maps from an entity set E into a domain D: A: E D 2009 Qing Li The Entity-Relationship Model Modeling Constructs (cont’d) Relationship -- an association among several entities Example: Patrick and Eva are friends Patrick is taking cs3450 Relationship Set -- a set of relationships of the same type Example: taking John cs3450 mary cs2578 may ee4532 Formally, a relationship R is a subset of: { (e1, e2, …, ek) | e1 E1, e2 E2, …, ek Ek) } 2009 Qing Li The Entity-Relationship Model Modeling Constructs (cont’d) Relationship vs. Attribute an attribute A: E D is a “simplified” form of a relationship: If we allow D to be an Entity Set, then A becomes a relationship a relationship can carry attributes properties of the relationship Example: Patrick takes cs2450 with a grade of B+ Supplier S supplies item T with a price of P 2009 Qing Li The Entity-Relationship Model Modeling Constructs (cont’d) Entity Set vs. Attribute What constitutes an attribute, and what constitutes an entity set? Example: Employee and Phone 1) employee entity set with attribute phone# 2) empPhn relationship set with entity sets employee and phone# No simple answer, depending on - what we want to model - meaning of attributes 2009 Qing Li The Entity-Relationship Model Integrity Constraints Mapping Cardinalities One - to - One (1:1) a 1 b 2 c 3 One - to - Many (1:M) / Many - to - One (N:1) a b Many - to - Many (M:N) ?? 2009 Qing Li c 1 2 The Entity-Relationship Model 2009 Qing Li The Entity-Relationship Model Integrity Constraints (cont’d) Keys: to distinguish individual entities or relationships Insertion/Deletion Constraints: => “strong” vs. “weak” entities ER Diagram rectangle: Entity Set diamond: Relationship Set ellipse: Attribute others (such as double rectangle for “weak entity set”, double ellipses for “multi-valued attribute, underlined attribute for key,…) 2009 Qing Li SUMMARY OF ER-DIAGRAM NOTATION Symbol Meaning ENTITY TYPE WEAK ENTITY TYPE RELATIONSHIP TYPE IDENTIFYING RELATIONSHIP TYPE ATTRIBUTE KEY ATTRIBUTE MULTIVALUED ATTRIBUTE COMPOSITE ATTRIBUTE DERIVED ATTRIBUTE E1 E1 R R 2009 Qing Li E2 R N (min,max) TOTAL PARTICIPATION OF E2 IN R E2 CARDINALITY RATIO 1:N FOR E1:E2 IN R E STRUCTURAL CONSTRAINT (min, max) ON PARTICIPATION OF E IN R The Entity-Relationship Model Integrity Constraints (cont’d) Keys: to distinguish individual entities or relationships superkey -- a set of one or more attributes which, taken together, identify uniquely an entity in an entity set Example: {student ID, Name} identify a student candidate key -- minimal set of attributes which allow to identify uniquely an entity in an entity set a superkey for which no proper subset is a superkey Example: student ID identify a student, but Name is not a candidate key (WHY?) primary key -- a candidate key chose by the DB designer to identify an entity in an entity set 2009 Qing Li The Entity-Relationship Model ER Diagram Rectangles: Ellipses: Diamonds: Lines: Entity Sets Attributes Relationship Sets Attributes to Entity/Relationship Sets or, Entity Sets to Relationship Sets m m 1 2009 Qing Li R R R n 1 1 The Entity-Relationship Model Weak Entity Set an entity set that does NOT have enough attributes to form a primary/candidate key Acct. no account trans. no balance log Role Indicators Emp. name transaction Multi-value attri. Phone# manager employee 2009 Qing Li date worker Works-for amount ER DIAGRAM FOR A BANK DATABASE 2009 Qing Li ER DIAGRAM WITH ROLE NAMES AND MINI-MAX CONSTRAINTS 2009 Qing Li The Entity-Relationship Model Transformation of ER diagram to Record-based schema Standard transformation algorithms are available Mapping from ER to relational and network schemas are straightforward Mapping from ER to hierarchical schema is relatively harder Eg., for the Many - to - Many (M:N) relationships ER Data Abstractions Aggregation (limited form) Association (Yes) Classification (Yes) Recursion (Yes) 2009 Qing Li The Entity-Relationship Model Summary The ER Model is the 1st “semantic” model centered around relationships, not attributes It combines successfully the best features of the network and relational models simple and easy to understand 2009 The original model falls short of supporting more complex applications Recent “Trend” on ER: building ER database systems / interfaces applications of ER approaches extending the original ER to capture more “semantics” => Extended ER (EER) Models Qing Li