Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Systems 236363 Introduction Bureaucratic Info (1/2) • Lecturer: Prof. Roy Friedman – Office: Taub 605 – E-mail: [email protected] • TA in Charge: Omer Katz – Office: Taub 603 – E-mail: [email protected] • TA: Ella Bolshinski – Office: Taub 315 – E-mail: [email protected] Bureaucratic Info (2/2) • Web site: http://webcourse.cs.technion.ac.il/236363/ • Home assignments: 3 dry, 1 wet • Grade structure: – 80% final exam, 20% home assignments (takef) – 8% for the wet assignment, 4% each dry one – Home assignments are mandatory – no final grade without a pass grade in each home assignment • Those who repeat the course must submit all home assignments Topics • • • • • Introduction Entity Relationship Diagrams (ERD) Relational Algebra Relational Calculus Database Design Theory – Functional Dependencies – Schemas Decomposition and Normal Forms • XML and the Query Language Xpath • Advanced topics or datalog – time permitting • In the recitation: The SQL Query Language What’s New This Semester • The lecture slides are in English – Based on the slides of Eldar Fischer and Johann Makowsky • Some change in material – We will not study Xquery – We might not study Datalog • Wet home assignment might be in Java – To be determined next week Let the Fun Begin… • Database – A (persistent) collection of data – Often with some logical structure – Examples include, e.g., bank accounts, students listed to courses and their grades, geographical data used by a map/navigation service, customers of an online web-site, dentist’s patients • A Query Language – Enables querying and manipulating the database – Examples include, e.g., Structured Query Language (SQL), Datalog, Cassandra Query Language (CQL) • Database Management System (DBMS) – The system that manages the database and supports the execution of queries on the database Why Do We Need a DBMS? • After all, we have operating systems and file systems… • DBMS provides a data oriented abstraction for manipulating the data • Enables direct manipulation of the data without worrying about storage and execution issues – Frees the programmer from worrying about many low level details such as: • Serializing and de-serializing the data to the storage • Organizing the data in the storage and masking storage latencies and inefficiencies • How to enable concurrent access to the database? • What to do when the database is larger than physical memory? • Split/cached database operation in combined mobile/cloud Data Model • The data model defines the framework for how data is represented – For example, in the relational model, data is represented as tables (relations) • The query language enables extracting data from the database according to the given data model – For example, SQL can express queries on tables and the results are tables as well DBMS Functionality • Storage management • Query processor – Optimizes query processing for efficient information retrieval and query execution • Concurrency control and recovery • Data integrity – E.g., that an ID number is a unique 9 digit number • Security – Access control, authentication and encryption Data Representation Independence • To obtain independence between the data model and the physical storage, we separate between three levels – User view, logical level, and physical level View Data is organized here according to the data model (relational) Independence of logical layout beyond this level Logical Physical Independence of physical layout beyond this level Upper Database Layer • In the end-user layer, each user is provided with a (potentially partial) view that may be different from the actual data layout Operations on a Database • Database structure definition (following an analysis of the application’s needs): Includes the logical structures for representing the data and their relations – Data Definition Language (DDL) • Query execution to retrieve data from the database • Data manipulation: adding, deleting, and updating – Data Manipulation Language (DML) • Administrative operations: – Defining views, indexes, etc. Database Administrator (DBA) • Responsible for – Planning the logical database layout and adapting it to the physical layer – Security and access control – Recovery management (after failures) – Performance fine tuning Data Models (1/2) • Relational data model – Data is represented using tables – Correspondence between tables is obtained by using same values in columns with the same name – The main focus of this course. • Entity Relation – A tool for analyzing the requirements of a database and designing its schema – This model is an abstract one and has no actual direct implementation Data Models (2/2) • Object Oriented data model – A model in which the data is represented as objects similarly to what is done in OOP • ERD can be mapped to OO • Semi-structured data model – Data is represented as a graph (independent of the physical layout) • Extensible Markup Language (XML) – A specific instance of the semi-structured model in which the graph is a tree The Relational Data Model • We wish to represent a collection of objects of a given type, each characterized by a fixed set of properties – For example, student’s name, date of birth, department • How would we do this in C or Java? • In the relational data model, we maintain these objects in a table – Each row is used to store one object – Each column represents one of the properties Student Name Date of Birth Department John Doe 01/01/1990 Computer Science Jane Roe 07/07/1992 Electrical Engineering … … … • The table is a logical structure – Might be physically stored in a completely different manner – Each row must be different than each other row in at least one attribute • The table represents a set rather than a multi-set Relational Model: Terminology Schema (title of table) An attribute (column name) Student Name Date of Birth Department John Doe 01/01/1990 Computer Science Jane Roe 07/07/1992 Electrical Engineering … … … A relation (the entire table) A record (an entire row) A Formal Definition • For a given set of attributes A1,…,An and a set of corresponding domains D1,…,Dn (each domain is a set of values) • Denote R(A1,…,An) the relational schema that contains the attributes A1,…,An • A relation r over R is a subset of the domain product r ⊆ D1xD2x…xDn The Formal Definition - Visualized • Instead of viewing each record as a line in a table, it is viewed as a point in the space of all possible value assignments. The relation is a finite subset of “all possible records”. Name John Doe, 01/01/1990 Jane Roe, 07/07/1992 Birth Date Keys • A superkey of a relation r is a subset of attributes of r’s schema such that specific values of these attributes identify a single record in r. In other words, there are no two records in r whose values in all attributes of the superkey are the same. • A relation may have several superkeys. A superkey is called minimal if none of its subsets is a superkey. Such a key is also called a candidate key. • One of the superkeys can be selected as the primary key. The primary key is used to identify a row in the implementation of the database. Yet, in PostgreSQL, when no key is defined, a table can include multiple records with the same values in all attributes Simple Databases • In the simplest case, all objects of interest are of the same “kind”, meaning that they all have the same attributes list – they are only distinguished by their specific attributes values • For example, the list of songs on my computer – each such object is characterized by the name of the song, the format (mp3/wma/…), playing time, and size • In these cases, all objects are organized in a single table More Involved Databases • Suppose we wish to design a database for the faculty’s administrative assistants. Here, we can identify at least two types of entities: – Students: student name, id, address – Courses: course name, catalogue number, lecturer • If a student is registered to a given course, we should be able to know about it and be able to retrieve the student’s final grade – Hence, each student’s participation in a given course should be recorded somewhere • The question is how to organize the database for this? Possible Organization • A simple option: Have a single large table – for each student’s registration, we will hold the student’s name, id, address, course name, catalogue number, lecturer, and final grade • Drawbacks: – Redundancy: Why should the student’s address be stored in each course she takes? – Inadequacy: How can we maintain the details of a student that does not take any course? – Difficult to update: If a student changes his address, we will need to update all records of all courses he is registered to. This is both expensive and a source of inconsistency Another Option • We can maintain one table for students, one for courses, and one for registration • The registration table schema can include a primary key for the students (e.g., student id) and a primary key for the courses (e.g., catalogue number) as well as the final grade • Now each student’s data is independent of each course’s data and vice versa Another Option Student Name Student ID Address Jane Roe 12345678 Technion City Course Name Catalogue Number Lecturer Database Systems 236363 Rony Superstar Catalogue Number Student ID Final Grade 236363 12345678 95 Students … Courses … … Registrations Life is Full of Difficult Choices • What if we wish to retrieve the names of all lecturers who taught a given student? • If in the registrations table we only maintain the course’ catalogue number, this query will require a long time to compute • Further, if the lecturers change over time, the reply will not be accurate – How do we fix this? • Should we add a lecturer attribute to the registrations table? • Should we define a new “lecturers” table and corresponding relations? • The database organizational design choices are not trivial – this is the subject of much of this course