Download Chapter 1: Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
CS 505
Intermediate Topics in Database Systems
Prof. Tingjian Ge
with thanks to Prof. Stan Zdonik, Brown University
Prof. Sam Madden, MIT
Prof. Avi Silberschatz, Yale University
What is a Database System?
 Database:
A very large collection of related data
 Models a real world enterprise:
 Entities (e.g., teams, games / students, courses)
 Relationships (e.g., The Wildcats are playing in the NCAA)
 DBMS: A software system that can be used to store,
manage and retrieve data from databases
 Database System: DBMS+data (+ applications)
1.2
Why Study Databases??
 Shift from computation to information
 Always true for corporate computing
 More and more true in the scientific world
 and of course, the Web
 DBMS encompasses much of CS in a practical discipline
Operating systems
Languages
Distributed systems
Performance
Theory
AI
 Jobs!
High-paying jobs!!!
1.3
Why Databases??
 Why not store everything in flat files:
i.e., use the file system of the OS, cheap/simple…
Name, Course, Grade
John Smith, CS115, B
Mike Stonebraker, CS405G, A
This is how things were
in the “Bad Old Days”
Jim Gray, CS505, A
John Smith, CS315, B+
…………………
Yes, but not scalable…
1.4
Problem 1
 Data redundancy and inconsistency
 Multiple file formats,
 duplication of information in different files
Name, Course, Email, Grade
John Smith, [email protected], CS112, B
Jim Gray, CS560, [email protected], A
John Smith, CS560, [email protected], B+
Name, Email, Course, Grade
Mike Stonebraker, [email protected], CS234, A
J. Smith, [email protected], CS560, B+
Why is this a problem?
 Wasted space
 Potential inconsistencies
(e.g., multiple formats, John Smith vs Smith J.)
1.5
Problem 2
 Data retrieval:
 Find the students who took CS505
 Find the students with GPA > 3.5
For every query we need to write a program!
 We need the retrieval to be:
 Easy to write
 Execute efficiently
1.6
Problem 3
 Data Integrity
 No support for sharing:
 Prevent simultaneous modifications
 No coping mechanisms for system crashes
 No means of Preventing Data Entry Errors (checks must be hard-coded
in the programs)
 Security problems
 Database systems offer solutions to all the above problems
1.7
Problem 4
 Long-lived data  Evolution
 What happens if I need to change my mind about how
the data is stored?
 Access patterns change
 Tuning
 Don’t want to have to re-write all my applications.
 Solution: Data independence!
1.8
Data Organization
 Data Models: a framework for describing




data objects
data relationships
data semantics
data constraints
 Presents primitives for
 Representing Data  Data Definition Language (DDL)
 Manipulating Data  Data Manipulation Language (DML)
 Entity-Relationship model
 We will concentrate on Relational model
 Other models:
 object-oriented model
 semi-structured data models, XML
1.9
Database Schema
 Similar to types and variables in programming languages
 Schema – the structure of the database
 e.g., the database consists of information about a set of
customers and accounts and the relationship between them)
 Expressed in some data model
 Occurs at multiple levels
 Logical schema: database design at the logical level
 Physical schema: database design at the physical level
1.10
Data Organization
 Two levels of data modeling
 Logical level: describes data stored in database, and the
relationships among the data.
type customer = record
name : string;
street : string;
city : integer;
end;
 Physical level: describes how a record (e.g., customer) is
stored.
 Also, View level: application programs hide details of data types.
Views can also hide information (e.g., salary) for security
purposes.
1.11
View of Data
A logical architecture for a database system
Data independence!
1.12
Entity-Relationship Model
Example of schema in the entity-relationship model
1.13
Entity Relationship Model (Cont.)
 E-R model of real world
 Entities (objects)
 E.g. customers, accounts, bank branch
 Relationships between entities
 E.g. Account A-101 is held by customer Johnson
 Relationship set depositor associates customers with accounts
 Widely used for database design
 Database design in E-R model usually converted to design in the
relational model (coming up next) which is used for storage and
processing
1.14
Relational Model
Attributes
 Example of tabular data in the relational model
Customer-id
customername
192-83-7465
Johnson
019-28-3746
Smith
192-83-7465
Johnson
321-12-3123
Jones
019-28-3746
Smith
Key
customerstreet
customercity
accountnumber
Alma
Palo Alto
A-101
North
Rye
A-215
Alma
Palo Alto
A-201
Main
Harrison
A-217
North
Rye
A-201
Schema = Customer (Customer-id, customer-name,
customer-table, account-number)
This whole table is an instance of the Customer schema.
1.15
Data Organization
 Data Storage (Ch 11)
Where can data be stored?
 Main memory
 Secondary memory (hard disks)
 Optical store
 Tertiary store (tapes)
 Move data? Determined by buffer manager
 Mapping data to files? Determined by file manager
1.16
Data retrieval
 Queries
Query = Declarative data retrieval
describes what data, not how to retrieve it
Ex. Give me the students with GPA > 3.5
vs
Scan the student file and retrieve the records with gpa>3.5
 Why?
1. Easier to write
2. Efficient to execute (why?)
1.17
Data retrieval
Query
Query Processor
Plan
Query Optimizer
Query Evaluator
Data
 Query Optimizer
“compiler” for queries (aka “DML Compiler”)
Plan ~ Assembly Language Program
Optimizer Does Better With Declarative Queries:
1. Algorithmic Query (e.g., in C) 1 Plan to choose from
2. Declarative Query (e.g., in SQL) n Plans to choose from
1.18
SQL
 SQL: widely used (declarative) non-procedural language
 E.g. find the name of the customer with customer-id 192-83-7465
select customer.customer-name
from customer
where customer.customer-id = ‘192-83-7465’
 E.g. find the balances of all accounts held by the customer with
customer-id 192-83-7465
select account.balance
from depositor, account
where depositor.customer-id = ‘192-83-7465’ and
depositor.account-number = account.account-number
 Procedural languages: C++, Java, relational algebra
1.19
Data Integrity
Transaction processing (Ch 15, 16)
 Why Concurrent Access to Data must be Managed?
John and Jane withdraw $50 and $100 from a common
account…
John:
1. get balance
2. if balance > $50
3. balance = balance - $50
4. update balance
Jane:
1. get balance
2. if balance > $100
3. balance = balance - $100
4. update balance
Initial balance $300. Final balance=?
It depends…
1.20
Data Integrity
Recovery (Ch 17)
Transfer $50 from account A ($100) to account B ($200)
1. get balance for A
2. If balanceA > $50
3. balanceA = balanceA – 50
4.Update balanceA in database
5. Get balance for B
System crashes….
6. balanceB = balanceB + 50
7. Update balanceB in database
Recovery management
1.21
Outline
 CS 405G: application-oriented
 How to develop database applications: User, application developer,
DBA
 CS 505 (This course!) : system-oriented
 Learn the internals of a relational DBMS (developer for Oracle,
better application developer, or system researcher)
1.22