Download introduction351 - COW :: Ceng

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
CENG 351
Introduction to Data Management
and File Structures
Nihan Kesim Çiçekli
Department of Computer Engineering
METU
CENG 351
1
CENG 351
•
•
•
•
Instructor: Nihan Kesim Çiçekli
Office: A308
Email: [email protected]
Lecture Hours:
Section 2:
Wed. 10:40, 11:40 (BMB3);
Thu. (BMB2)
• Course Web page: http://cow.ceng.metu.edu.tr
• Teaching Assistants:
– Emre Işıklıgil, Office: A402, [email protected]
– Burçak Otlu Sarıtaş, Office: B211, [email protected]
– Abdullah Doğan, Office: A206, [email protected]
CENG 351
2
References
• Raghu Ramakrishnan, Database Management Systems
(3rd. ed.), McGraw Hill, 2003 (text book).
• R. Elmasri, S.B. Navathe, Fundamentals of Database
Systems, 4th edition, Addison-Wesley, 2004.
• B. J. Salzberg, File Structures: An Analytic Approach,
Prentice Hall, 1988.
• Michael J. Folk, B. Zoellick, File Structures, 2nd ed.,
Addison-Wesley Longman Ltd., 1991.
CENG 351
3
Course Outline
1. Introduction to relational database systems
2. Relational Model and E/R Modeling
3. Relational Algebra, Relational Calculus
4. Structural Query language (SQL)
5. Secondary Storage Media
6. Sequential File Processing
7. External Sorting of Large Files
8. Indexing: Multilevel Indexing and B+ trees
9. Hashing (static, linear, extendible hashing)
10. SQL Query Evaluation and optimization issues
CENG 351
4
Grading
• Attendence 0% (50% attendance is mandatory for the final exam)
• Assignments 30% (written assignments 2.5% each; 2
programming assignments 10% each)
• Midterm Exam 1 20% (tentative date: Nov. 14, 2014)
• Midterm Exam 2 20% (tentative date: Dec. 19, 2014)
• Final 30%
• A student can take the final exam if and only if she/he (i)
attends at least 1/2 of the attendance checks, and (ii) the
weighted average of his/her assignments is at least 30 points.
Otherwise; the student is not allowed to take the final exam
and hence will get "NA".
CENG 351
5
Grading Policies
• Policy on missed midterm:
– you must inform the instructor BEFORE the exam
• Lateness policy:
– Every student has a total of 7 days for late submission for the
programming assignments. One can spend this credit for any of the
assignments or distribute it for all.
• All assignments and programs are to be your own
work. No group projects or assignments are
allowed.
CENG 351
6
Motivation
 Most computers are used for data processing.
A big growth area in the “information age”
 This course covers data processing from a
computer science perspective:
–
–
–
–
Storage of data
Organization of data
Access to data
Processing of data
CENG 351
7
Data Structures vs File Structures
• Both involve:
– Representation of Data
+
– Operations for accessing data
• Difference:
– Data structures: deal with data in main memory
– File structures: deal with data in secondary
storage
CENG 351
8
Where do File Structures fit in
Computer Science?
Application
DBMS
File system
Operating System
Hardware
CENG 351
9
Computer Architecture
data is
manipulated
here
Main Memory
(RAM)
- Semiconductors
- Fast, expensive,
volatile, small
data
transfer
data is
stored here
- disks
Secondary
Storage
CENG 351
- Slow, cheap,
stable, large
10
Advantages
•
•
•
Main memory is fast
Secondary storage is big (because it is cheap)
Secondary storage is stable (non-volatile) i.e.
data is not lost during power failures
Disadvantages
•
•
•
Main memory is small. Many databases are too
large to fit in main memory (MM).
Main memory is volatile, i.e. data is lost during
power failures.
Secondary storage is slow (10,000 times slower
than MM)
CENG 351
11
How fast is main memory?
• Typical time for getting info from:
Main memory: ~12 nanosec = 120 x 10-9 sec
Magnetic disks: ~30 milisec = 30 x 10-3 sec
• An analogy keeping same time proportion
as above:
Looking at the index of a book : 20 sec
versus
Going to the library: 58 days
CENG 351
12
Normal Arrangement
•
•
•
•
Secondary storage (SS) provides reliable, longterm storage for large volumes of data
At any given time, we are usually interested in
only a small portion of the data
This data is loaded temporarily into main
memory, where it can be rapidly manipulated
and processed.
As our interests shift, data is transferred
automatically between MM and SS, so the data
we are focused on is always in MM.
CENG 351
13
Goal of the file structures
• Minimize the number of trips to the disk in
order to get desired information
• Grouping related information so that we are
likely to get everything we need with only
one trip to the disk.
CENG 351
14
Physical Files and Logical Files
• physical file: a collection of bytes stored on a disk.
• logical file: a "channel" that connects the program to
a physical file
• The program (application) sends (or receives) bytes
to (from) a file through the logical file. The program
knows nothing about where the bytes go (came from).
• The operating system is responsible for associating a
logical file in a program to a physical file in disk.
Writing to or reading from a file in a program is done
through the operating system.
CENG 351
15
Files
• The physical file has a name, for instance
myfile.txt
• The logical file has a logical name (a
variable) inside the program.
– In C :
FILE * outfile;
– In C++:
fstream outfile;
CENG 351
16
Basic File Processing Operations
•
•
•
•
•
Opening
Closing
Reading
Writing
Seeking
CENG 351
17
File Systems
• Data is not scattered hither and thither on
disk.
• Instead, it is organized into files.
• Files are organized into records.
• Records are organized into fields.
CENG 351
18
Example
• A student file may be a collection of student
records, one record for each student
• Each student record may have several fields, such
as
–
–
–
–
–
–
Name
Address
Student number
Gender
Age
GPA
• Typically, each record in a file has the same fields.
CENG 351
19
Properties of Files
1) Persistance: Data written into a file
persists after the program stops, so the
data can be used later.
2) Sharability: Data stored in files can be
shared by many programs and users
simultaneously.
3) Size: Data files can be very large.
Typically, they cannot fit into main
memory.
CENG 351
20
Introduction to Database Systems
Ref. Ramakrishnan & Gehrke Chapter 1
21
Basic Definitions
•
•
•
•
Data
Database
Database Management System (DBMS)
Database System
22
Basic Definitions
• Data: Known facts that can be recorded and have an implicit
meaning.
• Database: A collection of related data.
• Database Management System (DBMS): A software
package/ system to facilitate the creation and maintenance of a
computerized database.
• Database System: The DBMS software together with the data
itself. Sometimes, the applications are also included.
23
Files vs. DBMS
• Application must stage large datasets between
main memory and secondary storage (e.g.,
buffering, page-oriented access, etc.)
• Special code for different queries
• Must protect data from inconsistency due to
multiple concurrent users
• Crash recovery
• Security and access control
24
Typical DBMS Functionality
• Define a database : in terms of data types,
structures and constraints
• Construct or load the database on a secondary
storage medium
• Manipulating the database : querying, generating
reports, insertions, deletions and modifications to
its content
• Concurrent processing and sharing by a set of users
and programs – yet, keeping all data valid and
consistent
25
Data Models
 A data model is a collection of concepts for
describing data.
 A schema is a description of a particular
collection of data, using the given data model.
 The relational model of data is the most widely
used model today.
– Main concept: relation, basically a table with rows and
columns.
– Every relation has a schema, which describes the
columns, or fields.
26
Example: University Database
 Conceptual schema:
–
–
–
Students(sid: string, name: string, login: string,
age: integer, gpa:real)
Courses(cid: string, cname:string, credits:integer)
Enrolled(sid:string, cid:string, grade:string)
 Physical schema:
–
–
Relations stored as unordered files.
Index on first column of Students.
 External Schema (View):
–
Course_info(cid:string,enrollment:integer)
27
Instance of Students Relation
Students( sid: string, name: string, login: string,
age: integer, gpa: real )
sid
53666
53688
53650
name
Jones
Smith
Smith
login
jones@cs
smith@ee
smith@math
age
18
18
19
gpa
3.4
3.2
3.8
28
Levels of Abstraction
 Many external schemata,
single conceptual(logical)
schema and physical
schema.
– External schemata describe
how users see the data.
– Conceptual schema defines
logical structure
– Physical schema describes the
files and indexes used.
External
Schema 1
External
Schema 2
External
Schema 3
Conceptual Schema
Physical Schema
* Schemas are defined using DDL; data is modified/queried using DML.
29
Data Independence
Applications are insulated from how data is
structured and stored.
 Logical data independence: The ability to
change the logical (conceptual) schema without
changing the External schema (User View)
 Physical data independence: The ability to
change the physical schema without changing the
logical schema is.
* One of the most important benefits of using a DBMS!
30
These layers
must
consider
concurrency
control and
recovery
Structure of a DBMS
 A typical DBMS has a
layered architecture.
 This is one of several
possible architectures;
each system has its own
variations.
Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
31