* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CENG 351 Introduction to Data Management and File Structures
Survey
Document related concepts
Transcript
CENG 351 Introduction to Data Management and File Structures Nihan Kesim Çiçekli Department of Computer Engineering METU CENG 351 1 CENG 351 • • • • Instructor: Nihan Kesim Çiçekli Office: A308 Email: [email protected] Lecture Hours: Section 1: Mon. 13:40, 14:40 (BMB4); Thu. 10:40 (BMB4) Section 2: Wed. 9:40, 10:40 (BMB5); Thu. 11:40 (BMB1) • Course Web page: http://cow.ceng.metu.edu.tr • Teaching Assistants: – Emre Işıklıgil Office: A402 [email protected] – Alev Mutlu Office: A302 [email protected] – Abdullah Doğan Office: A206 [email protected] CENG 351 2 References 1. Raghu Ramakrishnan, Database Management Systems (3rd. ed.), McGraw Hill, 2003. 2. R. Elmasri, S.B. Navathe, Fundamentals of Database Systems, 4th edition, Addison-Wesley, 2004. CENG 351 3 Course Outline 1. Introduction to relational database systems 2. Relational Model and E/R Modeling 3. Relational Algebra, Relational Calculus 4. Structural Query language (SQL) 5. Secondary Storage Media 6. Fundamental File Structure Concepts 7. Sequential File Processing 8. External Sorting of Large Files 9. Indexing: Multilevel Indexing and B+ trees 10. Hashing (static, linear, extendible hashing) 11. SQL Query Evaluation and optimization issues CENG 351 4 Grading • • • • • Assignments Attendence/Quiz Midterm Exam 1 Midterm Exam 2 Final 25% 5% 20% 20% 30% Tentative Exam Dates: Midterm Exam 1: Nov. 15, 2012 Midterm Exam 2: Dec. 20, 2012 CENG 351 5 Grading Policies • Policy on missed midterm: – no make-up exam • Lateness policy: – Every student has a total of 5 days for late submission for assignments. One can spend this credit for any of the assignments or distribute it for all. If total of late submissions exceeds the limit, a penalty of 10*day*day is applied. • All assignments and programs are to be your own work. No group projects or assignments are allowed. CENG 351 6 Introduction to File management CENG 351 7 Motivation Most computers are used for data processing. A big growth area in the “information age” This course covers data processing from a computer science perspective: – – – – Storage of data Organization of data Access to data Processing of data CENG 351 8 Data Structures vs File Structures • Both involve: – Representation of Data + – Operations for accessing data • Difference: – Data structures: deal with data in main memory – File structures: deal with data in secondary storage CENG 351 9 Where do File Structures fit in Computer Science? Application DBMS File system Operating System Hardware CENG 351 10 Computer Architecture data is manipulated here Main Memory (RAM) - Semiconductors - Fast, expensive, volatile, small data transfer data is stored here - disks, tape Secondary Storage CENG 351 - Slow,cheap, stable, large 11 Advantages • • • Main memory is fast Secondary storage is big (because it is cheap) Secondary storage is stable (non-volatile) i.e. data is not lost during power failures Disadvantages • • • Main memory is small. Many databases are too large to fit in main memory (MM). Main memory is volatile, i.e. data is lost during power failures. Secondary storage is slow (10,000 times slower than MM) CENG 351 12 How fast is main memory? • Typical time for getting info from: Main memory: ~12 nanosec = 120 x 10-9 sec Magnetic disks: ~30 milisec = 30 x 10-3 sec • An analogy keeping same time proportion as above: Looking at the index of a book : 20 sec versus Going to the library: 58 days CENG 351 13 Normal Arrangement • • • • Secondary storage (SS) provides reliable, longterm storage for large volumes of data At any given time, we are usually interested in only a small portion of the data This data is loaded temporarily into main memory, where it can be rapidly manipulated and processed. As our interests shift, data is transferred automatically between MM and SS, so the data we are focused on is always in MM. CENG 351 14 Goal of the file structures • Minimize the number of trips to the disk in order to get desired information • Grouping related information so that we are likely to get everything we need with only one trip to the disk. CENG 351 15 Physical Files and Logical Files • physical file: a collection of bytes stored on a disk or tape • logical file: a "channel" (like a telephone line) that connects the program to a physical file • The program (application) sends (or receives) bytes to (from) a file through the logical file. The program knows nothing about where the bytes go (came from). • The operating system is responsible for associating a logical file in a program to a physical file in disk or tape. Writing to or reading from a file in a program is done through the operating system. CENG 351 16 Files • The physical file has a name, for instance myfile.txt • The logical file has a logical name (a varibale) inside the program. – In C : FILE * outfile; – In C++: fstream outfile; CENG 351 17 Basic File Processing Operations • • • • • Opening Closing Reading Writing Seeking CENG 351 18 File Systems • Data is not scattered hither and thither on disk. • Instead, it is organized into files. • Files are organized into records. • Records are organized into fields. CENG 351 19 Example • A student file may be a collection of student records, one record for each student • Each student record may have several fields, such as – – – – – – Name Address Student number Gender Age GPA • Typically, each record in a file has the same fields. CENG 351 20 Properties of Files 1) Persistance: Data written into a file persists after the program stops, so the data can be used later. 2) Sharability: Data stored in files can be shared by many programs and users simultaneously. 3) Size: Data files can be very large. Typically, they cannot fit into main memory. CENG 351 21 Introduction to Database Systems Ref. Ramakrishnan & Gehrke Chapter 1 22 Basic Definitions • • • • Data Database Database Management System (DBMS) Database System 23 Basic Definitions • Data: Known facts that can be recorded and have an implicit meaning. • Database: A collection of related data. • Database Management System (DBMS): A software package/ system to facilitate the creation and maintenance of a computerized database. • Database System: The DBMS software together with the data itself. Sometimes, the applications are also included. 24 Files vs. DBMS • Application must stage large datasets between main memory and secondary storage (e.g., buffering, page-oriented access, etc.) • Special code for different queries • Must protect data from inconsistency due to multiple concurrent users • Crash recovery • Security and access control 25 Typical DBMS Functionality • Define a database : in terms of data types, structures and constraints • Construct or load the database on a secondary storage medium • Manipulating the database : querying, generating reports, insertions, deletions and modifications to its content • Concurrent Processing and Sharing by a set of users and programs – yet, keeping all data valid and consistent 26 Data Models A data model is a collection of concepts for describing data. A schema is a description of a particular collection of data, using the given data model. The relational model of data is the most widely used model today. – Main concept: relation, basically a table with rows and columns. – Every relation has a schema, which describes the columns, or fields. 27 Example: University Database Conceptual schema: – – – Students(sid: string, name: string, login: string, age: integer, gpa:real) Courses(cid: string, cname:string, credits:integer) Enrolled(sid:string, cid:string, grade:string) Physical schema: – – Relations stored as unordered files. Index on first column of Students. External Schema (View): – Course_info(cid:string,enrollment:integer) 28 Instance of Students Relation Students( sid: string, name: string, login: string, age: integer, gpa: real ) sid 53666 53688 53650 name Jones Smith Smith login jones@cs smith@ee smith@math age 18 18 19 gpa 3.4 3.2 3.8 29 Levels of Abstraction Many external schemata, single conceptual(logical) schema and physical schema. – External schemata describe how users see the data. – Conceptual schema defines logical structure – Physical schema describes the files and indexes used. External Schema 1 External Schema 2 External Schema 3 Conceptual Schema Physical Schema * Schemas are defined using DDL; data is modified/queried using DML. 30 Data Independence Applications insulated from how data is structured and stored. Logical data independence: Protection from changes in logical structure of data. Physical data independence: Protection from changes in physical structure of data. * One of the most important benefits of using a DBMS! 31 These layers must consider concurrency control and recovery Structure of a DBMS A typical DBMS has a layered architecture. This is one of several possible architectures; each system has its own variations. Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB 32