Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
C HAPT E R 3 Chapter 9 DataBase & DBMS J. Glenn Glenn Brookshear Brookshear J. 蔡文能 1 Copyright © 2009 Pearson Education, Inc. Slide 8-1 Chapter 9: Database Systems • • • • • 9.5 Traditional File Structures 9.1 Database Fundamentals 9.2 The Relational Model 9.3 Object-Oriented Databases 9.4 Maintaining Database Integrity • 9.6 Data Mining • 9.7 Social Impact of Database Technology 2 Copyright © 2009 Pearson Education, Inc. Slide 8-2 Data Structure vs. File Structure • Data Structure : How to arrange data in memory • File Structure : How to arrange data in Disk and/or any other secondary storage • DataBase and DataBase Management System – Users do NOT have to care about how to store data in a file. DBMS will handle the detail. – Users can use SQL (Structured Query Language) to access the DataBase in an interactive command and/or through a program (embeded SQL) RAM Disk 3 Copyright © 2009 Pearson Education, Inc. Slide 8-3 File Types • Sequential file: one accessed in a serial manner from beginning to end. E.g. audio, video, text, programs. • Text file: sequential file in which each logical record is a single character. ASCII: 1 byte/char Unicode: 2 bytes/char 4 Copyright © 2009 Pearson Education, Inc. Slide 8-4 Sequential Files • Sequential file: A file whose contents can only be read in order – Reader must be able to detect end-of-file (EOF) – Data can be stored in logical records, sorted by a key field • Greatly increases the speed of batch updates 5 Copyright © 2009 Pearson Education, Inc. Slide 8-5 Text Files • Simple file structure. • Extendable to more complex file structures using markup languages (XHTML, HTML). • XHTML, HTML control the display of the file on a monitor. • XML is a standard for markup languages. 6 Copyright © 2009 Pearson Education, Inc. Slide 8-6 Converting data from two’s complement notation into ASCII for storage in a text file 7 Copyright © 2009 Pearson Education, Inc. Slide 8-7 Figure 9.15 A procedure for merging two sequential files 8 Copyright © 2009 Pearson Education, Inc. Slide 8-8 Figure 9.16 Applying the merge algorithm (Letters are used to represent entire records. The particular letter indicates the value of the record’s key field.) 9 Copyright © 2009 Pearson Education, Inc. Slide 8-9 XML • Example <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> http://www.w3schools.com/xml/xml_whatis.asp 10 Copyright © 2009 Pearson Education, Inc. Slide 8-10 Indexed File key pointer Brown P1 Johnson P2 Jones P3 Smith P4 Watson P5 Index file Data item for Brown Data file Index: A list of key values and the location of their associated records 11 Copyright © 2009 Pearson Education, Inc. Slide 8-11 An inverted file 12 Copyright © 2009 Pearson Education, Inc. Slide 8-12 A file with a partial index 13 Copyright © 2009 Pearson Education, Inc. Slide 8-13 Figure 9.17 Opening an indexed file 14 Copyright © 2009 Pearson Education, Inc. Slide 8-14 Hashing • In hashing the index file is replaced by a hash function. – The storage space is divided into buckets. – Each record has a key field. Each record is stored in the bucket corresponding to the hash of its key. – A hash function computes a bucket number for each key value. • Advantage: no index table needed. • Disadvantages: i) hash function needs careful design; ii) unpredictable performance 15 Copyright © 2009 Pearson Education, Inc. Slide 8-15 Terminology • Bucket: section of the data storage area. • Key: identifier for a block of information. • Hash function: takes as input a key and outputs a bucket number. • Collision: two keys yield the same bucket number. 16 Copyright © 2009 Pearson Education, Inc. Slide 8-16 Hash Functions • Hash Function Requirements – Easily and quickly computed. – Values evenly spread over the bucket numbers. – What can go wrong: bucket number computed from 1st and 3rd characters of a name: Brown, Brook, Broom, Broadhead, Biot, Bloom, … Examples of Hash Functions Mid square: compute (key x key) and set bucket number = middle digits. Extraction: select digits from certain positions within the key. Divide key by number of buckets and use the remainder. 17 Copyright © 2009 Pearson Education, Inc. Slide 8-17 Figure 9.18 Hashing the key field value 25X3Z to one of 41 buckets 18 Copyright © 2009 Pearson Education, Inc. Slide 8-18 Figure 9.19 The rudiments of a hashing system, in which each bucket holds those records that hash to that bucket number 19 Copyright © 2009 Pearson Education, Inc. Slide 8-19 Collisions in Hashing • Collision: The case of two keys hashing to the same bucket – Clustering problem: Poorly designed hashing function can have uneven distribution of keys into buckets – Collision also becomes a problem when there aren’t enough buckets (probability greatly increases as load factor (% of buckets filled) approaches 75%) – Solution: somewhere between 50% and 75% load factor, increase number of buckets and rehash all data 20 Copyright © 2009 Pearson Education, Inc. Slide 8-20 Handling bucket overflow 21 Copyright © 2009 Pearson Education, Inc. Slide 8-21 A large file partitioned into buckets to be accessed by hashing 22 Copyright © 2009 Pearson Education, Inc. Slide 8-22 The role of an operating system when accessing a file System calls ? 23 Copyright © 2009 Pearson Education, Inc. Slide 8-23 Maintaining a file’s order by means of a File Allocation Table (FAT) 24 Copyright © 2009 Pearson Education, Inc. Slide 8-24 Information Required on a Hard Drive to Load an OS • Startup BIOS (POST , Load MBR) • Master Boot Record (MBR) – Master Boot Program – Partition Table (16 bytes * 4) • OS Boot Record (Boot Sector) – Loads the first program file of the OS • Boot Loader Program – Begins process of loading OS into memory 25 Copyright © 2009 Pearson Education, Inc. Slide 8-25 How Data Is Logically Stored on a Floppy Disk • All floppy and hard disk drives are divided into tracks and sectors • Tracks are concentric circles on a disk • Sector – Always 512 bytes – Physical organization of a disk – BIOS manages disk as sectors • Cluster (file allocation unit) – Group of sectors – Logical organization of a disk – OS views disk as a list of clusters 26 Copyright © 2009 Pearson Education, Inc. Slide 8-26 The Boot Record • Track 0, sector 1 of a floppy disk • Contains basic information about how the disk is organized • Includes bootstrap program, which can be used to boot from the disk • Uniform layout and content of boot record allows any version of DOS or Windows to read any DOS or Windows disk 27 Copyright © 2009 Pearson Education, Inc. Slide 8-27 The File Allocation Table (FAT) • Lists the location of files on disk in a one-column table • Floppy disk FAT is 12 bits wide, called FAT12 • Each entry describes how a cluster on the disk is used • A bad cluster on the disk will be marked in the FAT 28 Copyright © 2009 Pearson Education, Inc. Slide 8-28 The Root Directory • Lists all the files assigned to this table • Contains a fixed number of entries • Some items included are: – Filename and extension – Time and date of creation or last update – File attributes – First cluster number 29 Copyright © 2009 Pearson Education, Inc. Slide 8-29 How a Hard Drive is Logically Organized to Hold Data • Low-level format – Creates tracks and sectors, done at factory • Partition the hard drive (FDISK.EXE) – Creates partition table at the beginning of drive • High-level format – Done by OS for each logical drive • Master Boot Record (MBR) is the first 512 bytes of a hard drive – Master boot program (446 bytes) calls boot program to load OS – Partition table • Description, Location, Size 30 Copyright © 2009 Pearson Education, Inc. Slide 8-30 Partitions and Logical Drives 31 Copyright © 2009 Pearson Education, Inc. Slide 8-31 FAT16 • Supported by DOS and all versions of Windows • Uses 16 bits for each cluster entry • As the size of the logical drive increases, FAT16 cluster size increases dramatically 32 Copyright © 2009 Pearson Education, Inc. Slide 8-32 FAT32 • Became available with Windows 95 OSR2 • Used 32 bits per FAT entry, although only 28 bits were used to hold cluster numbers • More efficient than FAT16 in terms of cluster size 33 Copyright © 2009 Pearson Education, Inc. Slide 8-33 NTFS • Supported by Windows NT/2000/XP • Provides greater security • Used a database called the master file table (MFT) to locate files and directories • Supports large hard drives 34 Copyright © 2009 Pearson Education, Inc. Slide 8-34 Comparing FAT16, FAT32, NTFS 35 Copyright © 2009 Pearson Education, Inc. Slide 8-35 Figure 9.1: A file versus a database organization (1/2) 36 Copyright © 2009 Pearson Education, Inc. Slide 8-36 Figure 9.1: A file versus a database organization (2/2) 37 Copyright © 2009 Pearson Education, Inc. Slide 8-37 Figure 9.2: The conceptual layers of a database 38 Copyright © 2009 Pearson Education, Inc. Slide 8-38 Schemas • Schema: A description of the structure of an entire database, used by database software to maintain the database • Subschema: A description of only that portion of the database pertinent to a particular user’s needs, used to prevent sensitive data from being accessed by unauthorized personnel 39 Copyright © 2009 Pearson Education, Inc. Slide 8-39 Database Management Systems • Database Management System (DBMS): A software layer that manipulates a database in response to requests from applications • Distributed Database: A database stored on multiple machines – DBMS will mask this organizational detail from its users • Data independence: The ability to change the organization of a database without changing the application software that uses it 40 Copyright © 2009 Pearson Education, Inc. Slide 8-40 Database Models • Database model: A conceptual view of a database – Relational database model – Object-oriented database model 41 Copyright © 2009 Pearson Education, Inc. Slide 8-41 Relational Database Model • Relation: A rectangular table – Attribute: A column in the table – Tuple: A row in the table • Relational Design – Avoid multiple concepts within one relation • Can lead to redundant data • Deleting a tuple could also delete necessary but unrelated information 42 Copyright © 2009 Pearson Education, Inc. Slide 8-42 Improving a Relational Design • Decomposition: Dividing the columns of a relation into two or more relations, duplicating those columns necessary to maintain relationships – Lossless or nonloss decomposition: A “correct” decomposition that does not lose any information 43 Copyright © 2009 Pearson Education, Inc. Slide 8-43 Figure 9.3: A relation containing employee information 44 Copyright © 2009 Pearson Education, Inc. Slide 8-44 Figure 9.4: A relation containing redundancy 45 Copyright © 2009 Pearson Education, Inc. Slide 8-45 Figure 9.5 An employee database consisting of three relations 46 Copyright © 2009 Pearson Education, Inc. Slide 8-46 Figure 9.6 Finding the departments in which employee 23Y34 has worked 47 Copyright © 2009 Pearson Education, Inc. Slide 8-47 Figure 9.7: A relation and a proposed decomposition 48 Copyright © 2009 Pearson Education, Inc. Slide 8-48 Relational Operations • Select: Choose rows • Project: Choose columns • Join: Assemble information from two or more relations 49 Copyright © 2009 Pearson Education, Inc. Slide 8-49 Figure 9.8: The SELECT operation 50 Copyright © 2009 Pearson Education, Inc. Slide 8-50 Figure 9.9: The PROJECT operation 51 Copyright © 2009 Pearson Education, Inc. Slide 8-51 Figure 9.10: The JOIN operation 52 Copyright © 2009 Pearson Education, Inc. Slide 8-52 Figure 9.11: Another example of the JOIN operation 53 Copyright © 2009 Pearson Education, Inc. Slide 8-53 Figure 9.12 An application of the JOIN operation 54 Copyright © 2009 Pearson Education, Inc. Slide 8-54 Figure 9.13: The associations between objects in an object-oriented database 60 Copyright © 2009 Pearson Education, Inc. Slide 8-60 Maintaining Database Integrity (1/2) • Transaction: A sequence of operations that must all happen together – Example: transferring money between bank accounts • Transaction log: A non-volatile record of each transaction’s activities, built before the transaction is allowed to execute – Commit point: The point at which a transaction has been recorded in the log – Roll-back: The process of undoing a transaction 61 Copyright © 2009 Pearson Education, Inc. Slide 8-61 Maintaining database integrity (2/2) • Simultaneous access problems – Incorrect summary problem – Lost update problem • Locking = preventing others from accessing data being used by a transaction – Shared lock: used when reading data – Exclusive lock: used when altering data 62 Copyright © 2009 Pearson Education, Inc. Slide 8-62 表格正規化簡介 (1/2) • 第一階正規化 (1NF): – 關聯式資料庫(relational database)要求各個資料表皆須符合第一正規化 (first normal form, 簡稱1NF),如下: – 第一正規化:每一個欄位只准有一個值。 – 關聯式資料庫要求每個資料表都要有一個主鍵(primary key),用來識別每 一個tuple。 – 主鍵可以是資料表中的某一個欄位,也可以由幾個欄位組成。 – 關聯式資料庫對主鍵欄位另外有個要求,即實體完整性(entity integrity): 組成主鍵的任何欄位值,都不可以是Null。 • 第二階正規化 (2NF): – 資料表要滿足第一正規化,而且所有欄位都完全功能相依於主鍵。 • 功能相依(functionally dependence): – 由attribute X的值可以決定一個唯一的attribute Y的值,簡寫成X→Y。 • 完全功能相依(full functional dependence): – 如果attribute Y功能相依於attribute X,但是並不功能相依於attribute X的任 何子集,則稱attribute Y完全功能相依於attribute X。 63 Copyright © 2009 Pearson Education, Inc. Slide 8-63 表格正規化簡介 (2/2) X→Y,Y→Z的關係稱做遞移功能相依(transitive functional dependence),即Z遞移功能相依於X。 • 第三階正規化(third normal form,簡稱3NF) – 資料表要滿足第二正規化,而且所有欄位都不可遞移功能相依於主鍵。 • 第三階正規化解決的問題 – 解決了新增資料的問題 – 解決了刪除資料的問題 – 解決了修改資料的問題 64 Copyright © 2009 Pearson Education, Inc. Slide 8-64 Data Mining (資料探勘) • Data Mining: The area of computer science that deals with discovering patterns in collections of data • Data warehouse: A static data collection to be mined – Data cube: Data presented from many perspectives to enable mining • Data Mining Strategies – – – – – – Class description Class discrimination Cluster analysis Association analysis Outlier analysis Sequential pattern analysis 65 Copyright © 2009 Pearson Education, Inc. Slide 8-65 Data mining(資料探勘)主要功能與技術 功能 技術 適用領域 關聯性 (Association) 案例庫推理/集合理論/統計 菜籃分析 時間序列 (Sequence) 類神經網路/統計 利率預測 分類 (Classification) 基因演算/類神經網路/統計/ 模糊邏輯案例推理/決策樹 客戶評鑑分類 公式 (Modeling) 基因規劃/基因演算/迴歸 銷售預測 群組 (Clustering) 類神經網路/模糊邏輯/ 基因演算/統計 市場區隔 66 Copyright © 2009 Pearson Education, Inc. Slide 8-66 Social Impact of Database Technology • Problems – Massive amounts of personal data are being collected • Often without knowledge or meaningful consent of affected people – Data merging produces new, more invasive information – Errors are widely disseminated and hard to correct • Remedies – Existing legal remedies often difficult to apply – Negative publicity may be more effective 67 Copyright © 2009 Pearson Education, Inc. Slide 8-67 Thank You! DataBase Management System 謝謝捧場 [email protected] 蔡文能 68 Copyright © 2009 Pearson Education, Inc. Slide 8-68