Download chapter4

Chapter 4 study outline I. What is a physical data model? A. Three levels of database design and their goals 1) Conceptual: Generate faithful representation of data in real world. Define entity, attribute, relationship. 2) Logical: Generate most compact and robust representation in the computer translated from conceptual model. Define relation, atomic data type, primary or foreign key. 3) Physical: Implementation of logical data model into computer hardware and OS in the most efficient manner . Define file structure, index. “Efficient”: Queries and data retrieval go fastest. Cheap I/O cost B. Database “tuning” a) Changing physical data parameters to make queries efficient C. Physical data model concepts 1) Primary versus secondary memory 2) File structures 3) Database index or indices D. Problems with spatial data 1) Total ordering required for file structures & indices a) One dimensional data can be totally ordered, multi dimensional data cannot 2) Solutions a) Use space-filling curves to generate artificial order b) Use spatial indices such as grid files or R-tree II. Computer storage hardware A. Types of memories 1) Primary (Random Access Memory: RAM) Access speed is fast, but volume is low, and content disappear when power goes off. CPU access to RAM to perform a computation using the data in RAM. 2) Secondary or “hard disk” access time is slower in several orders of magnitude than RAM, but stores large volume of records. Retain contents without power. Dataset of a database cannot be fitted into main memory, forcing the data to be stored in secondary disk. Difference in access time between primary & secondary are performance causes problem. The goal of the physical database design is to reduce the amount of data transfer between disk and RAM (I/O), In addition, for spatial database, require to consider CPU time since spatial operation is heavy. B. The geometry of disk drives 1) Components a) Platter: Metallic disk like CD coated in magnetic media, collection of platters are mounted on a spindle and rotate. b) Tracks: a concentric ring with a platters c) Sector: a arc within a track, holding a few kilo byte d) Cylinder: collection of tracks across the platters with common radius e) Disk block (page): the smallest unit of transfer between the disk and the main memory. Integer multiple of the sector. f) Disk head: to read and write data on the sector. Each platter has a disk head (head assembly). Disk heads move together. 2) Access time to the disk: the whole process of data transfer from the disk to the main memory is divided into three a) Seek: time the disk head takes to reach the particular track. b) Latency: time the page take to rotate underneath the disk head c) Transfer: time the disk head takes to read or write data in the page. 3) Relative size of each component Seek time > latency time >> transfer time 4) Some rules of thumb a) A larger size sector provide fast transfer, but waste of storage space inside sector for small data. b) Storing most frequent transferred data on the middle track reduces seek time. c) Storing larger data requiring multiple sectors in a single cylinder reduces the seek time C. Buffer manager 1) Buffer: a portion of main memory where frequent accessed data is kept 2) Buffer manager – software module in DBMS managing the allocation of buffer space Which sectors stays in buffer Which sector is removed when main memory runs out of space 3) Rules for transferring records in and out of buffer -> based on access pattern in queries a) Hot set: a set of pages which is accessed over and over b) Least recently used -> this is going to be removed first when RAM gets full. D. Mapping software view to hardware 1) Field: data type describing attribute of entity, representing a column of table 2) Record: a collection of fields of the same or different types, representing a row of table 3) File: a collection of records possibly spanning multiple sectors, which are linked by pointers 4) File system: collection of files and organized into a directory 5) Pointer: containing address of the sector III. File structures A. What is a file structure? Organization of records in a file for efficient implementation of common file operations. Efficiency is measured by the sum of I/O cost (number of transferring between RAM and disk) and CPU cost (number of CPU instruction for computation). B. Common file operations 1) Find: return record matching the key value 2) Findnext: return the next record after find operation if record is ordered 3) Insert: add a new record without changing file structure. Location of added record depends on the file structure. 4) Delete: delete a record 5) Nearest neighbor (in data space and geo-space) : fine the nearest point to a query point C. Common file structures 1) Heap: records are not unordered Advantage is in insert operation. Disadvantage is in find and findnext operation 2) Ordered: records are ordered based on key fields Advantage is in findnext, Find, insert and delete uses binary search wichi is efficient. 3) Hash: Records are divided into buckets (sectors) using hash function. Within each bucket, the records are not ordered. Hash function returns addresses based on given key values which are evaluated via the function. Advantage is in insert, delete and find. Disadvantage is in findnext and nearest neighbor. 4) Clustering In general clustering is to reduce I/O cost by grouping records commonly accessed by a query (stored into common sectors). For spatial database, objects that are neighbor in space or jointly requested by queries are stored physically nearby in secondary memory. There are three types of clustering for spatial database. a) Internal clustering: a complete record sets for one single object are stored in one disk page b) Local clustering: a set of objects that are accessed together is stored in one page c) Global clustering: a set of spatially adjacent objects are stored on several consecutive pages. D. Clustering spatial data using space-filling curves 1) Difficulty: Mapping 2D space to 1D structures 2) ->Space filling curve; continuous curve that passes every points of grid cells (a size of any power of 2) that generated by the algorithms, thus making it possible to order 2-D file structures in 1-D dimensional way. e.g z-curve and Hilbert curve (no perfect curve) Valuable properties a) One-to-one correspondence No two points are mapped on the same point in the curve b) Distance-preserving Elements close to each other in space are also mapped on neighbor points in the curve. 3) Z-curve a) Basic algorithm 1. Binary representation of x and y coordinate of each grid. 2. Interleaving bits from both directions into one string 3. Convert the bit string back to decimal -> giving the order of tracing points b)Use with vector data 4) Hilbert curve a) More complex to generate (disadvantage) Especially, computing entry and exit point of the curve is difficult b) Eliminate large leaps (diagonal line) in the space seen in Z-curve (advantage) IV. Database index A. What is an index? Index is to speed up the searching of a data file. The index is stored in a auxiliary file soring key value of and address of a page. Index files also taking space of memory!! key values in index files are ordered. B. Primary versus secondary indices 1) Primary index Records in a data files are ordered by the key field, therefore index file need only the first key values for each page of the data file. This reduces the index file volume 2) Secondary index Records in a data files are not ordered by the key field, though the key values in the index file are ordered. The index file store all of the key values in a data file and page address. C. Structuring index files 1) Tree: Hierachical network consisting of Nodes (Root, Parents vs. children, and leaf (no further nodes below) 2) How trees help in searching All search starts with root node, pointer and key values at the node help move down to the leaf that tells where the data is stored B-trees; common index file structures for one-dimension ordered records “Balanced” trees:every path from the root node to leaf nodes is of the same length Typical node P1 K1 ........ Pn-1 Kn-1 Pn For leaf nodes, Pi points to either a file record with search key value Ki , or a bucket of pointers to records with that search key value. For non-reaf nodes, pointer Pi (i = 2, ..., m) points to a subtree containing search key values between Ki-1 and Ki. Pointer Pm points to a subtree containing search key values more than Km-1. Pointer P1 points to a subtree containing search key values less than K1. Fan-out: the number of pointers in the node. more info ->http://www.cs.sfu.ca/CC/354/zaiane/material/notes/Chapter11/node10.html 5) Grid files Divide space into grid cells that point to the corresponding disk pages a) Fixed grids Divided cells are equal size One-to-one correspondence: Grid cell versus disk sector b) Grid files Improvements of fixed grid files: non-uniform grid cells and multiple cells can point to one sector. Two components and two corresponding data access procedures 1) Scales: Define boundaries in each x and y coordinate direction (for 2 dimension) and identify index of grid directories to access to them. Located in main memory 2) Grid directories Contain pointer to a bucket, located in main memory but can be large so resides in disk. Procedure for search (2 disk access principle); 1. search scales to access to grid directory (1 I/O if grid directories do not fit in main memory) . 2. Retrieve pointer to access to a bucket from the selected grid directory (1 I/O) Search time is good Requiring large memory for grid directory and scale 6) The R-tree family: Hierarchical collection of rectangles to organize spatial objects, having characteristics of B-tree which is that tree is balanced. a) Basic components MBR (minimum bounding rectangle) for location and Object ID or Node ID b) Tree structure A leaf node has MBR(lower left corner and top right corner) and object ID ->MBR contains data objects A non leaf node has MBR and node ID ->MBR for a non leaf node contain child MBRs c)Basic R-tree  Objects appear in only one of the leaf nodes.  Allow the MBRs to overlap each other -> causing problems Problems with overlap & coverage a) coverage: total area covered by all the MBRs at a node. Indirect measure of empty space covered by the tree. -> The larger empty space is, the more inefficient. b) overlap: may search more than one subtree at each node. ->longer search time. Coverage and overlap should be minimized d)R+ tree  Eliminates overlap & minimizes coverage at any given level  Objects may appear multiple leaf nodes. -> For point query, search is the single path from the root to the leaf. -> require extra memory space

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download chapter4