Download chapter4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Versant Object Database wikipedia , lookup

Information privacy law wikipedia , lookup

File Allocation Table wikipedia , lookup

File locking wikipedia , lookup

Design of the FAT file system wikipedia , lookup

Business intelligence wikipedia , lookup

Computer file wikipedia , lookup

Clusterpoint wikipedia , lookup

Data vault modeling wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Files-11 wikipedia , lookup

Database model wikipedia , lookup

Disk formatting wikipedia , lookup

Transcript
Chapter 4 study outline
I. What is a physical data model?
A. Three levels of database design and their goals
1) Conceptual: Generate faithful representation of data in real world. Define entity,
attribute, relationship.
2) Logical: Generate most compact and robust representation in the computer translated
from conceptual model. Define relation, atomic data type, primary or foreign key.
3) Physical: Implementation of logical data model into computer hardware and OS in the
most efficient manner . Define file structure, index.
“Efficient”: Queries and data retrieval go fastest. Cheap I/O cost
B. Database “tuning”
a) Changing physical data parameters to make queries efficient
C. Physical data model concepts
1) Primary versus secondary memory
2) File structures
3) Database index or indices
D. Problems with spatial data
1) Total ordering required for file structures & indices
a) One dimensional data can be totally ordered, multi dimensional data cannot
2) Solutions
a) Use space-filling curves to generate artificial order
b) Use spatial indices such as grid files or R-tree
II. Computer storage hardware
A. Types of memories
1) Primary (Random Access Memory: RAM)
Access speed is fast, but volume is low, and content disappear when power goes off.
CPU access to RAM to perform a computation using the data in RAM.
2) Secondary or “hard disk”
access time is slower in several orders of magnitude than RAM, but stores large
volume of records. Retain contents without power.
Dataset of a database cannot be fitted into main memory, forcing the data to be stored in
secondary disk. Difference in access time between primary & secondary are performance
causes problem. The goal of the physical database design is to reduce the amount of data
transfer between disk and RAM (I/O), In addition, for spatial database, require to
consider CPU time since spatial operation is heavy.
B. The geometry of disk drives
1) Components
a) Platter: Metallic disk like CD coated in magnetic media, collection of platters are
mounted on a spindle and rotate.
b) Tracks: a concentric ring with a platters
c) Sector: a arc within a track, holding a few kilo byte
d) Cylinder: collection of tracks across the platters with common radius
e) Disk block (page): the smallest unit of transfer between the disk and the main memory.
Integer multiple of the sector.
f) Disk head: to read and write data on the sector. Each platter has a disk head (head
assembly). Disk heads move together.
2) Access time to the disk: the whole process of data transfer from the disk to the main
memory is divided into three
a) Seek: time the disk head takes to reach the particular track.
b) Latency: time the page take to rotate underneath the disk head
c) Transfer: time the disk head takes to read or write data in the page.
3) Relative size of each component
Seek time > latency time >> transfer time
4) Some rules of thumb
a) A larger size sector provide fast transfer, but waste of storage space inside sector for
small data.
b) Storing most frequent transferred data on the middle track reduces seek time.
c) Storing larger data requiring multiple sectors in a single cylinder reduces the seek
time
C. Buffer manager
1) Buffer: a portion of main memory where frequent accessed data is kept
2) Buffer manager – software module in DBMS managing the allocation of buffer space
Which sectors stays in buffer
Which sector is removed when main memory runs out of space
3) Rules for transferring records in and out of buffer -> based on access pattern in queries
a) Hot set: a set of pages which is accessed over and over
b) Least recently used -> this is going to be removed first when RAM gets full.
D. Mapping software view to hardware
1) Field: data type describing attribute of entity, representing a column of table
2) Record: a collection of fields of the same or different types, representing a row of table
3) File: a collection of records possibly spanning multiple sectors, which are linked by
pointers
4) File system: collection of files and organized into a directory
5) Pointer: containing address of the sector
III. File structures
A. What is a file structure?
Organization of records in a file for efficient implementation of common file operations.
Efficiency is measured by the sum of I/O cost (number of transferring between RAM and
disk) and CPU cost (number of CPU instruction for computation).
B. Common file operations
1) Find: return record matching the key value
2) Findnext: return the next record after find operation if record is ordered
3) Insert: add a new record without changing file structure. Location of added record
depends on the file structure.
4) Delete: delete a record
5) Nearest neighbor (in data space and geo-space) : fine the nearest point to a query point
C. Common file structures
1) Heap: records are not unordered
Advantage is in insert operation. Disadvantage is in find and findnext operation
2) Ordered: records are ordered based on key fields
Advantage is in findnext, Find, insert and delete uses binary search wichi is efficient.
3) Hash: Records are divided into buckets (sectors) using hash function. Within each
bucket, the records are not ordered. Hash function returns addresses based on given
key values which are evaluated via the function.
Advantage is in insert, delete and find. Disadvantage is in findnext and nearest
neighbor.
4) Clustering
In general clustering is to reduce I/O cost by grouping records commonly accessed by
a query (stored into common sectors). For spatial database, objects that are neighbor in
space or jointly requested by queries are stored physically nearby in secondary
memory. There are three types of clustering for spatial database.
a) Internal clustering: a complete record sets for one single object are stored in one
disk page
b) Local clustering: a set of objects that are accessed together is stored in one page
c) Global clustering: a set of spatially adjacent objects are stored on several
consecutive pages.
D. Clustering spatial data using space-filling curves
1) Difficulty: Mapping 2D space to 1D structures
2) ->Space filling curve; continuous curve that passes every points of grid cells (a size of
any power of 2) that generated by the algorithms, thus making it possible to order 2-D
file structures in 1-D dimensional way. e.g z-curve and Hilbert curve (no perfect
curve)
Valuable properties
a) One-to-one correspondence
No two points are mapped on the same point in the curve
b) Distance-preserving
Elements close to each other in space are also mapped on neighbor points in the curve.
3) Z-curve
a) Basic algorithm
1. Binary representation of x and y coordinate of each grid.
2. Interleaving bits from both directions into one string
3. Convert the bit string back to decimal -> giving the order of tracing points
b)Use with vector data
4) Hilbert curve
a) More complex to generate (disadvantage)
Especially, computing entry and exit point of the curve is difficult
b) Eliminate large leaps (diagonal line) in the space seen in Z-curve (advantage)
IV. Database index
A. What is an index?
Index is to speed up the searching of a data file. The index is stored in a auxiliary file
soring key value of and address of a page. Index files also taking space of memory!! key
values in index files are ordered.
B. Primary versus secondary indices
1) Primary index
Records in a data files are ordered by the key field, therefore index file need only the
first key values for each page of the data file. This reduces the index file volume
2) Secondary index
Records in a data files are not ordered by the key field, though the key values in the
index file are ordered. The index file store all of the key values in a data file and page
address.
C. Structuring index files
1) Tree: Hierachical network
consisting of Nodes (Root, Parents vs. children, and leaf (no further nodes below)
2) How trees help in searching
All search starts with root node, pointer and key values at the node help move down to
the leaf that tells where the data is stored
B-trees; common index file structures for one-dimension ordered records
“Balanced” trees:every path from the root node to leaf nodes is of the same length
Typical node
P1 K1 ........
Pn-1 Kn-1
Pn
For leaf nodes, Pi points to either a file record with search key value Ki , or a bucket of
pointers to records with that search key value.
For non-reaf nodes, pointer Pi (i = 2, ..., m) points to a subtree containing search key
values between Ki-1 and Ki. Pointer Pm points to a subtree containing search key values
more than Km-1. Pointer P1 points to a subtree containing search key values less than K1.
Fan-out: the number of pointers in the node.
more info ->http://www.cs.sfu.ca/CC/354/zaiane/material/notes/Chapter11/node10.html
5) Grid files
Divide space into grid cells that point to the corresponding disk pages
a) Fixed grids
Divided cells are equal size
One-to-one correspondence: Grid cell versus disk sector
b) Grid files
Improvements of fixed grid files: non-uniform grid cells and multiple cells can point to
one sector. Two components and two corresponding data access procedures
1) Scales:
Define boundaries in each x and y coordinate direction (for 2 dimension) and identify
index of grid directories to access to them. Located in main memory
2) Grid directories
Contain pointer to a bucket, located in main memory but can be large so resides in
disk.
Procedure for search (2 disk access principle); 1. search scales to access to grid directory
(1 I/O if grid directories do not fit in main memory) . 2. Retrieve pointer to access to a
bucket from the selected grid directory (1 I/O)
Search time is good
Requiring large memory for grid directory and scale
6) The R-tree family:
Hierarchical collection of rectangles to organize spatial objects, having characteristics of
B-tree which is that tree is balanced.
a) Basic components
MBR (minimum bounding rectangle) for location and Object ID or Node ID
b) Tree structure
A leaf node has MBR(lower left corner and top right corner) and object ID
->MBR contains data objects
A non leaf node has MBR and node ID
->MBR for a non leaf node contain child MBRs
c)Basic R-tree
 Objects appear in only one of the leaf nodes.
 Allow the MBRs to overlap each other -> causing problems
Problems with overlap & coverage
a) coverage: total area covered by all the MBRs at a node. Indirect measure of empty
space covered by the tree. -> The larger empty space is, the more inefficient.
b) overlap: may search more than one subtree at each node. ->longer search time.
Coverage and overlap should be minimized
d)R+ tree
 Eliminates overlap & minimizes coverage at any given level
 Objects may appear multiple leaf nodes.
-> For point query, search is the single path from the root to the leaf.
-> require extra memory space