Download Slides 01 - University of California, Irvine

Document related concepts

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
ICS 224: Database Management Systems
Spring 2011
Professor Sharad Mehrotra
Information and Computer Science Department
University of California, Irvine
1
Course General Info
• URL: http://www.ics.uci.edu/~cs224/
– All course info will be posted online
• Lecture times: Tue-Thurs 5 – 6.30
• Instructor: Sharad Mehrotra, BH 2082,
[email protected]
• Office Hours: on request
ICS214A
Notes 01
2
Prerequisites
• Basic Data Management Concepts:
– DB design, relational model, SQL, database
programming
 CS 122 or equivalent
– Database system implementation
 Indexing, query optimization, query processing, storage management,
etc.
 ICS 222 or equivalent
• Basic Computer Science Concepts:
– Depth-first search, directed/undirected graphs, “big O”
notation, computational complexity, NP completeness
…
ICS214A
Notes 01
3
Course Requirements
• Class Participation: 50%
– Attendance, presentations, comments,
interaction, enthusiasm, etc.
• Class Projects: 50%
– Implementation Oriented:
 Take a idea/topic, identify a project, get it okayed by
instructor, develop a demonstration
– Survey of an area
 In depth survey in the style of computing survey
articles. Provide your own perspective in a subarea.
– MUST commit to project at end of 2nd week.
ICS214A
Notes 01
4
Class Structure
• Each week we will
– Pick a topic
– identify 1 paper per student/group of 2 students
– 2 papers as lead papers for presentation (one for
each class), others presented as short
presentations
• Each week
–
–
–
–
ICS214A
Start with overview
Lead paper presentation
short presentation of other papers (main idea)
Discussions
Notes 01
5
This course …
Most important ideas in data management (instructor’s
pick)
But with the eye towards an end application …
Sentient spaces
ICS214A
Notes 01
6
Sentient Spaces …
• Spaces in which sensors are used to capture the dynamic evolving state
which is then analyzed for implementing adaptations.
• Numerous examples …
– intelligent transportation systems
– reconnaissance
– surveillance
systems
– smart buildings
– smart grid ...
7
Example:Smart Video Surveillance
Query
Query
Analysis
CS Building in UC
Irvine
Event
Database
Semantic
Extraction
Surveillance
Video
Database
Video
collection
8
Implications of Sentient
Space focus ..
• Class focuses on topics which you might
need to know if you wanted to explore
application in sentient space …
• Projects should target something about
sentient spaces …
– E.g., data cleaning of sentient data, data
model to represent sentient spaces, …
ICS214A
Notes 01
9
Data Models (2 weeks)
– Representing time - TSQL2
– Representing space
– Querying streaming data – CQL, ASQL
– Semi-structured data –OEM, Lore
ICS214A
Notes 01
10
New Ideas in Storage &
Indexing (2 weeks)
• New storage models
– Key-Value store
– Bigtable
– Column Stores
• New database system architecture
– Data outsourcing
– Multitenant databases
• New Indexing techniques
– Correlation maps
ICS214A
Notes 01
11
Data Quality (2 weeks)
• Data quality issues
– Inaccuracy, incompleteness, ambiguity,
errors, …
• Two aspects:
– Techniques to improve quality
 Exploiting contextual knowledge, issues of
efficiency
– Techniques to tolerate poor quality of data
in applications.
ICS214A
Notes 01
12
New Computing Architecture
(2 weeks)
•
•
•
•
•
•
Map Reduce framework
Hive
Pig latin
Join processing
HadoopDB
Hyrax?
ICS214A
Notes 01
13
Data Privacy (2 weeks)
• Use cases
– Data publishing, queries, sharing, data
outsourcing.
• Diverse criteria
– Differential privacy, Anonymity, l-diversity,
..
• Mechanisms to implement
ICS214A
Notes 01
14
A walk down the history of data models …
Two papers (MUST READ)
•Inclusion of New types in relational databases, Stonebraker
•Postgrest Next Generation databsase, Stonebraker.
16
The Paleolithic Period …
• There were no general purpose tools for
managing large volumes of data…
– OS provided resource management
– Data was stored in files
– Applications performed data management
functionalities





Fault-tolerance
Concurrency control
Reliability
Optimizations
…
– Such functionalities had to be re-implemented for
each application
ICS214A
Notes 01
17
The Neolithic Period…
• Early file systems evolve into general-purpose data
management tools.
• DBMS Goals:
– Efficiency and scalability (faster than files)
– Management of large heterogeneous types of structured
data
– High reliability
– Information sharing (multiple users)
• DBMS Users:
– E-commerce companies, banks, airlines, transportation
companies, corporate databases, government agencies, …
– Anyone you can think of!
ICS214A
Notes 01
18
The Dark Ages ….
• Network & hierarchical data models
– Resulted in data spaghetti
– Applications needed to chase pointers
– There was little data abstraction or separation of
concerns
 little difference between physical data representation and
logical data representation
– optimization was entirely left to application writers
– There were no clean data management languages
 Unless you are a Cobol fan!
ICS214A
Notes 01
19
The Relational Era..
• Relational model proposed by Codd
– Everything is a relation
– Query consists of algebraic composition of a few powerful
operators
– Equivalent to a first-order relational calculus
• Primary features
– Simple clean data representation
 solid mathematical basis
– data abstraction
 Users did not need to be concerned about how data is stored physically
– simple declarative query language
 User’s specify what to compute not how to do it.
–
ICS214A
optimization by the system
Notes 01
20
Data Wars (1)
• Codasyl versus relational debates began…
– Heated arguments during early SIGMODS
– Codasyl: relational model is too simple,
applications built using it will never scale in
performance.
– Relational: network/hierarchical models have
no formal basis, are too complex, and
unmanageable as application complexity
increases.
• Relational model found many supporters
– Specially at universities
– Its simplicity was enticing
ICS214A
Notes 01
21
Data Wars (2)
•
Many projects started off trying to implement a relational DBMS
– System R @ IBM Almaden
– Ingres @ Berkeley
– These early systems led to the technologies that drive modern data
management
•
Early prototypes became products
– DB2 & Ingres
•
Principle designers from both the System R teams & Ingres left to start
companies
– Oracle, Sybase
•
Early relational companies went door to door converting industry to the
relational model
– Industry got hooked on to the simplicity of writing complex applications in
relational model
– Boeing among the first converts
ICS214A
Notes 01
22
Pointer’s Strike Back…
Application
data structures
Relational
Copy and
Transparent
translation
ODBMS
representation
data transfer
RDBMS
• Complex objects in emerging DBMS applications cannot be
effectively represented as records in relational model.
• Representing information in RDBMSs requires complex and
inefficient conversion into and from the relational model to the
application programming language
• ODBMSs provide a direct representation of objects to DBMSs
overcoming the impedance mismatch problem
ICS214A
Notes 01
23
Object Model
• Object:
– observable entity in the world being modeled
– similar to concept to entity in the E/R model
• An object consists of:
– attributes: properties built in from primitive types
– relationships: properties whose type is a reference to
some other object or a collection of references
– methods: functions that may be applied to the object.
ICS214A
Notes 01
24
Object Oriented Databases
• Evolved as persistent Object Oriented
Programming Languages:
• Start with an OO language (e.g., C++,
Java, SMALLTALK) which has a rich
type system
• Add persistence to the objects in
programming language where
persistent objects stored in databases
ICS214A
Notes 01
29
Persistent Programming Languages
• Single programming language for application and data management
ii2
a[ j]  a[ j 1]  3
EmployeeSpousebenefit_levelbenefitlevel1

• Update to persistent variable results in automatic update to database.
• Persistent data could be types such as sets and lists and arrays.
• Application can follow pointers (OID) to navigate through data.
ICS214A
Notes 01
30
Persistence
• Objects created may have different lifetimes:
– transient: allocated memory managed by the programming
language run-time system.
 E.g., local variables in procedures have a lifetime of a procedure
execution
 global variables have a lifetime of a program execution
– persistent: allocated memory and stored managed by ODBMS
runtime system.
• Classes are declared to be persistence-capable or transient.
• Different languages have different mechanisms to make objects
persistent:
– creation time: Object declared persistent at creation time (e.g.,
in C++ binding) (class must be persistent-capable)
– persistence by reachability: object is persistent if it can be
reached from a persistent object (e.g., in Java binding) (class must
be persistent-capable).
ICS214A
Notes 01
31
Persistent Object-Oriented
Programming Languages
• Persistent objects are stored in the database and accessed from
the programming language.
• Single programming language for applications as well as data
management.
– Avoid having to translate data to and from application
programming language and DBMS
 efficient implementation
 less code
– Programmer does not need to write explicit code to fetch data to
and from database
 persistent objects to programmer looks exactly the same as transient
objects.
 System automatically brings the objects to and from memory to
storage device. (pointer swizzling).
ICS214A
Notes 01
32
Approaches To Persistent Programming
• Persistent Virtual Memory
– disk representation and memory representation of data is identical.
– No cost to translate data from one representation to another— efficient!
– DB size limited to address space
32bit processor  2^32 byte addressability (4 GBytes)
– Differentiating persistent objects and non-persistent objects is difficult.
– Difficult to optimize disk layout and locality of access.
– Example system using approach:
OBJECT STORE.
ICS214A
Notes 01
33
Approaches To Persistent Programming Languages
• Store persistent objects in files
– Objects brought to memory on demand.
– Implementation of OID complex since pointers do not suffice in
general.
 If object in memory pointer can be used for OID
 if object on disk a disk address still not good as OID since storage can
be reorganized. A separate mechanism needed.
 Pointer swizzling for efficiency.
ICS214A
Notes 01
34
Challenges In Building Persistent Languages
• Efficient caching of objects in client address space.
– Cache coherence.
• In OODB data migrates to clients unlike relational client
server systems where query migrates to server.
• Given a large number of clients each with the cache of
objects ensuring consistency of object across multiple
clients is a challenge.
ICS214A
Notes 01
35
Disadvantages of ODBMS
Approach
• Low protection
– since persistent objects manipulated from applications directly,
more changes that errors in applications can violate data integrity.
• Non-declarative interface:
– difficult to optimize queries
– difficult to express queries
• But …..
– Most ODBMSs offer a declarative query language OQL to overcome
the problem.
– OQL is very similar to SQL and can be optimized effectively.
– OQL can be invoked from inside ODBMS programming language.
– Objects can be manipulated both within OQL and programming
language without explicitly transferring values between the two
languages.
– OQL embedding maintains simplicity of ODBMS programming
language interface and yet provides declarative access.
ICS214A
Notes 01
36
The Return of the Relations … POSTGRES
• Relational model evolved into ORDBMSs that include “best of” objectoriented concepts
• Amongst the first ORDBMS prototype built @ Berkeley
POSTGRES
commercialized
Illustra
bought by
Informix IUS
• Has had major impact on major commercial DBMS which have all
migrated to ORDBMS model.
• SQL3 supported by modern databases adapted many of the concepts
developed in Postgres
ICS214A
Notes 01
37
POSTGRES — Combinations
• Introduced object orientation into relation DBMSs.
• Fundamental Concepts.
– Each record has an OID.
– Access to data though:
 query language POSTQUEL.
 navigation through OIDs.
– Classes:
– Inheritance:
– Types: rich set of types available for columns.
– Functions: can be called within POSTQUEL.
ICS214A
Notes 01
38
Classes And Inheritance
• Class analogous to relation
• User can create new class
create Emp (name = c12, salary = float, age = int)
• Classes can inherit from others
create Salesman (quota = float) inherits Emp
•
Multiple inheritance permitted. If new class causes ambiguity it is not created.
•
Classes:
– real: base classes or relations
– derived: views
– version: maintained differentially compared to parent class
ICS214A
Notes 01
39
Types In POSTGRES
• Standard base types
– float, int, charac. Strings, etc.
– Abstract data type (ADT) facility to create new base types
e.g.;
create type point (x = int, y = int)
create type polygon
• ADT’s can be used in class definitions.
Create Dept( dname = c10,
mgr = c12,
floorspace = polygon
mailstop = point
)
mailstop
ICS214A
Notes 01
40
Functions In POSTGRES
• Three types:
(1) C functions
(2) Operators
(3) POSTQUEL functions
• C-functions
– any C-function over base types or composite type
retrieve (Dept. name) where
area (Dept. floorspace) > 500
retrieve (Emp. name)
where overpaid (Emp)
Function over a
class or method
ICS214A
Notes 01
41
Operators
• Arbit C-functions are not optimized by query optimizers.
– Special functions - operators can utilize indexes for their evaluation.
• Operator: function with 1 or 2 operand
Area Greater Than
retrieve (Dept. name)
where Dept. floor space-AGT “(0,0), (1,1), (0,2)”
• Index (e.g.; B-tree) defined properly can be used to speed up evaluation
of operators such as AGT.
ICS214A
Notes 01
42
Other Features Of POSTGRES
• Allowed creation of new indices by user.
• To an extent pioneered the approach of extensible database
technology which is prevalent with vendors today
• Supported transitive closure in query.
retrieve* into ans (parent. older)
from a in answer where.
Parent. younger = “John” or
parent. younger = a. older
• Supported rules
ICS214A
Notes 01
43
POSTQUEL Functions
• Any collection of commands in POSTQUEL.
– query = POSTQUEL function.
define function high-pay
returns Emp as
retrieve (Emp. all)
where Emp. salary > 50k
• POSTQUEL function with parameters.
define function Sal-lookup (c12)
returns float as
retrieve (Emp. Salary)
where Emp. name = $1
• Usage of POSTQUEL function
retrieve Emp. name
where Emp. Salary = Sal-lookup (“Joe”)
ICS214A
Notes 01
45
Composite Types In POSTGRES
• POSTQUEL:
– Composite types accessed via path expressions, using nested dot
notation.
remove (Emp  mgr  age)
where (Emp  name = ‘joe’)
• Prevents having to specify a join.
ICS214A
Notes 01
46
Composite Types In POSTGRES
• Attributes can have a class name as a type resulting in
complex objects with structure.
Create Emp ( name = c12,
salary = float [c12],
age = int,
Refers to 0 or more
references of Emp class.
mgr = Emp,
coworker = Emp
)
Could be elements of
any class
• A set type that can hold elements of any class.
Add to Emp (hobbies = set)
ICS214A
Notes 01
47
Types In POSTGRES
• Array type (constructor)
crate Emp ( name = c12,
salary = float [12],
age = int
Salary for each month.
)
• POSTQUEL query
retrieve (Emp  name)
where (Emp  salary [4] = 1000)
Array in query usage.
ICS214A
Notes 01
48
Database Technology Matrix
Q
u
r
y
S
u
p
p
o
r
t
Y
E
S
RDBMSs
ORDBMSs
N
O
File
System
OODBMSs
Simple
Complex
Database Types
ICS214A
Notes 01
49
XML & RDF - the new revolution
• Just when relational model had driven out
object-oriented database technology, WWW
led to the proliferation of semi-structured
data.
• 2 approaches to supporting XML/RDF
– Extend relational technology to support XML/RDF
– Native XML databases
ICS214A
Notes 01
50
Summary of Evolution
of Data Model
• The Dark Ages: network & heirarchical models
• Victory of simplicity and beauty over data spaghetti: The
Relational DBMS:
• The pointers strike back -- Object-Orientation, OODBMSs
• The return of the relations -- ORDBMS -- took the best of
the OO concepts and incorporated them in the relational
model.
• The current and near future -- support for XML & RDF
• The final frontier -- anyone’s guess!
ICS214A
Notes 01
51
Key Data Management
Technologies (quick
review)…
52
Key Database Technologies
• File Management
– provides a file abstraction as a collection of records stored in disk
• Index Management and Access Methods
– implements techniques for associative access to data
• Query Optimization and Processing
– given a query and data storage structures, determines an efficient
strategy to evaluate the query.
• Transaction management
– ensures consistency of the database in presence of concurrent
transactions and various types of failures
• Catalog Management
– maintains database schema information
• Authorization and Integrity Management
– tests for integrity constraints and user authorization
ICS214A
Notes 01
53
Database Management
System Architecture
Application
Queries
Schema changes
compilers
Metadata
and data
dictionary
optimizer
evaluator
Query processor
Buffer manager
Transaction
Manager
File system
Storage manager
Database and
Indices
ICS214A
Notes 01
54
Storage Media and their
Properties
• Main Memory
–
–
–
–
–
costs $100/Mbyte -- reduces every year
‘volatile’ -- does not survive system failures
random I/O very fast
data can be processed by CPU directly
capacity limited to orders of magnitude lower than what database
needs.
• Magnetic Disk
–
–
–
–
costs $0.50/Mbyte -- reduces each year
Non-volatile (except when disk crashes)
random I/O not as fast
CPU cannot directly process data. Needs to be transferred to main
memory
• Tape
– Cheaper but slower than disks. Sequential I/O devices. Handy for
backups, sometimes for archival.
ICS214A
Notes 01
Databases and Storage
Devices
•
•
•
•
Due to capacity, cost, volatility factors databases traditionally stored in
disks.
Data brought to main memory for processing from disks
There are many ways to interface memory with disk resident data
E.g., virtual memory:
– VM size limited to max address generated by CPU
– Existing VM does not support durability
•
•
File system provides a more powerful mapping between memory and
disk storage
A bunch of tricks used ensure that high latency of secondary storage
does not impact application response time and system throughput
– access disks asynchronously with active applications
– prefetch data before application needs it
– intelligent caching techniques
ICS214A
Notes 01
56
Functional Abstraction of a
Simplistic DBMS
beginT
SQL
SQL
endT
Access plan
optimizer
SQL statements
beginT
SQL
SQL
endT
Query Processor
Read write records, scan relations
Record-oriented file system
Get page containing tuples
Buffer manager
Basic file system
Read/write file pages
Hardware
ICS214A
Notes 01
57
Basic File System
• Provides the abstraction of a file where a file is an array of
fixed size blocks
• Hides the disk geometry -- cylinders, tracks, sectors, slots and
other functional components like arms, head, etc. such that the
programs do not need to deal with these complexities
• Operations supported:
–
–
–
–
–
–
–
ICS214A
create a file
delete a file
open a file
close a file
extend a file
read (set of) file blocks into buffers in memory
write (set of) file blocks
Notes 01
58
Basic File System Design
Issues
•
File allocation: how to allocate blocks on disk to a file.
– Contiguous allocation: file stored in contiguous disk blocks. Blocks for
storing file found using either of best-fit, worst-fit or first-fit policies.
 +ve: provides fast sequential scan of file
 -ve: fragmentation, difficult to enlarge files
– Linked allocation: file is a linked list of disk blocks
 +ve: prevents fragmentation, easy to enlarge files
 -ve: slow for both sequential and random access
– Index allocation: file implemented using fixed size blocks pointed to by an
index (e.g., B-tree). Popularized by Unix
 +ve: good random access, easy enlargement, no fragmentation.
 -ve: poor sequential access performance
– Extent based allocation: file is a collection of clusters of consecutive disk
blocks (extents) where collection maintained using linked lists or index
 Most popular approach with vendors.
•
Free space management: information about which blocks are free
ICS214A
Notes 01
59
Buffer Management
• Makes file pages addressable in memory and coordinates writing
of pages to disk with other components to guarantee
transactional properties
• Acts as a mediator between basic file system and recordoriented file system
• Buffer frames maintained in main memory. When a request for
file page access comes, check if page in buffer. Else get a free
frame and load file page into buffer
• Operations Supported:
–
–
–
–
ICS214A
bufferfix
bufferunfix
get block
flush
Notes 01
60
•
Database Buffer Management
Design Issues
DBMS buffer manager returns pointer to frame containing data instead
of returning copy of requested page to caller.
– Efficiency: prevents unnecessary copying of data
– Allows sharing of data at finer granularity than a page
 2 transactions T1 and T2.
 T1 and T2 update records r1 and r2 on same page
 if buffer manager allowed applications to copy data to their address space and
rewrite updated versions, updates might be lost
•
•
Database buffer manager participates in protocols to implement
transactions (WAL, FL@C, pinning buffer slots)
Novel page replacement strategies:
– Traditional LRU strategy used in OS works well only under the assumption
of locality of reference which may not hold for DBMSs
– Since DBMS query language are declarative, system has much more
information about reference patterns which it can exploit to improve
caching performance of buffer manager
ICS214A
Notes 01
61
Record-Oriented File System
•
•
Provides the abstraction of a file as a collection of records.
Records can be:
–
–
–
–
•
fixed size or variable length
short, long, or very long
attributes can be fixed length or variable length
simple or complex (e.g., containing set valued attributes)
Operations supported:
– create, delete, open, close, alter, drop
– read, insert, update, delete record
– scan all records in a file
•
Issues Involved:
– mapping records to pages
– file organization: organization of records in a file.
 Where to insert new records
 what mechanism can be used to retrieve records
ICS214A
Notes 01
62
Index Management and
Associative Access
• Associative access: accessing records based on their attribute
values.
• Index Files
– an index file declared over a (set of) attribute of the data file
provides associative access to records in the data file.
– Index file contains pointers to disk blocks where the record
corresponding to the value appear.
• Types of an Index: (let indexing attribute be A)
–
–
–
–
primary: A is a key and data file stored sorted on A
clustered: A is not a key but data file stored sorted on A
secondary (key): A is a key but data file not sorted on A
secondary (non-key): A is neither a key and nor is data file sorted
on A.
ICS214A
Notes 01
Organization of Index File
•
B-tree Index: index file is organized as a B-tree
– Advantages:
 Supports range searches efficiently.
 E.g., retrieve all employees with salary between 100K and 200K
– Disadvantages:
 Guaranteed good storage utilization
 searching for a given record could take around 3-4 disk I/Os
•
Hash Index: index file maintained as a hash file.
– Advantages:
 Looking for a specific record very efficient -- 1 disk I/O
– Disadvantages:
 cannot support range searches
•
Multdimensional Access Methods
–
modern databases are beginning to support novel data structures like Rtrees, grid files, inverted lists to better serve emerging application
requirements
ICS214A
Notes 01
64
Multidimensional Indexing
Motivation
• Many applications of databases are geographical = 2-d data.
Others involve large number of dimensions
• Examples:
– location of restaurants in a city.
– Map data: zones, county lines, rivers, lakes, etc. (Data has spatial
extent)
– Sales information described by store, day, item, color, size, etc.
Sale = point in multidimensional space.
– Student described by age, zipcode, marital status.
• Queries:
– Range Query: “ find all McDonald restaurant within a given region”.
– Nearest Neighbor Query: Find the nearest McDonald to my house
– partial match queries
ICS214A
Notes 01
65
Approach: Utilize Single
Dimensional Index
•
•
•
•
Index on attributes independently
Project query range to each attribute determine pointers.
Intersect pointers
go to the database and retrieve objects in the intersection.
May result in very
high I/O cost
ICS214A
Notes 01
66
R-tree Data Structure
•
•
•
•
•
ICS214A
Notes 01
Extension of B-tree to
multidimensional space.
Paginated, balanced,
guaranteed storage
utilization.
Can support both point
data and data with spatial
extent
Groups objects into
possibly overlapping
clusters (rectangles in our
case)
Search for range query
proceeds along all paths
that overlap with the
query.
67
Split Node
•
•
Given a node split it into two nodes which are each atleast half full
Multiple Objectives:
–
–
•
•
minimize overlap
minimize covered area
R-tree minimizes covered area
What is an optimal criteria???
Minimize overlap
ICS214A
Notes 01
Minimize covered area
68
Minimizing Covered Area
• Group objects into 2 parts such that the
covered area is minimized
• NP Hard!!
• Hence use heuritics
• Two heuristics explored
– quadratic and linear
ICS214A
Notes 01
69
Other Multidimensional Data
Structures
•
Many generalizations of R-tree
– different splitting criteria
– different shapes of clusters (e.g., d-dimensional spheres)
– adding redundancy to reduce search cost:

•
store objects in multiple rectangles instead of a single rectangle to reduce cost of
retrieval. But now insert has to store objects in many clusters. This strategy also
increases overlap causing search performance to detoriate.
Space Partitioning Data Structures
– unlike R-tree which group objects into possibly overlapping clusters, these
methods attempt to partition space into non-overlapping regions.
– E.g., KD tree, quad tree, grid files, KD-Btree, HB-tree, hybrid tree.
•
Space filling curves
– superimpose an ordering on multidimensional space that preserves
proximity in multidimensional space. (Z-ordering, hilbert ordering)
– Use a B-tree as an index on that ordering
ICS214A
Notes 01
70
KD-tree
• A main memory data structure based on
binary search trees
– can be adapted to block model of storage
(KD-Btree)
• Levels rotate among the dimensions,
partitioning the space based on a value
for that dimension
• KD-tree is not necessarily balanced.
ICS214A
Notes 01
71
KD-Tree Example
X=7
X=3
X=5
y=6
y=5
Y=6
x=3
x=8
x=7
Y=2
y=2
X=5
ICS214A
X=8
Notes 01
72
Adapting KD Tree to Block
Model
• Similar to B-tree, tree nodes split many ways instead of two
ways
– Risk:
 insertion becomes quite complex and expensive.
 No storage utilization guarantee since when a higher level node splits,
the split has to be propagated all the way to leaf level resulting in many
empty blocks.
• Pack many interior nodes (forming a subtree) into a block.
– Risk
 it may not be feasible to group nodes at lower level into a block
productively.
 Many interesting papers on how to optimally pack nodes into blocks
recently published.
ICS214A
Notes 01
73
Quad Tree
• Nodes split along all dimensions
simultaneously
• Division fixed: by quadrants
• As with KD-tree we cannot make
quadtree levels uniform
ICS214A
Notes 01
74
Quad Tree Example
X=7
X=3
SW
NW
SE
X=5
ICS214A
NE
X=8
Notes 01
75
Grid Files
• Space Partitioning strategy but
different from a tree.
• Select dividers along each
dimension. Partition space into cells
• Unlike KD-tree dividers cut all the
way.
• Each cell corresponds to 1 disk
page.
• Many cells can point to the same
page.
• Cell directory potentially
exponential in the number of
dimensions
ICS214A
Notes 01
76
Space Filling Curve
• Assumption
– finite precision in representing each coordinate.
B
A
01 10 11
Z(A) = shuffle(x_A, y_A) = shuffle(00,11)
= 0101 = 5
Z(B) = 11 = 3
00
(common prefix to all its blocks)
00
01
C
10
11
Z(C1) = 0010 = 2
Z(C2) = 1000 = 8
ICS214A
Notes 01
77
Deriving Z-Values for a
Region
• Obtain a quad-tree decomposition of an object by recursively
dividing it into blocks until blocks are homogeneous.
01
11
Objects representation
00
10
00
11
01
00
ICS214A
is
0001, 0011,01
11
Notes 01
78
Generalized Search Trees
• Motivation:
– disparate applications require different data structures and access
methods.
– Requires separate code for each data structure to be integrated
with the database code
 too much effort.
 Vendors will not spend time and energy unless application very
important or data structure has general applicability.
• Generalized search trees abstract the notion of data structure
into a template.
– Basic observation: most data structures are similar and a lot of
book keeping and implementation details are the same.
– Different data structures can be seen as refinements of basic GiST
structure. Refinements specified by providing a registering a bunch
of functions per data structure to the GiST.
ICS214A
Notes 01
79
GiST supports extensibility both in
terms of data types and queries
• GiST is like a “template” - it defines its
interface in terms of ADT rather than physical
elements (like nodes, pointers etc.)
• The access method (AM) can customize GiST
by defining his or her own ADT class i.e. you
just define the ADT class, you have your
access method implemented!
• No concern about search/insertion/deletion,
structural modifications like node splits etc.
ICS214A
Notes 01
80
Query Processing in DBMSs
Internal relational
algebra based
representation of
query
Select …
From …
Where ...
Parsing and
Translation
optimizer
Statistics about data
Sally 4000
Dick 9000
…
…
...
Evaluation engine
Optimized
execution plan
Query results
Data and index
ICS214A
Notes 01
81
Query Optimization
•
•
Goals: to find the cheapest evaluation strategy for a query
Stages of Optimization:
– algebraic manipulations: heuristics used to convert query tree into an
equivalent but more efficient representation.




perform selections and projections as early as possible.
combine selections with cartesian products to make a join
combine sequence of unary operations (selections and projections).
look for common subexpressions in an expression.
– Cost based Analysis: given optimized representation produced after
algebraic manipulation:
 generate all possible query plans and estimate their costs based on the statistical
information and costs of each unary and binary operations.
 Best possible query plan chosen as an execution strategy.
 Number of plans considered even after heuristic are applied is exponential in the
number of operators in query tree. It is important to choose a good plan since
cost of generating plan amortized over multiple query executions.
ICS214A
Notes 01
Cost of Query Execution
• Access to disk: cost of reading, writing, searching data blocks.
(i/o cost)
• Storage Costs: cost of storing intermediate files generated
during query execution. (i/o cost)
• Computation cost: cost of in memory execution of operations.
(cpu cost)
• Communication cost: cost of shipping the query and results
from site to site or terminal where query originated.
(communication cost)
• Total cost = I/O cost + w1* CPU cost + w2 *Communication
cost
• Traditionally I/O cost considered most important
ICS214A
Notes 01
Transaction Management
Applications in databases are modeled as transactions which
provides ACID guarantees.
• Atomicity: either all the effects of a transaction appear in
database or none of the effects of a transaction appears in
database.
• Consistency: each transaction maps a database from
consistent state to another consistent state
• Isolation: concurrent execution of trasnactions is hidden from
other concurrently executing transactions
• Durability: if a transaction completes its effects are permanent
and survive failures.
ICS214A
Notes 01
84
Transaction Model
• Transactions provide a simple, powerful, and a natural
programming model for writing database applications.
• Transaction concept supports:
– simple failure semantics: either all the effects of transaction appear
in database or none do -- all or nothing
– isolated view of the world: protection from partial effects of other
concurrent applications.
• Transactions allows applications to share data without having to
explicitly deal with either fault-tolerance or synchronization
• Transactions are the enabling technology for large distributed
applications.
ICS214A
Notes 01
85
Isolation
•
•
Isolation is implemented by using 2 phase locking protocol
2 Phase Locking Protocol:
– Each transaction acquires a lock on a data item before accessing data
– Locks are released when a transaction commits
User 1 reads account = 1500
time
User 2 reads account = 1500
User 1sets account value = 500
(withdraws 1000 dollars)
User 2 sets account value = 700
(withdraws 800 dollars)
The execution will be prevented by 2 phase locking since user
1’s transaction will not release the lock on account until user 1
transaction terminates
ICS214A
Notes 01
86
Atomicity
•
•
•
Atomicity is implemented by using a logging strategy.
A transaction, before updating a data item writes a undo log record,
using which its effects can be undone.
If transaction aborts then undo log records used toreconstruct
database state before transaction execution
Old state
New state
Normal processing
DO
Undo log record
New state
Transaction rollback
due to either user
requested abort,
system failure,
consistency violation
ICS214A
Old state
UNDO
Undo log record
Notes 01
87
Durability
•
•
•
Durability is implemented using logging strategy
A transaction, before updating a data item, writes a redo log record
using which its effects are redone
If system fails before a committed transaction’s effects appear in
database its effects are redone using redo log records on recovery.
Old state
New state
Normal processing
DO
Redo Log record
Old state
New state
Redo of committed
transaction
ICS214A
REDO
Redo log record
Notes 01
88