Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data vault modeling wikipedia , lookup

SQL wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Relational algebra wikipedia , lookup

Asynchronous I/O wikipedia , lookup

Clusterpoint wikipedia , lookup

Search engine indexing wikipedia , lookup

Operational transformation wikipedia , lookup

Versant Object Database wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Chapter 5study outline
1. What is query processing and optimization (QPO)?
A. Queries versus QPO
1) Queries: high-level, declarative language such as SQL.
2) QPO: component of DBMS
a) Make an execution plan by mapping the query into a sequence of operation supported
by the physical data model (index or file structures)
b) Goal: process the query accurately in minimal computation cost
c) Overhead time should be less than actual executing time.
B. Why learn about QPO?
1) Identify performance bottlenecks a) Physical data model? b) QPO module?
-> Help QPO to speed up queries a) Rewrite queries
-> Design physical data model a) Change file structure, e.g. ordered, hash etc. b) Add
indices
C. Why is special for QPO in spatial database?
1) Algorithms for spatial operations are CPU intensive as well as I/O intensive,
whereas the operations for traditional DB focus on I/O cost.
2) Spatial operation and data type are diverse and no consensus on building blocks
for spatial DB query processing exists. The strategies for some query processing are
limited, such as nearest neighbor
2. Key concepts in QPO
A. Building blocks: a set of basic operation that is constituent of complicated queries such as
1) Select (point queries, range queries)
2) Join
3) Sorting
B. Strategies for building blocks
1) Each building block has several processing options, e.g. select operation has several
strategies. Which strategy should be used depends on parameters from physical models
(table size, available index).
C. Query optimization
1) QPO attempts to choose most efficient strategy among processing options given the
database parameters
D. QPO challenges
1) Building blocks
a) Aspatial - relational: SQL translates into relational algebra. Building blocks of RA
are select, project, and join.
b) Spatial: Difficult
1) Large number of spatial object and operations
2) No consensus on building blocks
2) Processing strategies
a) Aspatial: Large number of strategies = complex choice
b) Spatial: Number of processing strategies may be limited
3) Choice among processing strategies
a) General strategies
1) Fixed priority: a specific strategy is predetermined based on physical parameters
of database.
2) Cost models based on physical parameters of database ->estimate cost of
procedure using the parameters
b) Problem with spatial data
1) Cost model not mature and complicated (I/O and CPU cost)
3. Spatial query processing and optimization
A. Major classes of spatial operations
1) Update
Standard database operation such as create, delete, modify etc.
2) Selection
a) Point queries (PQ)
Given a point, find all spatial objects that contain that point.
b) Range queries (RQ)
Given a polygon, find all spatial objects which intersect that polygon
3) Spatial joins
Join two tables based on spatial predicate (relationship)
Predicate: intersect, contain, in_contained by, meets, distance etc.
Example: “Find all forest-stands and river flood-plain which overlap”
SELECT FS.name, FP.name
FROM Forest-stand FS, Flood-plain FP
WHERE overlap(FS.Geometry, FP.Geometry) = 1
4) Spatial aggregate
Variation of nearest neighbor operation
B. Filter-refine paradigm
1) Filter step
a) Find candidate set (superset of the real answer set) using approximations
- Reduced CPU cost.
- Geometry can be approximated with Orthogonal MBR
- Spatial predicates such as touch, inside, buffer etc. can be approximated by using
overlap.
- Guarantee that no records that are final answer are eliminated in this step.
Envelope operation in OGIS returns MBR for line string and polygon
2) Refine step
a) Find exact answer from candidate set using exact geometry – CPU intensive
C. Choice of building blocks (depends on vendors), representative building blocks
Each building block has a few strategies!!
1) Point query: return a spatial object that contains the query point from a table. Pointing
one location and asking what is here (or what this is)
2) Range query: return all spatial objects within a spatial region from a table. Range query
can be addressed as two point queries (find maximum and minimum values)
3) Spatial join: return pairs of spatial objects from 2 tables that satisfy spatial predicate
4) Nearest neighbor: return one spatial object from a set of objects that is closet to a query
point
D. Choice of processing strategies
1) Complexity analysis
To make a rough assessment of time required to complete an algorithm. Measuring exact
time required to complete the algorithm is not definitive answer for complexity analysis
since execution time depends on computer language, hardware configuration, operating
system, network situation, and other factors. Classify the algorithms based on the number
of operations as a function of input size. Worst-case scenario uses the upper bound of the
number of operation.
2) Complexity class
Provide simulated run-times for each operation as a function of input size
Polynomial time
Constant: I -> good
Logarithmic: log n
Linear: n
Quadratic: n2
Exponential: 2n -> bad
2) Point queries
a) Unsorted with no index
Scan the whole file and test individual record -> cost = O (n)
b) Space-filling curves such z-curve with B-tree
Use binary search like traditional DB ->cost = O (log B n): B is blocking factor (the # of
records per sector), the larger B, the less cost.
c) Spatial index such as R-tree, R+-tree
Apply find () operation to R-tree file structure->cost = O (log n). use MBR
3) Range queries
a) Unsorted with no index
The same as point query
b) Space-filling curves
Find the range of index values satisfying range query and find lowest boundary using
binary search -> cost = O (log B n) + query size /clustering –efficiency
Clustering –efficiency: how well spatial adjacency is preserved in spatial filling curve.
c) Spatial index
The same as point query
4) Spatial selection
a) Digression: Measuring the cost of spatial selection
1) Rank: Rank the predicates by taking into account both selectivity and cost of
spatial function per tuple. Execute the predicate in the ascending order.
2) Selectivity: the ratio of cardinality (number of set) of input to cardinality of query
output. The closer to zero, the better
b) Spatial join: test pairs from two tables for the predicate
1) Nested loop -> no index available in both tables
Test all possible pairs O (2n)
2) Space partitioning
Test pair of objects located in common spatial regions partitioned based on
MBRs.
3) Tree matching -> spatial index such as R-tree are available in both tables
Test MBRs of objects from two tables for the predicate from the top of the tree. If
the MBRs from both tables are in leaf node and satisfy the spatial predicate,
output those pairs
c) Nearest neighbor
1) Two phase
Need to order the spatial data with space-filling curve. Fetch the query point and
the closest point based on filling curve. Compute distance and test the range
query using circle with radius of that distance and center of the query point to
find the points and return the shortest points
2) Single phase
Recursive algorithm with R-tree and actual distance measurements between the
query point and MBRs. Breadth first traverse and Depth first traverse. For
breadth first traverse, traverse each level and choose the MBR that has closet
distance from the query point, and then move on to the child nodes and repeat this
procedure.
Exercises
1. There are twelve octahedrons in a space, how many operations are needed to process a range
query. Assume no index and cost of processing each polygon is equal to the number of edges.
a) Directly
Since no indices exist, scan all records. More precisely, compare each edge of each polygon with
each edge of query range. For each polygon, 8 operations are required. 12 polygons -> 8 x 12 =
96 operations
b) Using the filter-refine strategy.
Using MBR to simplify the shape of polygon, the number of the edge of the polygon is reduced
to 4 from 8. But still all disk scan is required. In the filtering step, 4 x 12 (= 48) processes are
required. 4 candidates should be left after the filtering step. Since overlap (MBR (polygon), query
range) = 1 is only necessary condition (not sufficient condition) for that intersect (polygon, query
range) = 1, meaning that still need exact comparison. 4 candidates are left. So 4 x 8 (=32)
operations are required in the refining step. Total of operations are 80.
4. Choose appropriate strategy (tree matching, or nested loop with index) for a spatial join under
the following cases.
a) Overlay join: neither data set has any R-tree indices.
Nested loop: test all possible pairs for overlay.
b) Overlay join: one data set has a R-tree index
Nested loop with index: putting the indexed relation in the inner loop to avoid scanning all the
record in the inner relation in each iteration. Reduce the cost to O(n log n)
c) Overlay join: both data-sets have R-tree indices.
Tree matching