Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data vault modeling wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Relational algebra wikipedia , lookup
Asynchronous I/O wikipedia , lookup
Clusterpoint wikipedia , lookup
Search engine indexing wikipedia , lookup
Operational transformation wikipedia , lookup
Versant Object Database wikipedia , lookup
Chapter 5study outline 1. What is query processing and optimization (QPO)? A. Queries versus QPO 1) Queries: high-level, declarative language such as SQL. 2) QPO: component of DBMS a) Make an execution plan by mapping the query into a sequence of operation supported by the physical data model (index or file structures) b) Goal: process the query accurately in minimal computation cost c) Overhead time should be less than actual executing time. B. Why learn about QPO? 1) Identify performance bottlenecks a) Physical data model? b) QPO module? -> Help QPO to speed up queries a) Rewrite queries -> Design physical data model a) Change file structure, e.g. ordered, hash etc. b) Add indices C. Why is special for QPO in spatial database? 1) Algorithms for spatial operations are CPU intensive as well as I/O intensive, whereas the operations for traditional DB focus on I/O cost. 2) Spatial operation and data type are diverse and no consensus on building blocks for spatial DB query processing exists. The strategies for some query processing are limited, such as nearest neighbor 2. Key concepts in QPO A. Building blocks: a set of basic operation that is constituent of complicated queries such as 1) Select (point queries, range queries) 2) Join 3) Sorting B. Strategies for building blocks 1) Each building block has several processing options, e.g. select operation has several strategies. Which strategy should be used depends on parameters from physical models (table size, available index). C. Query optimization 1) QPO attempts to choose most efficient strategy among processing options given the database parameters D. QPO challenges 1) Building blocks a) Aspatial - relational: SQL translates into relational algebra. Building blocks of RA are select, project, and join. b) Spatial: Difficult 1) Large number of spatial object and operations 2) No consensus on building blocks 2) Processing strategies a) Aspatial: Large number of strategies = complex choice b) Spatial: Number of processing strategies may be limited 3) Choice among processing strategies a) General strategies 1) Fixed priority: a specific strategy is predetermined based on physical parameters of database. 2) Cost models based on physical parameters of database ->estimate cost of procedure using the parameters b) Problem with spatial data 1) Cost model not mature and complicated (I/O and CPU cost) 3. Spatial query processing and optimization A. Major classes of spatial operations 1) Update Standard database operation such as create, delete, modify etc. 2) Selection a) Point queries (PQ) Given a point, find all spatial objects that contain that point. b) Range queries (RQ) Given a polygon, find all spatial objects which intersect that polygon 3) Spatial joins Join two tables based on spatial predicate (relationship) Predicate: intersect, contain, in_contained by, meets, distance etc. Example: “Find all forest-stands and river flood-plain which overlap” SELECT FS.name, FP.name FROM Forest-stand FS, Flood-plain FP WHERE overlap(FS.Geometry, FP.Geometry) = 1 4) Spatial aggregate Variation of nearest neighbor operation B. Filter-refine paradigm 1) Filter step a) Find candidate set (superset of the real answer set) using approximations - Reduced CPU cost. - Geometry can be approximated with Orthogonal MBR - Spatial predicates such as touch, inside, buffer etc. can be approximated by using overlap. - Guarantee that no records that are final answer are eliminated in this step. Envelope operation in OGIS returns MBR for line string and polygon 2) Refine step a) Find exact answer from candidate set using exact geometry – CPU intensive C. Choice of building blocks (depends on vendors), representative building blocks Each building block has a few strategies!! 1) Point query: return a spatial object that contains the query point from a table. Pointing one location and asking what is here (or what this is) 2) Range query: return all spatial objects within a spatial region from a table. Range query can be addressed as two point queries (find maximum and minimum values) 3) Spatial join: return pairs of spatial objects from 2 tables that satisfy spatial predicate 4) Nearest neighbor: return one spatial object from a set of objects that is closet to a query point D. Choice of processing strategies 1) Complexity analysis To make a rough assessment of time required to complete an algorithm. Measuring exact time required to complete the algorithm is not definitive answer for complexity analysis since execution time depends on computer language, hardware configuration, operating system, network situation, and other factors. Classify the algorithms based on the number of operations as a function of input size. Worst-case scenario uses the upper bound of the number of operation. 2) Complexity class Provide simulated run-times for each operation as a function of input size Polynomial time Constant: I -> good Logarithmic: log n Linear: n Quadratic: n2 Exponential: 2n -> bad 2) Point queries a) Unsorted with no index Scan the whole file and test individual record -> cost = O (n) b) Space-filling curves such z-curve with B-tree Use binary search like traditional DB ->cost = O (log B n): B is blocking factor (the # of records per sector), the larger B, the less cost. c) Spatial index such as R-tree, R+-tree Apply find () operation to R-tree file structure->cost = O (log n). use MBR 3) Range queries a) Unsorted with no index The same as point query b) Space-filling curves Find the range of index values satisfying range query and find lowest boundary using binary search -> cost = O (log B n) + query size /clustering –efficiency Clustering –efficiency: how well spatial adjacency is preserved in spatial filling curve. c) Spatial index The same as point query 4) Spatial selection a) Digression: Measuring the cost of spatial selection 1) Rank: Rank the predicates by taking into account both selectivity and cost of spatial function per tuple. Execute the predicate in the ascending order. 2) Selectivity: the ratio of cardinality (number of set) of input to cardinality of query output. The closer to zero, the better b) Spatial join: test pairs from two tables for the predicate 1) Nested loop -> no index available in both tables Test all possible pairs O (2n) 2) Space partitioning Test pair of objects located in common spatial regions partitioned based on MBRs. 3) Tree matching -> spatial index such as R-tree are available in both tables Test MBRs of objects from two tables for the predicate from the top of the tree. If the MBRs from both tables are in leaf node and satisfy the spatial predicate, output those pairs c) Nearest neighbor 1) Two phase Need to order the spatial data with space-filling curve. Fetch the query point and the closest point based on filling curve. Compute distance and test the range query using circle with radius of that distance and center of the query point to find the points and return the shortest points 2) Single phase Recursive algorithm with R-tree and actual distance measurements between the query point and MBRs. Breadth first traverse and Depth first traverse. For breadth first traverse, traverse each level and choose the MBR that has closet distance from the query point, and then move on to the child nodes and repeat this procedure. Exercises 1. There are twelve octahedrons in a space, how many operations are needed to process a range query. Assume no index and cost of processing each polygon is equal to the number of edges. a) Directly Since no indices exist, scan all records. More precisely, compare each edge of each polygon with each edge of query range. For each polygon, 8 operations are required. 12 polygons -> 8 x 12 = 96 operations b) Using the filter-refine strategy. Using MBR to simplify the shape of polygon, the number of the edge of the polygon is reduced to 4 from 8. But still all disk scan is required. In the filtering step, 4 x 12 (= 48) processes are required. 4 candidates should be left after the filtering step. Since overlap (MBR (polygon), query range) = 1 is only necessary condition (not sufficient condition) for that intersect (polygon, query range) = 1, meaning that still need exact comparison. 4 candidates are left. So 4 x 8 (=32) operations are required in the refining step. Total of operations are 80. 4. Choose appropriate strategy (tree matching, or nested loop with index) for a spatial join under the following cases. a) Overlay join: neither data set has any R-tree indices. Nested loop: test all possible pairs for overlay. b) Overlay join: one data set has a R-tree index Nested loop with index: putting the indexed relation in the inner loop to avoid scanning all the record in the inner relation in each iteration. Reduce the cost to O(n log n) c) Overlay join: both data-sets have R-tree indices. Tree matching