Download Distributed Query Processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

PL/SQL wikipedia , lookup

Functional Database Model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

SQL wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Distributed Query Processing
Donald Kossmann
University of Heidelberg
[email protected]
Agenda
• Query Processing 101
– centralized query processing
– distributed query processing
• Middleware
– SQL and XML data integration
• The Role of Web Services
Problem Statement
• Input: Query
How many times has the moon circled around the
earth in the last twenty years?
• Output: Answer
240!
• Objectives:
– response time, throughput, first answers, little IO, ...
• Centralized vs. Distributed Query Processing
– same problem
– but, different parameters and objectives
Query Processing 101
• Input: Declarative Query
– SQL, OQL, XQuery, ...
• Step 1: Translate Query into Algebra
– Tree of operators
• Step 2: Optimize Query (physical and logical)
– Tree of operators
– (Compilation)
• Step 3: Interpretation
– Query result
Algebra
A.d
SELECT A.d
FROM A, B
WHERE A.a = B.b
AND A.c = 35
A.a = B.b,
A.c = 35
X
A
B
– relational algebra for SQL very well understood
– algebra for OQL fairly well understood
– algebra for XQuery (work in progress)
Query Optimization
A.d
A.d
A.a = B.b,
A.c = 35
hashjoin
X
A
B.b
B
index A.c
B
– „no brainers“ (e.g., push down cheap predicates)
– enumerate alternative plans, apply cost model
– use search heuristics to find cheapest plan
Query Execution
John
A.d
(John, 35, CS)
hashjoin
(John, 35, CS)
(Mary, 35, EE)
index A.c
–
–
–
–
(CS)
(AS)
B.b
(Edinburgh, CS,5.0)
(Edinburgh, AS, 6.0)
B
library of operators (hash join, merge join, ...)
pipelining (iterator model)
lazy evaluation
exploit indexes and clustering in database
Summary: Centralized Queries
• Basic SQL (SPJG, nesting) well understood
• Very good extensibility
– nearest neighbor search, spatial joins, time series, UDF,
roll-up, cube, ...
• Current problems
– statistics, cost model for optimization
– physical database design expensive
• Trends
– interactiveness during execution
– approximate answers
– more and more functionality, powerful models (XML)
Distributed Query Processing 101
• Idea:
This is just an extension of centralized query
processing. (System R* et al. in the early 80s)
• What is different?
–
–
–
–
–
–
extend physical algebra: send&receive operators
resource vectors, network interconnect matrix
caching and replication
optimize for response time
less predictability in cost model (adaptive algos)
heterogeneity in data formats and data models
Distributed Query Plan
A.d
hashjoin
receive
receive
send
send
B.b
index A.c
B
Cost
1
Total Cost =
Sum of Cost of Ops
8
Cost = 40
1
6
1
6
2
5
10
Response Time
Total Cost = 40
first tuple = 25
last tuple = 33
25, 33
independent,
pipelined
parallelism
24, 32
0, 7
0, 24
0, 6
0, 18
0, 12
0, 5
0, 10
first tuple = 0
last tuple = 10
Adaptive Algorithms
• Deal with unpredictable events at run time
– delays in arrival of data, burstiness of network
– autonomity of nodes, change in policies
• Example: double pipelined hash joins
– build hash table for both input streams
– read inputs in separate threads
– good for bursty arrival of data
• re-optimization at run time
– monitor execution of query
– adjust estimates of cost model
– re-optimize if delta is too large
Heterogeneity
•
•
•
•
Use Wrappers to „hide“ heterogeneity
Wrappers take care of data format, packaging
Wrappers map from local to global schema
Wrappers carry out caching
– connections, cursors, data, ...
• Wrappers map queries into local dialect
• Wrappers participate in query planning!!!
– define the subset of queries that can be handled
– give cost information, statistics
– „capability-based rewrite“ (HKWY, VLDB 1997)
Data Cleaning
• Are two objects the same?
• Is „D. A. Kossman“ the same as „Kossmann“?
• Is the object that was at Position x 10 min. ago
the same as the object at Position y now?
• Approaches (combination of)
– statistical
– domain knowledge
– human interspection
• Very Expensive
Summary
• „Theory“ very well understood
– extend traditional (centralized) query processing
– add some bells and whistles
– heterogeinity needs manual work and wrappers
• Problems in Practice
–
–
–
–
cost model, statistics
architectures are not fit for adaptivity, heterogeneity
optimizers do not scale for 10,000s of sites
autonomy of sites,
systems not built for asynchronous communication
– data cleaning
Middleware
• Two kinds of middleware
– data warehouses
– virtual integration
• Data Warehouses
–
–
–
–
good: query response times
good: materializes results of data cleaning
bad: high resource requirements in middleware
bad: staleness of data
• Virtual Integration
– the opposite
– caching possible to improve response times
Virtual Integration
Query
Middleware
(query decomposition, result composition)
wrapper
wrapper
sub
query
sub
query
DB1
DB2
IBM Data Joiner
SQL Query
Data Joiner
wrapper
wrapper
sub
query
sub
query
SQL DB1
SQL DB2
Adding XML
Query
XML Publishing
Middleware (SQL)
wrapper
wrapper
sub
query
sub
query
DB1
DB2
XML Data Integration
XML Query
Middleware (XML)
XML
query
XML
query
wrapper
wrapper
DB1
DB2
XML Data Integration
• Example: BEA Liquid Data
• Advantage
– Availability of XML wrappers for all major
databases
• Problems
– XML - SQL mapping is very difficult
– XML is not always the right language
(e.g., decision support style queries)
Summary
• Middleware „looks“ like a homogenous,
centralized database
– location transparency
– data model transparency
• Middleware provides global schema
– data sources map local schemas to global schema
• Various kinds of middleware (SQL, OQL, XML)
• „Stacks“ of middleware possible
• Data Cleaning requires special attention
A Note on Web Services
• Idea: Encapsulate Data Source
– provide WSDL interface to access data
– works very well if query pattern is known
• Problem: Exploit Capability of Source
– WSDL limits capabilities of data source
good optimization requires „white box“
– example: access by id, access by name, full scan
should all combinations be listed in WSDL?
• Solution: WSDL for Query Planning
– Details ???