Download Slides - UCLA Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Load Shedding
CS240B notes
1
Load Shedding in a DSMS
 DSMS: online response on boundless and
bursty data streams—How?
 By using approximations and synopses and even
 Shedding load when arrival rates become
impossible
 Approximations and Synopses are often used
with normal load
 Shedding is used for bursty streams and
overload situations.
2
QoS and Load Schedding
When input stream rate exceeds system capacity
a stream manager can shed load (tuples)
 Load shedding affects queries and their answers:
drop the tasks and the tuples that will cause least loss
 Introducing load shedding in a data stream
manager is a challenging problem
 Random load shedding or semantic load shedding
3
Problems to Address
When to shed load
Overload should be detected quickly
Where to shed load
Avoid wasted work
Upstream Drop Vs. Downstream Drop
How much to shed
The magnitude of the drop
Which tuples to shed
4
Loss-tolerance QoS function
Loss function is not linear:
5
Value-based QoS
Value-based QoS
Show which values of the
output tuple space are
most important.
In a medical application
that monitors patient
heartbeats
Extreme values are
certainly more interesting
than normal ones
Corresponding higher
utility
6
Load Shedding in Aurora
 QoS for each application as a function
relating output to its utility
– Delay based, drop based, value based
Techniques for introducing load shedding
operators in a plan such that QoS isdisrupted
the least
– Determining when, where and how much
load to shed
7
Which Query to drop First?
Models and algorithms proposed include




Greedy algorithms or
Fractional Knapsack Problem
Other OR techniques
Must deal with nonlinearities
8
Load Shedding in STREAM
Formulate load shedding as an optimization
problem for multiple sliding window aggregate
queries
– Minimize inaccuracy in answers subject to
output rate matching or exceeding arrival
rate
Consider placement of load shedding
operators in query plan
– Each operator sheds load uniformly with
probability pi
9
Window-Oriented Load Shedding
Input stream divided into windows of size w
Use fewer Slides per windows to compute
aggregates—tumbles is the extreme case.
 Window-based Sampling
 Reservoir sampling for incoming tuples
 Expiring tuples pose a more difficult problem.
10
Load Shedding by Sampling for
Continuous Aggregate Queries on Data Streams:
Only random samples are available for computing aggregate
queries because of


Limitations of remote sensors, or transmission lines

Load Shedding policies implemented when overloads occur

When overloads occur (e.g., due to a burst of arrivals} we can
1.
drop queries all together, or
2.
sample the input---much preferable

Key objective: Achieve answer accuracy with sparse samples for
complex aggregates on windows

Can we improve answer accuracy with minimal overhead?
11
Load Shedding
To cope with bursty arrivals of high-volume
data
DSMS has to shed load while minimizing the
degradation of the Quality of Service
(QoS)
The goal then becomes determining:
when, where and how much load to shed
An intelligent scheme, can improve the
quality of our mining results under bursty
arrivals
12
A first Architecture
S1
Sn
…...
 Basic Idea: [BDM04]
Query Network
 Optimize sampling rates of
load shedders for accurate
answers.
 Find an error bound for each
aggregate query.
 Determine sampling rates
that minimize query
inaccuracy within the limits
imposed by resource
constraints.
 This approach works for SUM
and COUNT
 Generalization to other
functions?
∑
∑
…...
Si
∑
Data Stream
Load Shedder
Query Operator
∑
Aggregate
13
Query Network: arbitrary placement of aggregates
and shedder after any aggregate
S1
Sn
L1
L4
Q1
Q4
L2
Q2
L5
Q3
Q5
Data Stream
Load Shedder
Aggregate Operator
14
Generalized Load Shedding in Stream Mill
1. A general framework that achieves optimal load shedding
policies, while accommodating:
 Different requirements for different users, different query
sensitivities, and different penalties.
2. Applicability to a wide spectrum of aggregate functions:
 We have formally characterized using a new notion, called
reciprocal-error queries.
3. Proposing an extensible architecture that allows UDAs to
benefit from the system provided load shedding functions.
4. Significant improvements (in absolute error, false positives, and
false negatives) compared to the common uniform approach.
5. We propose an efficient (linear-time) algorithm to handle
severe overloads without losing optimality.
15
Goals to Achieve
Light-weight overhead handling
React to overload immediately
Minimizing QoS degradation
Delivering subset results
Only omitting tuples from the correct answer
Never produce incorrect answers
16
Prediction & Improvements
 A larger class of queries was considered in [LZ08]
 SUM, COUNT, AVG, Quantiles.
 Temporal Correlation between answers can be used
to improve answer
 Example: sensor data
 Current answer can be adjusted by the past answers so that:
 Low sampling rate  current answer less accurate  more
dependent on history.
 High sampling rate  current answer more accurate  less
dependent on history.
 A Bayesian quality enhancement module which can
achieve this objective automatically and reduce the
uncertainty of the approximate answers.
17
Improved Model Using History
 the observed answer P
 the error model
 history of the answer
…...
Query Network
 The observed
answer à is
computed from
random samples of
the complete stream
with sampling rate P.
 A bayesian method
to obtain the
improved answer by
combining
S1
Sn
∑
…...
∑
∑
…...
Ã
History
Quality Enhancement
Module
Si
Data Stream
Load Shedder
Improved
answer
Query Operator
∑
Aggregate
18
Summary
 An error model
Works for ordered statistics and data mining
functions as well as with traditional aggregates,
 computationally very efficient
 Bayesian quality enhancement method for
approximate aggregates in the presence of
sampling.
 No correction when concept changes are
suspected—a two-sample test used to detect
suspected changes.
19
References—Sampling and load shedding
[Tabul03] Nesime Tatbul, Ugur Cetintemel, Stanley B. Zdonik, Mitch
Cherniack, Michael Stonebraker: Load Shedding in a Data Stream
Manager.VLDB2003, pp.309--320.
[BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load
Shedding for Aggregation Queries over Data Streams. ICDE
2004: 350-361.
[Tabul07] Nesime Tatbul, Ugur Cetintemel, Stanley B. Zdonik:
Staying FIT: Efficient Load Shedding Techniques for Distributed
Stream Processing. VLDB 2007: 159-170.
[LZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of
Continuous Aggregates and Mining Queries on Data Streams under
Load Shedding. International Journal of Business Intelligence and
Data Mining, 2008.
[ICDE 2010] Barzan Mozafari and Carlo Zaniolo, Optimal Load
Shedding with Aggregates and Mining Queries. In Proceedings of
the 26th International Conference on Data Engineering (ICDE
2010), Long Beach, California, USA, March 1-6, 2010.
20