Download TRAC: Toward Recency And Consistency Reporting in a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
TRAC: Toward Recency And
Consistency Reporting in a
Database with Distributed
Data Sources
Jiansheng Huang
Jeffrey F. Naughton
Miron Livny
Motivating Scenario
In a distributed monitoring system,
autonomous nodes report in at
unpredictable intervals, state captured at
central site is always out of date and
inconsistent
2
Specific Scenario




A cluster of machines
A job can be submitted to any node.
Submit node may schedule a job to be
run on another machine.
The state is captured in a centralized
database.
3
An Example
Submit
job j
Schedule
job j
m1
Job j received here
m2
Job j running here
State of m1
State of m2
Job
received
here
No
Nojjinfo
info
about
about
job
job&&
jj
Job
received
here
scheduled to
torun
runon
on m
m22
scheduled
No jinfo
about here
job j
Job
is running
Central site
4
Enforcing Consistency

Option 1: Do everything in distributed
transactions



Won’t scale to large systems.
At odds with the autonomous nature of
nodes.
Option 2: Only present latest consistent
snapshot

Can give rise to very out of date information
to user.
5
Problem Addressed



Question: how can we help users cope
with inconsistencies in collected data
while retaining the scalability and
autonomy of the system?
Our answer: instead of enforcing
consistency, allow inconsistency and help
user interpret what they see.
Issue: How to do so efficiently and without
swamping user with too much irrelevant
information?
6
Reporting Recency
A user asks: “Has the machine that m1 scheduled the
job j to run started running it?”
State of m1
Job j received here &
scheduled to run on m2
State of m2
No info about job j
… 9998 more
……
Information system at central site
Answer without recency reporting: NOT YET
Naïve
Our
idea
way
is to
to report
only report:
recency:
m1 last
m1 last
reported
reported
in atin09/12/2006
at…, m2 last
15:20,
reported
m2 last
in
7
reported
at…,
m3 last
in atreported
09/11/2006
in at09:30
…… m
and
nothing
last reported
else. in at …
10000
Roadmap




Background
Definitions and Techniques
Prototype and Evaluation
Conclusion and Future Work
8
Terminology
S
Data source: an abstraction for a node being
monitored.
streaming of facts
RDBMS
Recency timestamp: the most recent
time a data source reported in.
9
Goals

Completeness: all “relevant” data
sources are in report.

Precision: reduce the number of
“irrelevant” data sources included in
report using efficient techniques.
10
Schema Model

lastReport table (source id, recency ts)

Other relevant relations (c1, c2,…, source id)

Assumption: updates from a data source can
only make changes to tuples with its own
source id in the data source field.
11
Roadmap




Background
Definitions and Techniques
Prototype and Evaluation
Conclusion and Future Work
12
Definitions
Theorem 1.
Definition
2.
For
areferences
query
update
Q referencing
irrelevant
relations
1. No
If Qsingle
a from
singleanrelation
R, we
R
data
, …, R
can
change
say
the
a result
data
source
of a query
say
asource
source
s is that
relevant
if exists
a s is
1, R
2data
n, we
relevant
Q ift for
exists
j ands,as.t.
potential
tuple
tj
potential for
tuple
R from
t satisfies
Q’s
for
Rj, and for any k ≠j, exists a tuple tk for Rk,
predicates.
such that these tuples together satisfies Q’s
predicates. In this case we say that s is
relevant for Q via Rj.
13
Example 1
Tablewe
1: An
example
instance
Activity activities
Suppose
keep
track
of for
machine
Mach_id Value
Event_time
in a table called Activity. The attributes of
M1
Idle
03/11/2006 20:37:46
Activity M
are mach_id,
activity18:22:01
value and the
Busy
02/10/2006
2
time when
value becomes
valid.
M3 an activity
Idle
03/12/2006
10:23:05
We treat the machine ID as the data source
mach_id FROM Activity
column.SELECT
WHERE mach_id IN (‘m ’, m ’) AND value = ‘idle’;
1
2
TheActivity(mach_id,
query result is {‘m1’}. The
set of relevant
data sources for
value,
event_time)
the query is {‘m1’, ‘m2’}.
14
Example 2
Table 2: A sample instance for Routing
Table 3: An example instance for Activity
Mach_id
Value
Event_time
Consider a P2P system whereMwe useIdle
Routing
to
03/11/2006 20:37
M
M
03/12/2006 23:20
capture
neighboring
relationships.
Mach_id
is
M
Busy
02/10/2006
18:22
M
M
02/10/2006 03:34
M
Idle
03/12/2006
treated
as
the data
source column.
Activity
is same10:23
as in example 1.
Mach_id
Neighbor
Event_time
1
1
3
2
2
3
3
SELECT A.mach_id FROM Routing R, Activity A
WHERE
R.mach_id = ‘m1’ AND
R.neighbor
= A.mach_id AND
Routing(mach_id,
neighbor,
event_time)
A.value = ‘idle’;
The query result is {‘m3’} . The set of data sources relevant via R is
{‘m1’}, the set of data sources relevant via A is {‘m3’}
15
The Focused Method
Query parts
User Query
Analyze
Generate
lastReport
Recency
query
Evaluate
Evaluate
Recency
Report
Query
Result
System
16
Roadmap




Background
Definitions and Techniques
Prototype and Evaluation
Conclusion and Future Work
17
Recency Reporting Prototype


Recency information
PL/pgSQL
table function
includes:
recencyReport:
accepts
a usertable
query,
evaluates
it andrelevant
reports
 A temporary
name
for exceptional
data sources
and their recency timestamps
recency
information
 Another temporary table name for the other relevant
Usage
spec: SELECT * FROM
data sources and their recency timestamps
recencyReport($$SQL TEXT$$);



The least recent data source and its recency
timestamp
The most recent data source and its recency
timestamp
Bound of inconsistency
18
Goals of Experiments

Our approach raises many questions:




Is it expensive to analyze user queries and
generate focused recency queries?
Is it inefficient to evaluate the recency query
in addition to each user query?
Does the focused recency query really
succeed in reducing the number of irrelevant
data sources in recency report?
Our experiments are an attempt to begin
to answer these questions empirically.
19
Methods Evaluated



Naïve method: the recency query
simply returns all data sources as being
relevant
Focused method: with automatic
generation of recency query
A variant of Focused method: with
hardcoding of recency query
20
Evaluation Metrics


False positive rate: the percentage of
the number of irrelevant data sources
reported vs. the number of relevant data
sources.
Response time overhead: time
overhead for the additional recency
reporting
21
Schema and Data


lastReport, Activity, Routing as in earlier
examples
Synthetic data: total row count of Activity
fixed at 10,000,000. Vary the number of
data sources and number of rows per
data source (data ratio) inversely.
22
Test Queries



Selection query and two-way join
queries are measured
Q3: joins Routing and Activity with a very
selective predicate on Routing
Q4: similar to Q3, but with a non
selective predicate on Routing
23
Overhead Comparisons: Q3
Figure1: Q3’s performance overhead for recency and consistency reporting
w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources)
24
=10,000,000).
High Overhead Region, Q3
Figure2: Response times for Q3 with and without recency report w.r.t data
ratio and # of data sources ((data ratio)×(#of data sources)= 10,000,000).
The Focused method with auto generation of recency query is used here.
25
Overhead Comparisons, Q4
Figure3: Q4’s performance overhead for recency and consistency reporting
w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources)
=10,000,000).
26
Performance Evaluation Summary


The overhead for analyzing a user query
and generating a recency query is
insignificant.
The overhead for evaluating a recency
query is insignificant and the focused
method has less or equal cost than the
naïve method unless



Data ratio is very low and
A query has a join and
The query is not selective on data sources.
27
False Positive Rates

Naïve method: depending on exact
number of data sources, assuming
100,000 for illustration



Q3: fpr = (100000-6)/6 = 16665
Q4: fpr = 6/(100000-6) = 0.00006
Focused methods: all 0 because the
precise sets of relevant data sources are
found
28
Roadmap




Background
Definitions and Techniques
Prototype and Evaluation
Conclusion and Future Work
29
Conclusion



Large scale asynchronous system: reporting
recency, rather than enforcing it, is a viable
solution
Defining “relevance” is non-trivial. Our solution:
a data source is relevant if a single update from
it will change the result of a query
Evaluation on our prototype showed that our
methods incur insignificant overhead in most
cases and are more precise than the naïve
method
30
Future Work



Key constraints
Maintenance cost
Other definitions of relevance


A data source is “relevant” if N updates from
it may change query result
A data source is “relevant” if a sequence of
updates from it may change query result
31
Backup Slides
32
Another Example [1]
33
1. http://www.cs.wisc.edu/condor/map
Example 3
Table 1: Another sample instance for Activity, same Routing table
instance as in Example 2
Mach_id
Value
Event_time
M1
Busy
03/11/2006 20:37:46
M2
Busy
02/10/2006 18:22:01
M3
Busy
03/12/2006 10:23:05
SELECT A.mach_id FROM Routing R, Activity A
WHERE R.mach_id = ‘m1’ AND A.value = ‘idle’ AND
R.neighbor = A.mach_id;
The query result is empty. The set of data sources relevant via R
is Ø, the set of data sources relevant via A is {‘m3’}. Therefore m1
is not relevant, but two updates from m1 will change the query
result: 1) m1 is updated to ‘idle’ in Activity, and 2) m1 is added as a
neighbor of m1 itself in Routing.
34
Reporting recency



Recency information of relevant data sources
are stored in a session duration temporary
table
Possible that even the number of relevant
sources will be large, so we provide additional
summary information: the minimum recency
timestamp, the maximum recency timestamp,
and the range of recency
Also since machine failures are common in
Condor, report exceptional data sources using
z-score outlier detection
35
Experiment Setup





System: Tao Linux 1.0, 2.4 GHz Intel,
512MB memory
Database: PostgreSQL 8.0.0
Shared buffer pool: 8MB
Working memory size: 1MB
Each query run 11 times and avg time of
the last 10 runs is used to minimize
fluctuation.
36
Test Queries




Q1: SELECT COUNT(*) FROM Activity A WHERE A.mach_id IN
(‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND
A.value = ‘idle’;
Q2: SELECT COUNT(*) FROM Activity A WHERE A.mach_id NOT
IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’)
AND A.value = ‘idle’;
Q3: SELECT COUNT(*) FROM Routing R, Activity A WHERE
R.mach_id IN (‘Tao1’,’Tao10’,’Tao100’,
‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id
AND A.value = ‘idle’;
Q4: SELECT COUNT(*) FROM Routing R, Activity A WHERE
R.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’,
‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id
AND A.value = ‘idle’;
37
Roadmap





Background
Definitions and Techniques
Prototype and Evaluation
Conclusion and Future Work
Related Work
38
Enforcing currency and consistency







R. Alonso et al., Quasi-copies: Efficient data sharing for information
retrieval systems. In EDBT, pages 443-468, 1988.
H. Garcia-Molina et al., Read-only transactions in a distributed database.
ACM Trans. Database Syst., 7(2):209-234, 1982.
R. Lenz. Adaptive distributed data management with weak consistent
replicated data. In SAC, pages 178-185, 1996.
A. Segev and W. Fang. Currency-based updates to distributed
materialized views. In ICDE, pages 512-520, 1990.
A. Labrinidis et al., Balancing performance and data freshness in web
database servers. In VLDB, pages 393-404, 2003.
L. Bright et al., Using latency-recency profiles for data delivery on the web.
In VLDB, pages 550-561, 2002.
H. Guo et al., Relaxed currency and consistency: How to say “good
enough” in SQL. In SIGMOD Conference, pages 815-826, 2004.
A common theme here is to enforce recency constraints through a
combination of choosing the correct version of an object to query (I.e.,
the cached or the primary copy) or refreshing “stale” objects by
39
synchronously “pulling” new data in response to a query.
Data Lineage

Y. Cui et al., Lineage tracing for general data
warehouse transformations. In VLDB, pages
471-480, 2001.
Identify the set of source data items that produced
a view item, our work is different in that even if a
data source doesn’t contribute any lineage data
items (possibly due to latency in reporting in or
some error), it may still be “relevant”.
40
Distributed Query Processing

R. Munz et al., Application of sub-predicate
tests in database systems. In A. L. Furtado and
H. L. Morgan, editors, VLDB, pages 426-435,
1979.
The problem statement relies on data items
being placed in a distributed environment in
such a way that they satisfy various predicates,
and then they use the interaction of the data
placement predicates and query predicates to
identify where data satisfying a simple query
might be located
41
Partitioning Pruning



D. J. DeWitt et al., Gamma - a high
performance dataflow database machine. In
VLDB, pages 228-237, 1986.
IBM. DB2 UDB for z/OS Version 8 Performance
Topics, 2005.
Oracle Corporation. Oracle Database
Concepts, 10g Release 1, 2003.
Choose partitions by matching certain types of
selection predicates with the “partitioning
predicates”
42