* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download TRAC: Toward Recency And Consistency Reporting in a
Survey
Document related concepts
Transcript
TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources Jiansheng Huang Jeffrey F. Naughton Miron Livny Motivating Scenario In a distributed monitoring system, autonomous nodes report in at unpredictable intervals, state captured at central site is always out of date and inconsistent 2 Specific Scenario A cluster of machines A job can be submitted to any node. Submit node may schedule a job to be run on another machine. The state is captured in a centralized database. 3 An Example Submit job j Schedule job j m1 Job j received here m2 Job j running here State of m1 State of m2 Job received here No Nojjinfo info about about job job&& jj Job received here scheduled to torun runon on m m22 scheduled No jinfo about here job j Job is running Central site 4 Enforcing Consistency Option 1: Do everything in distributed transactions Won’t scale to large systems. At odds with the autonomous nature of nodes. Option 2: Only present latest consistent snapshot Can give rise to very out of date information to user. 5 Problem Addressed Question: how can we help users cope with inconsistencies in collected data while retaining the scalability and autonomy of the system? Our answer: instead of enforcing consistency, allow inconsistency and help user interpret what they see. Issue: How to do so efficiently and without swamping user with too much irrelevant information? 6 Reporting Recency A user asks: “Has the machine that m1 scheduled the job j to run started running it?” State of m1 Job j received here & scheduled to run on m2 State of m2 No info about job j … 9998 more …… Information system at central site Answer without recency reporting: NOT YET Naïve Our idea way is to to report only report: recency: m1 last m1 last reported reported in atin09/12/2006 at…, m2 last 15:20, reported m2 last in 7 reported at…, m3 last in atreported 09/11/2006 in at09:30 …… m and nothing last reported else. in at … 10000 Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work 8 Terminology S Data source: an abstraction for a node being monitored. streaming of facts RDBMS Recency timestamp: the most recent time a data source reported in. 9 Goals Completeness: all “relevant” data sources are in report. Precision: reduce the number of “irrelevant” data sources included in report using efficient techniques. 10 Schema Model lastReport table (source id, recency ts) Other relevant relations (c1, c2,…, source id) Assumption: updates from a data source can only make changes to tuples with its own source id in the data source field. 11 Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work 12 Definitions Theorem 1. Definition 2. For areferences query update Q referencing irrelevant relations 1. No If Qsingle a from singleanrelation R, we R data , …, R can change say the a result data source of a query say asource source s is that relevant if exists a s is 1, R 2data n, we relevant Q ift for exists j ands,as.t. potential tuple tj potential for tuple R from t satisfies Q’s for Rj, and for any k ≠j, exists a tuple tk for Rk, predicates. such that these tuples together satisfies Q’s predicates. In this case we say that s is relevant for Q via Rj. 13 Example 1 Tablewe 1: An example instance Activity activities Suppose keep track of for machine Mach_id Value Event_time in a table called Activity. The attributes of M1 Idle 03/11/2006 20:37:46 Activity M are mach_id, activity18:22:01 value and the Busy 02/10/2006 2 time when value becomes valid. M3 an activity Idle 03/12/2006 10:23:05 We treat the machine ID as the data source mach_id FROM Activity column.SELECT WHERE mach_id IN (‘m ’, m ’) AND value = ‘idle’; 1 2 TheActivity(mach_id, query result is {‘m1’}. The set of relevant data sources for value, event_time) the query is {‘m1’, ‘m2’}. 14 Example 2 Table 2: A sample instance for Routing Table 3: An example instance for Activity Mach_id Value Event_time Consider a P2P system whereMwe useIdle Routing to 03/11/2006 20:37 M M 03/12/2006 23:20 capture neighboring relationships. Mach_id is M Busy 02/10/2006 18:22 M M 02/10/2006 03:34 M Idle 03/12/2006 treated as the data source column. Activity is same10:23 as in example 1. Mach_id Neighbor Event_time 1 1 3 2 2 3 3 SELECT A.mach_id FROM Routing R, Activity A WHERE R.mach_id = ‘m1’ AND R.neighbor = A.mach_id AND Routing(mach_id, neighbor, event_time) A.value = ‘idle’; The query result is {‘m3’} . The set of data sources relevant via R is {‘m1’}, the set of data sources relevant via A is {‘m3’} 15 The Focused Method Query parts User Query Analyze Generate lastReport Recency query Evaluate Evaluate Recency Report Query Result System 16 Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work 17 Recency Reporting Prototype Recency information PL/pgSQL table function includes: recencyReport: accepts a usertable query, evaluates it andrelevant reports A temporary name for exceptional data sources and their recency timestamps recency information Another temporary table name for the other relevant Usage spec: SELECT * FROM data sources and their recency timestamps recencyReport($$SQL TEXT$$); The least recent data source and its recency timestamp The most recent data source and its recency timestamp Bound of inconsistency 18 Goals of Experiments Our approach raises many questions: Is it expensive to analyze user queries and generate focused recency queries? Is it inefficient to evaluate the recency query in addition to each user query? Does the focused recency query really succeed in reducing the number of irrelevant data sources in recency report? Our experiments are an attempt to begin to answer these questions empirically. 19 Methods Evaluated Naïve method: the recency query simply returns all data sources as being relevant Focused method: with automatic generation of recency query A variant of Focused method: with hardcoding of recency query 20 Evaluation Metrics False positive rate: the percentage of the number of irrelevant data sources reported vs. the number of relevant data sources. Response time overhead: time overhead for the additional recency reporting 21 Schema and Data lastReport, Activity, Routing as in earlier examples Synthetic data: total row count of Activity fixed at 10,000,000. Vary the number of data sources and number of rows per data source (data ratio) inversely. 22 Test Queries Selection query and two-way join queries are measured Q3: joins Routing and Activity with a very selective predicate on Routing Q4: similar to Q3, but with a non selective predicate on Routing 23 Overhead Comparisons: Q3 Figure1: Q3’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) 24 =10,000,000). High Overhead Region, Q3 Figure2: Response times for Q3 with and without recency report w.r.t data ratio and # of data sources ((data ratio)×(#of data sources)= 10,000,000). The Focused method with auto generation of recency query is used here. 25 Overhead Comparisons, Q4 Figure3: Q4’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000). 26 Performance Evaluation Summary The overhead for analyzing a user query and generating a recency query is insignificant. The overhead for evaluating a recency query is insignificant and the focused method has less or equal cost than the naïve method unless Data ratio is very low and A query has a join and The query is not selective on data sources. 27 False Positive Rates Naïve method: depending on exact number of data sources, assuming 100,000 for illustration Q3: fpr = (100000-6)/6 = 16665 Q4: fpr = 6/(100000-6) = 0.00006 Focused methods: all 0 because the precise sets of relevant data sources are found 28 Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work 29 Conclusion Large scale asynchronous system: reporting recency, rather than enforcing it, is a viable solution Defining “relevance” is non-trivial. Our solution: a data source is relevant if a single update from it will change the result of a query Evaluation on our prototype showed that our methods incur insignificant overhead in most cases and are more precise than the naïve method 30 Future Work Key constraints Maintenance cost Other definitions of relevance A data source is “relevant” if N updates from it may change query result A data source is “relevant” if a sequence of updates from it may change query result 31 Backup Slides 32 Another Example [1] 33 1. http://www.cs.wisc.edu/condor/map Example 3 Table 1: Another sample instance for Activity, same Routing table instance as in Example 2 Mach_id Value Event_time M1 Busy 03/11/2006 20:37:46 M2 Busy 02/10/2006 18:22:01 M3 Busy 03/12/2006 10:23:05 SELECT A.mach_id FROM Routing R, Activity A WHERE R.mach_id = ‘m1’ AND A.value = ‘idle’ AND R.neighbor = A.mach_id; The query result is empty. The set of data sources relevant via R is Ø, the set of data sources relevant via A is {‘m3’}. Therefore m1 is not relevant, but two updates from m1 will change the query result: 1) m1 is updated to ‘idle’ in Activity, and 2) m1 is added as a neighbor of m1 itself in Routing. 34 Reporting recency Recency information of relevant data sources are stored in a session duration temporary table Possible that even the number of relevant sources will be large, so we provide additional summary information: the minimum recency timestamp, the maximum recency timestamp, and the range of recency Also since machine failures are common in Condor, report exceptional data sources using z-score outlier detection 35 Experiment Setup System: Tao Linux 1.0, 2.4 GHz Intel, 512MB memory Database: PostgreSQL 8.0.0 Shared buffer pool: 8MB Working memory size: 1MB Each query run 11 times and avg time of the last 10 runs is used to minimize fluctuation. 36 Test Queries Q1: SELECT COUNT(*) FROM Activity A WHERE A.mach_id IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’; Q2: SELECT COUNT(*) FROM Activity A WHERE A.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’; Q3: SELECT COUNT(*) FROM Routing R, Activity A WHERE R.mach_id IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’; Q4: SELECT COUNT(*) FROM Routing R, Activity A WHERE R.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’; 37 Roadmap Background Definitions and Techniques Prototype and Evaluation Conclusion and Future Work Related Work 38 Enforcing currency and consistency R. Alonso et al., Quasi-copies: Efficient data sharing for information retrieval systems. In EDBT, pages 443-468, 1988. H. Garcia-Molina et al., Read-only transactions in a distributed database. ACM Trans. Database Syst., 7(2):209-234, 1982. R. Lenz. Adaptive distributed data management with weak consistent replicated data. In SAC, pages 178-185, 1996. A. Segev and W. Fang. Currency-based updates to distributed materialized views. In ICDE, pages 512-520, 1990. A. Labrinidis et al., Balancing performance and data freshness in web database servers. In VLDB, pages 393-404, 2003. L. Bright et al., Using latency-recency profiles for data delivery on the web. In VLDB, pages 550-561, 2002. H. Guo et al., Relaxed currency and consistency: How to say “good enough” in SQL. In SIGMOD Conference, pages 815-826, 2004. A common theme here is to enforce recency constraints through a combination of choosing the correct version of an object to query (I.e., the cached or the primary copy) or refreshing “stale” objects by 39 synchronously “pulling” new data in response to a query. Data Lineage Y. Cui et al., Lineage tracing for general data warehouse transformations. In VLDB, pages 471-480, 2001. Identify the set of source data items that produced a view item, our work is different in that even if a data source doesn’t contribute any lineage data items (possibly due to latency in reporting in or some error), it may still be “relevant”. 40 Distributed Query Processing R. Munz et al., Application of sub-predicate tests in database systems. In A. L. Furtado and H. L. Morgan, editors, VLDB, pages 426-435, 1979. The problem statement relies on data items being placed in a distributed environment in such a way that they satisfy various predicates, and then they use the interaction of the data placement predicates and query predicates to identify where data satisfying a simple query might be located 41 Partitioning Pruning D. J. DeWitt et al., Gamma - a high performance dataflow database machine. In VLDB, pages 228-237, 1986. IBM. DB2 UDB for z/OS Version 8 Performance Topics, 2005. Oracle Corporation. Oracle Database Concepts, 10g Release 1, 2003. Choose partitions by matching certain types of selection predicates with the “partitioning predicates” 42