* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Continuous Queries and Publish/Subscribe
Microsoft Access wikipedia , lookup
Relational algebra wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Versant Object Database wikipedia , lookup
Clusterpoint wikipedia , lookup
Continuous Queries and Publish/Subscribe Aaron Hirshfield Salinee (NOI) Jencharat Ened Ketri Esther Ryvkina Miriam Speert Presentation Outline 1:40 1:50 2:25 3:10 3:20 - 1:50 2:25 3:10 3:20 4:00 Introduction to Pub/Sub EXPRESSION/EVALUATE TAPESTRY Break Efficient Filtering of XML 4:00 - 4:30 Discussion References Aguilera, M., Strom, R., Sturman, D., Astley, M., and Chandra, T. “Matching Events in a Content-based Subscription System”. 18th ACM Symposium on Principles of Distributed Computing, 1999: 53-61 Fabret, F., Jacobsen, H.A., Llirbat, F., Pereira, J. “Filtering Algorithms and Implementation for Very Fast Publish/Subscribe System”. ACM SIGMOD 2001: 115-126. Yalamanchi, A., Srinivasan, J., Gawlick, D. “Managing Expressions as Data in Relational Database Systems”. Proceedings of the 2003 CIDR Conference. References (Continued) Douglas B. Terry, David Goldberg, David Nichols and Brian M. Oki. Continuous Queries over Append-Only Databases. SIGMOD Conference 1992: 321-330. What is Publish/Subscribe? Connecting information providers to information consumers Delivering published events to subscribers who expressed interests in such events. Group-Based Publishers categorize events into fixed groups such as subjects, channels, or topics. Subscribers choose certain a group(s) and then obtain every event in the specified group(s). Content-Based No pre-defined groups - subscribers can filter criteria along multiple dimensions. No administrative overhead for publishers to maintain groups. Issues Large number of subscribers Large number/high rates of events Minimize delay Simple and expressive subscription Changing subscription How does it relate to us? Subset of stream database with only select queries Events in form of streams Subscribers’ interests in form of continuous queries or possibly as another stream Managing Expressions as Data in Relational Database Systems Aravind Yalamanchi Jagannathan Srinivasan Dieter Gawlick First, a note on efficiency… Save: time space computation computation resources Motivation Sophisticated Publish/Subscribe systems do not currently exist Authors propose approach to Pub/Sub that allows greater expressiveness Idea: I want a list of movies meeting the following criteria: G Rating Cartoon Shorter than 3 hours Could execute a query, or… Different Approach … we could store the query (as data)! Arrival of new data search DB for queries that are satisfied Notify users who issued matching queries Definition: Expressions Boolean conditions characterizing user’s interests Syntax-equivalent to SQL-WHERE clause Stored in a column of database table Treated as data More on Expressions can reference variables and functions Example: Tivo/Digital Guide ratingReview(movie) = 4 and Day = 6 and Length(movie) < 200; finds movies that received 4 stars, less than 200 minutes, and are being shown on day 6 (Saturday) Expression Data Type Conditional expression stored as text Associated with Expression Set (set of expressions in db) Metadata (variables and functions common to set) Cid Zipcode Service Plan … Interest 1 02454 Basic Cable … Genre = ‘Comedy’ and Rating = ‘PG’ and Time < 20 Expression Metadata Expressions share a common list of variables Metadata made up of these variables, their data types and functions Expressions can use this Metadata in their predicates Expression Metadata Cont’d Metadata determines evaluation context of expressions in a column Insertion or update to expression: expression checked against metadata to ensure its validity Expression Metadata Elementary Attributes Genre VARCHAR(30) Rating VARCHAR(3) Day Number Time Number Year Number User-Defined Function Length ReviewRating Variable names and their data type List of user-defined Function Definition: EVALUATE An operation comparable to (eval) in LISP/Scheme Equivalent SQL Query metadata in FROM clause expression in WHERE clause SELECT 1 FROM (SELECT :Genre as Genre, :Rating as Rating FROM <any_table_with_one_row>) WHERE Genre = ‘Comedy’ and Rating = ‘PG’ EVALUATE Operator Evaluates an expression into 0 or 1. Takes 2 arguments text representation of expression Data Item (String or AnyData type) SELECT * FROM consumer WHERE EVALUATE( consumer.interest, ‘Genre => “COMEDY”, Rating => “PG”’ ) = 1 SELECT * FROM consumer WHERE EVALUATE( consumer.interest, AnyData.convertObject( Movie(‘Comedy’,’PG’))= 1 Use of EVALUTE in SQL Use of aggregates to allow complex expressions. using GROUP BY or ORDER BY on the customer’s other attributes SELECT * FROM consumer WHERE EVALUATE(consumer.interest, <movie details> ) = 1 Order BY consumer.age Use of EVALUTE (continued) Can be combined with other predicates to allow “Mutual Filtering” (Publishers can filter data, not only Subscribers) notifying only customers with age >= 17 about the R-rated movies Use of EVALUTE (continued) Perform JOINS on multiple relations order the movies in the newMovie table by the number of subscribers interested in the type of the movie (using count(*) in SQL) SELECT DISTINCT (newMovie.id), count(*) as demand FROM consumer, newMovie WHERE EVALUATE(consumer.interest, <movie details from newMovie table> ) = 1 GROUP BY newMove.id ORDER BY demand DESC Use of EVALUTE (continued) Maintain complex (many-to-many) relationship between multiple tables. joining customer and movie tables to create a listing of all the actual movies from movie table that each customer is interested in. Expression Indexing As in previously seen systems, Oracle developers exploit commonalities Large set of expressions many will tend to share common parts of their predicates Index defined on group of expressions in column to process more “efficiently” with EVALUATE Expression Indexing Cont’d Exploitation of commonalities leads to more “efficient” processing of expressions Processing costs shared across multiple predicates Example: Given 2 predicates: one wants movie rated PG and one wants movie rated R If one predicate is satisfied, the other is not Exploitation of Commonalities Logical grouping of predicates by their LHS ( length(movie) LHS < 2; group on length(movie)) OP RHS Example: want movies shorter than 3 hours and movies shorter than 2 hours, movies shorter than 2 hours satisfy both predicates Predicate Table Predicates are grouped by LHS Contains 1 row for each expression Row ID Rating Op RHS Length Op RHS Stars Op RHS Sparse Predicates 1 = G > 180 > 3 Director = Crowe 2 3 = R > 120 > 2 Yr > 1980 > 90 > 3 Lang = French = Index Processing Data’s attribute compared with the RHS and OP columns Example: SELECT exp_id from predicate_table WHERE G2_OP = ‘=‘ and G2_RHS = :rhs_val OR G2_OP = ‘>’ and G2_RHS < :rhs_val OR … Index on Predicate Table We can have an index on the Predicate Table. Processing: 1. Indexed predicate group 2. Stored predicate group 3. Sparse predicate group Some expressions are filtered out after each step - Less to check with slower predicate group. Predicate HandlingComparisons Oracle Table built for expression sets (based on LHS operators and common attributes) Telegraph For each attribute of each source in a query, grouped filter index is maintained & index used to compute overlapping range queries TriggerMan Predicates grouped into Equivalence Classes based on expression signatures Performance Characterization Oracle says: preliminary experiments promising results for the EFI Scheme Performed best when “fine-tuned” for given expression set-Tunable Characteristics of an index: list of common predicates list of common ops for these preds # of indexed predicates Future Work Native support needed for: Expression data type EVALUATE operator Why Does It Matter? (What to take away) Expression/EVALUATE allows us to list all interested subscribers for a data by just one query. Makes it easier to scale for large number of subscribers. Using Relational Database Management system for Pub/Sub. Discussion Sneak Preview What if we don’t have all information? Must conform to metadata, so now what? Have we seen Pub/Sub before? What about myYahoo portals? So, if RDMS can be extended to handle Pub/Sub, is this the limit? Continuous Queries over Append-Only Databases Using the Tapestry System Talk Topics Continuous Queries TQL / Tapestry Query Transformations Results Conclusions Continuous Queries What does ‘continuous’ mean? (Continuous Semantics) The results of a continuous query is the set of data that would be returned if the query were executed at every instant in time. Why an ‘append-only’ database? Users are being presented with answers, and we don’t want a data item to be an answer at one point, but not an answer later in time. Periodic Query Example Show all e-mail messages that have received a reply. Problems with Periodic Queries Non-deterministic results If two people make this query, will they obtain the same results? Duplicates can be returned Every time the query is run, the same results might be returned. Inefficiency of the system The query is run over all data, not new data, resulting in slow processing time do to unnecessary work being done by the system. The Tapestry System A system that keeps a database of all mail and news messages received. Designed to filter mail and newsgroup messages. Allows users to query over this database. Since Tapestry doesn't use triggers (time is used), it can be implemented in any commercial database that supports SQL applications. TQL TQL – Tapestry Query Language Works much like SQL, with the addition of a timestamp. Every query uses DISTINCT. No aggregates at this time. Does not support outer joins. TQL monotone incremental SQL TQL Advantages Allows ad hoc queries. Can be used in a conventional relational database. Time-oriented queries are supported. Flexible scheduling of query running order. Continuous Query Example Show all e-mail messages that have received a reply, and that have not been returned by this query before. Method: Subtract old answers from new answers and return this new set of data. This is done because some past answers might be included in the ‘new’ data. Incremental Query Example Show all e-mail messages that have received a reply, and that have not been returned by this query before. (Same query as before) Method: Run the query on the new data that has arrived, not the entire database or by comparing to past answers. Solutions Incremental Queries can eliminate duplicates. Timestamp Query Transformations To run continuous queries and solve the problem of non-determinism, TQL was developed. Query Transformation Q(t) QM(t) QI (t, τ) QM(t) – monotone query QI (t, τ) – incremental query Monotone query: QM(t1) QM(t2) when t1 < t2 Minimum bounding monotone query Incremental query: QM(t)-QM(τ) QI(t, τ) QM(t) Should return enough, but not too much Monotone queries QM(t) = Us<=t Q(s) Monotone query result = = regular query + continuous semantics Not all queries are monotone, but for an append-only database many queries are Monotone queries Monotone (for append-only) Boolean predicates over column values of a table. Queries involving joins. Possibly non-monotone Functions that read current time. Sub-queries prefaced by ‘NOT EXISTS.’ Queries with time SELECT * FROM tbl WHERE tbl.field op GetDate() tbl.field < GetDate() -- monotone (<=) “Select all messages that have been replied to” tbl.field > GetDate() -- non-monotone (>=) “Select all messages that haven’t expired” =, <> -- likely to be non-monotone Queries with NOT EXIST SELECT * FROM msgs m WHERE NOT EXISTS ( SELECT * FROM msgs m1 WHERE m1.inReplyTo = m.msgId) Same as with time “Select all messages that haven’t been replied to.” Modification of this query that makes more sense in continuous semantics: “Select all messages that haven’t been replied to for 2 weeks.” Query Transformation Q(t) QM(t) QI (t, τ) Standard form of a query: SELECT … FROM table1, table2, … WHERE (E1 AND E2 AND … Ek) OR ( … ) OR … OR ( … ) If Q is in the form (P ۷ R), then QM is in form (PM ۷ RM) => Can consider each of the ‘AND’ expressions separately. Authorized User: should add example?? (extra slide) QQM Rewrite Rules: handling time Assume (for now) no NOT EXISTS For each table mentioned in a query, add a clause: table.ts < t Terms that don’t contain time don’t affect monotonicity. Group all terms with time and simplify. QQM Rewrite Rules: handling time C1 < t AND … MAX (C1,…, AND Cn < t Dm C n) < t AND AND C<t AND AND… AND t< P t<D MIN (D1,…, AND P AND P C<D AND t< AND AND P t < D1 C<t Dm ) QQM Rewrite Rules: handling NOT EXISTS … NOT EXISTS (SELECT * FROM m1, … mk WHERE R(t)) Not always possible to convert to a monotone query without changing the meaning If R(t) not monotone -- usually impossible “Select messages that have no reply” --> “Select messages that are at least N weeks old and have no reply.” QM I Q QI (t, τ) = QM(t) but inefficient. Rewrite Rules AND (NOT QM(τ)) -- simple Just check the records that arrived in (t, τ) interval – seems reasonable SELECT * FROM m WHERE P SELECT * FROM m WHERE P can rewrite as AND τ m.ts < t In Practice Experiments to evaluate performance were conducted. Goal: to prove effectiveness of implementation of continuous queries as incremental queries Description and results -- to follow Authorized User: for time -- talk about warm-up run, not including I/Os, since data in buffer cache, impossibility to flash cache Experiments Database 380,000 news articles (messages) 1 table -- “msgs” with schema: {msgId, from, subject, date, newsgroup, inReplyTo, ts} Idea compare times to execute a non-incremental query, vs. incremental version of the same query for different values of τ τ: 100%, 10%, 1% Time measured average of 10-20 executions Experiments: Queries Q1: “Select messages from the comp.databases newsgroup” (simple selection predicate) Q2: “Select messages from all of the comp.sys newsgroups” (selection predicate with LIKE) Q3: “Select messages that have a reply sent to comp.databases” (simple join) Q4: “Select messages that are 4 weeks old and have not been replied to” (selection with NOT EXIST) Q5: “Select first messages in conversation chains greater than 2 in length” (more sophisticated join) Expected: Results QI (100%) comparable to Q QI (1%) QI (10%) QI (100%) In practice: Experiments: Results QI (100%) comparable to Q Holds for for Q2, Q3, Q4, Q5 For Q1: incremental uses index on ts, while regular uses index on newsgroup QI (1%) QI (10%) QI (100%) Holds for Q1, Q2, Q4 Q3, Q5: joins. Limitation of Sybase -- can’t utilize multiple indices for joins. Experiments: More Results Q3 – original version Q3 – incremental version SELECT m.msgId FROM msgs m, msg m1 SELECT m.msgId FROM msgs m, msgs m1 WHERE m1.inReplyTo = m.msgId WHERE m1.inReplyTo = m.msgId AND m1.newsgroup = “comp.databases” AND m1.newsgroup = “comp.databese” AND (m.ts > OR m1.ts > ) Sybase can’t deal efficiently with ORed attributes => uses index on newsgroup, not on ts => no improvement with smaller τ. Revised queries Q3 and Q5 perform much better. General technique for rewriting queries with joins – breaking 1 query into a few that are optimized easily. For future discussion Optimization doesn’t happen automatically! Is it a serious limitation? Can you think of any ways to improve this aspect of the system? Other Experiments Cost of incremental queries as a function of τ Q1: 0% to 100% Cost proportional to τ Cost of incremental queries as a function of database size Q3 (revised) Execution cost doesn’t grow with DB size (stays basically constant). Other Experiments Previous experiments Simple index built on values of single fields. New experiments Composite index made up of the timestamp value appended to the field value: Optimizer doesn’t have to choose between using index on ts and the newsgroup. Better results for Q1 and Q3 (Q3 was significantly better). Conclusion Continuous Queries TQL / Tapestry Monotone / Incremental Queries Future Work Make a system that allows for updates and deletions (use the techniques of this paper in a historical database). Supporting aggregate queries. Example – Show all e-mails that have received at least 5 reply’s. Discussion Scalability of Tapestry: Size of the database -- Yes Number of queries -- ??? Continuous queries and streams: Similarities ? Differences ? Discussion Append-only databases: Is it a vital limitation in practice? This approach was invented in 1992. Why didn’t it find its way to wide use? Next……. After the break, we have the paper “Efficient Filtering of XML Documents for Selective Dissemination of Information” by Ened. Following this paper, we conclude with a discussion about all the papers presented today. Efficient Filtering of XML Documents for SDI Mehmet Altmel, Michael J. Franklin Presentation by Ened Ketri Relation to the previous papers Feature --------Type Pub Sub Tapestry Maybe SQL Mod CQ DB TQL Incremental Newsgroups Append Only Index Filter Yes No Oracle Yes EVALUATE Expressions as data Conventional No RDBMS Yes XFilter Yes XPath Profiles XML docs Yes Yes Overview Introduction, streams connection Alternatives, Motivation Execution model XML use Xpath Xfilter Enhancements Empirical Data References XML http://www.w3.org/TR/REC-xml XPath http://www.w3.org/TR/path XML Query Model http://www.w3.org/TR/query-datamodel/ SDI : Selective Dissemination of Information Process of filtering and delivering documents based on user interests Data Collection (input streams) Filtering (predicates) Profiling (continuous queries) Delivery (output streams) Alternatives IR Matching Boolean Similarity Problems Domain Precision/Efficiency Scale Motivation Big improvements in bandwidth Overwhelming Information New tools available (XML related) Millions of users Huge volumes of data (Web) Example Applications Tickers (stocks, sports, weather) Traffic information, GPS Electronic newspapers Entertainment Delivery Streams and SDI SDBMS model Queries stored (profiles) Data comes in (XML documents from web) Predicates (filtering) Previously seen relevant systems PSoup (Query indexing) NiagaraCQ (XML oriented) Execution Model Notes on architecture Profile creation Users System Huge set of profiles (millions) Efficient/Quick search capability XML documents as Data XML Why Popularity (interoperability/compatibility) Implementation (well-formedness) Standard Hierarchical structure Precision (matching) Efficiency XML DTD (Document Type Definition) Example <?xml version="1.0"?> <!DOCTYPE greeting SYSTEM "hello.dtd"> <greeting>Hello, world!</greeting> </xml> Tagging Language Attributes/Elements XPath XPath XML Query Language W3C standard XML doc Tree of nodes Path expressions Matching Document application XPath Paths Absolute Relative Operators (/) (//) (*) ([..]) Example /movies/*//actors[age<30]/name Looking back… NiagaraCQ IR based Document data (input) oriented DTD limited selectivity Triggers Scalability Matching Execution model XFilter XFilter: Issues Problems Order checking Wildcards Filter evaluation (nested queries) Solution XPath FSM (Finite State Machine) Query Indexing FSM States Path nodes State Lists Candidate List, Wait List FSM Parsing SAX Event Handlers Nested path expressions Enhancements Improve Performance Scalability Solutions List Balancing Prefiltering List Balancing Why Poor selectivity from first elements Skewed Candidate Lists Solution Chose element to be placed in CL based on selectivity and list length PreFiltering Why One level step analysis Unnecessary Work Solution Preliminary parsing of documents Only check queries whose elements appear in the document Workload parameters Varying number of profiles Varying Query/Document depth Varying Wildcard Probability Conclusions Summary A Pub/Sub system, that basically filters XML documents based on user profiles Internet-scale SDI suitable for huge number of users Profiles are treated as standing queries Future work / Issues Partial results Handling joins/multiple streams List Balancing Discussion Questions What if we don’t have all information? Must conform to metadata, so now what? Have we seen Pub/Sub before? What about myYahoo portals? So, if RDMS can be extended to handle Pub/Sub, is this the limit? Discussion Questions Scalability of Tapestry: Size of the database -- Yes Number of queries -- ??? Continuous queries and streams: Similarities ? Differences ? Discussion Questions Append-only databases: Is it a vital limitation in practice? This approach was invented in 1992. Why didn’t it find its way to wide use? Discussion Questions Optimization doesn’t happen automatically (remember – queries with joins) Is it a serious limitation? Can you think of any ways to improve this aspect of the system? Discussion Questions What would be the optimal granularity in data being served? Is profile/CQ content updating an important feature and if so how could it be implemented? How would a structure/content description language based (XML) standard model of data sources affect Pub/Sub, CQ, SDBMS?