Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6 Continuous Data Stream Processing Music Virtual Channel Peer search engine Clustering Profile engine database Internet Cluster coordinator Interface Channel monitor V.C. player Profile monitor Favorite channel 1 … Music channel simulator Filtering XML Filtering engine 2 … V.C. player Cluster monitor MusicXML Music database metadata N Music collections 2 Continuous Data Stream Processing Research Directions Sequence Query Matching Temporal Query Processing Episode Query Matching Filtering Spatial Query Processing Range Search KNN Search Aggregate Query Processing Streaming Data Management Top-K Search Frequent Tree Pattern Mining Closed Tree Pattern Mining Frequent Itemset Mining (sliding window) Frequent Itemset Mining (landmark model) Mining 3 Continuous Data Stream Processing Sequence Query Matching Given a set of sequence queries (SQs), how to continuously monitor the event stream for them and report the segments that are approximate answers of certain queries as soon as the segments arrive according to the error bounds of the queries? Event Stream <a,b,c,d><c,e><a,b,c><b,d><a,d><e,f><a,e><a,b, c><e,f><a,b,c><e><b,c,e><d,f>······················ Sequence Query <a,b,c><b,d><a,c,d><e,f><a,e>, ε=1 4 Continuous Data Stream Processing Episode Query Matching Knowledge Discovery from Telecommunication Network Alarm Databases [ICDE96] If an alarm of type A occurs, then an alarm of type B occurs within 30 seconds with probability 0.8 If alarms of types A and B occurs within 5 seconds, then a alarm of type C occurs within 60 seconds with probability 0.7 If an alarm of type A precedes an alarm of type B, and C precedes D, all within 15 seconds, then E will follow within 4 minutes with probability 0.6 5 seconds B A 15 seconds A A B C D 5 Continuous Data Stream Processing Top-K Query Suppose there are two continuous queries and . Then, another continuous query is registered. Which two web documents are the most popular across the first and second servers? Which two web documents are the most popular across the third and fourth servers? Which two web documents are the most popular across the second Coordinator and third servers? Queries Server 1 Server4 Server 2 Server 3 6 Continuous Data Stream Processing Main Difficulties Heavy Communication Cost The serve only updates its current data when necessary Multiple Continuous Queries Most papers focus on one-time top-k queries or single continuous top-k query Information sharing is necessary 7 Continuous Data Stream Processing Spatial Query Processing Continuous queries for moving objects in highdimensional space Range search user profile Search KNN search V.C. engine recommended channel user profile, channel V.C. player selected channel V.C. player player Vote Mechanism V.C. player V.C. player 8 Continuous Data Stream Processing Problem Definition Given a set of objects with their positions on a Ndimension (N>20) region. The set of objects is highly dynamic: each object can move in an unrestricted fashion, i.e., we do not assume any pattern of motion Continuously monitoring the results of each query point Range Query KNN Query 9 Continuous Data Stream Processing Main Difficulties Q1 Q2 Heavy Communication Cost The object updates occur only when the results for some queries might change • Safe Region [SIGMOD05] Incremental Update Efficiently maintain the effective results Multiple Continuous Queries Decide the quarantine area for each query Mixed Types of Queries Support both the range query and Q1 the KNN query Q2 Q1 Q2 10 Continuous Data Stream Processing Range Query Query Q: (x,y), r Cell C A: max < r B: min r max C: min > r max: dis(query,cell) min: dis(query,cell) 11 Continuous Data Stream Processing Range Query (Cont.) Moving Query MQ How to maintain the Result for a MQ? 12 Continuous Data Stream Processing Range Query (Cont.) Server Q1 Q2 Q3 flag = 0/1 When to update? Client Q1 Q2 Q3 A A A No update and no recalculate A A B Update and recalculate for some queries A A C No update and no recalculate We only need to consider those objects marked with B 13 Continuous Data Stream Processing Range Query (Cont.) For a range query Q Result list O3 O5 O7 Covered cells A C3 C4 C5 B C2 C7 C9 For a cell C Affected queries A Q2 Q4 Q7 B Q3 Q6 Q9 C2 C2 Query Motion 14 Continuous Data Stream Processing KNN Query Query Q: (x,y), 3 Object Update update the order update the order re-computation 15 Continuous Data Stream Processing KNN Query (Cont.) Query Q: (x,y), 3 d’max Query Q’: (x’,y’), r r = d’max 16 Continuous Data Stream Processing KNN Query (Cont.) Query Q: (x,y), 3 Query Q’: (x’,y’), r r = dmax+dquery dmax dquery 17 Continuous Data Stream Processing KNN Query (Cont.) Query Q: (x,y), 3 Query Q’: (x’,y’), r r = dmax+dcell dmax dcell 18 Continuous Data Stream Processing Tree Pattern Mining As the trees stream in, find out the subtrees that occur more than θ·N times, where N is the number of trees received so far and 0≦θ≦1 Frequent Tree Patterns T3 T2 T1 STMer 19 Continuous Data Stream Processing Closed Tree Pattern Mining Mining closed frequent subtrees over data streams a subtree is closed if none of its proper supertrees has the same support as its A A B B B C C D frequent subtrees A B C 2 3 3 D C A A B B D B D C 2 2 2 3 B C B D 2 closed C 2 20