Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining of Frequent Patterns from Sensor Data Presented by: Ivy Tong Suk Man Supervisor: Dr. B C M Kao 20 August, 2003 1 Outline Outline of the Presentation Motivation Problem Definition Algorithm • Apriori with data transformation • Interval-List Apriori Experimental Results Conclusion 2 Motivation Continuous items reflect values from an entity that changes continuously in the external environment. Update Change of state of the real entity E.g. temperature reading data • Initial temperature: 25ºC at t=0s • Sequence of updates: <timestamp, new_temp> <1s, 27ºC>, <5s, 28ºC>, <10s, 26ºC>, <14s,..> … • t=0s to 1s, 25ºC 25ºC 27ºC 28ºC t=1s to 5s, 27ºC t=5s to 10s, 28ºC 0 1 5 26ºC t 10 What is the average temperature from t=0s to 10s? • Ans: (25x1+27x4+28x5)/10 = 27.3ºC 3 Motivation Time is a component in some applications E.g. stock price quotes, network traffic data “Sensors” are used to monitor some conditions, for example: Prices of stocks: by getting quotations from a finance website Weather: measuring temperature, humidity, air pressure, wind, etc. We want to find correlations of the readings among a set of sensors Goal: To mine association rules from sensor data 4 Challenges How different is it from mining association rules from market basket data? Time component When searching for association rules in market basket data, time field is usually ignored as there is no temporal correlation between the transactions Streaming data Data arrives continuously, possibly infinitely, and in large volume 5 Notations We have a set of sensors R = {r1,r2,…,rm} Each sensor ri has a set of numerical states Vi Assume binary states for all sensors Vi = {0,1} i s.t. ri R Dataset D: a sequence of updates of sensor state in the form of <ts, ri, vi> where ri R, vi Vi ts : timestamp of the update ri: sensor to be updated vi: new value of the state of ri For sensors with binary states • update in form of <ts, ri> as the new state can be inferred by toggling the old state 6 Example R={A,B,C,D,E,F} Initial states: all off D: <1,A> <2,B> <4,D> <5,A> <6,E> <7,F> <8,E> <10,A> <11,F> <13,C> A t 0 1 5 10 B t 2 C t 13 D t 4 t E 6 8 F t 7 11 7 More Notations An association rule is a rule, satisfying certain support and confidence restrictions, in the form XY where XR, YR and X R ,Y R XY= and X Y 8 More Notations Association rule X Y has confidence c, In c % of the time when the sensors in X are ON (with state = 1), the sensors in Y are ON Association rule X Y has support s, In s% of the total length of history, the sensors in X and Y are ON 9 More Notations TLS(X) denote Total LifeSpan of X Total length of time that the sensors in X are ON T – total length of history Sup(X) = TLS(X)/T Conf(X Y) = Sup(X U Y) / Sup(X) Example: T = 15s TLS(A)=9, TLS(AB)=8 Sup(A) = 9/15 = 60% Sup(AB) =8/15 = 53% Conf(A->B) = 8/9 = 89% A t 0 1 5 B 10 t 2 10 Algorithm A Transform & Apriori Transform the sequence of updates to the form of market basket data At each point of update • take a snapshot of the states of all sensors • Output all sensors with state=on as a transaction • Attach Weight(transaction) = Lifespan(this update) = timestamp(next update) – timestamp(this update) 11 Algorithm A - Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight C t 13 D t 4 E t 6 8 F t 7 11 12 Algorithm A - Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items 1 A Weight timestamp=1 C t 13 D t 4 t E 6 8 F t 7 11 timestamp=1 13 Algorithm A - Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight 1 A 2-1=1 timestamp=1 2 A,B timestamp=2 C t 13 D t 4 t E 6 8 F t 7 11 timestamp=2 14 Algorithm A - Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight 1 A 1 2 A,B 4-2=2 timestamp=2 3 A,B,D C t 13 D t 4 E timestamp=4 t 6 8 F t 7 11 timestamp=4 15 Algorithm A - Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight 1 A 1 2 A,B 2 3 A,B,D 1 4 B,D 1 5 B,D,E 1 6 B,D,E,F 1 7 B,D,F 2 8 A,B,D,F 1 9 A,B,D 2 10 A,B,C,D 15-13=2 timestamp=13 C t 13 D t 4 E t 6 8 F t 7 11 End of history = 15s 16 Algorithm A - Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight 1 A 1 2 A,B 2 3 A,B,D 1 4 B,D 1 5 B,D,E 1 6 B,D,E,F 1 7 B,D,F 2 8 A,B,D,F 1 9 A,B,D 2 10 A,B,C,D 2 C t 13 D t 4 E t 6 8 F t 7 11 17 Algorithm A Apply Apriori on the transformed dataset D’ TID sup( X ) Weight ( X ) / Weight 1 X T ,T D ' Drawbacks: A lot of redundancy Adjacent transactions may be very similar, differed by the one sensor with state update Items Weight A 1 2 A,B 2 3 A,B,D 1 4 B,D 1 5 B,D,E 1 6 B,D,E,F 1 7 B,D,F 2 8 A,B,D,F 1 9 A,B,D 2 10 A,B,C,D 2 18 Algorithm B Interval-List Apriori Uses an “interval-list” format <X, interval1, interval2, interval3, … > where intervali is the interval in which all sensors in X are on. TLS(X) = (intervali.h – intervali.l) Example: A t 0 1 <A, [1,5), [10,15)> 5 10 TLS(A) = (5-1) + (15-10) = 9 19 Algorithm B Step 1: For each ri R, build a list of interval in which ri is ON by scanning the sequence of updates Calculate the TLS of each ri If TLS(ri) min_sup, put ri into L1 20 Algorithm B – Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C> <A, empty> <B, empty> <C, empty> <D, empty> <E, empty> <F, empty> 21 Algorithm B – Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C> <A, [1,?)> <B, empty> <C, empty> <D, empty> <E, empty> <F, empty> 22 Algorithm B – Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C> <A, [1,?)> <B, [2,?)> <C, empty> <D, empty> <E, empty> <F, empty> 23 Algorithm B – Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C> <A, [1,5)> <B, [2,?)> <C, empty> <D, [4,?)> <E, empty> <F, empty> 24 Algorithm B – Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C> <A, [1,5),[10,?)> <B, [2,?)> <C, [13,?)> <D, [4,?)> <E, [6,8)> <F, [7,11)> 25 Algorithm B – Example Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C> <A, [1,5),[10,15)> <B, [2,15)> <C, [13,15)> <D, [4,15)> <E, [6,8)> <F, [7,11)> End of history T =15s 26 Algorithm B Step 2: Find all larger frequent sensor-sets Similar to Apriori Frequent Itemst Property Any subset of a frequent sensor-set must be frequent. Method: Generate candidates of size i+1 from frequent sensor-sets of size i. Approach used: join to obtain sensor-sets of size i+1 if two size-i frequent sensor-sets agree on i-1 May also prune candidates who have subsets that are not large. Count the support by merging (intersection of) the interval lists of the two size-i frequent sensor-sets If sup min_sup, put into Li+1 Repeat the process until the candidate set is empty 27 Algorithm B Example: <A, [1,5), [10,15)> <B, [2,15)> <AB, [2,5),[10,15)> A t 0 1 5 10 B t 2 T=15 28 Algorithm B (Example) A C B D [1,5) [2,15) [13,15) [10,15) LS:13 LS:2 E F [4,15) [6,8) [7,11) LS:11 LS:2 LS:4 LS:9 AB AD [2,5) [4,5) [10,15) [10,15) LS:8 LS:6 AF BD BF [10,11) [4,15) [7,11) LS:1 LS:11 LS:4 ABD [4,5) [10,15) Min support count: 3 LS:6 29 Algorithm B – Candidate Generation When generating a candidate sensor-set C of size i from two size i-1 sensor-sets LA and LB (subsets of C), we also construct the interval list of C by intersecting the interval lists of LA and LB. Joining the two interval lists (of length m and n) is a key step in our algorithm Use simple linear scan requires O(m+n) time There are i different size i-1 subset of C which two to pick? 30 Algorithm B – Candidate Generation Method 1: Choose two lists with fewest no of intervals =>Store no of intervals for each sensor-set Method 2: Choose two lists with smallest count (TLS) Intuitively shorter lifespan implies fewer intervals Easier to implement • Have the lifespan when checking if the sensor-set is frequent 31 Experiments Data generation Stimulate data generated by a set of n binary sensors Make use of a standard market basket data With n sensors, each of which can be either on or off =>2n possible combination of sensor states Assign a probability to each of the combinations 32 Experiments – Data Gen How to assign the probabilities? Let N be the no of occurrences of the transaction in the market basket that contains exactly only the sensors that are ON • E.g. Consider R={A,B,C,D,E,F} • Suppose we want to assign prob to the sensor state AC (only A and C are ON) • N is no of transactions that contain exactly only A and C Assign prob = N/|D|, where |D| is the size of the market basket dataset Note: Need sufficiently large market basket data • transactions that occur very infrequently will not be given ZERO probability 33 Experiments – Data Gen Generating sensor set data Choose the initial state (at t=0s) • Randomly • According to the probabilities assigned • Pick the combination with highest probability assigned => first sensor set states 34 Experiment – Data Gen What is the next set of sensor-set states? For simplicity, in our model, only one sensor can be updated at a time For any two adjacent updates, the sensor-set states at the two time instants are differed by only one sensor => change only one sensor state => n possible combinations by toggling each of the n sensor states We normalize the probabilities of the n combinations by their sum Pick the next set of sensor-set states according to the normalized probabilities Inter-arrival time of updates: exponential distribution 35 Experiments Market Basket Dataset 8,000,000 transactions 100 items number of maximal potentially large itemsets = 2000 average transaction length: 10 average length of maximal large itemsets: 4 length of the maximal large itemsets: 11 minimum support: 0.05% length of the maximal large itemsets: ? Algorithms: Apriori: cached mode IL-apriori: • (a) random-join (IL-apriori) • (b) join-by-smallest lifespan (IL-apriori-S) • (c) join-by-fewest-no-of-intervals (IL-apriori-C) 36 Experiments - Results Mining Tim e Runtime (sec.) 500 450 400 350 cache-apriori 300 250 200 IL-apriori IL-apriori-C IL-apriori-S 150 100 50 0 0 1 2 3 4 5 Support threshold (%) Performance of algorithms (larger support): All IL-apriori algorithms outperform cache apriori 37 Experiments - Results Mining Tim e 950 Runtime (sec.) 850 750 cache-apriori 650 IL-apriori 550 IL-apriori-C 450 IL-apriori-S 350 250 150 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Support Threshold (%) Performance (lower support): More candidates => IL-apriori: Expensive to join interval lists 38 Experiments - Results Mining Tim e 950 Runtime (sec.) 850 750 cache-apriori 650 IL-apriori 550 IL-apriori-C 450 IL-apriori-S 350 250 150 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Support Threshold (%) More long frequent sensor-sets Apriori has to match the candidates by search through the DB IL-apriori-C and IL-apriori-S reduce a lot of time in joining the lists 39 Experiments - Results Memory 1900 Mem ory usage (MB) 1700 1500 cache-apriori 1300 IL-apriori 1100 IL-apriori-C 900 IL-apriori-S 700 500 300 0 1 2 3 4 5 Support threshold (%) Amounts of memory usage - peak memory usage Cache apriori - store the whole database IL-apriori – store a lot of interval lists when no of candidates is growing large 40 Experiments – Results Experiments - Results Mining Tim e (min_sup = 0.02%) Runtime (sec.) 140 120 100 cache-apriori 80 IL-apriori 60 IL-apriori-C 40 IL-apriori-S 20 0 1 2 3 4 5 6 7 8 9 10 11 12 i-th pass Apriori is faster in the first 3 passes Running time for IL-apriori drops sharply after Apriori has to scan over the whole database IL-apriori (C/S) needs to join relatively short interval-lists in later passes 41 Experiments - Results (min_sup = 0.02%) Memory usage (MB) Mem ory 2000 1800 1600 1400 1200 1000 800 600 400 200 0 cache-apriori IL-apriori IL-apriori-C IL-apriori-S No. of Frequent Sensor-sets 1 2 3 4 5 6 7 8 9 10 11 12 10000 8000 i-th pass 6000 4000 2000 0 Memory requirement for IL-apriori is a lot higher when there are more frequent sensor-set interval lists to join 1 2 3 4 5 6 7 8 9 10 11 12 i-th pass 42 Experiments - Results (min_sup = 0.05%) Mining Tim e 380 Runtime (sec.) 330 280 cache-apriori 230 IL-apriori 180 IL-apriori-C IL-apriori-S 130 80 30 1000 2000 3000 4000 5000 6000 7000 8000 Number of transactions (K) Runtime for all algorithms increases linearly with total number of transactions 43 Experiments - Results (min_sup = 0.05%) Mem ory Memory usage (MB) 1600 1400 1200 cache-apriori 1000 IL-apriori 800 IL-apriori-C 600 IL-apriori-S 400 200 0 1000 2000 3000 4000 5000 6000 7000 8000 Num ber of transactions (K) Memory required by all algorithms increases as no of transactions increases. Rate of increase in IL-apriori is faster 44 Conclusions Interval-list method to mine sensor data is described Two interval list joining strategies are quite effective in reducing running time Memory requirement is quite high Future Work Other methods for joining interval-lists • Trade-off between time and space Extending to the streaming case • Consider approaches other than Lossy Counting Algorithms (Manku, and R. Motwani, VLDB’02) 45 Q&A 46