Download Algorithm B (Example)

Mining of Frequent Patterns from Sensor Data Presented by: Ivy Tong Suk Man Supervisor: Dr. B C M Kao 20 August, 2003 1 Outline  Outline of the Presentation  Motivation  Problem Definition  Algorithm • Apriori with data transformation • Interval-List Apriori  Experimental Results  Conclusion 2 Motivation  Continuous items  reflect values from an entity that changes continuously in the external environment.  Update  Change of state of the real entity  E.g. temperature reading data • Initial temperature: 25ºC at t=0s • Sequence of updates: <timestamp, new_temp> <1s, 27ºC>, <5s, 28ºC>, <10s, 26ºC>, <14s,..> … • t=0s to 1s, 25ºC 25ºC 27ºC 28ºC t=1s to 5s, 27ºC t=5s to 10s, 28ºC 0 1 5 26ºC t 10  What is the average temperature from t=0s to 10s? • Ans: (25x1+27x4+28x5)/10 = 27.3ºC 3 Motivation  Time is a component in some applications  E.g. stock price quotes, network traffic data  “Sensors” are used to monitor some conditions, for example:  Prices of stocks: by getting quotations from a finance website  Weather: measuring temperature, humidity, air pressure, wind, etc.  We want to find correlations of the readings among a set of sensors  Goal: To mine association rules from sensor data 4 Challenges  How different is it from mining association rules from market basket data?  Time component When searching for association rules in market basket data, time field is usually ignored as there is no temporal correlation between the transactions  Streaming data Data arrives continuously, possibly infinitely, and in large volume 5 Notations  We have a set of sensors R = {r1,r2,…,rm}  Each sensor ri has a set of numerical states Vi  Assume binary states for all sensors  Vi = {0,1} i s.t. ri R  Dataset D: a sequence of updates of sensor state in the form of <ts, ri, vi> where ri R, vi Vi     ts : timestamp of the update ri: sensor to be updated vi: new value of the state of ri For sensors with binary states • update in form of <ts, ri> as the new state can be inferred by toggling the old state 6 Example  R={A,B,C,D,E,F}  Initial states: all off  D: <1,A> <2,B> <4,D> <5,A> <6,E> <7,F> <8,E> <10,A> <11,F> <13,C> A t 0 1 5 10 B t 2 C t 13 D t 4 t E 6 8 F t 7 11 7 More Notations  An association rule is a rule, satisfying certain support and confidence restrictions, in the form XY where XR, YR and X R ,Y  R XY= and X  Y   8 More Notations  Association rule X  Y has confidence c, In c % of the time when the sensors in X are ON (with state = 1), the sensors in Y are ON  Association rule X  Y has support s, In s% of the total length of history, the sensors in X and Y are ON 9 More Notations  TLS(X) denote Total LifeSpan of X  Total length of time that the sensors in X are ON  T – total length of history  Sup(X) = TLS(X)/T Conf(X  Y) = Sup(X U Y) / Sup(X)  Example: T = 15s TLS(A)=9, TLS(AB)=8 Sup(A) = 9/15 = 60% Sup(AB) =8/15 = 53% Conf(A->B) = 8/9 = 89% A t 0 1 5 B 10 t 2 10 Algorithm A  Transform & Apriori  Transform the sequence of updates to the form of market basket data  At each point of update • take a snapshot of the states of all sensors • Output all sensors with state=on as a transaction • Attach Weight(transaction) = Lifespan(this update) = timestamp(next update) – timestamp(this update) 11 Algorithm A - Example   Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight C t 13 D t 4 E t 6 8 F t 7 11 12 Algorithm A - Example   Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items 1 A Weight timestamp=1 C t 13 D t 4 t E 6 8 F t 7 11 timestamp=1 13 Algorithm A - Example   Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight 1 A 2-1=1 timestamp=1 2 A,B timestamp=2 C t 13 D t 4 t E 6 8 F t 7 11 timestamp=2 14 Algorithm A - Example   Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight 1 A 1 2 A,B 4-2=2 timestamp=2 3 A,B,D C t 13 D t 4 E timestamp=4 t 6 8 F t 7 11 timestamp=4 15 Algorithm A - Example   Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight 1 A 1 2 A,B 2 3 A,B,D 1 4 B,D 1 5 B,D,E 1 6 B,D,E,F 1 7 B,D,F 2 8 A,B,D,F 1 9 A,B,D 2 10 A,B,C,D 15-13=2 timestamp=13 C t 13 D t 4 E t 6 8 F t 7 11 End of history = 15s 16 Algorithm A - Example   Initial states: all off D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C> A t 0 1 5 10 B t 2 Transformed database D’: TID Items Weight 1 A 1 2 A,B 2 3 A,B,D 1 4 B,D 1 5 B,D,E 1 6 B,D,E,F 1 7 B,D,F 2 8 A,B,D,F 1 9 A,B,D 2 10 A,B,C,D 2 C t 13 D t 4 E t 6 8 F t 7 11 17 Algorithm A  Apply Apriori on the transformed dataset D’ TID  sup( X )   Weight ( X ) /  Weight 1 X T ,T D '  Drawbacks:  A lot of redundancy  Adjacent transactions may be very similar, differed by the one sensor with state update Items Weight A 1 2 A,B 2 3 A,B,D 1 4 B,D 1 5 B,D,E 1 6 B,D,E,F 1 7 B,D,F 2 8 A,B,D,F 1 9 A,B,D 2 10 A,B,C,D 2 18 Algorithm B  Interval-List Apriori  Uses an “interval-list” format  <X, interval1, interval2, interval3, … > where intervali is the interval in which all sensors in X are on.  TLS(X) =  (intervali.h – intervali.l)  Example: A t 0 1 <A, [1,5), [10,15)> 5 10 TLS(A) = (5-1) + (15-10) = 9 19 Algorithm B  Step 1: For each ri R, build a list of interval in which ri is ON by scanning the sequence of updates  Calculate the TLS of each ri  If TLS(ri)  min_sup, put ri into L1 20 Algorithm B – Example  Initial states: all off  D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>       <A, empty> <B, empty> <C, empty> <D, empty> <E, empty> <F, empty> 21 Algorithm B – Example  Initial states: all off  D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>       <A, [1,?)> <B, empty> <C, empty> <D, empty> <E, empty> <F, empty> 22 Algorithm B – Example  Initial states: all off  D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>       <A, [1,?)> <B, [2,?)> <C, empty> <D, empty> <E, empty> <F, empty> 23 Algorithm B – Example  Initial states: all off  D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>       <A, [1,5)> <B, [2,?)> <C, empty> <D, [4,?)> <E, empty> <F, empty> 24 Algorithm B – Example  Initial states: all off  D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>       <A, [1,5),[10,?)> <B, [2,?)> <C, [13,?)> <D, [4,?)> <E, [6,8)> <F, [7,11)> 25 Algorithm B – Example  Initial states: all off  D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>       <A, [1,5),[10,15)> <B, [2,15)> <C, [13,15)> <D, [4,15)> <E, [6,8)> <F, [7,11)> End of history T =15s 26 Algorithm B  Step 2:  Find all larger frequent sensor-sets  Similar to Apriori Frequent Itemst Property  Any subset of a frequent sensor-set must be frequent.  Method:  Generate candidates of size i+1 from frequent sensor-sets of size i.  Approach used: join to obtain sensor-sets of size i+1 if two size-i frequent sensor-sets agree on i-1  May also prune candidates who have subsets that are not large.  Count the support by merging (intersection of) the interval lists of the two size-i frequent sensor-sets  If sup  min_sup, put into Li+1  Repeat the process until the candidate set is empty 27 Algorithm B  Example:  <A, [1,5), [10,15)>  <B, [2,15)>  <AB, [2,5),[10,15)> A t 0 1 5 10 B t 2 T=15 28 Algorithm B (Example) A C B D [1,5) [2,15) [13,15) [10,15) LS:13 LS:2 E F [4,15) [6,8) [7,11) LS:11 LS:2 LS:4 LS:9 AB AD [2,5) [4,5) [10,15) [10,15) LS:8 LS:6 AF BD BF [10,11) [4,15) [7,11) LS:1 LS:11 LS:4 ABD [4,5) [10,15) Min support count: 3 LS:6 29 Algorithm B – Candidate Generation  When generating a candidate sensor-set C of size i from two size i-1 sensor-sets LA and LB (subsets of C), we also construct the interval list of C by intersecting the interval lists of LA and LB.  Joining the two interval lists (of length m and n) is a key step in our algorithm  Use simple linear scan requires O(m+n) time  There are i different size i-1 subset of C which two to pick? 30 Algorithm B – Candidate Generation  Method 1:  Choose two lists with fewest no of intervals  =>Store no of intervals for each sensor-set  Method 2:  Choose two lists with smallest count (TLS)  Intuitively shorter lifespan implies fewer intervals  Easier to implement • Have the lifespan when checking if the sensor-set is frequent 31 Experiments  Data generation  Stimulate data generated by a set of n binary sensors  Make use of a standard market basket data  With n sensors, each of which can be either on or off =>2n possible combination of sensor states  Assign a probability to each of the combinations 32 Experiments – Data Gen  How to assign the probabilities?  Let N be the no of occurrences of the transaction in the market basket that contains exactly only the sensors that are ON • E.g. Consider R={A,B,C,D,E,F} • Suppose we want to assign prob to the sensor state AC (only A and C are ON) • N is no of transactions that contain exactly only A and C  Assign prob = N/|D|, where |D| is the size of the market basket dataset  Note: Need sufficiently large market basket data • transactions that occur very infrequently will not be given ZERO probability 33 Experiments – Data Gen  Generating sensor set data  Choose the initial state (at t=0s) • Randomly • According to the probabilities assigned • Pick the combination with highest probability assigned => first sensor set states 34 Experiment – Data Gen  What is the next set of sensor-set states?  For simplicity, in our model, only one sensor can be updated at a time  For any two adjacent updates, the sensor-set states at the two time instants are differed by only one sensor => change only one sensor state => n possible combinations by toggling each of the n sensor states  We normalize the probabilities of the n combinations by their sum  Pick the next set of sensor-set states according to the normalized probabilities  Inter-arrival time of updates: exponential distribution 35 Experiments  Market Basket Dataset         8,000,000 transactions 100 items number of maximal potentially large itemsets = 2000 average transaction length: 10 average length of maximal large itemsets: 4 length of the maximal large itemsets: 11 minimum support: 0.05% length of the maximal large itemsets: ?  Algorithms:  Apriori: cached mode  IL-apriori: • (a) random-join (IL-apriori) • (b) join-by-smallest lifespan (IL-apriori-S) • (c) join-by-fewest-no-of-intervals (IL-apriori-C) 36 Experiments - Results Mining Tim e Runtime (sec.) 500 450 400 350 cache-apriori 300 250 200 IL-apriori IL-apriori-C IL-apriori-S 150 100 50 0 0 1 2 3 4 5 Support threshold (%)  Performance of algorithms (larger support):  All IL-apriori algorithms outperform cache apriori 37 Experiments - Results Mining Tim e 950 Runtime (sec.) 850 750 cache-apriori 650 IL-apriori 550 IL-apriori-C 450 IL-apriori-S 350 250 150 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Support Threshold (%)  Performance (lower support):  More candidates => IL-apriori: Expensive to join interval lists 38 Experiments - Results Mining Tim e 950 Runtime (sec.) 850 750 cache-apriori 650 IL-apriori 550 IL-apriori-C 450 IL-apriori-S 350 250 150 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Support Threshold (%)  More long frequent sensor-sets  Apriori has to match the candidates by search through the DB  IL-apriori-C and IL-apriori-S reduce a lot of time in joining the lists 39 Experiments - Results Memory 1900 Mem ory usage (MB) 1700 1500 cache-apriori 1300 IL-apriori 1100 IL-apriori-C 900 IL-apriori-S 700 500 300 0 1 2 3 4 5 Support threshold (%)    Amounts of memory usage - peak memory usage Cache apriori - store the whole database IL-apriori – store a lot of interval lists when no of candidates is growing large 40 Experiments – Results Experiments - Results Mining Tim e (min_sup = 0.02%) Runtime (sec.) 140 120 100 cache-apriori 80 IL-apriori 60 IL-apriori-C 40 IL-apriori-S 20 0 1 2 3 4 5 6 7 8 9 10 11 12 i-th pass   Apriori is faster in the first 3 passes Running time for IL-apriori drops sharply after   Apriori has to scan over the whole database IL-apriori (C/S) needs to join relatively short interval-lists in later passes 41 Experiments - Results (min_sup = 0.02%) Memory usage (MB) Mem ory 2000 1800 1600 1400 1200 1000 800 600 400 200 0 cache-apriori IL-apriori IL-apriori-C IL-apriori-S No. of Frequent Sensor-sets 1 2 3 4 5 6 7 8 9 10 11 12 10000 8000 i-th pass 6000 4000 2000 0  Memory requirement for IL-apriori is a lot higher when there are more frequent sensor-set interval lists to join 1 2 3 4 5 6 7 8 9 10 11 12 i-th pass 42 Experiments - Results (min_sup = 0.05%) Mining Tim e 380 Runtime (sec.) 330 280 cache-apriori 230 IL-apriori 180 IL-apriori-C IL-apriori-S 130 80 30 1000 2000 3000 4000 5000 6000 7000 8000 Number of transactions (K)  Runtime for all algorithms increases linearly with total number of transactions 43 Experiments - Results (min_sup = 0.05%) Mem ory Memory usage (MB) 1600 1400 1200 cache-apriori 1000 IL-apriori 800 IL-apriori-C 600 IL-apriori-S 400 200 0 1000 2000 3000 4000 5000 6000 7000 8000 Num ber of transactions (K)   Memory required by all algorithms increases as no of transactions increases. Rate of increase in IL-apriori is faster 44 Conclusions  Interval-list method to mine sensor data is described  Two interval list joining strategies are quite effective in reducing running time  Memory requirement is quite high  Future Work  Other methods for joining interval-lists • Trade-off between time and space  Extending to the streaming case • Consider approaches other than Lossy Counting Algorithms (Manku, and R. Motwani, VLDB’02) 45 Q&A 46

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Algorithm B (Example)