Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Framework for Mining Sequential Patterns from Spatio-Temporal Event Data Sets Yan Huang, Liqin Zhang, Pusheng Zhang, IEEE Transactions on Knowledge and Data Engineering, 20 (4), 2008. Group Webpage: http://www-users.cs.umn.edu/~anuj/8715/index.html (G10) Group Members: Anuj Karpatne Vijay Borra Outline • • • • • • • • • • Motivation Basic Concepts Problem Statement Challenges Key Concepts Approach Validation Novelty Contributions Assumptions and Suggestions Motivation • Earth Science: – Global Warming Bug LifeSpan Forest Fire – Peatland Deforestation Reduced Soil Moisture • Epidemiology: – Transmission of West Nile Disease: • Bird Mosquito Human Being • Climatology: – El Nino – La Nina – El Nino Increase in Forest Fires in Indonesia Decrease in Forest Fires in Indonesia La Nina Global Warming Forest Fire Basic Concepts • Event Instance: <ID, Location, Time, Event Type> Nominal 2D-Tuple t = 0 to T Categorical • Examples: – – – – <11, {5,5}, t0, Car Accident> <23, {5,2}, t2, Traffic Jam > <75, (75.50E, 23.30N), May 2009, Deforestation> <83, (75.30E, 23.10N), June 2010, Forest Fire> • Event type Sequence: Event Type 1 Event Type 2 Event Type 3 Problem Statement Input: •Set of Event Types, F = { f1, f2, …, fk } •Event Database, D = { e1, e2, …, en } where, ei = <IDi, locationi, timei, Event Typei Î F> For e.g. <a1, {x1,y1}, t1, A> <b3, {x2,y2}, t2, B> Output: •Event type sequences S = {s1, s2, …, sl} where si = {fi(1) → fi(2) → …fi(m)} For e.g. <A → B> <C → D → B> <B → C> •User-defined Neighborbood Relation: N(e) •User-defined Threshold Objective: •Minimize computational cost Constraints: •The algorithm is correct and complete Challenges • Developing a scoring mechanism to assess the significance of a given sequential pattern • Finding the interpretability of the scoring mechanism using spatial statistics • Developing an algorithmic design for mining significant patterns • Dealing with memory requirement constraints of the algorithm in the presence of large database of events Key Concepts (Source: Fig. 2 of Huang et al.) (a)A sample spatio-temporal data set. (b)Densities of events of B in events of A’s neighborhoods, represented by shades of different intensities. DensityRatio f.ε = set of events with event type f (Source: Fig. 2 of Huang et al.) 0 0.4 0.5 0.3 1.875 4 X 0.16 densityRatio( A → B) = Sequence Index Event Type Sequence: S a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B S[1:3] = {A→B →C} c7 S[3] = C A.e S[4] = A Sequence Index Event Type Sequence: S Belongs to an event sequence a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B S[1:3] = {A→B →C} c7 S[3] = C A.e S[4] = A Sequence Index Event Type Sequence: S Belongs to an event sequence a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B S[1:3] = {A→B →C} c7 S[3] = C A.e S[4] = A Does not belong to an event sequence Sequence Index Event Type Sequence: S Belongs to an event sequence a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B S[1:3] = {A→B →C} c7 S[3] = C A.e S[4] = A Does not belong to an event sequence Sequence Index Event Type Sequence: S Belongs to an event sequence a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B S[1:3] = {A→B →C} c7 S[3] = C A.e S[4] = A Does not belong to an event sequence Sequence Index Event Type Sequence: S Belongs to an event sequence a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B S[1:3] = {A→B →C} c7 S[3] = C A.e S[4] = A Does not belong to an event sequence Belongs to a Tail Event Set Sequence Index Event Type Sequence: S Belongs to an event sequence a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B c7 S[3] = C S[1:3] = {A→B →C →A} A.e S[4] = A Does not belong to an event sequence Belongs to a Tail Event Set Sequence Index Event Type Sequence: S Belongs to an event sequence a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B c7 S[3] = C S[1:3] = {A→B →C →A} A.e S[4] = A Does not belong to an event sequence Belongs to a Tail Event Set Sequence Index Event Type Sequence: S Belongs to an event sequence a1 b1 c1 a1 a2 b2 c2 a2 a3 b3 c3 a3 a4 b4 c4 a4 a5 b5 c5 a5 c6 a6 a6 A.e B.e S[1] = A C.e S[2] = B c7 S[3] = C S[1:3] = {A→B →C →A} Does not belong to an event sequence Belongs to a Tail Event Set A.e S[4] = A Properties: Antimonotone Property Weak Antimonotone Property STS-Miner Sequential Pattern Tree Node S[k]: •Event Type •Tail Event Set •DensityRatio(S[1:k-1] → S[k]) Algorithm: Start with empty sequence Do Depth-first expand (S[1:k]) For each event type S[k+1] Generate candidate pattern S[1:k+1] Compute new DensityRatio and Tail Event Set using follow join If (DensityRatio > threshold) Expand pattern tree by adding node S[k+1] Depth-first expand (S[1:k+1]) else Mark S[1:k+1] as terminal node end Source: Fig. 3 of Huang et al. end Algorithm Trace Source: Table 1 of Huang et al. Slicing-STS-Miner •When in-memory operations can’t be performed • Using uni-directional property of time for developing temporal slicing-based algorithm • Considers each temporal slice at a time in a piece-meal fashion •Three Phases of algorithm: •Phase 1: Hashing •Phase 2: Mining and Merging •Phase 3: Pruning Hashing: Divide the time dimension into overlapping slices Source: Fig. 5 of Huang et al. Slicing-STS-Miner • Mining and merging: – Process slices in time-increasing order – Keep updating pattern tree – Challenges faced in piece-meal processing: • • Duplicate sequences in overlapping areas of two consecutive slices. Sequences broken by boundaries of consecutive slices – Crossing Tail Event Set: e Crossing Tail Event Set if • • • e Tail Event Set, or e is located in the overlapping region of two consecutive slices e’ → e, where e’ is in slicei, e is in slicei+1 and e’, e are not in the overlapping area • • Maintain Crossing Tail Event Set as a queue CrossingQ Expand nodes in CrossingQ first till it is empty before moving to a new slice – Modified Depth-First-Expand • Pruning: – Post-processing step as can only be applied when all the slices have been processed. Cost Analysis Notations: – – – – – Fi : Average size of fi.e; 0< <1; 0< <1 pn : Number of maximal sequential patterns ps : Mean length of maximal sequential patterns costloadD : cost of loading data into main memory nSTS/ nSlicing-STS : number of times STS/Slicing-STS loads entire data into main memory • In-Memory Processing: – costSTS = O(Fi X pn X 2 ps X ) – costSlicing-STS = O(Fi X p´n X 2 p´s X ), where, p´n > pn, and p´s > ps • For Large Data sets: – costSTS = costSTS + nSTS X costloadD – costSlicing-STS = costSlicing-STS + nSlicing-STS X costloadD Results • Results on Synthetic Data: – – – – – Effect of Sequence Index Thresholds Effect of the Average Number of Event Sequences for Each Pattern Effect of the Average Pattern Size Effect of the Number of Patterns Effect of Slicing Size • Real World Applications: – – – – – NPP Temperature Precipitation Solar Radiation Evaporation Contributions • Introduction of 2 novel interest measures – – density ratio (for sequences of size utmost 2) – Sequence index (otherwise) • Proposed algorithmic designs: – STS-Miner: A depth-first expand based mining method exploiting the weak antimonotone property of Sequence index. – Slicing-STS-Miner: Utilizes temporal slicing to partition the dataset into overlapping slices when the number of events is too large to be processed in memory. Novelty Related Work• Sequential pattern mining in the market-basket data analysis – Events are discrete and considered as transactions in time – Example Datasets: Web log click streams, DNA sequences and medical treatments – Limitation: ‘Transactionization’ not suited for spatio-temporal data as space and time are continuous. • Mining trajectory patterns in spatio-temporal data: – Trajectory data of different moving objects reveal insights into the underlying travelling patterns of the objects. – Limitation: • • Same object has to be tracked at different time instances for obtaining trajectory data. Trajectory analysis can only be applied if the trajectories have been provided apriori. Assumptions • Events are categorical and instantaneous • Events occur as totally ordered sequences (chains) • Neighborhood Definition – Contiguous – Discrete • DensityRatio = 1 implies conditional independence • Statistical interpretability of DensityRatio is assumed Suggestions • Continuous and Interval-based events • Graphical models to address partial ordering of event types • Improvements in neighborhood definition – Real-valued based on spatio-temporal closeness – Incorporating cyclicity (non-contiguous nature) in neighborhood functions using transformations such as basis functions or kernel functions – Dynamically expanding or contracting neighborhoods – Incorporating prior knowledge of influences between event type pairs • Monte Carlo simulations for interpretability of DensityRatio = 1 • More clear space and time complexity analysis