Download A Framework for Mining Sequential Patterns from Spatio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Gene prediction wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
A Framework for Mining Sequential Patterns from
Spatio-Temporal Event Data Sets
Yan Huang, Liqin Zhang, Pusheng Zhang, IEEE Transactions on
Knowledge and Data Engineering, 20 (4), 2008.
Group Webpage:
http://www-users.cs.umn.edu/~anuj/8715/index.html
(G10) Group Members:
Anuj Karpatne
Vijay Borra
Outline
•
•
•
•
•
•
•
•
•
•
Motivation
Basic Concepts
Problem Statement
Challenges
Key Concepts
Approach
Validation
Novelty
Contributions
Assumptions and Suggestions
Motivation
• Earth Science:
– Global Warming
Bug LifeSpan
Forest Fire
– Peatland Deforestation
Reduced Soil Moisture
• Epidemiology:
– Transmission of West Nile Disease:
• Bird
Mosquito
Human Being
• Climatology:
– El Nino
– La Nina
– El Nino
Increase in Forest Fires in Indonesia
Decrease in Forest Fires in Indonesia
La Nina
Global Warming
Forest Fire
Basic Concepts
• Event Instance:
<ID, Location, Time, Event Type>
Nominal
2D-Tuple
t = 0 to T
Categorical
• Examples:
–
–
–
–
<11, {5,5}, t0, Car Accident>
<23, {5,2}, t2, Traffic Jam >
<75, (75.50E, 23.30N), May 2009, Deforestation>
<83, (75.30E, 23.10N), June 2010, Forest Fire>
• Event type Sequence:
Event Type 1
Event Type 2
Event Type 3
Problem Statement
Input:
•Set of Event Types, F = { f1, f2, …, fk }
•Event Database, D = { e1, e2, …, en }
where, ei = <IDi, locationi, timei, Event Typei Î F>
For e.g.
<a1, {x1,y1}, t1, A>
<b3, {x2,y2}, t2, B>
Output:
•Event type sequences S = {s1, s2, …, sl}
where si = {fi(1) → fi(2) → …fi(m)}
For e.g.
<A → B>
<C → D → B>
<B → C>
•User-defined Neighborbood Relation: N(e)
•User-defined Threshold
Objective:
•Minimize computational cost
Constraints:
•The algorithm is correct and complete
Challenges
• Developing a scoring mechanism to assess the significance of a
given sequential pattern
• Finding the interpretability of the scoring mechanism using
spatial statistics
• Developing an algorithmic design for mining significant patterns
• Dealing with memory requirement constraints of the algorithm in
the presence of large database of events
Key Concepts
(Source: Fig. 2 of Huang et al.)
(a)A sample spatio-temporal data set.
(b)Densities of events of B in events of A’s neighborhoods, represented by shades of different intensities.
DensityRatio
f.ε = set of events with event type f
(Source: Fig. 2 of Huang et al.)
0  0.4  0.5  0.3
 1.875
4 X 0.16
densityRatio( A → B) =
Sequence Index
Event Type Sequence: S
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
S[1:3] = {A→B →C}
c7
S[3] = C
A.e
S[4] = A
Sequence Index
Event Type Sequence: S
Belongs to an event sequence
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
S[1:3] = {A→B →C}
c7
S[3] = C
A.e
S[4] = A
Sequence Index
Event Type Sequence: S
Belongs to an event sequence
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
S[1:3] = {A→B →C}
c7
S[3] = C
A.e
S[4] = A
Does not belong to an event sequence
Sequence Index
Event Type Sequence: S
Belongs to an event sequence
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
S[1:3] = {A→B →C}
c7
S[3] = C
A.e
S[4] = A
Does not belong to an event sequence
Sequence Index
Event Type Sequence: S
Belongs to an event sequence
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
S[1:3] = {A→B →C}
c7
S[3] = C
A.e
S[4] = A
Does not belong to an event sequence
Sequence Index
Event Type Sequence: S
Belongs to an event sequence
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
S[1:3] = {A→B →C}
c7
S[3] = C
A.e
S[4] = A
Does not belong to an event sequence
Belongs to a Tail Event Set
Sequence Index
Event Type Sequence: S
Belongs to an event sequence
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
c7
S[3] = C
S[1:3] = {A→B →C →A}
A.e
S[4] = A
Does not belong to an event sequence
Belongs to a Tail Event Set
Sequence Index
Event Type Sequence: S
Belongs to an event sequence
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
c7
S[3] = C
S[1:3] = {A→B →C →A}
A.e
S[4] = A
Does not belong to an event sequence
Belongs to a Tail Event Set
Sequence Index
Event Type Sequence: S
Belongs to an event sequence
a1
b1
c1
a1
a2
b2
c2
a2
a3
b3
c3
a3
a4
b4
c4
a4
a5
b5
c5
a5
c6
a6
a6
A.e
B.e
S[1] = A
C.e
S[2] = B
c7
S[3] = C
S[1:3] = {A→B →C →A}
Does not belong to an event sequence
Belongs to a Tail Event Set
A.e
S[4] = A
Properties:
 Antimonotone Property
 Weak Antimonotone Property
STS-Miner
Sequential Pattern Tree
Node S[k]:
•Event Type
•Tail Event Set
•DensityRatio(S[1:k-1] → S[k])
Algorithm:
Start with empty sequence
Do Depth-first expand (S[1:k])
For each event type S[k+1]
Generate candidate pattern S[1:k+1]
Compute new DensityRatio and Tail
Event Set using follow join
If (DensityRatio > threshold)
Expand pattern tree by adding node S[k+1]
Depth-first expand (S[1:k+1])
else
Mark S[1:k+1] as terminal node
end
Source: Fig. 3 of Huang et al.
end
Algorithm Trace
Source: Table 1 of Huang et al.
Slicing-STS-Miner
•When in-memory operations can’t be
performed
• Using uni-directional property of time
for developing temporal slicing-based
algorithm
• Considers each temporal slice at a
time in a piece-meal fashion
•Three Phases of algorithm:
•Phase 1: Hashing
•Phase 2: Mining and Merging
•Phase 3: Pruning
Hashing: Divide the time
dimension into overlapping slices
Source: Fig. 5 of Huang et al.
Slicing-STS-Miner
• Mining and merging:
– Process slices in time-increasing order
– Keep updating pattern tree
– Challenges faced in piece-meal processing:
•
•
Duplicate sequences in overlapping areas of two consecutive slices.
Sequences broken by boundaries of consecutive slices
– Crossing Tail Event Set: e  Crossing Tail Event Set if
•
•
•
e  Tail Event Set, or
e is located in the overlapping region of two consecutive slices
e’ → e, where e’ is in slicei, e is in slicei+1 and e’, e are not in the overlapping area
•
•
Maintain Crossing Tail Event Set as a queue CrossingQ
Expand nodes in CrossingQ first till it is empty before moving to a new slice

– Modified Depth-First-Expand

• Pruning:
– Post-processing step as can only be applied when all the slices have been
processed.
Cost Analysis
Notations:
–
–
–
–
–
Fi : Average size of fi.e;
0< <1;
0< <1
pn : Number of maximal sequential patterns
ps : Mean length of maximal sequential patterns
costloadD : cost of loading data into main memory
nSTS/ nSlicing-STS : number of times STS/Slicing-STS loads entire data into main memory
• In-Memory Processing:
– costSTS = O(Fi X pn X 2 ps X )
– costSlicing-STS = O(Fi X p´n X 2 p´s X ), where, p´n > pn, and p´s > ps
• For Large Data sets:
– costSTS = costSTS + nSTS X costloadD
– costSlicing-STS = costSlicing-STS + nSlicing-STS X costloadD
Results
• Results on Synthetic Data:
–
–
–
–
–
Effect of Sequence Index Thresholds
Effect of the Average Number of Event Sequences for Each Pattern
Effect of the Average Pattern Size
Effect of the Number of Patterns
Effect of Slicing Size
• Real World Applications:
–
–
–
–
–
NPP
Temperature
Precipitation
Solar Radiation
Evaporation
Contributions
• Introduction of 2 novel interest measures –
– density ratio (for sequences of size utmost 2)
– Sequence index (otherwise)
• Proposed algorithmic designs:
– STS-Miner: A depth-first expand based mining method exploiting the weak
antimonotone property of Sequence index.
– Slicing-STS-Miner: Utilizes temporal slicing to partition the dataset into
overlapping slices when the number of events is too large to be processed
in memory.
Novelty
Related Work• Sequential pattern mining in the market-basket data analysis
– Events are discrete and considered as transactions in time
– Example Datasets: Web log click streams, DNA sequences and medical
treatments
– Limitation: ‘Transactionization’ not suited for spatio-temporal data as space
and time are continuous.
•
Mining trajectory patterns in spatio-temporal data:
– Trajectory data of different moving objects reveal insights into the
underlying travelling patterns of the objects.
– Limitation:
•
•
Same object has to be tracked at different time instances for obtaining trajectory data.
Trajectory analysis can only be applied if the trajectories have been provided apriori.
Assumptions
• Events are categorical and instantaneous
• Events occur as totally ordered sequences (chains)
• Neighborhood Definition
– Contiguous
– Discrete
• DensityRatio = 1 implies conditional independence
• Statistical interpretability of DensityRatio is assumed
Suggestions
• Continuous and Interval-based events
• Graphical models to address partial ordering of event types
• Improvements in neighborhood definition
– Real-valued based on spatio-temporal closeness
– Incorporating cyclicity (non-contiguous nature) in neighborhood functions
using transformations such as basis functions or kernel functions
– Dynamically expanding or contracting neighborhoods
– Incorporating prior knowledge of influences between event type pairs
• Monte Carlo simulations for interpretability of DensityRatio = 1
• More clear space and time complexity analysis