Download Master of Science - Lyle School of Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Temporal Relationship Among
Clusters for Data Streams
Margaret H. Dunham, Michael Hahsler, Doug Raiford
Students: Yu Meng, Donya Quick, Jie Huang, Charlie
Isaksson, Mallik Kotamarti
CSE Department
Southern Methodist University
Dallas, Texas 75275
[email protected]
This material is based upon work supported by the National Science Foundation under Grant No IIS-0948893.
10/26/09, Wilfrid
Laurier University
1
Objectives/Outline
Traditional Clustering of Data Streams
Ignores one of the most Salient
Features of Streams: Ordering
 Introduction
 Background
 TRAC-DS
 TRAC-DS Applications
 Conclusions/Future Work
10/26/09, Wilfrid
Laurier University
2
Objectives/Outline
 Introduction
 Stream Data
 Motivation
 Background
 TRAC-DS
 TRAC-DS Applications
 Conclusions/Future Work
10/26/09, Wilfrid
Laurier University
3
Stream Data
A growing number of applications generate streams
of data.
 Computer network monitoring data
 Call detail records in telecommunications
 Highway transportation traffic data
 Online web purchase log records
 Sensor network data
 Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.
Clustering techniques play a key role in modeling
and analyzing this data.
10/26/09, Wilfrid
Laurier University
4
Stream Data Format
 Events arriving in a stream
 At any time, t, we can view the state
of the problem as represented by a
vector of n numeric values:
Vt = <S1t, S2t, ..., Snt>
V1
S1
S2
…
Sn
S11
S21
…
Sn1
V2
S12
S22
…
Sn2
…
…
…
…
…
Vq
S1q
S2q
…
Snq
Time
10/26/09, Wilfrid
Laurier University
5
Data Stream Modeling











Single pass: Each record is examined at most once
Bounded storage: Limited Memory for storing synopsis
Real-time: Per record processing time must be low
Summarization (Synopsis )of data
Use data NOT SAMPLE
Temporal and Spatial
Dynamic
Continuous (infinite stream)
Learn
Forget
Sublinear growth rate - Clustering
10/26/09, Wilfrid
Laurier University
6 6
Traditional Clustering
10/26/09, Wilfrid
Laurier University
7
TRAC-DS
10/26/09, Wilfrid
Laurier University
8
Motivation
 Temporal Ordering is a major feature of
stream data.
 Many stream applications depend on this
ordering
 Prediction of future values
 Anomaly (rare event) detection
 Concept drift
10/26/09, Wilfrid
Laurier University
9
Objectives/Outline
 Introduction
 Background
 Clustering Stream Data
 Extensible Markov Model - EMM
 TRAC-DS
 TRAC-DS Applications
 Conclusions/Future Work
10/26/09, Wilfrid
Laurier University
10
Stream Clustering Requirements
 Dynamic updating of the clusters
 Identify outliers
 Barbara [2]:
 compactness
 fast
 incremental processing
10/26/09, Wilfrid
Laurier University
11
Stream Clustering Algorithms
 LOCALSEARCH [4]
 Partitions stream into segments
 Clusters each segment individually by solving the kmedians problem
 Iteratively reclusters the resulting centers
 CluStream [1]
 Micro-clusters represented by summary statistics.
 Micro-clusters are handled online
 Micro-clusters merged offline
 MONIC [13]
 Evolution of clusters over time
 Cluster transitions over time
10/26/09, Wilfrid
Laurier University
12
MM
A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time
points, where Pij = P(Ej | Ei), and at any time the future
behavior of the process is based solely on the current
state
A Markov Model (MM) is a graph with m vertices or states,
S, and directed arcs, A, such that:
 S ={N1,N2, …, Nm}, and
 A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,
Lij = <Ni,Nj> is labeled with a transition probability
Pij = P(Nj | Ni).
10/26/09, Wilfrid
Laurier University
13
Extensible Markov Model (EMM)
 Time Varying Discrete First Order Markov Model
 Nodes are clusters of real world states.
 Learning continues during application phase.
 Learning:
 Transition probabilities between
states(clusters)
 State labels (Cluster summary)
 State are modified as clusters are
10/26/09, Wilfrid
Laurier University
14
EMM for TRAC-DS Modeling
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
2/3
2/3
2/21
2/3
1/1
1/2
1/2
N3
N1
1/3
N2
1/1
1/2
1/1
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
10/26/09, Wilfrid
Laurier University
15
Objectives/Outline
 Introduction
 Background
 TRAC-DS
 Definition
 Relationship to Traditional Clustering
 Operations
 TRAC-DS Applications
 Conclusions/Future Work
10/26/09, Wilfrid
Laurier University
16
TRAC-DS NOTE
 TRAC-DS is not:
 Another stream clustering
algorithm
 TRAC-DS is:
 A new way of looking at clustering
 Built on top of an existing clustering
algorithm
 TRAC-DS may be used with any
stream clustering algorithm
10/26/09, Wilfrid
Laurier University
17
TRAC-DS Overview
10/26/09, Wilfrid
Laurier University
18
Data Stream Clustering
 At each point in time a data stream clustering ζ is
a partitioning of D', the data seen thus far.
 Instead of the whole partitions C1, C2,..., Ck only
synopses Cc1,Cc2,...,Cck are available and k is
allowed to change over time.
 The summaries Cci with i =1, 2,...,k typically
contain information about the size, distribution
and location of the data points in Ci.
10/26/09, Wilfrid
Laurier University
19
TRAC-DS Definition
Given a data stream clustering ζ, a temporal
relationship among clusters (TRAC-DS) overlays a
data stream clustering ζ with a EMM M, in such a
way that the following are satisfied:
(1) There is a one-to-one correspondence
between the clusters in ζ and the states S in M.
(2) A transition aij in the EMM M represents the
probability that given a data point in cluster i,
the next data point in the data stream will
belong to cluster j with i; j = 1; 2; : : : ; k.
(3) The EMM M is created online together with the
data stream clustering
10/26/09, Wilfrid
Laurier University
20
Clustering Operations
A clustering operation is a function q : ζ × x
→ ζ which is used by the data stream
clustering algorithm to update the
clustering ζ given some additional
information x which either is a new data
point or other information (e.g., the
number of the cluster to be deleted to be
simplified the clustering).
10/26/09, Wilfrid
Laurier University
21
TRAC-DS Operations
 A TRAC-DS operation is a function r : M × sc × y
→ M × sc that updates the temporal relationship
among clusters represented by the EMM M with
states S given a current state sc ∈ S and
additional information y and returns an updated
EMM and possibly a new current state.
 In order to be able to dynamically update the EMM
M we need to store a transition count matrix C.
The count cij in C contains the number of times
we observed a new point being assigned by the
clustering algorithm to cluster i followed by a point
being assigned to cluster j.
10/26/09, Wilfrid
Laurier University
22
Stream Clustering Operations *
 qassign point(ζ,x): Assigns the new data point x
to an existing cluster.
 qnew cluster(ζ,x): Create a new cluster.
 qremove cluster(ζ,x): Removes a cluster. Here x
is the cluster, i, to be removed. In this case the
associated summary Cci is removed from ζ and
k is decremented by one.
 qmerge clusters(ζ,x): Merges two clusters.
 qfade clusters(ζ,x): Fades the cluster structure.
 qsplit clusters(ζ,x): Splits a cluster.
* Inspired by MONIC [?]
10/26/09, Wilfrid
Laurier University
23
TRAC-DS Operations
 rassign point(M,sc,y): Assigns the new data point
to the state representing an existing cluster
 rnew cluster(M,sc,y): Create a state for a new
cluster.
 rremove cluster(M,sc,y): Removes state.
 rmerge clusters(M,sc,y): Merges two states.
 rfade clusters(M,sc,y): Fades the transition
probabilities using an exponential decay f(t)=2−λt
 rsplit clusters(M,sc,y): Splits states. Y clustering
operations.
10/26/09, Wilfrid
Laurier University
24
TRAC-DS Example
10/26/09, Wilfrid
Laurier University
25
TRAC-DS Advantages
 Dynamic
 Flexible –
 Use any Clustering Algorithm
 Supports and clustering operations
 Scalable
 Merges Clustering & Markov Modeling
10/26/09, Wilfrid
Laurier University
26
Objectives/Outline
 Introduction
 Background:
 TRAC-DS
 TRAC-DS Applications
 Anomaly Detection
 Bioinformatics
 Conclusions/Future Work
10/26/09, Wilfrid
Laurier University
27
What is Anomaly in Stream Data?
 Rare - Anomalous – Surprising
 Out of the ordinary
 Not outlier detection
 No knowledge of data distribution
 Data is not static
 Must take temporal and spatial values into account
 May be interested in sequence of events
 Ex: Snow in upstate New York is not an anomaly
 Snow in upstate New York in June is rare
 Rare events may change over time
10/26/09, Wilfrid
Laurier University
28
TRAC-DS Approach to Detect Anomalies
 By learning what is normal, the model can
predict what is not
 Normal is based on likelihood of occurrence
 Use TRAC-DS to build clusters and behavior
between clusters
 We view a rare event as:
 Unusual event
 Transition between events states which does
not frequently occur.
 Continue learning
10/26/09, Wilfrid
Laurier University
29
Determining Rare
 Occurrence Frequency (OFi) of an EMM
state Si is normalized count of state:
OFi  ni /  ni
i
 Normalized Transition Probability (NTPmn),
from one state, Sm, to another, Sn, is a
normalized transition Count:
NTPm, n  (Cm, n) /(  ni )
i
10/26/09, Wilfrid
Laurier University
30
Datasets/Anomalies
 MnDot – Minnesota Department of Transportation
 Automobile Accident
 Ouse and Serwent – River flow data from England
 Flood
 Drought
 KDD Cup 1999 & 2000
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Intrusion
 Cisco VoIP – VoIP traffic data obtained at Cisco
 Unusual Phone Call

10/26/09, Wilfrid
Laurier University
31
EMM Sublinear Growth
Servent Data
10/26/09, Wilfrid
Laurier University
32
TRAC-DS River Prediction
8
7
Water Level (m)
6
5
4
3
2
1
0
1
48 95 142 189 236 283 330 377 424 471 518 565 612 659
Input Time Series
RLF Prediction
10/26/09, Wilfrid
Laurier University
EMM Prediction
Observed
33
TRAC-DA Rare Event Detection
Detected unusual
weekend traffic pattern
Weekdays Weekend
10/26/09, Wilfrid
Laurier University
Minnesota DOT Traffic Data
34
TRAC-DS Intrusion Detection
 DARPA 1999/2000
 Synthetic Dataset
 MIT Lincoln Lab
 The DARPA 1999 dataset which is
free of attacks for two weeks (1st
week and 3rd week) is used as
training data
 DARPA 2000 dataset which
contains DDoS attacks is used a
test data.
10/26/09, Wilfrid
Laurier University
35
Table 8. EMM detection and false positive rates.
TRAC-DS Intrusion Detection
Thresh
old
DARPA 1999, and 2000
Detection
False Positive
Rate
Rate
0.9
6%
94%
0.8
20%
80%
0.7
50%
50%
0.6
100%
0%
10/26/09, Wilfrid
Laurier University
36
TRAC-DS & Bioinformatics
 Analysis DNA/RNA Sequences
 Applications:
Classification
 Differentiation
 16s RNA
 1542 nt rRNA
 Highly conserved across species
 miRNA
 Short (20-25nt) sequence of noncoding RNA
 Known since 1993 but significance not widely
appreciated until 2001
 Impact / Prevent translation of mRNA

10/17/06
37
First – Convert Sequence to NSV
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
Pos 0-8
Pos 1-9
A
2
1
C
3
3
G
3
3
T
1
2
4
2
1
…
Pos 34-42 2
10/17/06
38
Next – Apply TRAC-DS
10/26/09, Wilfrid
Laurier University
39
TRAC-DS Predictionwith miRNA
 Positive Data Model
 Cutoff Probability = 0.3
 False Positive Rate = 0%
 True Positive Rate = 66%
 Test results could be improved by
meta classifiers combining multiple
positive and negative classifiers
together.
10/17/06
41
Profile EMMs
•Examples of three different Profile EMMs constructed for 16S
data from 3 different bacteria families
10/26/09, Wilfrid
Laurier University
42
Profile EMMs for Organism Classification
10/26/09, Wilfrid
Laurier University
43
16S Classification Accuracy
 Classification accuracy using different scoring
metrics on 16S rRNA data from NCBI.
 We learned 31 classification models (at the
phylogenetic class level) from 98 organisms and
tested with 23 randomly chosen organisms.
 The Profile EMM approach was able to achieve
classification of more than 90% after tuning the
resolution settings.
10/26/09, Wilfrid
Laurier University
44
TRAC-DS and Bioinformatics
 Efficient
 Alignment free sequence analysis
 Clustering reduces size of model
 Flexible
 Any sequence
 Applicability to Metagenomics
 Scoring based on similarity between EMMs
or EMM and input sequence
 Applications
 Classification
 Differentiation
10/26/09, Wilfrid
Laurier University
45
Objectives/Outline
 Introduction
 Background
 TRAC-DS
 TRAC-DS Applications
 Conclusions/Future Work
10/26/09, Wilfrid
Laurier University
46
TRAC-DS Ongoing/Future
 Create online tool suite
 Improve TRAC algorithms:
 Aging
 Delete state
 Merge states
 Split states
 Apply to Image Recognition
 Bioinformatics
 Build Profile EMM database of NCBI 16S Bacteria
Data
 Perform classification using Metagenomic Data
collected from Yellowstone National Park
10/26/09, Wilfrid
Laurier University
47
10/26/09, Wilfrid
Laurier University
48
Bibliography
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
13)
C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. Proceedings of the International
Conference on Very Large Data Bases (VLDB), pp 81-92, 2003.
D. Barbara, “Requirements for clustering data streams,” SIGKDD Explorations, Vol 3, No 2, pp 23-27, 2002.
Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle, “Visualization of DNA/RNA Structure using Temporal
CGRs,”Proceedings of the IEEE 6th Symposium on Bioinformatics & Bioengineering (BIBE06), October 16-18, 2006, Washington D.C. ,pp
171-178.
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams: Theory and practice,” IEEE Transactions on
Knowledge and Data Engineering, Vol 15, No 3, pp 515-528, 2003.
Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, submitted
to SIAM International Conference on Data Mining.
Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Proceedings IEEE ICDM Conference, November 2004, pp 371374.
Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” International Journal of Computer
Science and Network Security, Vol 6, No 6, June 2006, pp 258-265.
Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLDM
Conference, pp 440-453.
Mallik Kotamarti, Douglas W. Raiford, M. L. Raymer, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for
Microbial Organisms Using Genome-Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics and
Bioengineering, pp 161-167, June 22-24 2009.
Yu Meng and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,” Proceedings of the IEEE PAKDD
Conference, April 2006, Singapore. (Also in Lecture Notes in Computer Science, Vol 3918, 2006, Springer Berlin/Heidelberg, pp 750-754.)
Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,” Journal of Computers, Vol 1, No
3, June 2006, pp 43-50.
MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html,
(2008)
M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706–711, 2006.
10/26/09, Wilfrid
Laurier University
49