Download Statistical Mining in Data Streams

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Statistical Mining in Data Streams
Ankur Jain
Dissertation Defense
Computer Science, UC Santa Barbara
Committee
Edward Y. Chang (chair)
Divyakant Agrawal
Yuan-Fang Wang
Roadmap
 The
Data Stream Model
 Introduction
and research issues
 Related work
 Data
Stream Mining
 Stream
data clustering
 Bayesian reasoning for sensor stream processing
 Contribution
 Future
5/22/2017
Summary
work
Statistical Mining in Data Streams
2
Data Streams
“A data stream is an unbounded and continuous
sequence of tuples.”
Tuples
arrive online and could be multi-dimensional
A tuple seen once cannot be easily retrieved later
No control over the tuple arrival order
5/22/2017
Statistical Mining in Data Streams
3
Applications – Text
Sensor
Network
Processing
Networks
Monitoring
Anomalies
Intrusions
Click stream
Email
DoS,
Clustering
PROBE,
U2R ?
• VideoINTERNET
surveillance
Blogs
“Find
the mean
• Stock
ticker monitoring
• Process control
temperature
of the& manufacturing
• Traffic monitoring & analysis
lagoon
in last 3 hours”
• Transaction log processing
Traditional DBMS does not work!
5/22/2017
Statistical Mining in Data Streams
4
Data Stream Projects

STREAM (Stanford)
A general-purpose Data Stream Management System (DSMS)
Telegraph (Berkeley)
 Adaptive query processing
 TinyDB: General purpose sensor database



Aurora Project (Brown/MIT)



The Cougar Project (Cornell)



Distributed stream processing
Introduces new operators (map, drop etc.)
Sensors form a distributed database system
Cross-layer optimizations (data management layer and the routing layer)
MAIDS (UIUC)


5/22/2017
Mining Alarming Incidents in Data Streams
Streaminer: Data stream mining
Statistical Mining in Data Streams
5
Data Stream Processing – Key Ingredients
 Adaptivity
 Incorporate
evolutionary changes in the stream
 Approximation
 Exact
results are hard to compute fast with limited
memory
5/22/2017
Statistical Mining in Data Streams
6
A Data Stream Management System (DSMS)
Query Precision
Sampling Rate
Sliding Window Size
Sensor Calibration
Sensor
Calibration
Data
Data Acquisition
Acquisition
Streaming Data
Sources/Sensors
User Query
The Central Stream
Processing System
Streaming Query Result
Stream Synopsis
5/22/2017
Query Processing
Resource Management
Adaptive Stream
Adaptive
StreamMining
Mining
Statistical Mining in Data Streams
Data Filtering
Data
Filtering
Data Sampling
Data
Sampling
Resource Management
Resource
Management
7
Thesis Outline
“Develop fast, online, statistical methods for
mining data streams.”
 Adaptive
non-linear clustering in multidimensional streams
 Bayesian reasoning sensor stream processing
 Filtering methods for resource conservation
 Change detection in data streams
 Video sensor data stream processing
5/22/2017
Statistical Mining in Data Streams
8
Roadmap
 The
Data Stream Model
 Introduction
and research issues
 Related work
 Data
Stream Mining
 Stream
data clustering
 Bayesian reasoning for sensor stream processing
 Contribution
 Future
5/22/2017
Summary
work
Statistical Mining in Data Streams
9
Clustering in High-Dimensional Streams
“Given a continuous sequence of points, group
them into some number of clusters, such that
the members of a cluster are geometrically
close to each other.”
5/22/2017
Statistical Mining in Data Streams
10
Example Application – Network Monitoring
INTERNET
Connection tuples
(high-dimensional)
DoS ,
Probe,
Normal?
5/22/2017
Statistical Mining in Data Streams
DoS ,
Probe,
Normal?
11
Stream Clustering – New Challenges
 One-pass

restriction and limited memory constraint
Fading cluster technique proposed by Aggarwal et al.
 Non-linear

We propose using the kernel trick to deal with the nonlinearity issue
 Data

5/22/2017
separation boundaries
dimensionality
We propose effective incremental dimension reduction
technique
Statistical Mining in Data Streams
12
The 2-Tier Framework
Input Space
Adaptive Non-linear Clustering
C1
d-dimensional
x
Latest point
received
from the
stream
5/22/2017
LDS
Tier1:
Stream Segmentation
~
x
C2
C3
q-dimensional
q<d
Tier 2:
LDS Projection & Update
C8
C9
C6
C5
2-Tier clustering module
(uses the kernel trick)
Statistical Mining in Data Streams
C4
C7
Fading Clusters
13
The Fading Cluster Methodology
Each cluster Ci, has a recency value Ri s.t.
Ri = f(t-tlast), where,
t: current time
tlast: last time Ci was updated
f(t) = e- t
: fading factor
 A cluster is erased from memory (faded) when Ri ·
h, where h is a user parameter
  controls the influence of historical data
 Total number of clusters is bounded
5/22/2017
Statistical Mining in Data Streams
14
Non-linearity in Data
Input Space
Feature Space
Traditional clustering techniques
(k-means) do not perform well
Spectral clustering methods
likely to perform better

5/22/2017
Feature space
mapping
Statistical Mining in Data Streams
15
Non-linearity in Network Intrusion Data
Geometrically wellbehaved trend
Use
kernel
trick
?
Input Space

Feature Space
“ipsweep” attack data
5/22/2017
Statistical Mining in Data Streams
16
The Kernel Trick
 Actual
projection in higher dimension is
computationally expensive
 The kernel trick does the non-linear projection
implicitly!
Given two input space vectors x,y
k(x,y) = <(x),(y)>
Kernel Function
5/22/2017
Gaussian kernel function
k(x,y) = exp(-||x-y||2) used in
the previous example !
Statistical Mining in Data Streams
17
Kernel Trick - Working Example
Not required explicitly!
:x = (x1, x2) → (x) = (x12, x22, √2x1x2)
2, z 2, √2z
“Kernel trick= allows
make
operations
in high<(x),(z)>
<(x12, us
x22to
, √2x
x
),
(z
1 2
1
2
1z2)>,
dimensional feature
space
using
a kernel function
2
2
2
2
x1 z1 + xrepresenting
2 z2 + 2x1x2z
but without =explicitly
1“z2,
= (x1z1 + x2z2)2,
= <x,z>2.
k(x,z) = <x,z>2
5/22/2017
Statistical Mining in Data Streams
18
Stream Clustering – New Challenges
 One-pass

restriction and limited memory constraint
We use the fading cluster technique proposed by Aggarwal
et. al.
 Non-linear

We propose using kernel methods to deal with the nonlinearity issue
 Data

5/22/2017
separation boundaries
dimensionality
We propose effective incremental dimension reduction
technique
Statistical Mining in Data Streams
19
Dimensionality Reduction
 PCA like
kernel method desirable
representation – EVD preferred
 KPCA is computationally prohibitive - O(n3)
 The principal components evolve with time –
frequent EVD updates may be necessary
 Explicit
 We
propose to perform EVD on grouped-data
instead point-data
Requires a novel
kernel method
5/22/2017
Statistical Mining in Data Streams
20
The 2-Tier Framework
Input Space
Adaptive Non-linear Clustering
C1
d-dimensional
x
Latest point
received
from the
stream
5/22/2017
LDS
Tier1:
Stream Segmentation
~
x
C2
C3
q-dimensional
q<d
Tier 2:
LDS Projection & Update
C8
C9
C6
C5
2-Tier clustering module
(uses the kernel trick)
Statistical Mining in Data Streams
C4
C7
Fading Clusters
21
The 2-Tier Framework …
 Tier
1 captures the temporal locality in a segment
 Segment
is a group of contiguous points in the stream
geometrically packed closely in the feature space
 Tier
2 adaptively selects segments to project data
in LDS
 Selected
segments are called representative segments
 Implicit data in the feature space is projected explicitly
in LDS such that the feature-space distances are
preserved
5/22/2017
Statistical Mining in Data Streams
22
The 2-Tier Framework …
Obtain a point x
From the stream
YES
TIER 2
Add S in memory
and update LDS
TIER 1
Is ((x) novel
w.r.t S and s > smin)
OR is s = smax?
YES
Is
S a representative
segment?
NO
Clear contents of S
Obtain
NO
~
x in LDS
Add x to S
Update cluster centers
and recency values.
Delete faded clusters
5/22/2017
Assign x to its
nearest cluster
Create new
cluster with x
Statistical Mining in Data Streams
YES
~
Is x close to
an active cluster?
NO
23
Network Intrusion Stream
• Simulated data from MIT Lincoln Labs.
• 34 Continuous Attributes (Features)
• 10.5 K Records
• 22 types of intrusion attacks + 1 normal class
5/22/2017
Statistical Mining in Data Streams
24
Network Intrusion Stream
Clustering accuracy at LDS dimensionality u=10
5/22/2017
Statistical Mining in Data Streams
25
Efficiency - EVD Computations
Image data
5K Records, 576 Features, 10 digits
Newswire data
3.8K Records, 16.5K Features, 10 news topics
5/22/2017
Statistical Mining in Data Streams
26
In Retrospect…
 We
proposed an effective stream clustering
framework
 We use the kernel trick to delineate non-linear
boundaries efficiently
 We use stream segmentation approach to
continuously project data into a low
dimensional space
5/22/2017
Statistical Mining in Data Streams
27
Roadmap
 The
Data Stream Model
 Introduction
and research issues
 Related work
 Contributions
Towards Stream Mining
 Stream
data clustering
 Bayesian reasoning sensor steam processing
 Contribution
 Future
5/22/2017
Summary
work
Statistical Mining in Data Streams
28
Bayesian Reasoning for Sensor Data
Processing
 Users
submit queries with
precision constraints
 Resource conservation is
of prime concern to
prolong system life
“Find the temperature
with 80% confidence
Use probabilistic models at central
acquisition
site for approximate predictions
 Data communication
preventing actual acquisitions
 Data
5/22/2017
Statistical Mining in Data Streams
29
Dependencies in Sensor Attributes
Acquire Voltage!
Get Temperature
Attribute
Acquisition Cost
Temperature
50 J
Voltage
5J
Dependency Model
Bayesian
Networks
Report Temperature
Get Voltage
Voltage
Temperature
5/22/2017
Statistical Mining in Data Streams
30
Using Correlation Models [Deshpande et al.]
Correlation models ignore conditional dependency
Intel Lab ( Real Sensor network data)
Attributes: Voltage (V), Temperature (T), Humidity (H)
“voltage” is correlated with “temperature”
Humidity [35-40)
“voltage” is conditionally independent
of “temperature”, given “humidity” !
5/22/2017
Deshpande et al. [VLDB’04]
Statistical Mining in Data Streams
31
BN vs. Correlations
Correlation model [Deshpande et. al.]
•Maintains all dependencies
•Search space of finding best possible alternative
sensor attribute is high
•Joint probability is represented in O(n2) cells
NDBC Buoy Dataset
Bayesian Network
•Maintains vital dependencies only
•Lower search complexity O(n)
•Storage O(nd), d: avg. node degree
•Intuitive dependency structure
Intel Lab. Dataset
5/22/2017
Statistical Mining in Data Streams
32
Bayesian Networks (BN)
Qualitative Part – Directed Acyclic Graph (DAG)
• Nodes – Sensor Attributes
• Edges – Attribute influence relationship
Quantitative Part – Conditional Probability Table (CPT)
• Each node X has its own CPT , P(X|parents(X))
Together, the BN represent the joint probability in
factored from: P(T,H,V,L) = P(T)P(H|T)P(V|H)P(L|T)
The “influence relationship” is represented by conditional entropy
function H.
H(Xi)=kl=1 P( Xi = xil )log(P( Xi = xil ))
We learn the BN by minimizing H(Xi| Parents(Xi)).
5/22/2017
Statistical Mining in Data Streams
33
System Architecture
CPTs
X1
Cost
{(Temperature, 80%)}
C1
0.5
X3
C2
0.3
C3
0.4
C4
0.01
X2
X4
X5
0.6
0.3
0.1
0.05
0.04 0.6
0.05
0.2
X6
{(Air Pressure, 90%),(Wind Speed, 90%)}
Storage
{(Temperature, 95%),(Wind Speed, 85%)}
Sensor Network
BN
Group-query Plan Generation
Bayesian Inference Engine
Group Query (Q)
Acquisition Plan
Acquisition Values
{(Wind Speed, 75%)}
5/22/2017
Query Processor
Statistical Mining in Data Streams
34
Finding the Candidate Attributes
 For
any attribute in the group-query Q, analyze
candidates attributes in the Markov blanket
recursively
Information Gain (Conditional Entropy)

Selection criterion
Acquisition cost

Select candidates in a
greedy fashion
Meet precision constraints
Maximize resource conservation
5/22/2017
Statistical Mining in Data Streams
35
Experiments – Resource Conservation
NDBC dataset, 7 attributes
Effect of using MB Property with min = 0.90
Effect of using group-queries, |Q| - Group-query size
5/22/2017
Statistical Mining in Data Streams
36
Results - Selectivity
Wave Period (WP)
Wind Speed (SP)
Air Pressure (AP)
Wind Direction (DR)
Water Temperature (WT)
Wave Height (WH)
Air Temperature (AT)
5/22/2017
Statistical Mining in Data Streams
37
In Retrospect…
 Bayesian
networks can encode the sensor
dependencies effectively
 Our method provides significant resource
conservation for group-queries
5/22/2017
Statistical Mining in Data Streams
38
Contribution Summary






“Adaptive Stream resource management using Kalman Filters.” [SIGMOD’04]
“Adaptive sampling for sensor networks.” [DMSN’04]
“Adaptive non-linear clustering for Data Streams.” [CIKM’06]
“Using stationary-dynamic camera assemblies for wide-area video surveillance
and selective attention.” [CVPR’06]
“Filtering the data streams.” [in submission]
“Efficient diagnostic and aggregate queries on sensor networks.”
[in submission]

“OCODDS: An On-line Change-Over Detection framework for tracking
evolutionary changes in Data Streams.” [in submission]
5/22/2017
Statistical Mining in Data Streams
39
Future Work
 Develop
non-linear techniques for capturing
temporal correlations in data streams
 The Bayesian framework can be extended to
address “what-if” queries with counterfactual
evidence
 The clustering framework can be extended for
developing stream visualization systems
 Incremental
EVD techniques can improve the
performance further
5/22/2017
Statistical Mining in Data Streams
40
Thank You !
5/22/2017
Statistical Mining in Data Streams
41
BACKUP SLDIES!
5/22/2017
Statistical Mining in Data Streams
42
Back to Stream Clustering
 We
propose a 2-tier stream clustering
framework
 Tier
1: Kernel method that continuously divides the
stream into segments
 Tier 2: Kernel method that uses the segments to
project data in a low-dimensional space (LDS)
 The
5/22/2017
fading clusters reside in the LDS
Statistical Mining in Data Streams
43
Clustering – LDS Projection
5/22/2017
Statistical Mining in Data Streams
44
Clustering – LDS Update
5/22/2017
Statistical Mining in Data Streams
45
Network Intrusion Stream
Clustering accuracy at
LDS dimensionality u=10
Cluster strengths at LDS
dimensionality u=10
5/22/2017
Statistical Mining in Data Streams
46
Effect of dimensionality
5/22/2017
Statistical Mining in Data Streams
47
Query Plan Generation
 Given
a group query, the query plan computes
“candidate attributes” that will actually be acquired
to successfully address the query.
 We exploit the “Markov Blanket (MB)” property to
select candidate attributes.
Given a BN G, the Markov Blanket of a node Xi comprises
the node, and its immediate parent and child.
P( X i , MB ( X i ))
P( X i | G )  P( X i | MB ( X i )) 
P( MB ( X i ))
5/22/2017
Statistical Mining in Data Streams
48
Exploiting the MB Property
“Given a node Xi and a set of arbitrary nodes Y in a BN s.t. MB(Xi) µ Y [ Xi),
the conditional entropy of Xi given Y is at least as high as that given its
Markov blanket or H(Xi|Y) ¸ H(Xi|MB(Xi)).”
Proof: Separating MB(Xi) into two parts MB1 = MB(Xi) [ Y and MB2
= MB(Xi) - MB1 and denoting Z = Y – MB(Xi):
H(Xi|Y) = H(Xi|Z,MB1)
Y = Z [ MB1
¸ H(Xi|Z,MB1,MB2) Additional information cannot
increase entropy
= H(Xi|Z, MB(Xi))
MB(Xi) = MB1|MB2
= H(Xi|MB(Xi))
Markov-blanket definition
5/22/2017
Statistical Mining in Data Streams
49
Bayesian Reasoning -More Results…
Effect of using MB Property
with min = 0.90
5/22/2017
Query answer Quality loss
50-node Synth. Data BN
Statistical Mining in Data Streams
50
Bayesian Reasoning for Group Queries
More accurate in addressing group queries
= { (Xi, i)|Xi 2 X Æ (0 < i ·1) Æ 1 · i · n) } s.t.
i <maxl P(Xi = xil)}
 X ={X1,X2 ,X3,…, Xn} Sensor attributes
Q

i
Confidence parameters
 P(Xi
= xil) Probability with which Xi assumes the value of
xil
 Bayesian
reasoning is helpful in detecting
abnormalities
5/22/2017
Statistical Mining in Data Streams
51
Bayesian Reasoning – Candidate attribute selection algorithm
5/22/2017
Statistical Mining in Data Streams
52