Download Anomaly Detection in Communication Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Streaming media wikipedia , lookup

Transcript
Streaming Models and
Algorithms for Communication
and Information Networks
Brian Thompson (joint work with James Abello)
Outline
 Introduction and Motivation
 A Streaming Model
 Our Approach
 Algorithms
 Experimental Results
 Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Outline
 Introduction and Motivation
 A Streaming Model
 Our Approach
 Algorithms
 Experimental Results
 Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Problem Description
 Data: A network (G;T)
 G = (V,E) is a graph
 T is a set of time-stamped events corresponding to nodes
or edges in G
 Goals:
 Identify recent correlated activity
 Measure influence between entities
 Challenges:
 Scalability – networks may be very large, limited space
 Efficiency – high data rate, time-sensitive information
 Variability – entities have different temporal dynamics
Streaming Models and Algorithms for Communication and Information Networks
Related Work
 Time-evolving graph model - sequence of “snapshots”
t=1
t=2
t=3
t=4
 Time series analysis
IP Traffic (MB Per Hour)
Streaming Models and Algorithms for Communication and Information Networks
Related Work
 Cascade model – set of seed nodes, information
(product, news, virus) propagates through network
Streaming Models and Algorithms for Communication and Information Networks
Outline
 Introduction and Motivation
 A Streaming Model
 Our Approach
 Algorithms
 Experimental Results
 Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Data Model
 G is a graph
Alice
Devika
Bob
Cheng
Elina
 T is a set of time-stamped events corresponding to
nodes or edges in G
Source
Recipient
Content
Timestamp
Alice
(public)
“Fire at 2nd & Main!”
Tuesday, 9:25am
Bob
Cheng
(private message)
Tuesday, 9:27am
Cheng
(public)
“RT @Alice Fire ...”
Tuesday, 9:28am
Streaming Models and Algorithms for Communication and Information Networks
Data Model
(Node-centric)
Devika
Alice
Bob
Cheng
Elina
Streaming Models and Algorithms for Communication and Information Networks
Data Model
(Edge-centric)
Devika
Alice
Bob
Cheng
Elina
Streaming Models and Algorithms for Communication and Information Networks
Renewal Theory
 A renewal process Φ is a continuous-time Markov
process where state transitions occur with holding times
sampled independently from a positive distribution 𝜇.
 Let 𝑆1 , 𝑆2 , … be samples from 𝜇, and consider a sequence
of events corresponding to those holding times.
S3
𝑇Φ :
0
t1
t2
t3 t4
t5
 We call 𝑆𝑖 inter-arrival times, and refer to the sequence
𝑇Φ = 𝑡𝑖 =
𝑖
0 𝑆𝑖
as the discrete-event sequence for Φ.
Streaming Models and Algorithms for Communication and Information Networks
Renewal Theory
 The age of a renewal process Φ at time 𝑡 is the amount
of time elapsed since the last event:
𝑡 − max 𝑡𝑖 ∶ 𝑡𝑖 < 𝑡 if 𝑡 ≥ 𝑡1
∞ otherwise
𝐴𝑔𝑒Φ 𝑡 =
𝐴𝑔𝑒Φ 𝑡
𝑇Φ :
0
t1
t2
t3 t4
t5
t
Streaming Models and Algorithms for Communication and Information Networks
The REWARDS Model
REneWal theory Approach for Real-time Data Streams
 We model a stream of communication data from a node
or across an edge as a renewal process
Inter-Arrival Time Distribution
xmin
xmax
Discrete-event sequence:
t1 t2
t3 t4
t5
Streaming Models and Algorithms for Communication and Information Networks
The REWARDS Model
REneWal theory Approach for Real-time Data Streams
 Given a stream of time-stamped events, we estimate
the parameters of the renewal process for each node
or edge based on the inter-arrival times
Inter-Arrival Time Distribution
xmin
xmax
Discrete-event sequence:
t1 t2
t3 t4
t5
Streaming Models and Algorithms for Communication and Information Networks
Outline
 Introduction and Motivation
 A Streaming Model
 Our Approach
 Algorithms
 Experimental Results
 Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Recency
 Goal: highlight recent activity
 Key idea: more recent = more relevant
8:00 am
10:00 am
12:00 pm
NOW!
User: alice1337
User: bob_iz_kewl
 Challenge: The most frequent communicators will
always seem “recent”, overshadowing others’ behavior.
We call this time-scale bias.
Streaming Models and Algorithms for Communication and Information Networks
Recency
 We can overcome time-scale bias by using the
REWARDS Model
 We first derive the limit distribution
𝐴𝑔𝑒 ∗
𝐹Φ
of the 𝐴𝑔𝑒
function:
𝐴𝑔𝑒 ∗
𝐹Φ
𝜏 = lim Pr 𝐴𝑔𝑒Φ 𝑡 ≤ 𝜏
𝑡→∞
 We define the recency of Φ at time 𝑡 to be:
𝑅𝑒𝑐Φ 𝑡 = 1 −
𝐴𝑔𝑒 ∗
𝐹Φ
𝐴𝑔𝑒Φ 𝑡
Streaming Models and Algorithms for Communication and Information Networks
Recency
 𝑅𝑒𝑐Φ is a decreasing function on every interval 𝑡𝑖 , 𝑡𝑖+1 .
It also satisfies the uniformity property: for any renewal
process Φ, the limit distribution of 𝑅𝑒𝑐Φ is Uniform(0,1).
Recency of Edge <3,22> in Bluetooth Dataset
 Recency effectively normalizes the age of a process
relative to its own temporal dynamics, making our
approach robust to differences in time scale between
networks or between entities within the same network.
Streaming Models and Algorithms for Communication and Information Networks
Delay
 Goal: measure influence of entity A on entity B
 Key idea: study pairwise (A,B)-gaps
8:00 am
10:00 am
12:00 pm
NOW!
User: alice1337
User: bob_iz_kewl
 Challenge: More frequent communicators will tend to
always have shorter “gaps”.
Another example of time-scale bias.
Streaming Models and Algorithms for Communication and Information Networks
Delay
 Given renewal processes Φ and Ψ, we say the ordered
pair of events 𝜙𝑖 , 𝜓𝑗 are adjacent if 𝑡(𝜙𝑖 ) < 𝑡(𝜓𝑗 ) and
∄ 𝑡 ∈ 𝑇Φ ∪ 𝑇Ψ ∶ 𝑡(𝜙𝑖 ) ≤ 𝑡 ≤ 𝑡(𝜓𝑗 ). We refer to the
elapsed time 𝑡(𝜓𝑗 ) − 𝑡(𝜙𝑖 ) as the pairwise gap. We
denote by 𝐺𝑎𝑝Φ,Ψ (𝑡) the most recent such gap at time 𝑡.
 If Φ and Ψ are independent processes, then we can
𝐺𝑎𝑝 ∗
𝐹Φ,Ψ
derive the limit distribution
of pairwise gaps
between consecutive (Φ, Ψ) event pairs.
 We define the (Φ, Ψ)-delay at time 𝑡 to be:
𝐷𝑒𝑙Φ,Ψ 𝑡 = 1 −
𝐺𝑎𝑝 ∗
𝐹Φ,Ψ
𝐺𝑎𝑝Φ,Ψ 𝑡
Streaming Models and Algorithms for Communication and Information Networks
Delay
 𝐷𝑒𝑙Φ,Ψ is a constant function on every interval 𝑡𝑖 , 𝑡𝑖+1 ,
and also satisfies the uniformity property: for any pair of
independent renewal process Φ and Ψ, the limit
distribution of 𝐷𝑒𝑙Φ,Ψ is Uniform(0,1).
 By comparing an observed gap to the theoretical joint
distribution of inter-arrival times for Φ and Ψ, delay
effectively normalizes the gap relative to the temporal
dynamics of Φ and Ψ individually.
 Similarly to the recency function, this makes our
approach robust to differences in time scale between
networks or between entities within the same network.
Streaming Models and Algorithms for Communication and Information Networks
Outline
 Introduction and Motivation
 A Streaming Model
 Our Approach
 Algorithms
 Experimental Results
 Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Divergence
 Based on the Kolmogorov-Smirnov statistic:
Fn(x)
F(x)
1
Compares empirical EDF Fn(x)
to hypothetical CDF F(x)
0.8
0.6
0.4
𝑲𝑺 𝑭𝒏 || 𝑭 = 𝐬𝐮𝐩 𝑭𝒏 (𝒙) − 𝑭(𝒙)
KS = 0.32
0.2
0
0
0.2
0.4
0.6
0.8
1
 Recency divergence compares recency values for a set
of nodes or edges to the CDF for Uniform(0,1)
 Delay divergence compares delay values for a set of
edges, or for all (A,B)-gaps, to the CDF for Uniform(0,1)
Streaming Models and Algorithms for Communication and Information Networks
Streaming Node-Centric Algorithm
• Goal: Flag times at which a node exhibits anomalous
activity (indicated by an unusually high concentration
of recent outgoing communication)
• Approach: Since the recency function is decreasing
between consecutive communication, measure the
recency divergence at a node only at times at which
new activity occurs
Streaming Models and Algorithms for Communication and Information Networks
The MCD Algorithm
Maximal Component Divergence Algorithm
• Goal: Identify subgraphs with correlated behavior
• Recency divergence to find recent anomalous activity
• Delay divergence to identify spheres of influence
Challenge: How do we overcome the combinatorial explosion?
Streaming Models and Algorithms for Communication and Information Networks
The MCD Algorithm
Maximal Component Divergence Algorithm
1. Calculate edge weights using recency or delay function
2. Gradually decrease the threshold, updating
components and divergence values as necessary
3. Output: Disjoint components with max divergence
0.9
V1
0.7
0.1
V5
V2
0.75
V3
0.3
0.5
V4
θ
Component
Div(C)
0.9
{V1,V2}
2.908
0.75
{V1,V2,V3}
2.723
0.7
{V1,V2,V3}
6.132
0.5
{V4,V5}
1.143
0.3
{V1,V2,V3,V4,V5}
2.380
0.1
{V1,V2,V3,V4,V5}
1.882
2.4
2.7
6.1
V3
2.9
V1
V2
1.1
V4
Streaming Models and Algorithms for Communication and Information Networks
V5
Sample Output
MCD
θ
#V(C)
E-frac
%E(C)
%E(G)
14.57
0.07
54
53/212
0.25
0.08
12.84
0.08
32
31/88
0.35
0.08
3.70
0.10
6
5/7
0.71
0.10
2.97
0.18
5
4/4
1.00
0.14
1.91
0.05
7
6/41
0.15
0.04
Streaming Models and Algorithms for Communication and Information Networks
Outline
 Introduction and Motivation
 A Streaming Model
 Our Approach
 Algorithms
 Experimental Results
 Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale
• Simulation: R-MAT model, 128 vertices, avg. degree 16
• IATs for edge activity sampled from Bounded Pareto
distributions, rate parameter btwn 10 mins. and 1 week
• Every 5 days, a randomly selected node has anomalous
activity at 10x its normal rate
Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale
Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale
• Conclusion: While it takes longer for anomalous
activity to be recognized at nodes with lower rates,
the magnitude of the peak seems to be independent
of activity rate but highly correlated with degree
Streaming Models and Algorithms for Communication and Information Networks
Accuracy and Precision
• Simulation: star network, 100 trials w/ only normal activity
and 100 trials including a period of anomalous activity
• ROC curves show accuracy and precision for several
methods for distinguishing between the two scenarios
• Conclusion: Especially when variability is introduced, our
approach out-performs the WtdDeg and Z-Score metrics
Streaming Models and Algorithms for Communication and Information Networks
Detection Latency
• Data: Enron corpus, 1k nodes, 2k edges, 4k timestamps
• Compare our approach with GraphScope Algorithm
• Conclusion: The two algorithms seem to identify similar
times of anomalous activity, but our approach based on
the REWARDS model has shorter response time
Streaming Models and Algorithms for Communication and Information Networks
Anomaly Detection in IP Traffic
• Data: LBNL network trace, > 9 million timestamps during
one hour on December 15, 2004
• Compare our approach with total network volume and
with “scanning activity” labeled by LBNL analysts
Streaming Models and Algorithms for Communication and Information Networks
Anomaly Detection in IP Traffic
• Three of the four times of highest 𝐷𝑖𝑣 𝑅𝑒𝑐 correspond to
labeled scanning activity
• The peak in scanning activity at 12:07pm is primarily due
to an increase in DNS and NBNS lookups
• The peak at 12:26pm was not flagged by the analysts
since the sequence of IP addresses was not monotonic
Streaming Models and Algorithms for Communication and Information Networks
Complexity Analysis
 Dataset: Twitter messages, Nov. 2008 – Oct. 2009
(263k nodes, 308k edges, 1.1 million timestamps)
 Updates O(1) per communication
 MCD Algorithm O(m log m), where m = # of edges;
can be approximated in effectively O(m) time
runtime (milliseconds)
Runtime for MCD Algorithm
2000
1500
1000
500
0
0
15,000
30,000
45,000
60,000
number of live edges
Streaming Models and Algorithms for Communication and Information Networks
Outline
 Introduction and Motivation
 A Streaming Model
 Our Approach
 Algorithms
 Experimental Results
 Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Future Work
 Incorporate duration of communication and other node
or edge attributes into our model
 Make use of geographical and textual content
 Use gap divergence to infer links, compare to approach
of Gomez-Rodriguez et. al.
 Develop streaming algorithm to identify emerging
trends
Streaming Models and Algorithms for Communication and Information Networks
Acknowledgements
 Part of this work was conducted at Lawrence Livermore
National Laboratory, under the guidance of Tina EliassiRad.
 This project is partially supported by a DHS Career
Development Grant, under the auspices of CCICADA,
a DHS Center of Excellence.
Streaming Models and Algorithms for Communication and Information Networks
Streaming Models and Algorithms for Communication and Information Networks