Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Streaming Models and Algorithms for Communication and Information Networks Brian Thompson (joint work with James Abello) Outline Introduction and Motivation A Streaming Model Our Approach Algorithms Experimental Results Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks Outline Introduction and Motivation A Streaming Model Our Approach Algorithms Experimental Results Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks Problem Description Data: A network (G;T) G = (V,E) is a graph T is a set of time-stamped events corresponding to nodes or edges in G Goals: Identify recent correlated activity Measure influence between entities Challenges: Scalability – networks may be very large, limited space Efficiency – high data rate, time-sensitive information Variability – entities have different temporal dynamics Streaming Models and Algorithms for Communication and Information Networks Related Work Time-evolving graph model - sequence of “snapshots” t=1 t=2 t=3 t=4 Time series analysis IP Traffic (MB Per Hour) Streaming Models and Algorithms for Communication and Information Networks Related Work Cascade model – set of seed nodes, information (product, news, virus) propagates through network Streaming Models and Algorithms for Communication and Information Networks Outline Introduction and Motivation A Streaming Model Our Approach Algorithms Experimental Results Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks Data Model G is a graph Alice Devika Bob Cheng Elina T is a set of time-stamped events corresponding to nodes or edges in G Source Recipient Content Timestamp Alice (public) “Fire at 2nd & Main!” Tuesday, 9:25am Bob Cheng (private message) Tuesday, 9:27am Cheng (public) “RT @Alice Fire ...” Tuesday, 9:28am Streaming Models and Algorithms for Communication and Information Networks Data Model (Node-centric) Devika Alice Bob Cheng Elina Streaming Models and Algorithms for Communication and Information Networks Data Model (Edge-centric) Devika Alice Bob Cheng Elina Streaming Models and Algorithms for Communication and Information Networks Renewal Theory A renewal process Φ is a continuous-time Markov process where state transitions occur with holding times sampled independently from a positive distribution 𝜇. Let 𝑆1 , 𝑆2 , … be samples from 𝜇, and consider a sequence of events corresponding to those holding times. S3 𝑇Φ : 0 t1 t2 t3 t4 t5 We call 𝑆𝑖 inter-arrival times, and refer to the sequence 𝑇Φ = 𝑡𝑖 = 𝑖 0 𝑆𝑖 as the discrete-event sequence for Φ. Streaming Models and Algorithms for Communication and Information Networks Renewal Theory The age of a renewal process Φ at time 𝑡 is the amount of time elapsed since the last event: 𝑡 − max 𝑡𝑖 ∶ 𝑡𝑖 < 𝑡 if 𝑡 ≥ 𝑡1 ∞ otherwise 𝐴𝑔𝑒Φ 𝑡 = 𝐴𝑔𝑒Φ 𝑡 𝑇Φ : 0 t1 t2 t3 t4 t5 t Streaming Models and Algorithms for Communication and Information Networks The REWARDS Model REneWal theory Approach for Real-time Data Streams We model a stream of communication data from a node or across an edge as a renewal process Inter-Arrival Time Distribution xmin xmax Discrete-event sequence: t1 t2 t3 t4 t5 Streaming Models and Algorithms for Communication and Information Networks The REWARDS Model REneWal theory Approach for Real-time Data Streams Given a stream of time-stamped events, we estimate the parameters of the renewal process for each node or edge based on the inter-arrival times Inter-Arrival Time Distribution xmin xmax Discrete-event sequence: t1 t2 t3 t4 t5 Streaming Models and Algorithms for Communication and Information Networks Outline Introduction and Motivation A Streaming Model Our Approach Algorithms Experimental Results Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks Recency Goal: highlight recent activity Key idea: more recent = more relevant 8:00 am 10:00 am 12:00 pm NOW! User: alice1337 User: bob_iz_kewl Challenge: The most frequent communicators will always seem “recent”, overshadowing others’ behavior. We call this time-scale bias. Streaming Models and Algorithms for Communication and Information Networks Recency We can overcome time-scale bias by using the REWARDS Model We first derive the limit distribution 𝐴𝑔𝑒 ∗ 𝐹Φ of the 𝐴𝑔𝑒 function: 𝐴𝑔𝑒 ∗ 𝐹Φ 𝜏 = lim Pr 𝐴𝑔𝑒Φ 𝑡 ≤ 𝜏 𝑡→∞ We define the recency of Φ at time 𝑡 to be: 𝑅𝑒𝑐Φ 𝑡 = 1 − 𝐴𝑔𝑒 ∗ 𝐹Φ 𝐴𝑔𝑒Φ 𝑡 Streaming Models and Algorithms for Communication and Information Networks Recency 𝑅𝑒𝑐Φ is a decreasing function on every interval 𝑡𝑖 , 𝑡𝑖+1 . It also satisfies the uniformity property: for any renewal process Φ, the limit distribution of 𝑅𝑒𝑐Φ is Uniform(0,1). Recency of Edge <3,22> in Bluetooth Dataset Recency effectively normalizes the age of a process relative to its own temporal dynamics, making our approach robust to differences in time scale between networks or between entities within the same network. Streaming Models and Algorithms for Communication and Information Networks Delay Goal: measure influence of entity A on entity B Key idea: study pairwise (A,B)-gaps 8:00 am 10:00 am 12:00 pm NOW! User: alice1337 User: bob_iz_kewl Challenge: More frequent communicators will tend to always have shorter “gaps”. Another example of time-scale bias. Streaming Models and Algorithms for Communication and Information Networks Delay Given renewal processes Φ and Ψ, we say the ordered pair of events 𝜙𝑖 , 𝜓𝑗 are adjacent if 𝑡(𝜙𝑖 ) < 𝑡(𝜓𝑗 ) and ∄ 𝑡 ∈ 𝑇Φ ∪ 𝑇Ψ ∶ 𝑡(𝜙𝑖 ) ≤ 𝑡 ≤ 𝑡(𝜓𝑗 ). We refer to the elapsed time 𝑡(𝜓𝑗 ) − 𝑡(𝜙𝑖 ) as the pairwise gap. We denote by 𝐺𝑎𝑝Φ,Ψ (𝑡) the most recent such gap at time 𝑡. If Φ and Ψ are independent processes, then we can 𝐺𝑎𝑝 ∗ 𝐹Φ,Ψ derive the limit distribution of pairwise gaps between consecutive (Φ, Ψ) event pairs. We define the (Φ, Ψ)-delay at time 𝑡 to be: 𝐷𝑒𝑙Φ,Ψ 𝑡 = 1 − 𝐺𝑎𝑝 ∗ 𝐹Φ,Ψ 𝐺𝑎𝑝Φ,Ψ 𝑡 Streaming Models and Algorithms for Communication and Information Networks Delay 𝐷𝑒𝑙Φ,Ψ is a constant function on every interval 𝑡𝑖 , 𝑡𝑖+1 , and also satisfies the uniformity property: for any pair of independent renewal process Φ and Ψ, the limit distribution of 𝐷𝑒𝑙Φ,Ψ is Uniform(0,1). By comparing an observed gap to the theoretical joint distribution of inter-arrival times for Φ and Ψ, delay effectively normalizes the gap relative to the temporal dynamics of Φ and Ψ individually. Similarly to the recency function, this makes our approach robust to differences in time scale between networks or between entities within the same network. Streaming Models and Algorithms for Communication and Information Networks Outline Introduction and Motivation A Streaming Model Our Approach Algorithms Experimental Results Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks Divergence Based on the Kolmogorov-Smirnov statistic: Fn(x) F(x) 1 Compares empirical EDF Fn(x) to hypothetical CDF F(x) 0.8 0.6 0.4 𝑲𝑺 𝑭𝒏 || 𝑭 = 𝐬𝐮𝐩 𝑭𝒏 (𝒙) − 𝑭(𝒙) KS = 0.32 0.2 0 0 0.2 0.4 0.6 0.8 1 Recency divergence compares recency values for a set of nodes or edges to the CDF for Uniform(0,1) Delay divergence compares delay values for a set of edges, or for all (A,B)-gaps, to the CDF for Uniform(0,1) Streaming Models and Algorithms for Communication and Information Networks Streaming Node-Centric Algorithm • Goal: Flag times at which a node exhibits anomalous activity (indicated by an unusually high concentration of recent outgoing communication) • Approach: Since the recency function is decreasing between consecutive communication, measure the recency divergence at a node only at times at which new activity occurs Streaming Models and Algorithms for Communication and Information Networks The MCD Algorithm Maximal Component Divergence Algorithm • Goal: Identify subgraphs with correlated behavior • Recency divergence to find recent anomalous activity • Delay divergence to identify spheres of influence Challenge: How do we overcome the combinatorial explosion? Streaming Models and Algorithms for Communication and Information Networks The MCD Algorithm Maximal Component Divergence Algorithm 1. Calculate edge weights using recency or delay function 2. Gradually decrease the threshold, updating components and divergence values as necessary 3. Output: Disjoint components with max divergence 0.9 V1 0.7 0.1 V5 V2 0.75 V3 0.3 0.5 V4 θ Component Div(C) 0.9 {V1,V2} 2.908 0.75 {V1,V2,V3} 2.723 0.7 {V1,V2,V3} 6.132 0.5 {V4,V5} 1.143 0.3 {V1,V2,V3,V4,V5} 2.380 0.1 {V1,V2,V3,V4,V5} 1.882 2.4 2.7 6.1 V3 2.9 V1 V2 1.1 V4 Streaming Models and Algorithms for Communication and Information Networks V5 Sample Output MCD θ #V(C) E-frac %E(C) %E(G) 14.57 0.07 54 53/212 0.25 0.08 12.84 0.08 32 31/88 0.35 0.08 3.70 0.10 6 5/7 0.71 0.10 2.97 0.18 5 4/4 1.00 0.14 1.91 0.05 7 6/41 0.15 0.04 Streaming Models and Algorithms for Communication and Information Networks Outline Introduction and Motivation A Streaming Model Our Approach Algorithms Experimental Results Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks Robustness to Time Scale • Simulation: R-MAT model, 128 vertices, avg. degree 16 • IATs for edge activity sampled from Bounded Pareto distributions, rate parameter btwn 10 mins. and 1 week • Every 5 days, a randomly selected node has anomalous activity at 10x its normal rate Streaming Models and Algorithms for Communication and Information Networks Robustness to Time Scale Streaming Models and Algorithms for Communication and Information Networks Robustness to Time Scale • Conclusion: While it takes longer for anomalous activity to be recognized at nodes with lower rates, the magnitude of the peak seems to be independent of activity rate but highly correlated with degree Streaming Models and Algorithms for Communication and Information Networks Accuracy and Precision • Simulation: star network, 100 trials w/ only normal activity and 100 trials including a period of anomalous activity • ROC curves show accuracy and precision for several methods for distinguishing between the two scenarios • Conclusion: Especially when variability is introduced, our approach out-performs the WtdDeg and Z-Score metrics Streaming Models and Algorithms for Communication and Information Networks Detection Latency • Data: Enron corpus, 1k nodes, 2k edges, 4k timestamps • Compare our approach with GraphScope Algorithm • Conclusion: The two algorithms seem to identify similar times of anomalous activity, but our approach based on the REWARDS model has shorter response time Streaming Models and Algorithms for Communication and Information Networks Anomaly Detection in IP Traffic • Data: LBNL network trace, > 9 million timestamps during one hour on December 15, 2004 • Compare our approach with total network volume and with “scanning activity” labeled by LBNL analysts Streaming Models and Algorithms for Communication and Information Networks Anomaly Detection in IP Traffic • Three of the four times of highest 𝐷𝑖𝑣 𝑅𝑒𝑐 correspond to labeled scanning activity • The peak in scanning activity at 12:07pm is primarily due to an increase in DNS and NBNS lookups • The peak at 12:26pm was not flagged by the analysts since the sequence of IP addresses was not monotonic Streaming Models and Algorithms for Communication and Information Networks Complexity Analysis Dataset: Twitter messages, Nov. 2008 – Oct. 2009 (263k nodes, 308k edges, 1.1 million timestamps) Updates O(1) per communication MCD Algorithm O(m log m), where m = # of edges; can be approximated in effectively O(m) time runtime (milliseconds) Runtime for MCD Algorithm 2000 1500 1000 500 0 0 15,000 30,000 45,000 60,000 number of live edges Streaming Models and Algorithms for Communication and Information Networks Outline Introduction and Motivation A Streaming Model Our Approach Algorithms Experimental Results Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks Future Work Incorporate duration of communication and other node or edge attributes into our model Make use of geographical and textual content Use gap divergence to infer links, compare to approach of Gomez-Rodriguez et. al. Develop streaming algorithm to identify emerging trends Streaming Models and Algorithms for Communication and Information Networks Acknowledgements Part of this work was conducted at Lawrence Livermore National Laboratory, under the guidance of Tina EliassiRad. This project is partially supported by a DHS Career Development Grant, under the auspices of CCICADA, a DHS Center of Excellence. Streaming Models and Algorithms for Communication and Information Networks Streaming Models and Algorithms for Communication and Information Networks