Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Mining in Data Streams Ankur Jain Dissertation Defense Computer Science, UC Santa Barbara Committee Edward Y. Chang (chair) Divyakant Agrawal Yuan-Fang Wang Roadmap The Data Stream Model Introduction and research issues Related work Data Stream Mining Stream data clustering Bayesian reasoning for sensor stream processing Contribution Future 5/22/2017 Summary work Statistical Mining in Data Streams 2 Data Streams “A data stream is an unbounded and continuous sequence of tuples.” Tuples arrive online and could be multi-dimensional A tuple seen once cannot be easily retrieved later No control over the tuple arrival order 5/22/2017 Statistical Mining in Data Streams 3 Applications – Text Sensor Network Processing Networks Monitoring Anomalies Intrusions Click stream Email DoS, Clustering PROBE, U2R ? • VideoINTERNET surveillance Blogs “Find the mean • Stock ticker monitoring • Process control temperature of the& manufacturing • Traffic monitoring & analysis lagoon in last 3 hours” • Transaction log processing Traditional DBMS does not work! 5/22/2017 Statistical Mining in Data Streams 4 Data Stream Projects STREAM (Stanford) A general-purpose Data Stream Management System (DSMS) Telegraph (Berkeley) Adaptive query processing TinyDB: General purpose sensor database Aurora Project (Brown/MIT) The Cougar Project (Cornell) Distributed stream processing Introduces new operators (map, drop etc.) Sensors form a distributed database system Cross-layer optimizations (data management layer and the routing layer) MAIDS (UIUC) 5/22/2017 Mining Alarming Incidents in Data Streams Streaminer: Data stream mining Statistical Mining in Data Streams 5 Data Stream Processing – Key Ingredients Adaptivity Incorporate evolutionary changes in the stream Approximation Exact results are hard to compute fast with limited memory 5/22/2017 Statistical Mining in Data Streams 6 A Data Stream Management System (DSMS) Query Precision Sampling Rate Sliding Window Size Sensor Calibration Sensor Calibration Data Data Acquisition Acquisition Streaming Data Sources/Sensors User Query The Central Stream Processing System Streaming Query Result Stream Synopsis 5/22/2017 Query Processing Resource Management Adaptive Stream Adaptive StreamMining Mining Statistical Mining in Data Streams Data Filtering Data Filtering Data Sampling Data Sampling Resource Management Resource Management 7 Thesis Outline “Develop fast, online, statistical methods for mining data streams.” Adaptive non-linear clustering in multidimensional streams Bayesian reasoning sensor stream processing Filtering methods for resource conservation Change detection in data streams Video sensor data stream processing 5/22/2017 Statistical Mining in Data Streams 8 Roadmap The Data Stream Model Introduction and research issues Related work Data Stream Mining Stream data clustering Bayesian reasoning for sensor stream processing Contribution Future 5/22/2017 Summary work Statistical Mining in Data Streams 9 Clustering in High-Dimensional Streams “Given a continuous sequence of points, group them into some number of clusters, such that the members of a cluster are geometrically close to each other.” 5/22/2017 Statistical Mining in Data Streams 10 Example Application – Network Monitoring INTERNET Connection tuples (high-dimensional) DoS , Probe, Normal? 5/22/2017 Statistical Mining in Data Streams DoS , Probe, Normal? 11 Stream Clustering – New Challenges One-pass restriction and limited memory constraint Fading cluster technique proposed by Aggarwal et al. Non-linear We propose using the kernel trick to deal with the nonlinearity issue Data 5/22/2017 separation boundaries dimensionality We propose effective incremental dimension reduction technique Statistical Mining in Data Streams 12 The 2-Tier Framework Input Space Adaptive Non-linear Clustering C1 d-dimensional x Latest point received from the stream 5/22/2017 LDS Tier1: Stream Segmentation ~ x C2 C3 q-dimensional q<d Tier 2: LDS Projection & Update C8 C9 C6 C5 2-Tier clustering module (uses the kernel trick) Statistical Mining in Data Streams C4 C7 Fading Clusters 13 The Fading Cluster Methodology Each cluster Ci, has a recency value Ri s.t. Ri = f(t-tlast), where, t: current time tlast: last time Ci was updated f(t) = e- t : fading factor A cluster is erased from memory (faded) when Ri · h, where h is a user parameter controls the influence of historical data Total number of clusters is bounded 5/22/2017 Statistical Mining in Data Streams 14 Non-linearity in Data Input Space Feature Space Traditional clustering techniques (k-means) do not perform well Spectral clustering methods likely to perform better 5/22/2017 Feature space mapping Statistical Mining in Data Streams 15 Non-linearity in Network Intrusion Data Geometrically wellbehaved trend Use kernel trick ? Input Space Feature Space “ipsweep” attack data 5/22/2017 Statistical Mining in Data Streams 16 The Kernel Trick Actual projection in higher dimension is computationally expensive The kernel trick does the non-linear projection implicitly! Given two input space vectors x,y k(x,y) = <(x),(y)> Kernel Function 5/22/2017 Gaussian kernel function k(x,y) = exp(-||x-y||2) used in the previous example ! Statistical Mining in Data Streams 17 Kernel Trick - Working Example Not required explicitly! :x = (x1, x2) → (x) = (x12, x22, √2x1x2) 2, z 2, √2z “Kernel trick= allows make operations in high<(x),(z)> <(x12, us x22to , √2x x ), (z 1 2 1 2 1z2)>, dimensional feature space using a kernel function 2 2 2 2 x1 z1 + xrepresenting 2 z2 + 2x1x2z but without =explicitly 1“z2, = (x1z1 + x2z2)2, = <x,z>2. k(x,z) = <x,z>2 5/22/2017 Statistical Mining in Data Streams 18 Stream Clustering – New Challenges One-pass restriction and limited memory constraint We use the fading cluster technique proposed by Aggarwal et. al. Non-linear We propose using kernel methods to deal with the nonlinearity issue Data 5/22/2017 separation boundaries dimensionality We propose effective incremental dimension reduction technique Statistical Mining in Data Streams 19 Dimensionality Reduction PCA like kernel method desirable representation – EVD preferred KPCA is computationally prohibitive - O(n3) The principal components evolve with time – frequent EVD updates may be necessary Explicit We propose to perform EVD on grouped-data instead point-data Requires a novel kernel method 5/22/2017 Statistical Mining in Data Streams 20 The 2-Tier Framework Input Space Adaptive Non-linear Clustering C1 d-dimensional x Latest point received from the stream 5/22/2017 LDS Tier1: Stream Segmentation ~ x C2 C3 q-dimensional q<d Tier 2: LDS Projection & Update C8 C9 C6 C5 2-Tier clustering module (uses the kernel trick) Statistical Mining in Data Streams C4 C7 Fading Clusters 21 The 2-Tier Framework … Tier 1 captures the temporal locality in a segment Segment is a group of contiguous points in the stream geometrically packed closely in the feature space Tier 2 adaptively selects segments to project data in LDS Selected segments are called representative segments Implicit data in the feature space is projected explicitly in LDS such that the feature-space distances are preserved 5/22/2017 Statistical Mining in Data Streams 22 The 2-Tier Framework … Obtain a point x From the stream YES TIER 2 Add S in memory and update LDS TIER 1 Is ((x) novel w.r.t S and s > smin) OR is s = smax? YES Is S a representative segment? NO Clear contents of S Obtain NO ~ x in LDS Add x to S Update cluster centers and recency values. Delete faded clusters 5/22/2017 Assign x to its nearest cluster Create new cluster with x Statistical Mining in Data Streams YES ~ Is x close to an active cluster? NO 23 Network Intrusion Stream • Simulated data from MIT Lincoln Labs. • 34 Continuous Attributes (Features) • 10.5 K Records • 22 types of intrusion attacks + 1 normal class 5/22/2017 Statistical Mining in Data Streams 24 Network Intrusion Stream Clustering accuracy at LDS dimensionality u=10 5/22/2017 Statistical Mining in Data Streams 25 Efficiency - EVD Computations Image data 5K Records, 576 Features, 10 digits Newswire data 3.8K Records, 16.5K Features, 10 news topics 5/22/2017 Statistical Mining in Data Streams 26 In Retrospect… We proposed an effective stream clustering framework We use the kernel trick to delineate non-linear boundaries efficiently We use stream segmentation approach to continuously project data into a low dimensional space 5/22/2017 Statistical Mining in Data Streams 27 Roadmap The Data Stream Model Introduction and research issues Related work Contributions Towards Stream Mining Stream data clustering Bayesian reasoning sensor steam processing Contribution Future 5/22/2017 Summary work Statistical Mining in Data Streams 28 Bayesian Reasoning for Sensor Data Processing Users submit queries with precision constraints Resource conservation is of prime concern to prolong system life “Find the temperature with 80% confidence Use probabilistic models at central acquisition site for approximate predictions Data communication preventing actual acquisitions Data 5/22/2017 Statistical Mining in Data Streams 29 Dependencies in Sensor Attributes Acquire Voltage! Get Temperature Attribute Acquisition Cost Temperature 50 J Voltage 5J Dependency Model Bayesian Networks Report Temperature Get Voltage Voltage Temperature 5/22/2017 Statistical Mining in Data Streams 30 Using Correlation Models [Deshpande et al.] Correlation models ignore conditional dependency Intel Lab ( Real Sensor network data) Attributes: Voltage (V), Temperature (T), Humidity (H) “voltage” is correlated with “temperature” Humidity [35-40) “voltage” is conditionally independent of “temperature”, given “humidity” ! 5/22/2017 Deshpande et al. [VLDB’04] Statistical Mining in Data Streams 31 BN vs. Correlations Correlation model [Deshpande et. al.] •Maintains all dependencies •Search space of finding best possible alternative sensor attribute is high •Joint probability is represented in O(n2) cells NDBC Buoy Dataset Bayesian Network •Maintains vital dependencies only •Lower search complexity O(n) •Storage O(nd), d: avg. node degree •Intuitive dependency structure Intel Lab. Dataset 5/22/2017 Statistical Mining in Data Streams 32 Bayesian Networks (BN) Qualitative Part – Directed Acyclic Graph (DAG) • Nodes – Sensor Attributes • Edges – Attribute influence relationship Quantitative Part – Conditional Probability Table (CPT) • Each node X has its own CPT , P(X|parents(X)) Together, the BN represent the joint probability in factored from: P(T,H,V,L) = P(T)P(H|T)P(V|H)P(L|T) The “influence relationship” is represented by conditional entropy function H. H(Xi)=kl=1 P( Xi = xil )log(P( Xi = xil )) We learn the BN by minimizing H(Xi| Parents(Xi)). 5/22/2017 Statistical Mining in Data Streams 33 System Architecture CPTs X1 Cost {(Temperature, 80%)} C1 0.5 X3 C2 0.3 C3 0.4 C4 0.01 X2 X4 X5 0.6 0.3 0.1 0.05 0.04 0.6 0.05 0.2 X6 {(Air Pressure, 90%),(Wind Speed, 90%)} Storage {(Temperature, 95%),(Wind Speed, 85%)} Sensor Network BN Group-query Plan Generation Bayesian Inference Engine Group Query (Q) Acquisition Plan Acquisition Values {(Wind Speed, 75%)} 5/22/2017 Query Processor Statistical Mining in Data Streams 34 Finding the Candidate Attributes For any attribute in the group-query Q, analyze candidates attributes in the Markov blanket recursively Information Gain (Conditional Entropy) Selection criterion Acquisition cost Select candidates in a greedy fashion Meet precision constraints Maximize resource conservation 5/22/2017 Statistical Mining in Data Streams 35 Experiments – Resource Conservation NDBC dataset, 7 attributes Effect of using MB Property with min = 0.90 Effect of using group-queries, |Q| - Group-query size 5/22/2017 Statistical Mining in Data Streams 36 Results - Selectivity Wave Period (WP) Wind Speed (SP) Air Pressure (AP) Wind Direction (DR) Water Temperature (WT) Wave Height (WH) Air Temperature (AT) 5/22/2017 Statistical Mining in Data Streams 37 In Retrospect… Bayesian networks can encode the sensor dependencies effectively Our method provides significant resource conservation for group-queries 5/22/2017 Statistical Mining in Data Streams 38 Contribution Summary “Adaptive Stream resource management using Kalman Filters.” [SIGMOD’04] “Adaptive sampling for sensor networks.” [DMSN’04] “Adaptive non-linear clustering for Data Streams.” [CIKM’06] “Using stationary-dynamic camera assemblies for wide-area video surveillance and selective attention.” [CVPR’06] “Filtering the data streams.” [in submission] “Efficient diagnostic and aggregate queries on sensor networks.” [in submission] “OCODDS: An On-line Change-Over Detection framework for tracking evolutionary changes in Data Streams.” [in submission] 5/22/2017 Statistical Mining in Data Streams 39 Future Work Develop non-linear techniques for capturing temporal correlations in data streams The Bayesian framework can be extended to address “what-if” queries with counterfactual evidence The clustering framework can be extended for developing stream visualization systems Incremental EVD techniques can improve the performance further 5/22/2017 Statistical Mining in Data Streams 40 Thank You ! 5/22/2017 Statistical Mining in Data Streams 41 BACKUP SLDIES! 5/22/2017 Statistical Mining in Data Streams 42 Back to Stream Clustering We propose a 2-tier stream clustering framework Tier 1: Kernel method that continuously divides the stream into segments Tier 2: Kernel method that uses the segments to project data in a low-dimensional space (LDS) The 5/22/2017 fading clusters reside in the LDS Statistical Mining in Data Streams 43 Clustering – LDS Projection 5/22/2017 Statistical Mining in Data Streams 44 Clustering – LDS Update 5/22/2017 Statistical Mining in Data Streams 45 Network Intrusion Stream Clustering accuracy at LDS dimensionality u=10 Cluster strengths at LDS dimensionality u=10 5/22/2017 Statistical Mining in Data Streams 46 Effect of dimensionality 5/22/2017 Statistical Mining in Data Streams 47 Query Plan Generation Given a group query, the query plan computes “candidate attributes” that will actually be acquired to successfully address the query. We exploit the “Markov Blanket (MB)” property to select candidate attributes. Given a BN G, the Markov Blanket of a node Xi comprises the node, and its immediate parent and child. P( X i , MB ( X i )) P( X i | G ) P( X i | MB ( X i )) P( MB ( X i )) 5/22/2017 Statistical Mining in Data Streams 48 Exploiting the MB Property “Given a node Xi and a set of arbitrary nodes Y in a BN s.t. MB(Xi) µ Y [ Xi), the conditional entropy of Xi given Y is at least as high as that given its Markov blanket or H(Xi|Y) ¸ H(Xi|MB(Xi)).” Proof: Separating MB(Xi) into two parts MB1 = MB(Xi) [ Y and MB2 = MB(Xi) - MB1 and denoting Z = Y – MB(Xi): H(Xi|Y) = H(Xi|Z,MB1) Y = Z [ MB1 ¸ H(Xi|Z,MB1,MB2) Additional information cannot increase entropy = H(Xi|Z, MB(Xi)) MB(Xi) = MB1|MB2 = H(Xi|MB(Xi)) Markov-blanket definition 5/22/2017 Statistical Mining in Data Streams 49 Bayesian Reasoning -More Results… Effect of using MB Property with min = 0.90 5/22/2017 Query answer Quality loss 50-node Synth. Data BN Statistical Mining in Data Streams 50 Bayesian Reasoning for Group Queries More accurate in addressing group queries = { (Xi, i)|Xi 2 X Æ (0 < i ·1) Æ 1 · i · n) } s.t. i <maxl P(Xi = xil)} X ={X1,X2 ,X3,…, Xn} Sensor attributes Q i Confidence parameters P(Xi = xil) Probability with which Xi assumes the value of xil Bayesian reasoning is helpful in detecting abnormalities 5/22/2017 Statistical Mining in Data Streams 51 Bayesian Reasoning – Candidate attribute selection algorithm 5/22/2017 Statistical Mining in Data Streams 52