* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Understanding traffic activities in large networks
Survey
Document related concepts
Transcript
Tackling Network Management Problems with Machine Learning Techniques Preliminary Oral Exam Yu Jin Advisor: Professor Zhi-Li Zhang Motivation Effective solutions are required for various problems in large networks: Traffic classification Anomaly detection Host/traffic profiling Trouble-shooting Manual rule-based solutions are either costly or unavailable Scalability is a major issue Our Solution Abstract into machine learning problems Select advanced machine learning algorithms to solve these problems Address the scalability issue Deliver a system and evaluate the system under real operational environment Outline Work in the past – Traffic classification Design and implementation of a modular machine learning architecture for flow-level traffic classification Analysis of traffic interaction patterns using traffic activity graphs (TAGs) Visualizing spatial traffic class distribution using colored TAGs Traffic classification using colored TAGs On-going work Customer ticket prediction and trouble-shooting in DSL networks Summary and time table A Light-Weight Modular Machine Learning Approach to Large-Scale Network Traffic Classification Motivation Required for: Security monitoring Traffic policing and prioritization Predicting application trends Identifying new applications An interesting research topic Across multiple research areas: machine learning, traffic profiling, social network analysis, system optimization Design a operational system for a large ISP network Learning valuable lessons for solving other practical problems Accuracy Challenges Scalability Stability Challenges: Scalability: training and operating on 10Gbps links Versatility Accuracy: Similar performance as the rule-based classifier Stability: remain accurate without human intervention Versatility: reconfigurable 300k --- TCP --- UDP 250k 200k 150k 100k 50k Mon Tue Wed Thu Fri Sat Sun New flow arrival rate (per minute) on a 1Gbps link Current solution – Rule based classifier Matching layer 4/layer 7 packet headers with manually defined rules Expensive in operation (flow/packet sampling is required) Expensive in deployment (special hardware is required) Inapplicable if the packet is encrypted, e.g., end-to-end IPSec traffic However, the rule based classifier can provide a good set of “labeled” flow data A machine learning solution Network 2 Network 3 Network 1 Traffic collection Raw traffic data Network 0 Training data Labeled by rule- training based classifier Network 4 A modular machine learning architecture A modular architecture for parallelization in both training and operation First level modularization, pre-partition the data by flow features Better accuracy Flow data Higher scalability from parallelization Page 10 By flow size, IP protocol, etc. Flow partition Classification on partition 1 Classification on partition 2 Prediction on partition 1 Prediction on partition 2 … Classification on partition m Prediction on partition m Second level modularization From a single k-class classifier to k 1-vs-all classifiers Accelerate training and operation Low memory consumption Parallelization Flow data in No significant performance loss partition j Flow data in partition j Classifier j Binary classifier for application class 1 Binary classifier for application class 2 max Prediction on partition j Page 11 Based on the posteriors P(Cj | x) Prediction on partition j … Binary classifier for application class k Training a binary classifier Sampling is necessary due to the huge amount of training data (90 million TCP flows plus 85 million UDP flows) Weighted threshold sampling for a more balanced and representative training set 2.778% if count(Cj) ≤ θ{ keep all the flows in Cj; }else{ sample with rate θ/count(Cj); } 63.276% 2.517% 0.060% 6% 0% 11% Business Business Chat Chat Dns Dns FileSharing 10% FileSharing Ftp Ftp Games4% Games Mail Mail 10% 24.268% 6% 11% 10% 0.397% 11% 1.874% 11% 3.626% 1.013% 0.090% 0.003% 0.099% Multimedia Multimedia NetNews NetNews 10% SecurityThreat SecurityThreat Voip Voip Web Web Selection of binary classifier Any classifier can be used as a component in our traffic classification architecture Boosting decision stumps (1-level decision trees) Fast Accurate Simple Implicit L-1 regularization TCP flow error rates for different binary classifiers. “Non-linear” classifier Boosting Trees (BTree) has the best performance. Boosting stumps (BStump) is a bit better than L1Maxent. Page 13 Logistic calibration The IID assumption has been violated by the weighted thresholded sampling method Score output from adaboost makes direct combination of binary classification results infeasible We need to calibrate the binary classifiers Address the difference in traffic distributions Convert scores fc(x) to posterior probabilities P(C|x) Reliability diagram for TCP Web Training Architecture for a binary classifier One binary classifier for each application class One calibrator is trained according to the classification results on a small independent flow samples (s.r.s.) training Thresholded sampled Binary classifier Application classifier Training data Calibrator Reliability diagram training Page 15 Performance Evaluation - Accuracy Our classifier can reproduce the classification results from the rule- based classifier with high accuracy Bef ore Calibration Af ter Calibration 26.92% 30.00% Direct training of a multi-class classifier has little gain on accuracy according to testing on small samples Flow error rate 25.00% 20.00% 15.00% 10.00% 13.76% 10.19% 7.64% 3.07% 1.77% 1.34% 5.00% 0.34% 0.00% BStump TCP Page 16 NBayes TCP BStump UDP NBayes UDP Scalability Scalability in training Accuracy increases with the increase of training data size and the number of iterations Less than 1.5 hours with 700K training samples and 640 iterations Scalability in operation Use basic optimization Recall on 2 x 1Gbps links, the new flow arrival rate is 450K/min We achieve 800K/min with a single thread Close to 7M/min with 10 threads, can scale up to 10Gbps links Page 17 Evaluation on Stability Temporal stability After two-month, the flow error rates are 3±0.5% for TCP traffic and 0.4±0.04% for UDP traffic After one year, the TCP flow error rate is 5.48±0.54% and the UDP flow error rate is 1.22±0.2%. Spatial stability Train and test at two geolocations Page 18 Importance of the Port number Using our architecture, we obtain an error rate of 4.1% for TCP and 0.72% for UDP with only port features (3.13% for TCP and 0.35% for UDP after adding other flow features) We use port graph to visualize and understand machine learned port rules TCP Multimedia use 554 and 5000-5180 UDP Games uses port 88 (Xbox) UDP Chat uses port 6661-6670 Page 19 Summary We have designed a modular machine learning architecture for large scale ISP/enterprise network traffic classification The system scales up to 10Gbps links and remains accurate for one year on multiple sites without re-training We have conducted evaluation on a large operational ISP network What if the port number and other flow level statistics are unavailable? For example, classifying the end-to-end IPSec traffic. Page 20 IPSec traffic Limited traffic statistics for IPSec traffic IPSec header Encrypted Inner packet No port number, protocol,pay load etc. Number of packets, number of bytes, averageClassification packet inter-arrivalon time, averageActivity Graphs Our solution: Traffic packet size in both directions Only 80% accuracy using the proposed machine learning architecture Visualizing and Inferring Network Applications using Traffic Activity Graphs (TAGs) Traffic Activity Graphs Nodes represent the hosts in the network Edges represent the interaction between these hosts Defined on a set of flows collected in a time period T Help us study the communication patterns of different applications GHTTP UMN GEmail Internet GGnutella Application traffic activity graphs (TAGs) and evolution HTTP 1K to 3K Email 1K to 3K AOL IM 1K to 3K DNS 1K to 3K Properties of TAGs We observe difference in terms of basic statistics, such as GCC size graph density, average in/out degree, etc. ALL TAGs contain giant connected component (GCC), which accounts for more than 85% of all the edges. 100.00% 98.00% 96.00% 94.00% 92.00% 90.00% 88.00% 86.00% 84.00% 82.00% 80.00% Understanding the Interaction Patterns by Decomposing TAGs Block structures in the adjacency matrices indicate dense subgraphs in TAGs 1110010 1110000 1110000 0001111 0001011 HTTP Email AOL IM BitTorrent DNS TAG Decomposition using TriNonnegative Matrix Factorization Extracting dense subgraphs can be formulated as a co- clustering problem, i.e., cluster hosts into inside host groups and outside host groups, then extract pairs of groups with more edges connected (higher density). This co-clustering problem can be solved by tri-nonnegative matrix factorization algorithm, which minimizes: Tri-nonnegative matrix factorization A ≈ 100…0 100…0 010…0 010…0 … 000…1 × ≈ R × × H × 11100…0 00011…0 ... 00000…1 C R is m-by-k, C is r-by-n, hence, the product is a low-rank approximation rank min(k, Row group of A, with Proportional to the r) Column group Adjacency matrix assoc. with TAG membership Indicator matrix subgraph density matrix We identify dense subgraphs based on the large entries in H membership Indicator matrix Subgraph prototypes Recall inside (UMN) hosts are (likely) service requesters and outside hosts are service providers. Based on the number of inside/outside hosts in each subgraph, we propose three prototypes. In-star One inside client accesses multiple outside servers Out-star Bi-mesh Multiple inside client accesses one outside servers Multiple inside clients interacts with many outside servers Characterizing TAGs with subgraph prototypes Different application TAGs contain different types of subgraphs HTTP Email AOL IM BitTorrent We can distinguish and characterize applications based on the subgraph components What do these subgraphs mean? DNS Interpreting HTTP bi-mesh structures Most star structures are due to popular servers or active clients We can explain more than 80% of the HTTP bi-meshes identified in one day Server correlation driven Server farms Lycos, Yahoo, Google Correlated service providers CDN: LLNW, Akamai, SAVVIS, Level3 Advertising providers: DoubleClick, etc. User interests driven News: WashingtonPost, NewYork Times, Cnet Media: ImageShack, casalemedia, tl4s2 Online shopping: Ebay, Costco, Walmart Social network: Facebook, MySpace redirection redirection How are the dense subgraphs connected? AS1 (C) Pool (A) Randomly connected stars AS1 (B) Tree: client/server dual role SIGMETRICS/Performance 2009 AS2 (D) Correlated pool Summary We introduce the notion of traffic activity graphs (TAGs) Different applications show different interaction patterns We propose a tNMF based graph decomposition method to help understand the formation of application TAGs. Can we classify different application classes based on TAGs? Page 33 Traffic classification using Collective Traffic Statistics Colored TAGs Different applications are displayed as edges with different colors in TAGs Original TAG Web and FileSharing traffic removed Characterizing Colored TAGs Clustering effects: edges with the same color tend to cluster together Attractive (A) / Repulsive (R) effects: The collective traffic statistics summarize the spatial distribution of application classes in TAGs Methodology A two-step approach Input Unclassified TAG & traffic statistics Bootstrapping Initially labeled TAG Initial edge classification only based on traffic statistics Prediction output Graph Calibration Prediction on the traffic graph Edge color calibration using only colored neighborhood information in the initially labeled traffic graph Training the Two-Step Model The two-step model can be integrated into the existing traffic classification system easily The bootstrapping step uses traffic statistics associated with each edge to conduct initial classification on the edges Available traffic statistics depend on specific applications The graph calibration step uses collective traffic statistics in the TAG to reinforce/correct the initial labels Collective traffic statistics are encoded as histograms 1 eij 1 0.5 hi 0 Web FS Mail hj 0.5 0 Web FTP Mail Evaluation on Network-Level Traffic classification Packet header information (e.g., port number, TCP flags) is unavailable for bootstrapping, similar to the situation of classifying end-to-end IPSec traffic How accurate can we classify such traffic when the flow-level classification system only achieves 80% accuracy due to the lack of traffic features? Evaluation on Accuracy Our graph-based calibration reduces the error rate by 50%! The classifier remains stable across time and geolocation. Edge accuracy Bootstrap Calibration 92% 90% 88% 86% 84% 82% 80% 78% 76% 74% 2 day 1 week 1 month 1 year 1 month site 2 Evaluation on Per-Class Classification The two-step method improve the accuracy for all traffic classes The repulsive rules enable us to improve on the traffic classes with very poor initial labeling. Evaluation on Real-time Performance We can implement the two-step approach as a real-time system with little additional cost Evaluation on Flow-Level Traffic Classification We have access to all packet header information How much will the collective traffic statistics improve the overall accuracy of our system? Evaluation on Accuracy We achieve 15% reduction in errors within a month The F1 scores are improved for most application classes Summary We introduced the concept of colored TAGs We proposed a two-step model to utilize the spatial distribution of application classes in TAGs (collective traffic statistics) to help improve the classification accuracy The collective traffic statistics help reduce 50% of the errors for classification at the network layer 15% of the errors can be reduced for the flow-level traffic classification using graph calibration Trouble-shooting in Large DSL Networks (work in progress) Motivation The current solution for trouble-shooting in DSL networks is reactive and inefficient Potentially lead to churns Challenges Millions of users A large number of devices on each DSL line which cannot be controlled remotely Many possible locations where the line problem can happen Methodology Trouble Locator Measure the line condition between the DSL server and the cable modem for each customer Use machine learning techniques to learn the correlation between different line problems and our line measurements Ticket Predictor Maintain periodic measurements for each customer (every Saturday night) Learn the correlation between the measurement history and the potential line problems Overview of the Proactive Solution Proactively resolve line problems before the customer complains Planned Trial Evaluate our method in an operational DSL network Time Table Design of the DSL network trouble-shooting system (Feb. 2010 to Apr. 2010) Implementation of the system and offline evaluation (May. 2010 to Jul. 2010) Trial in an operational DSL network (Aug. 2010 to Oct.2010) Thesis (Nov.2010 to Jan.2011) Publications before 2010 Aiyou Chen,Yu Jin, Jin Cao, Li (Erran) Li, Tracking Long Duration Flows in Network Traffic, to appear in the 29th IEEE International Conference on Computer Communications INFOCOM2010 (mini-conference) (acceptance ratio 24.3%). Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Unveiling Core Network-Wide Communication Patterns through Application Traffic Activity Graph Decomposition, in Proc. of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2009) (acceptance ratio 14.9%). Jin Cao,Yu Jin, Aiyou Chen, Tian Bu, Zhi-Li Zhang, Identifying High Cardinality Internet Hosts, in Proc. of the 28th Conference on Computer Communications (IEEE INFOCOM 2009) (acceptance ratio 19.6%). Yu Jin, Esam Sharafuddin, Zhi-Li Zhang, Identifying Dynamic IP Address Blocks Serendipitously through Background Scanning Traffic, in Proc. of the 3rd International Conference on emerging Networking EXperiments and Technologies (CoNEXT 2007), New York, NY, December 10, 2007 (acceptance ratio 19.5%). Yu Jin, Zhi-Li Zhang, Kuai Xu, Feng Cao, Sambit Sahu, "Identifying and Tracking Suspicious Activities through IP Gray Space Analysis", In Proc. of the 3rd Workshop on Mining Network Data (MineNet'07), San Diego, CA, June 12, 2007 (in conjunction with ACM SIGMETRICS'07). Yu Jin, Gyorgy Simon, Kuai Xu, Zhi-Li Zhang, Vipin Kumar, "Gray's Anatomy: Dissecting Scanning Activities Using IP Gray Space Analysis", in Proc. of the Second Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML07), Boston, MA, April 10, 2007 (in conjunction with USENIX NSDI'07). Thanks! Questions? Backup slides Characteristics of app. TAGs These statistics show difference between various app. TAGs It does not explain the formation of TAGs TNMF algorithm related Iterative optimization algorithm Group density matrix derivation Backup for traffic classification Default flow features Compare of different algorithms for multi-class classification Training time for different machine learning algorithms Selection of flow size for partitioning Boosting decision stumps t =1 tcpflag contains S no t =T t =2 dstport_low = 443 yes S-= -1.066 S+= -2.523 no yes S-= -0.226 S+= 2.139 … byte >= 64.5 no S-= 0.446 yes S+= -0.202