Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003 1 Agenda Background and Overview Architecture Algorithms Results 5/25/2017 2 MURALS: Multiple Use Real-time Analytics for Large Scale Data Major information technology initiative • Objective: Develop intellectual property addressing the challenges created by: – Data generation/collection at previously unimaginable rates – Growing expectation that real time decision-making is feasible and necessary for competitive advantage – Dramatic increase in the data to information ratio – Compelling need for balance between result precision and timeliness Sponsored development of two technologies • InfoRes: Addresses IT issues associated with real-time querying of very large • relational databases CISA: Addresses IT issues associated with real-time analysis of high volume (varying arrival speed) stream data 5/25/2017 3 Background: Our problem space Many data sources supplying stream data Stream data can be summarized by a set of features/summary statistics over some time window Each data source needs continually classified or characterized Classification/characterization of a single data source may depend on data from other data sources Examples: • Computers connecting to a firewall • Sensor networks 5/25/2017 4 Internet Security Example Who is trying to inappropriately access a company’s network? There are 19 firewalls recording connections in a log file • Date/Time • Source and Destination IP addresses • Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule Inbound and outbound connections and warnings over a six day period in July 2002 were logged • but connections from site to site VPNs are not • only externally initiated connections are being analyzed • more data (6 days in September) were provided later 5/25/2017 5 The Problem: The faster data arrives, the more processing power required for real-time analysis. Every data arrival initiates some tasks (store data, recalculate features, update decisions, etc.), which each require computational time • Systems designed for gushing data • waste resources when data trickles. Systems designed for slower data flow fail when data arrives too fast. More sophisticated analysis techniques (better features, decision algorithms, etc.) require more computational time, but can provide better answers • Analytics designed for gushing data • don’t provide the best answer possible when data trickles. Analytics designed for slower data flow don’t provide timely answers when data arrives too fast To what data arrival rate should system be designed? 5/25/2017 6 The CISA Answer: A precision-speed trade-off When the data arrives more slowly than the system design rate, the best possible answer is provided • All data is considered. • Best analysis techniques are used. As the data flows faster than the system design rate the accuracy and/or precision of the solution degrades smoothly. System achieves precision-speed trade-off through: • Architecture – Answer not based on all current data – Requires feedback from algorithm so most important data is considered • Algorithms – Partial/approximate solutions provided 5/25/2017 7 Architecture and Algorithm Overview How CISA achieves precision-speed trade-off Architecture • Assign analysis tasks to asynchronously operating objects – storage, characterization, decisionmaking, and visualization • Prioritize analysis tasks associated with each new piece of data – Data likely to impact analysis is analyzed sooner Algorithm • Use incremental algorithms where possible – Update previous answer with new data rather than re-analyze all data • Stop or modify iterative or multi-step algorithms before completion when new data arrivals need to enter algorithm – Partial/approximate solutions provided 5/25/2017 8 Agenda Background and Overview Architecture Algorithms Results 5/25/2017 9 CISA Architectural Components Diagram Source 1 Source 2 Database Source 1 ... PRIORITIZE Source 2 Source Data Objects Data Management Object PRIORITIZE Algorithm Algorithm Objects Visualization/ Monitor Raw Data Summary Statistic/Feature Algorithm Direct Connection 5/25/2017 10 Internet Security Example Architecture Diagram Java Source 1 Source 1 1 FirewallSource 1 Data Firewall 1 Publisher Data Publisher Firewall 2 .. . Database ListenerPublisher PRIORITIZE Source 2 Data Publisher .. . Source 2 Feature/State Source 2Topic ListenerSource 2 Data Publisher Feature/State .. Topic . Listener.. Publisher . Source . Data Objects .. Source Data Objects JMS object communication Visualization/ State Reporter Listener Visualization/ State Reporter Listener Database Database Topic Publisher Topic Firewall 2 Publisher Topic Source 1 Feature/State ListenerFeature/State Decision .. Made . Topic Decision Made Decision Topic Maker ListenerPublisher Decision AlgorithmMaker Object ListenerPublisher Algorithm Object Topic Source 1 Database Source 1 PRIORITIZE Source 1 Feature Topic Source 1 Feature Topic2 Source Database Requester Source 2 Database Requester ListenerPublisher .. . Access database Source 2 .. . ListenerPublisher Data Management Object Feature Topic Source 2 Feature Topic Decision Update Topic Data Management Object Log Data Message Feature calculation Message State Update Message Direct Connection PRIORITIZE Decision Update Topic Log Data Message Feature calculation Message State Update Message Direct Connection PRIORITIZE SAS Analytics 5/25/2017 11 Advantages / Issues Related to rapid prototyping decisions JMS Advantages • Asynchronous • Prioritized Lists • Open Source / Off-the-shelf • Platform Independent Issues • Slow – system resources, ”thrashing”, db, (network speeds) • JMS Implementations vary slightly Access Advantages • Easy communication with Java • Easily and quickly developed – data storage and – feature calculation Issues • Slow • Not available on many platforms 5/25/2017 12 Agenda Background and Overview Architecture Algorithms Results 5/25/2017 13 Candidate CISA Algorithms A very broad group of statistical methods… Feature characteristics • • • Relies on more than one feature Some of the individual features take time to compute or measure Meaningful nested "subalgorithms" can be built on increasing sets of features Data source characteristics • • The algorithm can efficiently, update its current solution when feature values for only a small group of source objects change There is a natural method for prioritizing objects 5/25/2017 14 Construction Methodologies General Feature Priority • Order features (statically) • Create series of nested models that use an increasing number of features • Develop a function to assign priorities based on feature order and current object classification Data Source Priority • Order data sources (dynamically) • Assign priorities based on uncertainty of classification or cost of misclassification • Incremental algorithms are usually essential Combinations of Both 5/25/2017 15 Construction Methodologies Examples Feature Priority: Decompose an algorithm into subalgorithms that use subsets of features. Prioritize feature computation. • Example: Decision tree using X1,X2,… , Xn • Prioritize order of Xi computation based on tree structure • Use pruned trees to classify: {X1}, {X1,X2}, {X1, X2, X3}, …, {X1, X2, …, Xn} Data Source Priority: • Example: Cluster analysis—All features needed • Objects with incomplete feature sets get higher priority • Objects with more uncertain classifications get higher priority 5/25/2017 16 Feature Priority Construction Decision tree example X1<0.00134771 | X2<0.16844 X3<0.248832 X4<0.148293 X5<34.5 G B X6<0.722813 G G B G B 5/25/2017 17 Agenda Background and Overview Architecture Algorithms Results 5/25/2017 18 Internet Security Example Who is trying to inappropriately access the company’s network? There are 19 firewalls recording connections in a log file • Date/Time • Source and Destination IP addresses • Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule Inbound and outbound connections and warnings over a six day period in July 2002 were logged • but connections from site to site VPNs are not • only externally initiated connections are being analyzed • more data (6 days in September) were provided later 5/25/2017 19 External Network Connectors Summary statistics/features Quickly calculated features • % Drop • % Accept • Hits/Sec • # Hits More time consuming features • # Different Services • Different Services/Hit • # Different IPs • Different IPs/Hit 5/25/2017 20 Dates: 7/21/02 -7/27/02 N=3 Slow Port and IP Scans High Services High Number of IPs High Number of Hits Low Hits/Sec Large Drop % N=4636 Suspicious Large Drop % Medium IP/Hit Low everything else N=10 Fast IP Address Scans Low Services High Number of Hits N=7828 High IP/Hit Normal High Number of Hits/Sec High Accept % Large Drop % Mostly Foreign Represent 40% of External N=8055 Connections Suspicious-Too Early to Tell N=36 Large Drop % Port Scans High IP/Hit High Services Few Hits Large Drop % 5/25/2017 21 External Network Connectors Classifications Class Port Scans Mostly Foreign IP Sweeps Port and IP Sweeps Normal Suspicious Few Connections Sources Connections Percentage 36 218,658 14.40% 10 602,438 39.68% 3 9,165 0.60% 7,828 205,990 13.57% 4,636 455,687 30.02% 8,055 26,163 1.72% 70%-80% of IPs stay in same group from day to day. 5/25/2017 22 External Network Connectors Rule-based, feature priority classification algorithm Level Classification 0 1 Normal 2 Normal 3 Normal 4 Normal Features Added Too Early to Tell Too Early to Tell Suspicious Both IP IP Scan Port Scan and Port Too Early Only Only Scan Unknown to Tell Both IP IP Scan Port Scan and Port Too Early Only Only Scan Unknown to Tell Both IP IP Scan Port Scan and Port Too Early Only Only Scan Unknown to Tell Drop % Ratio Measures Distinct Services Distinct IP Addresses Priority 5/25/2017 23 Precision-Speed Trade-off Expected results 100 % 0 Connections per second Correctly classified same level algorithm Correctly classified different level algorithm Consistently classified Inconsistently classified 5/25/2017 24 Precision-Speed Trade-off % Observed results 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% Correctly classified same level algorithm Correctly classified different level algorithm Consistently classified Inconsistently classified Connections per Second 5/25/2017 25 External Network Connectors Dynamic, data source priority algorithm Traditional cluster analysis (e.g., K-means) is time consuming on large datasets Incremental clustering algorithm required for reasonable performance Our approach: • • After first cluster analysis, use centroid locations to seed the next analysis Used the SAS procedure FASTCLUS for proof-of-concept purposes 5/25/2017 26 Dates: 8/11/02 - 8/17/02 Outlier Outlier: n=1 (0.32% of connections) Extremely high services China 5/25/2017 27 Dates: 8/11/02 - 8/17/02 Cluster 0: n = 5207 (10.11% of connections) High Accept % Mix Max Hits Mix IP/Hit Cluster 1: n = 2561 (17.16% of connections) High Drop % Medium IP/Hit Cluster 0 Cluster 2 Cluster 4 Cluster 1 Cluster 3 Cluster 5 Cluster 2: n = 7 (50.35% of connections) High Drop % High Num Hits High Num IPs High Max Hits/Sec Cluster 3: n = 180 (17.81% of connections) High Services and/or Max Hits/Sec Mixed Cluster 4: n = 4 (01.42% of connections) High Drop % High Services 94.5% of connections from Korea 1 of 4 IPs from Korea Average 23 sec between hits Cluster 5: n = 5104 (02.82% of connections) High IP/Hit High Drop % 5/25/2017 28 External Network Connector Classifications Dashboard report Drop % Service/Hit IPS/Hit Max Hit/Sec IPs Scanned Services Scanned % of Sources % Connections 5/25/2017 29 External Network Connector Classifications Outlier report 40 Minutes 1.5 Src: 211.96.31.129 Country: CHINA Org ID: SCH-CHENGDU-HUITEC 1.0 1.0 0.9 0.8 0.5 0.7 0.6 0.5 0.0 0.4 0.3 0.2 -0.5 0.1 0.0 1 2 3 4 5 6 -1.0 -1.5 -1.5 -1.0 -0.5 cluster 0 0.0 1 2 3 0.5 4 5 1.0 6 1.5 Drop % Service/Hit IPS/Hit Max Hit/Sec IPs Scanned Services Scanned 7 5/25/2017 30