Download document 8753047

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
CISA
Continually Improving Stream Analysis
Nancy McMillan
Doug Mooney
Dave Burgoon
March 14, 2003
1
Agenda




Background and Overview
Architecture
Algorithms
Results
5/25/2017
2
MURALS:
Multiple Use Real-time Analytics for Large Scale Data

Major information technology initiative
• Objective: Develop intellectual property addressing the challenges created by:
– Data generation/collection at previously unimaginable rates
– Growing expectation that real time decision-making is feasible and necessary for
competitive advantage
– Dramatic increase in the data to information ratio
– Compelling need for balance between result precision and timeliness

Sponsored development of two technologies
• InfoRes: Addresses IT issues associated with real-time querying of very large
•
relational databases
CISA: Addresses IT issues associated with real-time analysis of high volume
(varying arrival speed) stream data
5/25/2017
3
Background:
Our problem space

Many data sources supplying stream data

Stream data can be summarized by a set of features/summary statistics
over some time window

Each data source needs continually classified or characterized

Classification/characterization of a single data source may depend on
data from other data sources

Examples:
• Computers connecting to a firewall
• Sensor networks
5/25/2017
4
Internet Security Example
Who is trying to inappropriately access a company’s network?


There are 19 firewalls recording connections in a log file
• Date/Time
• Source and Destination IP addresses
• Protocol
• Action (Accept/Drop/Decrypt/..)
• Service
• Rule
Inbound and outbound connections and warnings over a six day period
in July 2002 were logged
• but connections from site to site VPNs are not
• only externally initiated connections are being analyzed
• more data (6 days in September) were provided later
5/25/2017
5
The Problem: The faster data arrives, the more processing
power required for real-time analysis.

Every data arrival initiates
some tasks (store data,
recalculate features, update
decisions, etc.), which each
require computational time
• Systems designed for gushing data
•
waste resources when data trickles.
Systems designed for slower data
flow fail when data arrives too fast.

More sophisticated analysis
techniques (better features,
decision algorithms, etc.)
require more computational
time, but can provide better
answers
• Analytics designed for gushing data
•
don’t provide the best answer
possible when data trickles.
Analytics designed for slower data
flow don’t provide timely answers
when data arrives too fast
To what data arrival rate should system be designed?
5/25/2017
6
The CISA Answer:
A precision-speed trade-off



When the data arrives more slowly than the system design rate, the
best possible answer is provided
• All data is considered.
• Best analysis techniques are used.
As the data flows faster than the system design rate the accuracy
and/or precision of the solution degrades smoothly.
System achieves precision-speed trade-off through:
• Architecture
– Answer not based on all current data
– Requires feedback from algorithm so most important data is considered
• Algorithms
– Partial/approximate solutions provided
5/25/2017
7
Architecture and Algorithm Overview
How CISA achieves precision-speed trade-off

Architecture
•
Assign analysis tasks to
asynchronously operating
objects
– storage, characterization, decisionmaking, and visualization
•
Prioritize analysis tasks
associated with each new
piece of data
– Data likely to impact analysis is
analyzed sooner

Algorithm
•
Use incremental algorithms
where possible
– Update previous answer with new
data rather than re-analyze all data
•
Stop or modify iterative or
multi-step algorithms before
completion when new data
arrivals need to enter
algorithm
– Partial/approximate solutions
provided
5/25/2017
8
Agenda




Background and Overview
Architecture
Algorithms
Results
5/25/2017
9
CISA Architectural Components
Diagram
Source 1
Source 2
Database
Source 1
...
PRIORITIZE
Source 2
Source Data Objects
Data Management Object
PRIORITIZE
Algorithm
Algorithm Objects
Visualization/
Monitor
Raw Data
Summary Statistic/Feature
Algorithm
Direct Connection
5/25/2017
10
Internet Security Example Architecture
Diagram
Java
Source 1
Source 1
1
FirewallSource
1
Data
Firewall 1 Publisher Data
Publisher
Firewall 2
..
.
Database
ListenerPublisher
PRIORITIZE
Source 2
Data
Publisher
..
.
Source 2
Feature/State
Source 2Topic
ListenerSource
2
Data
Publisher
Feature/State
..
Topic
.
Listener..
Publisher
.
Source
. Data Objects
..
Source Data Objects
JMS object
communication
Visualization/
State Reporter
Listener
Visualization/
State Reporter
Listener
Database Database
Topic
Publisher
Topic
Firewall 2
Publisher
Topic
Source
1
Feature/State
ListenerFeature/State
Decision
.. Made
.
Topic
Decision
Made
Decision
Topic
Maker
ListenerPublisher
Decision
AlgorithmMaker
Object
ListenerPublisher
Algorithm Object
Topic
Source 1
Database
Source 1
PRIORITIZE
Source 1
Feature
Topic
Source
1
Feature
Topic2
Source
Database
Requester
Source 2
Database
Requester
ListenerPublisher
..
.
Access
database
Source 2
..
.
ListenerPublisher
Data Management Object
Feature
Topic
Source 2
Feature
Topic
Decision
Update
Topic
Data Management Object
Log Data Message
Feature calculation Message
State Update Message
Direct Connection
PRIORITIZE
Decision
Update
Topic
Log Data Message
Feature calculation Message
State Update Message
Direct Connection
PRIORITIZE
SAS
Analytics
5/25/2017
11
Advantages / Issues
Related to rapid prototyping decisions
JMS


Advantages
• Asynchronous
• Prioritized Lists
• Open Source / Off-the-shelf
• Platform Independent
Issues
• Slow – system resources,
”thrashing”, db, (network speeds)
• JMS Implementations vary slightly
Access

Advantages
• Easy communication with Java
• Easily and quickly developed
– data storage and
– feature calculation

Issues
• Slow
• Not available on many platforms
5/25/2017
12
Agenda




Background and Overview
Architecture
Algorithms
Results
5/25/2017
13
Candidate CISA Algorithms
A very broad group of statistical methods…

Feature characteristics
•
•
•
Relies on more than one
feature
Some of the individual
features take time to compute
or measure
Meaningful nested "subalgorithms" can be built on
increasing sets of features

Data source characteristics
•
•
The algorithm can efficiently,
update its current solution
when feature values for only
a small group of source
objects change
There is a natural method for
prioritizing objects
5/25/2017
14
Construction Methodologies
General

Feature Priority
• Order features (statically)
• Create series of nested models that use an increasing number of features
• Develop a function to assign priorities based on feature order and current object
classification


Data Source Priority
• Order data sources (dynamically)
• Assign priorities based on uncertainty of classification or cost of misclassification
• Incremental algorithms are usually essential
Combinations of Both
5/25/2017
15
Construction Methodologies
Examples

Feature Priority: Decompose an algorithm into subalgorithms
that use subsets of features. Prioritize feature computation.
• Example: Decision tree using X1,X2,… , Xn
• Prioritize order of Xi computation based on tree structure
• Use pruned trees to classify:
{X1}, {X1,X2}, {X1, X2, X3}, …, {X1, X2, …, Xn}

Data Source Priority:
• Example: Cluster analysis—All features needed
• Objects with incomplete feature sets get higher priority
• Objects with more uncertain classifications get higher priority
5/25/2017
16
Feature Priority Construction
Decision tree example
X1<0.00134771
|
X2<0.16844
X3<0.248832
X4<0.148293
X5<34.5
G
B
X6<0.722813
G
G
B
G
B
5/25/2017
17
Agenda




Background and Overview
Architecture
Algorithms
Results
5/25/2017
18
Internet Security Example
Who is trying to inappropriately access the company’s network?


There are 19 firewalls recording connections in a log file
• Date/Time
• Source and Destination IP addresses
• Protocol
• Action (Accept/Drop/Decrypt/..)
• Service
• Rule
Inbound and outbound connections and warnings over a six day period
in July 2002 were logged
• but connections from site to site VPNs are not
• only externally initiated connections are being analyzed
• more data (6 days in September) were provided later
5/25/2017
19
External Network Connectors
Summary statistics/features

Quickly calculated features
• % Drop
• % Accept
• Hits/Sec
• # Hits

More time consuming
features
• # Different Services
• Different Services/Hit
• # Different IPs
• Different IPs/Hit
5/25/2017
20
Dates: 7/21/02 -7/27/02
N=3
Slow Port and IP Scans
High Services
High Number of IPs
High Number of Hits
Low Hits/Sec
Large Drop %
N=4636
Suspicious
Large Drop %
Medium IP/Hit
Low everything else
N=10
Fast IP Address Scans
Low Services
High Number of Hits
N=7828
High IP/Hit
Normal
High Number of Hits/Sec
High Accept %
Large Drop %
Mostly Foreign
Represent 40% of External
N=8055
Connections
Suspicious-Too Early to Tell
N=36
Large Drop %
Port Scans
High IP/Hit
High Services
Few Hits
Large Drop %
5/25/2017
21
External Network Connectors
Classifications
Class
Port Scans
Mostly Foreign IP Sweeps
Port and IP Sweeps
Normal
Suspicious
Few Connections
Sources Connections Percentage
36
218,658
14.40%
10
602,438
39.68%
3
9,165
0.60%
7,828
205,990
13.57%
4,636
455,687
30.02%
8,055
26,163
1.72%
70%-80% of IPs stay in same group from day to day.
5/25/2017
22
External Network Connectors
Rule-based, feature priority classification algorithm
Level
Classification
0
1 Normal
2 Normal
3 Normal
4 Normal
Features Added
Too Early to Tell
Too Early
to Tell
Suspicious
Both IP
IP Scan Port Scan and Port
Too Early
Only
Only
Scan Unknown to Tell
Both IP
IP Scan Port Scan and Port
Too Early
Only
Only
Scan Unknown to Tell
Both IP
IP Scan Port Scan and Port
Too Early
Only
Only
Scan Unknown to Tell
Drop %
Ratio Measures
Distinct Services
Distinct IP
Addresses
Priority
5/25/2017
23
Precision-Speed Trade-off
Expected results
100
%
0
Connections per second
Correctly classified same level algorithm
Correctly classified different level algorithm
Consistently classified
Inconsistently classified
5/25/2017
24
Precision-Speed Trade-off
%
Observed results
100.0%
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
Correctly classified
same level algorithm
Correctly classified
different level algorithm
Consistently classified
Inconsistently classified
Connections per Second
5/25/2017
25
External Network Connectors
Dynamic, data source priority algorithm



Traditional cluster analysis (e.g., K-means) is time consuming
on large datasets
Incremental clustering algorithm required for reasonable
performance
Our approach:
•
•
After first cluster analysis, use centroid locations to seed the next
analysis
Used the SAS procedure FASTCLUS for proof-of-concept purposes
5/25/2017
26
Dates: 8/11/02 - 8/17/02
Outlier
Outlier: n=1 (0.32% of connections)
Extremely high services
China
5/25/2017
27
Dates: 8/11/02 - 8/17/02
Cluster 0: n = 5207 (10.11% of connections)
High Accept %
Mix Max Hits
Mix IP/Hit
Cluster 1: n = 2561 (17.16% of connections)
High Drop %
Medium IP/Hit
Cluster 0
Cluster 2
Cluster 4
Cluster 1
Cluster 3
Cluster 5
Cluster 2: n = 7
(50.35% of connections)
High Drop %
High Num Hits
High Num IPs
High Max Hits/Sec
Cluster 3: n = 180
(17.81% of connections)
High Services and/or Max Hits/Sec
Mixed
Cluster 4: n = 4
(01.42% of connections)
High Drop %
High Services
94.5% of connections from Korea
1 of 4 IPs from Korea
Average 23 sec between hits
Cluster 5: n = 5104 (02.82% of connections)
High IP/Hit
High Drop %
5/25/2017
28
External Network Connector Classifications
Dashboard report
Drop %
Service/Hit
IPS/Hit
Max Hit/Sec
IPs Scanned
Services Scanned
% of Sources
% Connections
5/25/2017
29
External Network Connector Classifications
Outlier report
40 Minutes
1.5
Src: 211.96.31.129
Country: CHINA
Org ID: SCH-CHENGDU-HUITEC
1.0
1.0
0.9
0.8
0.5
0.7
0.6
0.5
0.0
0.4
0.3
0.2
-0.5
0.1
0.0
1 2 3 4 5 6
-1.0
-1.5
-1.5
-1.0
-0.5
cluster
0
0.0
1
2
3
0.5
4
5
1.0
6
1.5
Drop %
Service/Hit
IPS/Hit
Max Hit/Sec
IPs Scanned
Services Scanned
7
5/25/2017
30