Download A Labeled Data Set For Flow-based Intrusion Detection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Deep packet inspection wikipedia , lookup

Network tap wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

IEEE 1355 wikipedia , lookup

UniPro protocol stack wikipedia , lookup

Transcript
A Labeled Data Set
For Flow-based Intrusion Detection
Anna Sperotto, Ramin Sadre,
Frank van Vliet, Aiko Pras
Design and Analysis of Communication Systems
University of Twente, The Netherlands
NMRG Workshop on Netflow/IPFIX Usage in Network Management
Maastricht - July 30, 2010
Contents
•
Operational experience in trace collections
•
•
•
Experimental Setup
Data processing and labeling
The labeled data set
Introduction
trace 1
trace 2
•
•
•
system 1
30%
attacks!
system 2
85%
attacks!
Systems are evaluated on proprietary traces
No shared ground truth
Results cannot be directly compared!
Data set requirements
We want the data set to be:
•
•
•
•
realistic data
complete and correct in labeling
achievable in an acceptable labeling time
sufficient trace size
The requirements will determine the collection setup
Measurement scale
•
•
•
NETWORK
realistic
not complete
it does not
scale
Measurement scale
•
•
•
NETWORK
realistic
not complete
it does not
scale
•
•
SUBNETWORK
realistic
not complete
Measurement scale
•
•
•
NETWORK
realistic
not complete
it does not
scale
•
•
SUBNETWORK
realistic
not complete
•
•
SINGLE HOST
realistic
enhanced logging
(honeypot)
Setup
HONEYPOT
ssh, http, ftp
ssh session transcript
XEN SERVER
tcpdump
•
•
•
•
daily used services with enhanced logging
direct connection to the Internet
attack exposure
complete tcpdump of the traffic (offline flow creation)
Data set creation
TRAFFIC
DUMP
LOGS
FLOWS
EVENTS
TYPESCRIPTS
ALERT
GENERATION/
CORRELATION
CLUSTERING
& CAUSALITY
LABELLED
DATASET
Preprocessing
• packets  flows
F = (Isrc , Idst , Psrc , Pdst , P ckts, Octs, Tstart , Tend , F lags, P rot)
• logs  log events
L = (T, Isrc , Psrc , Idst , Pdst , Descr, Auto, Succ, Corr)
Data set creation
TRAFFIC
DUMP
LOGS
FLOWS
EVENTS
TYPESCRIPTS
ALERT
GENERATION/
CORRELATION
CLUSTERING
& CAUSALITY
• The correlation process will results in alerts
A = (T, Descr, Auto, Succ, Serv, T ype)
LABELLED
DATASET
Correlation procedure
F1
HP
F2
LOGS
CORRELATE
ALERT A
(F1, A)
(F2, A)
Cluster and Causality
Alert
Cluster alert
cluster
relation
causality relation
Flow
Alert
Alert
Alert
Alert
Alert
Flow
Flow
Flow
Flow
Flow
•
Hierarchic view of the alerts to enrich the data set with
extra information on the traffic
•
Group simple alerts into cluster alerts
high level view of malicious activities
•
Implementation
Packets to flows
AUTOMATIC
Logs to log events
SEMI-AUTOMATIC
MANUAL
Alert correlation
SEMI-AUTOMATIC
Cluster and
causality
MANUAL
• softflowd
• shell scripts
• discriminate between manual/
automated attacks
• correlation procedure
• extensible for other attacks
• analysis of typescripts
The Dataset
dump file
flows
alerts
•
24 GB
14M
7.6M
Flow breakdown
13942629
SSH
18038
ICMP
FTP
13
9798
HTTP
14151511
TCP
UDP
1.0E+00
7383
IRC
583
1.0E+02
191339
AUTH/IDENT
18970
OTHERS
1.0E+04
number of flows
1.0E+06
1.0E+08
1.0E+00
1.0E+02
1.0E+04
number of flows
1.0E+06
1.0E+08
The Dataset
dump file
flows
alerts
•
24 GB
14M
7.6M
Alert breakdown
8756
SSH IN
FTP
6
5317
HTTP
35
SSH OUT
95664
AUTH/IDENT IN
AUTH/IDENT OUT
10
SSH IN
7591869
SSH OUT
6
IRC OUT
ICMP IN
1.0E+00
1.0E+02
1.0E+04
number of alerts
4
HTTP
3692
16382
1.0E+06
1.0E+08
0
10
20
number of alerts
30
40
The Dataset
•
•
We labeled: 98,5% flows and 99,99% alerts
Mainly malicious traffic:
•
•
•
ssh brute force attacks
automated http connections
Small percentage of side-effect traffic
•
•
auth/ident on port 113
IRC traffic
Conclusions
•
We presented the first labeled data set for flow-based
intrusion detection
•
•
•
•
http://traces.simpleweb.org/
Semi-automated correlation process
manual intervention is still needed
Data set mainly constituted of malicious traffic
• need to extend to benign traffic
Conclusions
•
Reactions:
•
•
•
Since publication (October 2009) ~ 7 requests
We do not monitor the downloads at the webpage
In contact with Philipp Winter (Hagenberg
University, AU): MSc Project “Inductive Intrusion
Detection in Flow-Based Network Data using One-Class
Support Vector Machines ”
Implementation
ALERT_TYPES
id
description
SERVICES
id
description
ALERTS
id
automated
succeeded
description
timestamp
type
service
ALERTS_CLUSTERING
parent
child
ALERTS_CAUSALITY
parent
child
NETFLOWS
id
src_ip
dst_ip
packets
octets
start_time
start_msec
end_time
end_msec
src_port
dst_port
tcp_flag
prot
NETFLOW_ALERTS
flowid
alertid
Correlation procedure
Algorithm 1 Correlation procedure
1: procedure ProcessFlowsForService (s : service)
2: for all Incoming flows F1 for the service s do
3:
Retrieve matching response Flow F2 such as
4:
F2 .Isrc = F1 .Idst ∧ F2 .Idst = F1 .Isrc ∧ F2 .Psrc = F1 .Pdst ∧ F2 .Pdst = F1 .Psrc
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
∧
F1 .Tstart ≤ F2 .Tstart ≤ F1 .Tstart + δ
with smallest F2 .Tstart − F1 .Tstart ;
Retrieve a matching log event L such as
L.Isrc = F1 .Isrc ∧ L.Idst = F1 .Idst ∧ L.Psrc = F1 .Pdst ∧ L.Pdst = F1 .Psrc ∧
F1 .Tstart ≤ L.T ≤ F1 .Tend ∧ not L.Corr
with smallest L.T − F1 .Tstart ;
if L exists then
Create alert A = (L.T, L.Descr, L.Auto, L.Succ, s, CONN).
Correlate F1 to A ;
if F2 exists then
Correlate F2 to A ; L.Corr ← true ;
end if
end if
end for