Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Labeled Data Set For Flow-based Intrusion Detection Anna Sperotto, Ramin Sadre, Frank van Vliet, Aiko Pras Design and Analysis of Communication Systems University of Twente, The Netherlands NMRG Workshop on Netflow/IPFIX Usage in Network Management Maastricht - July 30, 2010 Contents • Operational experience in trace collections • • • Experimental Setup Data processing and labeling The labeled data set Introduction trace 1 trace 2 • • • system 1 30% attacks! system 2 85% attacks! Systems are evaluated on proprietary traces No shared ground truth Results cannot be directly compared! Data set requirements We want the data set to be: • • • • realistic data complete and correct in labeling achievable in an acceptable labeling time sufficient trace size The requirements will determine the collection setup Measurement scale • • • NETWORK realistic not complete it does not scale Measurement scale • • • NETWORK realistic not complete it does not scale • • SUBNETWORK realistic not complete Measurement scale • • • NETWORK realistic not complete it does not scale • • SUBNETWORK realistic not complete • • SINGLE HOST realistic enhanced logging (honeypot) Setup HONEYPOT ssh, http, ftp ssh session transcript XEN SERVER tcpdump • • • • daily used services with enhanced logging direct connection to the Internet attack exposure complete tcpdump of the traffic (offline flow creation) Data set creation TRAFFIC DUMP LOGS FLOWS EVENTS TYPESCRIPTS ALERT GENERATION/ CORRELATION CLUSTERING & CAUSALITY LABELLED DATASET Preprocessing • packets flows F = (Isrc , Idst , Psrc , Pdst , P ckts, Octs, Tstart , Tend , F lags, P rot) • logs log events L = (T, Isrc , Psrc , Idst , Pdst , Descr, Auto, Succ, Corr) Data set creation TRAFFIC DUMP LOGS FLOWS EVENTS TYPESCRIPTS ALERT GENERATION/ CORRELATION CLUSTERING & CAUSALITY • The correlation process will results in alerts A = (T, Descr, Auto, Succ, Serv, T ype) LABELLED DATASET Correlation procedure F1 HP F2 LOGS CORRELATE ALERT A (F1, A) (F2, A) Cluster and Causality Alert Cluster alert cluster relation causality relation Flow Alert Alert Alert Alert Alert Flow Flow Flow Flow Flow • Hierarchic view of the alerts to enrich the data set with extra information on the traffic • Group simple alerts into cluster alerts high level view of malicious activities • Implementation Packets to flows AUTOMATIC Logs to log events SEMI-AUTOMATIC MANUAL Alert correlation SEMI-AUTOMATIC Cluster and causality MANUAL • softflowd • shell scripts • discriminate between manual/ automated attacks • correlation procedure • extensible for other attacks • analysis of typescripts The Dataset dump file flows alerts • 24 GB 14M 7.6M Flow breakdown 13942629 SSH 18038 ICMP FTP 13 9798 HTTP 14151511 TCP UDP 1.0E+00 7383 IRC 583 1.0E+02 191339 AUTH/IDENT 18970 OTHERS 1.0E+04 number of flows 1.0E+06 1.0E+08 1.0E+00 1.0E+02 1.0E+04 number of flows 1.0E+06 1.0E+08 The Dataset dump file flows alerts • 24 GB 14M 7.6M Alert breakdown 8756 SSH IN FTP 6 5317 HTTP 35 SSH OUT 95664 AUTH/IDENT IN AUTH/IDENT OUT 10 SSH IN 7591869 SSH OUT 6 IRC OUT ICMP IN 1.0E+00 1.0E+02 1.0E+04 number of alerts 4 HTTP 3692 16382 1.0E+06 1.0E+08 0 10 20 number of alerts 30 40 The Dataset • • We labeled: 98,5% flows and 99,99% alerts Mainly malicious traffic: • • • ssh brute force attacks automated http connections Small percentage of side-effect traffic • • auth/ident on port 113 IRC traffic Conclusions • We presented the first labeled data set for flow-based intrusion detection • • • • http://traces.simpleweb.org/ Semi-automated correlation process manual intervention is still needed Data set mainly constituted of malicious traffic • need to extend to benign traffic Conclusions • Reactions: • • • Since publication (October 2009) ~ 7 requests We do not monitor the downloads at the webpage In contact with Philipp Winter (Hagenberg University, AU): MSc Project “Inductive Intrusion Detection in Flow-Based Network Data using One-Class Support Vector Machines ” Implementation ALERT_TYPES id description SERVICES id description ALERTS id automated succeeded description timestamp type service ALERTS_CLUSTERING parent child ALERTS_CAUSALITY parent child NETFLOWS id src_ip dst_ip packets octets start_time start_msec end_time end_msec src_port dst_port tcp_flag prot NETFLOW_ALERTS flowid alertid Correlation procedure Algorithm 1 Correlation procedure 1: procedure ProcessFlowsForService (s : service) 2: for all Incoming flows F1 for the service s do 3: Retrieve matching response Flow F2 such as 4: F2 .Isrc = F1 .Idst ∧ F2 .Idst = F1 .Isrc ∧ F2 .Psrc = F1 .Pdst ∧ F2 .Pdst = F1 .Psrc 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: ∧ F1 .Tstart ≤ F2 .Tstart ≤ F1 .Tstart + δ with smallest F2 .Tstart − F1 .Tstart ; Retrieve a matching log event L such as L.Isrc = F1 .Isrc ∧ L.Idst = F1 .Idst ∧ L.Psrc = F1 .Pdst ∧ L.Pdst = F1 .Psrc ∧ F1 .Tstart ≤ L.T ≤ F1 .Tend ∧ not L.Corr with smallest L.T − F1 .Tstart ; if L exists then Create alert A = (L.T, L.Descr, L.Auto, L.Succ, s, CONN). Correlate F1 to A ; if F2 exists then Correlate F2 to A ; L.Corr ← true ; end if end if end for