Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining Approach for Network Intrusion Detection Zhen Zhang Advisor: Dr. Chung-E Wang 04/24/2002 Department of Computer Science California State University, Sacramento Outline Background – Intrusion Detection: promises and challenges – Data Mining in IDS: how can it help Motivation Approaches, tasks, problems and my contributions Results Conclusion and future work Intrusion Detection - Building a Secure Network Primary assumptions – System activities are observable – Normal and intrusive activities have distinct evidence Main techniques – Misuse detection: patterns of well-known attacks – Anomaly detection: deviation from normal usage Data Mining in IDS Shortfalls with current IDS (mostly misuse detections) – Variants: Intrusions change easily and frequently. – False positive: Difficult to pick up intrusions. – False negative: Detecting attacks for which there are no known signatures – Data overload: Amount of data grows rapidly. What is Data Mining Data Mining: Take data and pull from it patterns or deviations. Many different types of algorithms: Decision Tree, Link analysis, Clustering, Association, Rule abduction, Deviation Analysis, and Sequence analysis. Software and Tools: – MS SQL Server 2000 – Ripper and many others How can Data Mining help Variants – Use anomaly detection, no great concern with variants in an exploit code. False positives – To identify recurring sequences of alarms in order to help identify valid network activity. False negatives – Attacks for which signatures have not been developed might be detected. Data overload – Data mining plays a vital role. Summary of my work Identify objective – Distinguish network attacks from normal traffic – New area, several research projects, no commercial products – Focus on the principle and basic implementation of concepts Data Collection Data Pre-processing on tcpdump dataset Apply data mining on processed data Investigate results Software packages used: Visual Basic, Microsoft SQL Server 2000 with Analysis Server, Tcpdump Data Collection Tcpdump data (http://iris.cs.uml.edu:8080/) – Tcpdump was executed on the gateway, to capture the traffic between LAN and external, and broadcast packets within LAN – Only header, no user data – Filters were used, only TCP and UDP packets – Baseline and 4 simulated attacks TCPDUMP data format TCP packet – – – – – – – – – – Time stamp Source IP address Source port Destination IP address Destination port Flags (SYN, FIN, PUSH, RST, or .) Data sequence number of this packet Data sequence number of the data expected in return Number of bytes of receive buffer space available Indication of whether or not the data is urgent Tcpdump data format UDP packet – – – – – – Time stamp Source IP address Source port Destination IP address Destination port Length of the packet Example data Example tcpdump data Data Pre-processing - 80% ~ 90% work Packet level information to connection level – – Group by same source/destination IP/Port Use flags, acks to determine status of the connection » – – – SF, REJ, S0, S1, S3, S3, S4, RSTOSn, RSTRSn, SS, SH, SHR, OOS1, OOS2 Record start time, duration, protocol Calculate bytes in, bytes out, resent rate UDP is connectionless, so simply treat each packet as a connection First round of processing Intrinsic Features Establish more information Count_per_dest # of connections to this destination IP REJ_count_per_dest # of connections that get the flag “REJ” # of connections that send a SYN packet but never get the ACK packet (S0), or receive an ACK on SYN that they never have sent (S1). S01_count_per_dest Diff_Services_per_dest # of unique services Diff_Service_Rate Diff_Services / Count Same Destination Temporal and Statistical Attributes (last 2 seconds) Establish more information Count_per_service # of connections to this type of service REJ_count_per_service # of connections that get the flag “REJ” (SYN met by RST) # of connections that send a SYN packet but never get the ACK packet (S0), or receive an ACK on SYN that they never have sent (S1). S01_count_per_service Diff_Hosts_per_service # of unique destination hosts Diff_Hosts_Rate Diff_Hosts / Count Same Service Temporal and Statistical Attributes (last 2 seconds) Second round of processing Same Destination Temporal and Statistical Attributes Final round of processing Final, but important – Reduce data amount – Remove noise or trivial information – Re-organization data, add new feature if necessary Challenges – Hard to tell which data to reduced/remove – Requires tremendous domain knowledge – Need experiments and adjustments Data Mining Decision Tree Algorithm Microsoft SQL Server 2000 Analysis Server Steps: – 80% of baseline (normal) dataset as training data – Use 20% left as validation data, compute misclassification. – 20% of each of the four intrusion datasets as predication data, compute misclassification. Dependency Network Decision Tree Apply Data Mining Model to Validate/Predicate Results % misclassification (by final state) Normal 149/1510 = 9.86% Intrusion1 443/2324 = 19.06% Intrusion2 376/1968 = 19.10% Intrusion3 386/2011 = 19.19% Intrusion4 437/2298 = 19.01% Conclusion and future improvement Accuracy – Preliminary experiments of using DM on the tcpdump data showed promising results – depends on sufficient training data and right feature set. Performance – 6 hours on one dataset (628775 records) Size of time window – 2 seconds or larger? Automated process – Call MSSQL DM and DTS procedures within VB – Real-time monitor and alarm References Intrusion Detection, Rebecca Gurley Bace, Macmillan Technical Publishing, 2000 Data Mining: Concepts and Techniques, Jiawei Han Micheline kamber, Morgan Kaufmann Publishers 2001 Data Mining with Microcoft SQL Server 2000, Claude Seidman. Microsoft Press, 2001 http://www.cs.columbia.edu/~sal/hpapers/USENIX/usenix.html http://iris.cs.uml.edu:8080/network.html http://www-nrg.ee.lbl.gov/. Network Research Group (NRG) of the Information and Computing Sciences Division (ICSD) at Lawrence Berkeley National Laboratory (LBNL) in Berkeley, California. Thank You!