Download John Cavazos Institute for Computing Systems Architecture

Learning to Detect and Identify Malicious Executables in Wild J. Zico Kotler Marcus A Maloof Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware CISC 879 - Machine Learning for Solving Systems Problems Introduction • Machine learning and data mining to identify malicious code • Malicious Codes ? • Why not antivirus suites? • Training set: 1971 good and 1651 malicious executables • Features extracted: n-gram byte code and executable based on their functions of payload • Learning algorithms: naïve bayes, SVM, decision trees and boosting CISC 879 - Machine Learning for Solving Systems Problems Goals of the research Paper • How to use established methods to detect and classify malicious executables ? • Present empirical results from an extensive study of inductive methods for detection and classification • To show that methods achieve high detection rates on new and unseen executables. CISC 879 - Machine Learning for Solving Systems Problems Related Work • Lo et al., 1995; Kephart et al., 1995; Tesauro et al.,1996;Schultz et al.,2001 • Lo et al., 1995: analysis of several programs • Schultz et al.2001, used data mining to detect • Binary profiling • String Sequences (Naïve Bayes) • Hex dumps (Ripper learning) (six naïve bayesian classifiers) CISC 879 - Machine Learning for Solving Systems Problems Data Collection and Classification methods • 1971 benign and 1651 malicious executables of windows pe format • N-grams: Combine each four bye sequence into single term. For e.g.: ff 00 ab 3e 12 b3 , the corresponding n-grams are ff00ab3e, 00ab3e12, ab3e12b3 etc. • N-gram: each of them are considered as attributes • Most relevant attribute (n-grams) are calculated using Information gain also called average mutual information. Collected 500 most relevant n-grams CISC 879 - Machine Learning for Solving Systems Problems Classification methods CISC 879 - Machine Learning for Solving Systems Problems Classification methods • Instance based learner: Collection of training examples • Naive bayes: Probablisitc model. Based on condition probability of each class P(Ci) and P(Vj | Ci) CISC 879 - Machine Learning for Solving Systems Problems Classification methods • Support Vector machines: vector of weights w and threshold,b. Uses a kernel function to map training data into higher dimensioned space so that problem is linearly separable. • Decision Trees: Internal nodes correspond to attributes and leaf nodes corresponds to class labels. • Boosted classifiers: It is method for combining multiple classifiers. Boosting produces set of weighted models by iteratively learning a model from a weighted data set, evaluating it and reweighting the data set based on model’s performance. CISC 879 - Machine Learning for Solving Systems Problems Detecting malicious code using n-grams • Used Ten-fold cross validation • Pilot Study: To determine the size of n-grams and number of n-grams relevant. Used n-grams with n=4 and calculated the best number of n-grams using Information gain. 500 relevant n-grams produced the best result. • Experiment With Small collection: Small collection of executable with total of 68,744,909 n-grams • Experiment with Large Collection: 255 million distinct n-grams of size of 4. CISC 879 - Machine Learning for Solving Systems Problems Results of Small Collection • ROC curve for detecting malicious executables in small collection CISC 879 - Machine Learning for Solving Systems Problems Result of Bigger Collection • ROC Curve for bigger collection CISC 879 - Machine Learning for Solving Systems Problems Classifying executables by Payload function • Extent to which classification methods could determine whether a given malicious executable opened a backdoor, mass mailed or was an executable virus. • Identify and enumerate the functions of payloads • Many executables fell into many categories • Experimental design similar to previous but for each of the fucntion data set is made from malicious executables only. • Used ten fold Cross validation CISC 879 - Machine Learning for Solving Systems Problems Experimental Results • ROC curve for mass mailing capabilities CISC 879 - Machine Learning for Solving Systems Problems Experimental Results • ROC Curve for backdoor entries CISC 879 - Machine Learning for Solving Systems Problems Evaluating Real World Online Performance • Applied method to 291 real world malicious code to discovered after the original data were gathered • Classifiers from the original data were build for both benign and malicious code • Boosted decision tree detected 98% of the new malicious code. CISC 879 - Machine Learning for Solving Systems Problems Conclusion and Future wor • Machine learning and data mining are useful and appropriate tool for detection of malware • Boosted Classifiers, support vector machines performed exceptionally well • Boosting removes bias and variance and outperformed other classifiers in the study • This approach is scalable • 20-25 % of the codes were obfuscated using compression and encryption • For functions of payload experiments remove obfuscation and rerun the experiments with larger set CISC 879 - Machine Learning for Solving Systems Problems Conclusion and Future Work • Similarity of malicious code and how such executables change over time. Clustering can provide good insight into this. • This approach combined with search for known signatures, executing and analyzing code in virtual machine will provide better computer security CISC 879 - Machine Learning for Solving Systems Problems Q&A ? CISC 879 - Machine Learning for Solving Systems Problems

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download John Cavazos Institute for Computing Systems Architecture