Download John Cavazos Institute for Computing Systems Architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Learning to Detect and Identify Malicious
Executables in Wild
J. Zico Kotler
Marcus A Maloof
Presented by: Ashwani Rao
Dept of Computer & Information Sciences
University of Delaware
CISC 879 - Machine Learning for Solving Systems Problems
Introduction
•
Machine learning and data mining to identify
malicious code
•
Malicious Codes ?
•
Why not antivirus suites?
•
Training set: 1971 good and 1651 malicious
executables
•
Features extracted: n-gram byte code and
executable based on their functions of payload
•
Learning algorithms: naïve bayes, SVM, decision
trees and boosting
CISC 879 - Machine Learning for Solving Systems Problems
Goals of the research Paper
•
How to use established methods to detect and
classify malicious executables ?
•
Present empirical results from an extensive study of
inductive methods for detection and classification
•
To show that methods achieve high detection rates
on new and unseen executables.
CISC 879 - Machine Learning for Solving Systems Problems
Related Work
•
Lo et al., 1995; Kephart et al., 1995; Tesauro et
al.,1996;Schultz et al.,2001
•
Lo et al., 1995: analysis of several programs
•
Schultz et al.2001, used data mining to detect
•
Binary profiling
•
String Sequences (Naïve Bayes)
•
Hex dumps
(Ripper learning)
(six naïve bayesian classifiers)
CISC 879 - Machine Learning for Solving Systems Problems
Data Collection and
Classification methods
•
1971 benign and 1651 malicious executables of
windows pe format
•
N-grams: Combine each four bye sequence into
single term. For e.g.: ff 00 ab 3e 12 b3 , the
corresponding n-grams are ff00ab3e, 00ab3e12,
ab3e12b3 etc.
•
N-gram: each of them are considered as attributes
•
Most relevant attribute (n-grams) are calculated
using Information gain also called average mutual
information. Collected 500 most relevant n-grams
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
•
Instance based learner: Collection of training
examples
•
Naive bayes: Probablisitc model. Based on
condition probability of each class P(Ci) and P(Vj |
Ci)
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
•
Support Vector machines: vector of weights w and
threshold,b. Uses a kernel function to map training
data into higher dimensioned space so that problem
is linearly separable.
•
Decision Trees: Internal nodes correspond to
attributes and leaf nodes corresponds to class
labels.
•
Boosted classifiers: It is method for combining
multiple classifiers. Boosting produces set of
weighted models by iteratively learning a model
from a weighted data set, evaluating it and
reweighting the data set based on model’s
performance.
CISC 879 - Machine Learning for Solving Systems Problems
Detecting malicious code
using n-grams
•
Used Ten-fold cross validation
•
Pilot Study: To determine the size of n-grams and
number of n-grams relevant. Used n-grams with
n=4 and calculated the best number of n-grams
using Information gain. 500 relevant n-grams
produced the best result.
•
Experiment With Small collection: Small collection
of executable with total of 68,744,909 n-grams
•
Experiment with Large Collection: 255 million
distinct n-grams of size of 4.
CISC 879 - Machine Learning for Solving Systems Problems
Results of Small Collection
•
ROC curve for detecting malicious executables in
small collection
CISC 879 - Machine Learning for Solving Systems Problems
Result of Bigger Collection
•
ROC Curve for bigger collection
CISC 879 - Machine Learning for Solving Systems Problems
Classifying executables by
Payload function
•
Extent to which classification methods could
determine whether a given malicious executable
opened a backdoor, mass mailed or was an
executable virus.
•
Identify and enumerate the functions of payloads
•
Many executables fell into many categories
•
Experimental design similar to previous but for each
of the fucntion data set is made from malicious
executables only.
•
Used ten fold Cross validation
CISC 879 - Machine Learning for Solving Systems Problems
Experimental Results
•
ROC curve for mass mailing capabilities
CISC 879 - Machine Learning for Solving Systems Problems
Experimental Results
•
ROC Curve for backdoor entries
CISC 879 - Machine Learning for Solving Systems Problems
Evaluating Real World
Online Performance
•
Applied method to 291 real world malicious code to
discovered after the original data were gathered
•
Classifiers from the original data were build for both
benign and malicious code
•
Boosted decision tree detected 98% of the new
malicious code.
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion and Future wor
•
Machine learning and data mining are useful and
appropriate tool for detection of malware
•
Boosted Classifiers, support vector machines
performed exceptionally well
•
Boosting removes bias and variance and
outperformed other classifiers in the study
•
This approach is scalable
•
20-25 % of the codes were obfuscated using
compression and encryption
•
For functions of payload experiments remove
obfuscation and rerun the experiments with larger
set
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion and Future Work
•
Similarity of malicious code and how such
executables change over time. Clustering can
provide good insight into this.
•
This approach combined with search for known
signatures, executing and analyzing code in virtual
machine will provide better computer security
CISC 879 - Machine Learning for Solving Systems Problems
Q&A ?
CISC 879 - Machine Learning for Solving Systems Problems