Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Intelligent Detection of Malicious Script Code CS194, 2007-08 Benson Luk Eyal Reuveni Kamron Farrokh Advisor: Adnan Darwiche Introduction 3-quarter project Sponsored by Symantec Main focuses:  Web programming  Database development  Data mining  Artificial intelligence Overview Current security software catches known malicious attacks based on a list of signatures  The problem: New attacks are being created every day  • Developers need to create new signatures for these attacks • Until these signatures are made, users are vulnerable to these attacks Overview (cont.) Our objective is to build a system that can effectively detect malicious activity without relying on signature lists  The goal of our research is to see if and how artificial intelligence can discern malicious code from non-malicious code  Data Gathering Gather data using a web crawler (probably a modified web crawler based on the Heritrix software)  Crawler scours a list of known “safe” websites  Will also branch out into websites linked to by these websites for additional data, if necessary  While this is performed, we will gather key information on the scripts (function calls, parameter values, return values, etc.)  This will be done in Internet Explorer  Data Storage When data is gathered it will need to be stored for the analysis that will take place later  Need to develop a database that can efficiently store the script activity of tens of thousands (possibly millions) of websites  Data Analysis Using information from database, deduce normal behavior  Find a robust algorithm for generating a heuristic for acceptable behavior  The goal here is to later weigh this heuristic against scripts to determine abnormal (and thus potentially malicious) behavior  Challenges  Gathering • How to grab relevant information from scripts? • How deep do we search?  Good websites may inadvertently link to malicious ones  The traversal graph is probably infinitely long  Storage • In what form should the data be stored?  Need efficient way to store data without simplifying it  Example: A simple laundry list of function calls does not take call sequence into account  Analysis • What analysis algorithm can handle all of this data? • How can we ensure that the normality heuristic it generates minimizes false positives and maximizes true positives? Milestones  Phase I: Setup • Set up equipment for research, ensure whitelist is clean  Phase II: Crawler • Modify crawler to grab and output necessary data so that it can later be stored and begin crawler activity for sample information  Phase III: Database • Research and develop an effective structure for storing data and link it to webcrawler  Phase IV: Analysis • Research and develop an effective algorithm for learning from massive amounts of data  Phase V: Verification • Using webcrawler, visit a large volume of websites to ensure that heuristic generated in phase IV is accurate Certain milestones may need to be revisited depending on results in each phase