Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 Data Mining and Machine Learning Lab. 1 Paper Information Authors: Aaron Blum (University of Alabama, Birmingham) Brad Wardman (University of Alabama, Birmingham) Thamar Solorio (University of Alabama, Birmingham) Source: 2011/3/17 ACM Artificial Intelligence Security Workshop 3rd, 2010 Data Mining and Machine Learning Lab. 2 Outline Introduction Related Work Approach Data Evaluation Conclusion 2011/3/17 Data Mining and Machine Learning Lab. 3 Introduction Phishing A cybercrime comes from spammed emails and fraudulent websites Entice victims to provide sensitive information The information is used to steal identities or gain access to money Characteristics Highly dynamic environment Model need to be updated frequently New ideas 2011/3/17 Combine online learning with content-inspection based approach Model trained only by largely lexical features (without host based features) Provide results to show the performance of URL inspection based detection is as well as content inspection based detection Data Mining and Machine Learning Lab. 4 Related Work Content based Phishing URL Detection Purely URL based Malicious URL Detection Use host information and URL lexical features with online learning algorithms PhishNet Use the similarity between the content files to detect phishing websites Extend the usability of blacklists Domain Blacklisting 2011/3/17 Expand blacklist by the DNS zone file data and WHOIS information Data Mining and Machine Learning Lab. 5 Approach Feature Extraction Delimiters: “/”, ”?”, ”.”, ”=” and “_” Bigram combination Lexical feature groups Learning algorithm Confident Weighted Algorithm 2011/3/17 Updating model by different weights of the features’ occurrence Data Mining and Machine Learning Lab. 6 Approach (cont.) MD5 Matching Use files’ MD5 checksum to check files similarity Easy to evade ( by varying the content) Examples Deep MD5 Matching 2011/3/17 Download all the associated content files Compare the similarity between two websites’ content files by Kulczynski 2 coefficient Data Mining and Machine Learning Lab. 7 Data Data Source UAB Phishing Data Mine Cyveillance Two and half a year collecting time Benigns may look “phishy” (e.g.) 9,506unique domains 25,203 URLs (6,114 malicious) 18,990 unique domains 34,234 URLs (all malicious) All feeds are fully de-duplicated Datasets 2011/3/17 UAB Feeds Cyveillance full Cyveillance abridged Mixed Data Mining and Machine Learning Lab. 8 Data (cont.) Percentage of total URLs vs. Individual Domains 2011/3/17 Data Mining and Machine Learning Lab. 9 Evaluation Experiment setting 2011/3/17 Training and testing set was conducted on daily batches Training initially conducted on UAB data Model will be updated by a daily URL blacklist/whitelist feed False positive and false negative error rates were computed every prediction Data Mining and Machine Learning Lab. 10 Evaluation(cont.) 2011/3/17 Data Mining and Machine Learning Lab. 11 Evaluation(cont.) 2011/3/17 Data Mining and Machine Learning Lab. 12 Evaluation(cont.) 2011/3/17 Data Mining and Machine Learning Lab. 13 Conclusion Lexical features based learning provide robust performance by CW algorithm Quality diverse training data could approve a accuracy higher than 97% For proposed system 2011/3/17 Training data could be collected from any blacklists Easy implement and robust performance Data Mining and Machine Learning Lab. 14 Thanks for your attention Q&A? 2011/3/17 Data Mining and Machine Learning Lab. 15 Lexical Feature Group 2011/3/17 Data Mining and Machine Learning Lab. 16 URLs including the recipient’s email 2011/3/17 Data Mining and Machine Learning Lab. 17 Data in UAB Phishing Data Mine 2011/3/17 Data Mining and Machine Learning Lab. 18