Download PhishDef & Weblog

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lexical Feature Based
Phishing URL Detection
Using Online Learning
Reporter: Jing Chiu
Advisor: Yuh-Jye Lee
Email: [email protected]
2011/3/17
Data Mining and Machine Learning Lab.
1
Paper Information

Authors:




Aaron Blum
(University of Alabama, Birmingham)
Brad Wardman
(University of Alabama, Birmingham)
Thamar Solorio
(University of Alabama, Birmingham)
Source:

2011/3/17
ACM Artificial Intelligence Security Workshop
3rd, 2010
Data Mining and Machine Learning Lab.
2
Outline






Introduction
Related Work
Approach
Data
Evaluation
Conclusion
2011/3/17
Data Mining and Machine Learning Lab.
3
Introduction

Phishing




A cybercrime comes from spammed emails and fraudulent
websites
Entice victims to provide sensitive information
The information is used to steal identities or gain access to money
Characteristics

Highly dynamic environment


Model need to be updated frequently
New ideas



2011/3/17
Combine online learning with content-inspection based approach
Model trained only by largely lexical features
(without host based features)
Provide results to show the performance of URL inspection based
detection is as well as content inspection based detection
Data Mining and Machine Learning Lab.
4
Related Work

Content based Phishing URL Detection


Purely URL based Malicious URL Detection


Use host information and URL lexical features with
online learning algorithms
PhishNet


Use the similarity between the content files to
detect phishing websites
Extend the usability of blacklists
Domain Blacklisting

2011/3/17
Expand blacklist by the DNS zone file data and
WHOIS information
Data Mining and Machine Learning Lab.
5
Approach

Feature Extraction




Delimiters: “/”, ”?”, ”.”, ”=” and “_”
Bigram combination
Lexical feature groups
Learning algorithm

Confident Weighted Algorithm

2011/3/17
Updating model by different weights of the features’
occurrence
Data Mining and Machine Learning Lab.
6
Approach (cont.)

MD5 Matching


Use files’ MD5 checksum to check files
similarity
Easy to evade ( by varying the content)


Examples
Deep MD5 Matching


2011/3/17
Download all the associated content files
Compare the similarity between two websites’
content files by Kulczynski 2 coefficient
Data Mining and Machine Learning Lab.
7
Data

Data Source

UAB Phishing Data Mine





Cyveillance




Two and half a year collecting time
Benigns may look “phishy” (e.g.)
9,506unique domains
25,203 URLs (6,114 malicious)
18,990 unique domains
34,234 URLs (all malicious)
All feeds are fully de-duplicated
Datasets




2011/3/17
UAB Feeds
Cyveillance full
Cyveillance abridged
Mixed
Data Mining and Machine Learning Lab.
8
Data (cont.)

Percentage of total URLs vs. Individual
Domains
2011/3/17
Data Mining and Machine Learning Lab.
9
Evaluation

Experiment setting




2011/3/17
Training and testing set was conducted on daily
batches
Training initially conducted on UAB data
Model will be updated by a daily URL
blacklist/whitelist feed
False positive and false negative error rates
were computed every prediction
Data Mining and Machine Learning Lab.
10
Evaluation(cont.)
2011/3/17
Data Mining and Machine Learning Lab.
11
Evaluation(cont.)
2011/3/17
Data Mining and Machine Learning Lab.
12
Evaluation(cont.)
2011/3/17
Data Mining and Machine Learning Lab.
13
Conclusion



Lexical features based learning provide
robust performance by CW algorithm
Quality diverse training data could approve
a accuracy higher than 97%
For proposed system


2011/3/17
Training data could be collected from any
blacklists
Easy implement and robust performance
Data Mining and Machine Learning Lab.
14
Thanks for your attention

Q&A?
2011/3/17
Data Mining and Machine Learning Lab.
15
Lexical Feature Group
2011/3/17
Data Mining and Machine Learning Lab.
16
URLs including the recipient’s email
2011/3/17
Data Mining and Machine Learning Lab.
17
Data in UAB Phishing Data Mine
2011/3/17
Data Mining and Machine Learning Lab.
18
Related documents