Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Improving Spam Detection
Based on Structural Similarity
By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida,
Luis M. A. Bettencourt, Virgílio A. F. Almeida, Jussara M. Almeida
Presented at Steps to Reducing Unwanted Traffic on the Internet
Workshop, 2005
Presented by Jared Bott
Outline
Overview
Concepts
Detecting Spam
Experimental Results
Analysis of Paper
2
Overview
New algorithm to detect spam messages
Uses email information that is harder to
change
Works in conjunction with another spam
classifier
I.e. SpamAssassin
Less false positives than compared
methods
3
Spam Detection Problem
Spam detection algorithms use some part
of emails to determine if a message is
spam
Spammers change messages so that they do
not meet detection criteria for spam
Very easy to change spam messages,
usernames, domains, subjects, etc.
4
Key Idea
The lists that spammers and legitimate
users send messages to and from can be
used as the identifiers of classes of email
traffic.
The lists of addresses spammers send to are
unlikely to be similar to those of legitimate
users.
Lists don’t change that often
5
Using Lists
A user is not just an email address. It can
be a domain, etc.
Represent email user as a vector in multidimensional conceptual space created
with all possible contacts
Each sender and each recipient has their own
vector
Model relationship between senders and
recipients
6
Constructing Vectors
If there is at least one email sent from
sender si to recipient rn, then the value in
si’s vector’s nth dimension is 1.
Otherwise, that value is 0.
If there is at least one email received by
recipient ri from sender sn, the value in ri’s
vector’s nth dimension is 1. Otherwise it is
0.
7
Example Vectors
S[0,1,1]
R[0,1,0]
User 1
S[0,0,0]
R[1,1,0]
User 3
User 2
S[1,0,1]
R[1,0,0]
8
Similarity Between Senders
 Similarity between senders si and sk is the
cosine of the angle between their vectors
cos(si, sk)
0 means no shared contact
1 means identical contact lists
 In legitimate email, a 1 means that the senders
operate in the same social group.
 In spammers, a 1 means that the senders use
the same list or are the same person.
9
Grouping Users Into Clusters
Group users with similar vectors
Users with similar vectors are likely to have
related roles, i.e. spammer or legitimate user
Each cluster is represented by a vector
This vector is the sum of all its component
users’ vectors
10
Similarity Between a User and a Cluster
 Similarity is derived from user to user similarity
equation
If sender si is a member of cluster sck, then the
similarity is cos(sck – si, si).
If sender si is not a member of cluster sck, then the
similarity is cos(sck, si).
 Similarity between a user and a cluster will
change over time
Remove the user’s vector from the cluster’s vector when
computing similarity and reclassifying a user
11
Detecting Spam
Two probabilities to compute
Ps(m) – Probability of an email m being sent by
a spammer
Pr(m) – Probability of an email m being
addressed to users that receive spam
12
Detecting Spam
 When an email arrives, classify it using some
other method
 Find the cluster (sc) the email’s sender belongs
in
If many users in the cluster send messages that are
classified as spam by auxiliary method, the probability of
all the users in that cluster sending spam is high
 Update the sc’s spam probability
 Ps(m) ← sc’s spam probability
13
Detecting Spam
For all recipients of the email, find the
cluster (rc) each one belongs to
Update the spam probability for each
cluster
Pr(m) ← Pr(m) + spam probability of each
rc
Pr(m) ← Pr(m)/number of recipients
14
Detecting Spam
Compute a spam rank for the email based
upon Pr(m) and Ps(m)
If the spam rank is above some threshold
(ω), label it as spam
If the spam rank is below 1- ω, label it is
legitimate
Otherwise label the email as the auxiliary
method’s classification
15
16
Experimental Results
Tested on a log of eight days of email from
a large Brazilian university
Tested on a 2.8 GHz Pentium 4 with 512
MB RAM
Able to classify 20 messages per second
Faster than the average message arrival peak
rate
17
Results
Measure
# of emails
Non-Spam
191,417
Size of emails 11.3 GB
Spam
Aggregate
173,584
365,001
1.2 GB
12.5 GB
# of distinct
senders
12,338
19,567
27,734
# of distinct
recipients
22,762
27,926
38,875
18
Results
 Manually checked false positives to see if they
were spam or not
Auxiliary algorithm had more false positives
Algorithm
Original Classification
Their approach
% of Misclassifications
60.33%
39.67%
19
Strengths
Less false positives than SpamAssassin
Low-cost
Works with message information that
doesn’t change that much
20
Weaknesses
Needs an additional message classifier,
i.e. SpamAssassin
Manual tuning of algorithm
21
Improvements
Time correlation of similar addresses
Collaborative filtering based upon user
feedback
22