Download 209_Glyph_Bayesian_F..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bayesian Filtering
CMPE 209, SJSU, Spring 2007
Instructor: Richard Sinn
Team “Glyph”
Debbie Bridygham
Pravesvuth Uparanukraw
Ronald Ko
Rihui Luo
Thuong Luu
Table of Contents
I.
Introduction .......................................................................................................................................... 1
II.
Bayes’s Theorem ................................................................................................................................... 1
Naïve Bayes Classifier................................................................................................................................ 2
III.
Bayesian Spam Filter Process ............................................................................................................ 3
IV.
Advantages........................................................................................................................................ 4
V.
Disadvantages ....................................................................................................................................... 4
VI.
Bayesian Poisoning............................................................................................................................ 5
VII.
Conclusion ......................................................................................................................................... 5
VIII.
References ........................................................................................................................................ 6
b
I. Introduction
Spam has always been a problem when it comes to email. It costs businesses both wasted time and
money. Studies show that 50% of all our current email is spam. Because spam is always on the rise, it is
predicted that this number will reach higher percentage in a short time. Spam requires end users to
spend time going through their emails trying to figure out which emails are legitimate and getting rid of
junk emails. Many programs or email software these days have some sort of junk email filtering when it
comes to dealing with spam. The simplest form is having a list of words or email addresses. Then using
that list, we can search emails for those specific words and addresses to identify spam. While this
method is quick and easy, this approach is considered static because spammers can easily adapt and
find ways to circumvent that filtering. Because spammers can easily change their tactics or method in
sending spam emails, we also need a form of email filtering that change or adapt as well. An example of
such a technique is Bayesian filtering. One of the tried and common ways of filtering out spam is using
Bayesian filtering. Because of its effectiveness, many third party email filtering programs rely on this
method as the ‘brains’ behind their email filtering program. In this paper, we explore the logic and inner
workings of the Bayesian filtering method, how it works, and what is achieved from this type of filtering.
II. Bayes’s Theorem
Bayesian filtering is simply based upon Bayes’ theorem, also known as Bayes’ law. This theorem is
named after Thomas Bayes, a minister who simply had an interest in mathematics. The theorem was
published in an essay in the Philosophical Transactions of the Royal Socieity of London in 1763.
Bayes’ theorem isn’t some ‘magical’ or special technique that it may seem to sound like. In fact, it is just
based on simple math and statistics. The following formula is Bays’ Theorem in its well-known short
form. The simple idea behind this is that this formula determines a probability number depending on
the probability of other events occurring.
p( Ei | F ) 
p( F | Ei ) p( Ei )
 p( F | Ei ) p( Ei )
i
In order to make this formula a bit more understandable, the formula has been rewritten as the
following:
Pr(spam | words) =
Pr(words | spam) Pr(spam)
Pr(words)
“Bayes’s theorem, in the context of spam, says that the probability that an email is spam, given that it
has certain words in it, is equal to the probability of finding those certain words in spam email, times the
probability that any email is spam, divided by the probability of finding those words in any email.” (1)
1
Naïve Bayes Classifier
A naïve Bayes Classifier is a probabilistic classifier. It is Bayesian network that is used to do classification
(2). The term naïve is used because it means we make an independent assumption.
Since, Bayes classifier needed to be trained in order to work efficiently. Yet, Naïve Bayes classifier
required small amount of training to perform classification. This probably is an advantage to the Naïve
Bayes classifier because when it comes to email system users would likely to expect the spam filter work
without much effort.
Following is an example of using Bayes’ Theorem to classify a document if it is a spam or not.
First we have a document D that contains sets of wi words. The document can be classified as S if it is
a spam and S  if it isn’t a spam. Therefore, from probability theorem, we can say:
p( D | S )   p( wi | S )
i
And
p( D | S )   p(wi | S )
i
From above, we can write:
p ( S | D) 
p( S )
 p(wi | S )
p ( D) i
p ( S  | D) 
p( S )
 p(wi | S )
p ( D) i
And
Divide one by the other gives:
p ( wi | S )
p( S | D)
p( S )


p( S  | D) p( S ) i p( wi | S )
Knowing that p( S | D)  p( S  | D)  1 and
p( S | D)
is called likelihood ratios
p( S  | D)
Takes a logarithm would yield:
 p(wi | S ) 
 p ( S | D) 
 p( S ) 
ln 
 ln 
  ln 



 p ( S  | D) 
 p( S )  i  p( wi | S ) 
2
Where:
p( wi | S ) is a probability of a word wi appears in a spam email.
p( wi | S ) is a probability of a word wi appears in a non-spam email.
p( S | D) is a probability of an email being a spam.
p( S  | D) is a probability of an email not being a spam.
Thus the likelihood ratios can be calculated from the right side of the equation using probabilities of
words stored in database. Our aim is to get ln  p(S | D) p(S  | D)   0 so that we can assume the
email is not a spam (3).
III. Bayesian Spam Filter Process
Before mails can be filtered using Bayesian filtering technique, a user needs to generate a database with
tokens (words) collected from a sample of spam mail and valid mail (referred to as ‘ham’).
A probability value is then assigned to each
token. This probability is based on how
often that token occurs in spam as opposed
to legitimate mail (ham). This is done by
analyzing the outgoing mails and by
analyzing known spam: All the tokens in
both places are analyzed to generate the
probability that a particular word being
spam.
For instance, If the word "mortgage" occurs
in 400 of 3,000 spam mails and in 5 out of
300 legitimate emails, for example, then its
spam probability would be 0.8889 =
[400/3000] /[5/300 + 400/3000].
It should be noted that the analysis of ham mail should be tailored to a particular organization. For
example, a financial institution might use the word "mortgage" many times and would get a lot of false
positives if using a general anti-spam rule set. On the other hand, the Bayesian filter, if tailored to that
financial institution through an initial training period, records the institution's valid outgoing mails and
has a much better spam detection rate and a lower false positive rate.
Besides ham mail, the Bayesian filter also needs to be trained on spam data. This spam data should
include a large sample of known spam and be constantly updated. This is to ensure that the Bayesian
filter is aware of the latest spam tricks which results in a high spam detection rate.
3
Once the ham and spam databases have been created, the word probabilities can be calculated and the
filter is ready for use.
When a new mail arrives, it is broken down into tokens and the relevant tokens that are significant in
identifying whether the mail is spam or not are singled out. From these tokens, the Bayesian filter
calculates the probability of the new message being spam or not. If the probability is greater than a
predefined threshold, say 0.9, the message is classified as spam.
IV. Advantages
1. The Bayesian method takes the whole message (including header) into account. It recognizes
keywords that identify spam or valid mail. It considers only key interesting words and comes up
with a probability that a message is spam. Bayesian filtering is an intelligent approach because it
examines all aspects of a message, as opposed to keyword checking that classifies a mail as
spam on the basis of a single word.
2. A Bayesian filter is self-adapting. Tokens are constantly updated with new probability. The filter
learns and evolves from new spam and new valid incoming and outgoing mails. Learning from
outgoing mail reduces false positives greatly.
3. The Bayesian technique is user sensitive. It learns the email habits of the organization and
therefore more accurately detecting spam emails with a significantly lower false positive rate.
4. The Bayesian method is multi-lingual and international. A Bayesian filter can be used for any
language. It can also be configure to takes into account languages deviations or the diverse
usage of certain words in different areas.
5. Difficult to fool, as opposed to a keyword filter. A spammer who wants to trick the Bayesian
filter can either use fewer words that usually indicate spam, or more words that generally
indicate valid mail. Doing the latter is impossible because the spammer would have to know the
email profile of each recipient.
V. Disadvantages
1. Significantly increased system resource usage to identify a message as spam or not. What used
to take one pass, now takes as many as 10-15 passes.
2. Can't identify cloaked spam which is generally the most vile spam such as "v*i(a)g-r-a" or bogus
HTML tags, as well as more sophisticated cloaking.
3. Still based on and dependent upon having clearly visible and obvious keyword/token.
4. No standard method of determining why a particular message was caught by the filter which
makes it very difficult to intelligently tune the filter for optimal spam recognition.
5. Blind "training" and retraining of the Bayesian filter could results in unpredictable results and
negatively impacts the filter's ability to correctly identify future spam.
4
VI. Bayesian Poisoning
Bayesian Poisoning is a technique used by spammers to attempt to degrade the effectiveness of
Bayesian spam filters. In Bayesian spam filtering theorem, the appearance of innocent words are as
important as spam words. The appearance of spam words in spam emails can improve the accuracy, but
at the same time, the appearance of innocent words in spam emails can decrease the accuracy.
Therefore, if the spammers learn what innocent words are, they can inject them into spam emails to
reduce the likelihood of being identified as spam.
There’re two types of attacks by spammers, Type I and Type II attack. Type I attempts to deliver the
spam to user and Type II attacks attempt to turn previously innocent words into spam words in Bayesian
database. In other words, Type I attack tries to bypass the filter’s checking mechanism while Type II
attack attempts to confuse the filter.
There’re two types of Bayesian poisoning, passive poisoning and active poisoning. Passive poisoning
refers to the method that the spammer blindly sends message without getting any feedback about the
outcome. While its active counterpart adds random words into the message and uses web bugs to track
if the message is actually delivered.
Depending on the actual filter software, Bayesian poisoning has different outcomes. But in general,
active poisoning is more effective than it passive counterpart. Adding random words to conduct passive
attack is inefficient while adding innocent words to perform active attack is quite effective.
There’re many strategies to defend Bayesian poisoning, and the most important measurement is to
prevent active attack. It can be achieved by blocking the spammer receive any feedback.
VII. Conclusion
Bayesian filtering is a great improvement over static keyword based filtering techniques. Based on
Bayes’s theorem, unlike its static counterpart, Bayesian filter can adapt itself to spammer’s latest tactics
and make it very difficult for spammers to get around it. It can also be trained to suit individual’s unique
situation; this makes it extremely unlikely for spammers to come up with a universal spam scheme. The
result is very low rate of false positive.
Various methods used by spammers to defeat Bayesian filter are developed. These so-called Bayesian
poisoning use both passive and active methods to attempt to get around Bayesian filter’s filtering
mechanism. Although partially works under some situations, they’re largely unsuccessful.
With such a high spam rate in everybody’s email inbox, Bayesian filter provides a great way to protect
your inbox been flooded by spam emails. The filter can adapt changes and meet your individual situation.
With a little initial help from user, it can produce a very low false positive rate. Bayesian filter has proved
to be a very effective spam fighting tool.
5
VIII. References
1. Answers.com. Bayes' theorem . [Online] [Cited: April 9, 2007.] http://www.answers.com/topic/bayestheorem.
2. A Bayesian Approach to Filtering Junk E-Mail. Sahami, Mehran, et al. Madison, Wisconsin : AAAI
Technical Report WS-98-05, 1998. AAAI-98 Workshop on Learning for Text Categorization.
3. Wikipedia. Naive Bayes classifier. Wikipedia. [Online] March 23, 2007. [Cited: April 8, 2007.]
http://en.wikipedia.org/wiki/Naive_bayes_classifier.
4. —. Bayesian spam filtering. Wikipedia. [Online] Febuary 14, 2007. [Cited: April 8, 2007.]
http://en.wikipedia.org/wiki/Bayesian_spam_filtering.
5. Wei, Kai. A Naive Bayes Spam Filter. 2003.
6. GFI. Why Bayesian filtering is the most effective anti-spam technology. [Online] 2007.
http://www.gfi.com/whitepapers/why-bayesian-filtering.pdf.
6