Download Database Techniques for fight SPAM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DomainKeys Identified Mail wikipedia , lookup

Spamming wikipedia , lookup

Transcript
Database Techniques
for fighting SPAM
Telvis Calhoun
CSc 8710 – Advanced Databases
Dr. Yingshu Li
Everybody knows about SPAM




Spam is unsolicited bulk email sent for
profit and general mayhem.
BOTNETs = Distributed Network of
hijacked IPs.
IPs hard to track
70 billion emails sent per day. 70%
spam
How Anti-SPAM uses DBs?


Spam databases collect network layer and
application layer data.
IP Blacklisting



Detect a malicious host during SMTP dialog.
Difficult to detect IP address DHCP, botnet size or good
IPs used to forward
Content Analysis



Detect malicious mail content.
Requires that MTA complete the SMTP connection.
Arms race between content filter designers and
spammers.
Summary of DB Techniques





Grey Space Analysis
Trinity: Peer-to-Peer Database
Behavioral Blacklisting
Progressive Email Scanning
Content filtering using Bayesian
Analysis
Grey Space Analysis




Characterize IP Space: Active vs. Grey Space
IP Flow Database
Detect malicious IPs by extracting dominant
scanning ports (DSPs)
Find DSPs using relative uncertainty algorithm
Mining Technique: Relative
Uncertainty




Determines entropy of IP ports in flows database.
Formula := Entropy of dstPrt distribution ÷
maximum entropy.
p := number of flows with port[i] ÷ total flows
RU close to 1 shows ~even distribution, near 0
shows uneven distribution
Grey Space Algorithm





Isolate flows toward
grey space
Find dominant
scanning ports (DSPs)
Find outside hosts with
DSPs flows toward
grey and active hosts.
Find inside host
footprint for outside
hosts.
Classify adversary as
hitter or scanner.
Focused Hitters vs Bad
Scanners


Focused hitters tend to send tens or
hundreds of flows to each grey host.
Bad scanners send one or a few flows to
each grey host
Trinity: Distribute IP
Reputation Database



Botnets send a large
amount of data in a
short amount of time.
Trinity uses distributed
in-memory hash table
containing IP
reputation entries.
Each peer has 10 to
50 megabytes of data
(833K – 4.17M entries)
Chord Distributed Hash Table

Distribute data over a large P2P network


Stores key/value pairs



Quickly find any given item
The key value controls which node(s) stores the
value
Each node is responsible for some section of the
space
Basic operations


Store(key; val)
val = Retrieve(key)
Chord (cont)

Each node chooses a n-bit ID


Each lookup key is also a n-bit ID



IDs are arranged in a ring
i.e., the hash of the real lookup key
Node IDs and keys occupy the same space!
Each node is responsible for storing keys “near" its
ID



Replication usaully between current and previous node
Items can be replicated at multiple successors
No single host contains large fraction of a particular space
to guard against DDoS.
Database Updates




Compute the number of interval quarters since last update. Shift and
update counters accordingly
Determine site responsible for entry and send UDP. Once received
by owner site, forward entry to k peers using TCP.
Updates communicative, order doesn’t matter. Consistency not
required.
Even if host goes down, database can be rebuilt in an hour.
Security



Secure communications for neighbors
Limit updates for nodes that have sent
more than 100 emails in 10 minutes.
Falsified source IPs can cause false
positives.
Clustering Technique for
Behavioral Blacklisting




Identify spammers that
attack many domains.
Domain distribution
and frequency is the
sending pattern
Form clusters of
sending patterns
Use clusters to ID new
attack
Spectral Clustering


Divide Phase – produces a tree whose
leaves are elements of the set.
Merge Phase – Start with each leaf in its
own cluster and merge going up the tree.
Vector Generation

Database contains: M(i,j,k)



Total times that IP ‘i' sent email to domain
‘j’ in time slot ‘k’.
Find total flows for IP/Domain across
entire time axis (M’).
Generate feature vector from M’

IP := <#flows to domain 1, # flows to
domain 2, … #flows to domain j>
Clustering


Clusters contain IP
addresses that send
mail to similar sets of
domains.
Define traffic pattern
for each cluster

Averaging the rows
(vector contents) for all
IPs in the cluster.
IPxIP matrix of related spam senders
Classification



Input IP vector ‘r’ :=1 x d vector
Use similarity algorithm to find closes cluster
Spam score is the maximum similarity of r
with any cluster.
Progressive Email Scanner



Maintains Feature Instance (FI)
database
FI is any feature that can discriminate
HAM from SPAM.
Dynamic Features - Use any feature
that IDs mail such as contents,
network, etc.)

Paper only uses URL links as FIs
PEC Architecture

FI States





Grey (Ambiguous FI)
Black (Spam FI)
White (HAM FI)
Blacklist Module – Extracts and
hashes FIs
Scoreboard Module – Tracks FI
occurrences and timestamp (age)
Competitive Aging and Scoring
System (CASS)

Transition between states governed by




Score – number of occurrence of FI
Age – time since last score update.
Score (R) exceeds score threshold (S)
causes Grey to Black transition.
Age (A) exceeds age threshold (M)
triggers Grey to White transition.

Purge
Bayesian Content Filtering


Determine the probability that a
message is spam based on contents
Use Bayesian combination of spam
probabilities
Bayesian Training



Requires training
corpus of
HAM/SPAM
Find interesting
tokens.
Create HAM/SPAM
token tables
Classification
Hi,
Just a reminder: don’t forget your
allergy prescription when you visit New
York City today.
Mom
Sample Message




Spam Probability Table
Tokenize new message
Calculate spam probability for each message
Derive overall spam probablity using Bayes
formula. Sample Message = 0.0
Non-spam tokens outweigh spam tokens to prevent
false positives
Real World Applications
Messaging Security Architecture
TrustedSource.org
Summary

A variety of database techniques are used in AntiSpam Technology



Databases can contain:



IP Blacklisting
Content filtering
Network traffic: IP Addresses, Domain, Ports
Message Content: Words, URLs, HTML Text
Challenges:



Scalability – Must handle many connections or messages
Minimize False Positive Rates – Cannot classify a HAM
message as SPAM.
Finding useful SPAM features. Using machine learning
techniques.
References







Brodsky, et al, A Distributed Content Independent Method for
Spam Detection, HotBots 2007
Jin, et al, Identifying and Tracking Suspicious Activities
through IP Gray Space Analysis, MineNet 2007
Liu, et al, High-Speed Detection of Unsolicited Bulk Emails,
ANCS 2007
Ramachandran, A., Filtering Spam with Behavioral
Blacklisting, CCS 2007
Cheng, et al., A Divide-and-Merge Methodology for Clustering,
ACM Transactions on Database Systems, 2006
Graham P., A Plan for Spam,
www.paulgraham.com/spam.html, 2002
Secure Computing Corporation, http://trustedsource.org, 2008