Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards the Effective SpatioTemporal Mining of Spam Blacklists
Andrew G. West and Insup Lee
CEAS `11 – September 1, 2011
Big Idea / Outline
BIG IDEA: Identify IP addresses that have temporally
correlated spam behavior; harness this info. predictively
• Related work; motivations
• Blacklists as ground truth
• Data collection
• Measurement study
• Temporal association mining
• Technique
• Parameterization
• Negative results; discussion
2
Usage Example
Blacklist history (time)
IPx
IPy
20 min.
20 min.
tnow
What to do “now”?
• Assume IPy will be blacklisted
• Start blocking; decrease listing latency
3
Motivations
Recent research leveraging
group behaviors [1—5]:
• Overcome “cold-start”
• Grouping functions:
subnets, rDNS hosts, AS, etc.
History
AS
IP
AS-REP
REP
ALG
BLOCK
BLK-REP
Mail
Botnets a driving force
• Non-contiguous in IP space
• “Campaigns” should give
rise to temporal correlations
• Can we calculate grouping
function; use for reputation?
IP
IP-REP
Spatial
Functions
SPAM or HAM
Time
Plot into
Classify (SVM) 3-D Space
4
Related Work
“How to determine botnet membership?”
• Parsing P2P communication graphs
– Issues: Unproven, reqs. expansive view (BotGrep [6])
– Blacklists have inherent global view
• Similarity algs. over email bodies/URLs
– Issues: Privacy, complexity (Botnet Judo [7])
– Mining uses only IP addresses in computation
• Law enforcement infiltrations
– Data only useful in ex post facto fashion
5
BLACKLIST MEASUREMENT STUDY
6
Blacklists
• Why blacklists?
– Global compilation; aggregate; low false-positives
– We have tons of data
• Spamhaus blacklists [8]
1. PBL (Policy Block List) – Dynamic IP ranges
2. SBL (Spamhaus Block List) – Static ranges
belonging to spam gangs
3. XBL (Exploits Block List) – IPs spamming due to
malware, Trojans (i.e., botnet nodes)
7
Blacklist Ops
listing
de-listing
listed
IPx
re-listing
not- listed
listing duration (d)
listed
Blacklist history (time)
8
Blacklist Size
• Why?: Desirable to show that blacklists are
a reasonable proxy for the spam problem
• #1: Spike typical of
holiday seasons
#1
#2
• #2: Shutdowns of
Spamit.com affiliate
and Rustock
• Small spikes: Evidence
of campaigns
9
Listing Duration (d)
• Why?: Re-listings (basis for patterns/
correlations) limited by de-listing speed
• Almost universal
d=7.5 days
• Speaks to static TTL
delisting policy
• Must only correlate
listings, not
overlapping
durations
10
DHCP Issues
• Why?: Dynamic IPs may not be able to
accumulate enough history for mining, or
produce stale predictions XBL
PBL
• A large percentage (80%+)
of IPs are dynamic
• More important, is how
dynamic they are [9]
• This fact supports narrow
learning windows
11%
“possibly
dynamic”
10%
79%
“known
dynamic”
≈18.4% of all IP
space is on the PBL
11
Relisting Quantity
• Why?: Central issue: do some IPs have
histories extensive enough to be mined?
#1
#2
• #1: 50% of IPs have
only 1 listing. Discard.
Trim problem space.
• #2: 20% of IPs have
5+ listings, yet these
account for 66% of all
listings (non-trivial).
12
Relisting Rates
• Why?: Dynamism supports tight learning,
thus we want all re-listings well clustered
temporally.
• Media re-listing
time is 18 days
• Far from a uniform
distribution
• Also speaks to
infection lifetimes
13
TEMPORAL ASSOCIATION MINING
14
Association Rules
• Developed for “market basket” data
– “Beer and diapers” example
– Apriori and FP-Growth algs.
• Example rule
– {DIAPERS} →{BEER}
– Interest measures [10]:
ID
BEER MILK DIAPERS
1
Y
N
Y
2
N
Y
N
3
Y
Y
Y
4
Y
N
Y
5
N
N
Y
– lift(DIAPERS → BEER) = (3/5) / (4/5) * (4/5) = 0.94
– Ratio of actual support, to expected rand. support
15
Correlations
• Previous: discrete, unordered, and transactional data
• Spam data defies these
– Continuously distributed
– Bi-directionally ordered
• Define “correlation
radius” (r) to make
binary associations
• Symmetric but nonassociative
• Radius enables
probabilistic lift and
support equivalents
16
Best Pairs
For every IP address, produce a finite
“best pairs” list for persistent storage,
where ordering determined by “lift”
17
Implementation
• 232 × 232 = Scalability issues
• Prune search space with “minimum support”
– M=3 produces a
54.3 trillion
entry matrix
– But 98% sparse
• Multi-threaded
runtime= 3 days;
we used EC2
18
Free Variables
• Correlation radius (r)
– Try to capture campaigns with minimal noise
– r = 2 hours (4 hour diameter)
• Training window length (length(h’))
– Narrow: Infection lifetimes [11], DHCP addresses
– Broad: Need for re-listings, bot-to-campaign map
– length(h’) = 3 months
• Minimum support (m)
– Derived based on scalability needs (m=3)
19
RESULTS AND DISCUSSION
1. “Best pairs” significance
2. Botnet membership capture
3. Blacklist prediction
20
Rule Significance
• Intuition: Lift matrix should have values
higher than random chance would suggest
• #1: Flip expected;
About 0.6% of all
pairs correlate
more than random
#1
#2
• #2: Even at lift=120,
36% chance the
correlation is rand.
AGGREGATE
21
Botnet Membership
• Intuition: Given a set of botnet IPs, shared member/
pair lifts should exceed member/non-member pairs
• Actual dumps:
Kraken + Cutwail
• 70-80% of IPs are
XBL listed, 40% at
min. support
• 6.0% of shared
have non-zero lift,
compared to 2.8%
22
Blacklist Prediction (1)
23
Blacklist Prediction (2)
• Prediction criteria
– No ballot stuffing; can’t re-guess
• Experiment with different thresholds
• Same story: Outperforming random, but too
minimal to be of any consequence
24
Discussion (1)
• Scalability issues × minor performance
increments don’t warrant production
• Focus on acute areas of improvement:
1. DHCP research
– 90%+ of IPs at min. support are dynamic, how?
– Need reliable IP classification; churn rates
2. Refining windows/correlations
– Non-binary correlations. Gaussian weights.
– Time-decay of events in training windows
25
Discussion (2)
3. Appropriateness of blacklist data
– Desirable conciseness (500 million listings = 12GB)
– Blacklists inherently latent. Their aggregate, opaque,
and binary triggers may blur campaign-level data.
– Install on an email server? Collect other metadata
•
Takeaway; Utility in negative result
–
–
–
–
Measurement study builds on prior research
Our model serves as foundation for future efforts
Lessons learned about botnet dynamics
Identified poorly understood dynamism areas
26
References
[1] A. Ramachandran and N. Feamster. Understanding the network-level
behavior of spammers. In SIGCOMM, 2006.
[2] F. Li and M.-H. Hsieh. An empirical study of clustering behavior of
spammers and group-based anti-spam strategies. In CEAS, 2006.
[3] S. Hao, et al. Detecting spammers with SNARE: Spatio- temporal
network-level automated reputation engine. In USENIX Security, 2009.
[4] Z. Qian, et al. On network-level clusters for spam detection. In NDSS, 2010
[5] A. G. West, et al. Spam mitigation using spatio-temporal reputations from
blacklist history. In ACSAC, 2010.
[6] S. Nagaraja, et al. BotGrep: Finding P2P bots with structured graph
analysis. In USENIX Security, 2010.
[7] A. Pitsillidis, et al. Botnet judo: Fighting spam with itself. In NDSS, 2010.
[8] Spamhaus Project. http://www.spamhaus.org/
[9] Y. Xie, et al. How dynamic are IP addresses? In SIGCOMM, 2007.
[10] L. Geng and H. J. Hamilton. Interestingness measures for data mining: A
survey. ACM Comp. Surveys, 38(9), 2006.
[11] J. E. Dunn. Botnet PCs stay infected for years. Tech World, 2009.
27
Backup Slides (1)
28
Backup Slides (2)
Above: Lift distributions as a
consequence of altering
minimum support.
Above: Lift distributions as a
consequence of altering correlation
radius and minimum support
29
Backup Slides (3)
30