Download Inferring Undesirable Behavior from P2P Traffic Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Network tap wikipedia , lookup

Distributed firewall wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Peering wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Net bias wikipedia , lookup

Deep packet inspection wikipedia , lookup

Peer-to-peer wikipedia , lookup

Transcript
Inferring Undesirable Behavior
from P2P Traffic Analysis
Ruben Torres*, Mohammad Hajjat*, Sanjay Rao*,
Marco Mellia†, Maurizio Munafo†
* Internet Systems Lab, Department of ECE, Purdue University, USA
† Department of Electronics and Telecommunications, Politecnico di Torino, Italy
SIGMETRICS'09
1
Rapid Evolution of P2P Networks

Peer-to-peer (P2P) systems are huge, complex and
with millions of participants.


Used for many different applications.




Over 60% of network traffic is due to P2P systems.
File sharing – BitTorrent, eMule.
VoIP – Skype.
Video streaming – PPlive.
Matured to the point there are commercial offerings.
SIGMETRICS'09
2
Undesirable Behavior in P2P Networks



Most of the research is on P2P systems design and
characterization.
Shift attention to the impact P2P systems may have
on the Internet.
Our focus is on identifying undesirable behavior.


Patterns not expected, not intended or unwanted by
developers, users or network operators.
Potential for undesirable behavior due to:

Millions of users.
 Completely distributed.
 Software bugs.
 Malicious clients.
 Security vulnerabilities.
SIGMETRICS'09
3
Our Contributions

One of the first works to show that undesirable
behavior exists, is prevalent and significant.




Expose problems in the context of a traffic trace of a
large ISP.


Evidence of DDoS attacks exploiting P2P clients.
Significant waste of ISP resources.
Impact of application/user performance.
More than 5 million customers.
One of the first systematic approaches to uncover
undesirable behavior in P2P systems.
SIGMETRICS'09
4
Talk Outline



Dataset
Methodology
Results
SIGMETRICS'09
5
Setup
public
Full Cone
NAT
Home private
NAT
private
ISP
TCP
UDP
UDP
Internet

Traces obtained from large European ISP.

ISP provides ADSL (20/1Mbps) or Fiber (10/10Mbps).
Extensive usage of NATs in the ISP

 Peering
point (Most clients in the ISP have private IP addresses).
 Home NAT.
SIGMETRICS'09
6
Setup
Full Cone
NAT
PoP
ISP
Internet

Traces obtained from large European ISP.

ISP provides ADSL (20/1Mbps) or Fiber (10/10Mbps).
Extensive usage of NATs in the ISP

 Peering
point (Most clients in the ISP have private IP addresses).
 Home NAT.

Packet traces collected from a PoP within the ISP network.
 There
SIGMETRICS'09
are more than 2000 customers in the PoP.
7
eMule Traffic is Predominant in the PoP

eMule is a popular P2P file sharing application.

Over an entire 3 month period:
 60-70%
of inbound traffic to PoP is due to eMule.
 95% of outbound traffic is due to eMule.

eMule consists of two networks:
 Kad
- decentralized DHT-based network.
 UDP-based and mainly used for file search.
 ED2K - centralized tracker-based network.
 TCP-based and used for both search and data exchange.
SIGMETRICS'09
8
Systems Analyzed
1.
Generic eMule, which we refer to as Kad.
2.
Version of eMule customized to ISP, which we refer to
as KadU.

Modified version of Kad developed by users in the ISP.



Avoid performance problems because of the NAT at the
edge of the network.
Difference: KadU clients only contact other clients within
the ISP.
These two systems are analyzed separately because
they have different characteristics.

SIGMETRICS'09
e.g. Performance of KadU clients is much better.
9
High-Level Statistics of Dataset Analyzed
 25
hours dataset.
 478
kadU clients inside the PoP contact 229,000 kadU
clients inside ISP.
 136
Kad clients inside the PoP contact more than
300,000 Kad clients in the Internet.
 815,000
 More
ED2K TCP connections.
than 8 million Kad/KadU UDP flows.
SIGMETRICS'09
10
Traffic Classification and Samples Generation
Packet trace
Per flow classification
using Tstat
Tstat is a Passive sniffer
with Deep Packet
Inspection (DPI)
capabilities
Per host aggregation
of flows
Metrics
Aggregate over 5
minute period
Samples
SIGMETRICS'09
11
Metrics

More than 50 metrics obtained from flow records.
 Consider
both TCP and UDP flows.
 Consider if the flow initiator is inside or outside the PoP.

Examples:
 Flow:
average flow duration.
 Data Transfer: bps sent, bps received.
 Destinations: number of distinct destination IP addresses.
 Failures: failure ratio [TCP only].

Choice of metric:
 Intuitively
important.
 Used in the past in the context of P2P systems.
 Can capture specific behaviors of interest to us.
SIGMETRICS'09
12
Challenges

Very little knowledge of what kinds of undesirable
behavior may exist.

It is hard to clearly distinguish between normal and
unwanted behavior.


P2P traffic patterns are very heterogeneous across users.
Techniques relying on detecting abrupt changes may
not work since undesirable behavior can:


Be exhibited by the majority of the samples.
Last throughout the observation period.
 e.g. due to implementation bug in the P2P system.
SIGMETRICS'09
13
Our Approach

We use clustering techniques and manual inspection to
determine undesirable behavior.

Clustering:



Tens of thousands of samples and more than 50 metrics.
Clustering reduces the number of samples to study to a
granularity of clusters.
Domain knowledge and manual inspection:


Select regions of interest.
Interpret the results.
SIGMETRICS'09
14
Clustering - DBScan

DBScan is a density based clustering technique.
 Dense
regions of points are considered a cluster.
 Low density regions are considered noise.
Parameter tuning and sensitivity discussed in the paper.
Number of Samples

SIGMETRICS'09
Cluster1
Cluster3
Cluster2
Noise
Average Packet Size [bytes]
15
Selecting Regions of Interest - Metrics with
more than One Cluster

Metrics with more than one cluster and noise.

A cluster and/or noise are selected as interesting.
Number of Samples
Cluster3
Cluster1
clients only
send control
messages
Cluster2
Noise
Average Packet Size [bytes]
SIGMETRICS'09
16
Selecting Regions of Interest - Metrics with
One Cluster

Metrics with one cluster and noise.
Noise is typically selected as interesting.
Number of Samples

Cluster1: Normal
clients
noise
very active clients
x10
Bits per Second Sent
SIGMETRICS'09
5
17
Correlating Interesting Samples

Once samples in interesting regions are identified,
infer undesirable behavior.

Find the hosts that generate the interesting samples.

If a few hosts, anomalous behavior is a property of the
hosts.
 If many hosts, behavior is general to the application.

Find correlation across metrics.

Rely on domain knowledge to identify this.
 Ongoing work exploring use of techniques like rule
association mining.
SIGMETRICS'09
18
Talk Outline



Dataset
Methodology
Results


Generic Observations
Key Findings
SIGMETRICS'09
19
Preliminary Results

For Kad:



Most metrics have one cluster and noise.
8 metrics have two clusters and noise.
2 metrics have three clusters and noise.

Similar results for KadU.

Sensitivity study.



Night period and day period.
One week trace.
Obtained very similar results.
SIGMETRICS'09
20
Samples Distribution in the Interesting Region

Fraction of Hosts Generating Samples

A few hosts have abnormal behavior.
Abnormality spread across many hosts (circled below).
SIGMETRICS'09
Number of
destination ports
in range 0-1024
that receive a kad
flow
Fraction of Samples in the Interesting Region
21
Talk Outline



Dataset
Methodology
Results


Generic Observations
Key Findings
SIGMETRICS'09
22
DDoS Attacks Exploiting Kad

Considered UDP flows classified as Kad with
destination port in range 0-1024.
> 50% of these flows are sent to port 53 (DNS).


> 90% of these flows are unanswered.
Top most destinations were reported to be under attack.
Unanswered UDP
flows are those in
which the flow
destination never
replies.
SIGMETRICS'09
Fraction of Unanswered UDP Flows

Port 53
Port 4672:
Default
Kad port
Port Number
23
DDoS Attack Exploiting P2P Systems

Redirection Attacks.



There has been some awareness of the problem in the
research community - Belovin [2001], Ross [2006].


Malicious clients inject fake membership information about a
victim into the system.
Innocent clients send normal protocol message to the victim.
They have shown theoretical feasibility of doing the attack.
But our work is one of the first to show that these
attacks are prevalent in the wild.
SIGMETRICS'09
24
Fraction of Hosts Generating Samples
Unnecessary P2P Traffic in KadU and Kad
Cluster2: Most
incoming UDP flows
are unanswered
Cluster1
Noise
Fraction of Samples in the Interesting Region
SIGMETRICS'09
Fraction of Unanswered Flows
from Total Incoming UDP Flows
25
Unnecessary P2P Traffic in KadU and Kad

Large amount of wasted traffic:



Due to two reasons:



28% of all UDP flows incoming to PoP are unanswered. 65%
due to Kad and KadU.
30% of all TCP connections incoming to the PoP fail. 50%
due to KadU.
Stale membership information.
Nodes behind NAT.
Staleness can be extremely long lived (e.g. tens of
hours).
SIGMETRICS'09
26
Malicious P2P Trackers in the ED2K Network



Metric: average number of TCP connections per destination IP.
94% of interesting samples generated by two hosts.
Many short lived connections to two trackers reported as malicious.



Never responded to requests and closed the connections.
Likely deployed by copyright agencies (e.g. RIAA, IFPI).
Similar findings by Banerjee [2008] and Siganos [2009].
Cluster1
Noise: Clients contact
same destination more
than once in 5 minutes
SIGMETRICS'09
Average Connections per Destination
27
Generalizing to Other Systems

Findings in BitTorrent:


Very significant amount unnecessary P2P traffic is present
as in KadU.
Findings in Direct Connect:

Possible DDoS attack exploiting DC++. Many TCP
connections sent to port 80 of real web servers.

More findings in the paper.

Ongoing work studying traces from other networks.
SIGMETRICS'09
28
Summary


One of the first works to systematically study P2P
traffic to identify undesirable behavior.
Shown various types of undesirable behavior of P2P
systems in the wild:

DDoS attack on external servers exploiting the system.
 Wasted resources.
 Affect the performance of the P2P system (e.g. malicious
trackers).


Shown the potential of a systematic approach to
uncover this behavior.
Our initial analysis suggest that results hold over a
range of other P2P systems.
SIGMETRICS'09
29
Questions?
SIGMETRICS'09
30
Backup Slides
SIGMETRICS'09
31
Encrypted Traffic in eMule
11.05.2008 12:29 eMule 0.49a released
1.08.2008 20:25 eMule 0.49b released
Our trace
collection
SIGMETRICS'09
32
Why DBScan?





Does not rely on the assumption of the shape of the
cluster.
There is the concept of noise region
You don’t need to know how many clusters you want
ahead of time.
But, in principle, any technique can be used.
Just need a coarse way to cluster samples
SIGMETRICS'09
33
DBScan - Parameter Sensitivity

We adjust the parameters to match our intuition of
where the clusters should be if manually look at each
metric.


Try to keep noise region small but not too small (at most
6% of the samples in our study).
We have an automated way to get clusters.

More details in the paper
SIGMETRICS'09
34
Clustering for single metric instead of
multiple metrics

Clusters interpretation may be harder.

Typical metric distribution is very skewed.
 Metrics distribution have different support.

Single clustering still helps.

Automatic way to get thresholds for interesting region.
 First cut observations.


But this is a first step.
Ongoing work on multi-metric analysis.
SIGMETRICS'09
35
Do you think you find all behavior or there
is more?

We expect there is more (so there is more work to do).



But we expect we have caught first order issues
This is the first attempt on this direction.
We don’t have an exhaustive of undesirable behavior


There may be other behavior we could find when the
application or network setup changes.
For example, buddy problem. More to the architecture of Kad.
SIGMETRICS'09
36
How can you automated these, generalized
to different network?

First step pointing to the importance of the problem

Now that is there, we could look at better ways to
detect:

Changes over time
 Changes across networks
 For a class of P2P systems, use same list of undesirable
behavior.
SIGMETRICS'09
37