Download present - CSE, IIT Bombay

Document related concepts

IEEE 1355 wikipedia , lookup

Computer network wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

AppleTalk wikipedia , lookup

Multiprotocol Label Switching wikipedia , lookup

Net bias wikipedia , lookup

Asynchronous Transfer Mode wikipedia , lookup

CAN bus wikipedia , lookup

SIP extensions for the IP Multimedia Subsystem wikipedia , lookup

Remote Desktop Services wikipedia , lookup

Airborne Networking wikipedia , lookup

Distributed firewall wikipedia , lookup

Network tap wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Lag wikipedia , lookup

Deep packet inspection wikipedia , lookup

Wake-on-LAN wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Hypertext Transfer Protocol wikipedia , lookup

Proxy server wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Transcript
IIT BOMBAY NETWORK
MEASUREMENTS
MONITORING THE PERFORMANCE OF
BACKHAUL CAMPUS NETWORK
Guided by:
Prof. Purushottam Kulkarni
Submitted by:
Manveer Singh Chawla
OVERVIEW
Motivation
 Problem statement
 Related Work
 IIT Bombay Network Background
 Our Solution

Architecture
 Implementation


Experimental Evaluation


Network measurement data
Proxy log analysis
Future Work
 Thesis Contribution

MOTIVATION
 Consider
following scenarios
User writes a mail, clicks send but sending fails!!
 User is talking with a friend on gtalk and it
disconnects
 User is browsing web but the browsing speed is very
slow

 What

will a novice user do?
No structured approach:
Starts fiddling around with network settings
 Reboots machine

 Result?
Wastes a lot of time
 May not even find the cause

MOTIVATION CNTD.

Multiple points of failure

User’s machine



Incorrect network settings
Failure of ethernet card/cable
LAN
Switch
 Router
 DNS
 Proxy


WAN



Web Server
Network Congestion
No user control over LAN / WAN failures
PROBLEM DEFINITION
1.
2.
Build a measurement tool which monitors the
status of elements in network back- bone, such
that in case of network failure, it is able to
detect and diagnose the cause of failure. These
elements include the subnet routers, switches,
DNS servers and network proxy.
A measurement study of the network proxy to
study the response time variation, traffic
pattern and object size variation across the day
RELATED WORK

Jigsaw

Merge traces to passively measure




queuing delays, throughput
We summarize a trace to determine status of nodes
WiFiProfiler

Fault diagnosis in wireless setting for user machine

Perform distributed analysis

Ours is centralized processing of wired network
Network measurement tools

Pathchar: bandwidth, queue size, packet drop rate

Traceroute: RTT, Topology
IIT BOMBAY NETWORK
MAP
SERVICES




Proxy: netmon
 Web caching
 Authentication
 Content filtering
Firewall
 NATing
 Packet filtering
 Internal and External
DNS
 DNS server for campus
 DNS servers in few subnets
Monitoring
 Traffic statistics
WORKING OF PROXY
MEASUREMENT CHALLENGES

Permission from Computer Centre

Large volume of data


Unaware and amateur users

Specific h/w required

What to measure in such a large network
Use existing infrastructure

Old h/w: unpredictable failures

WAN: firewall makes difficult to diagnose
OUR SOLUTION
ARCHITECTURE
SERVER NODE
Check
ethernet cable
• ifconfig
utility on
interface
ICMP
PORT_UNRE
ACHABLE
Check the
subnet switch
• traceroute
utility to
send packet
Query reply
from server
Check the CC
router status
•traceroute
utility to send
packet
Check the
subnet router
Check the DNS
server status
•dig utility to
query status
• traceroute
utility to
Bad request
send packet
on HTTP GET
request
Check the status
of netmon
Perform external
web download
•wget utility to
check the status
• Send logs to diagnostic-node after collection
•wget utility to
download URL
CLIENT NODE
Subnet Switch status
Upload/Download rate
Subnet router
reachablity
DNS server
reachablity
netmon reachablity
•Any packet received
•Calculate the bytes
received and sent
•Using inter-packet time
difference for a
connection
•Packet from any other
subnet is received
•Packet from DNS server
is received
•Packet from
netmon.iitb.ac.in is
received
• Send logs to diagnostic-node after collection
DIAGNOSTIC NODE
Client Node
Server Node
•Status of subnet switch
•Reachablity of subnet
router, dns server and
netmon
•Status of subnet
switch, subnet router,
cc router, dns server,
netmon
•Web download
performance
Diagnostic
Node
Outage lengths
Frequency of
different
failures
Status of
different
machines
DNS Service
time
distribution
DIAGNOSTIC NODE CNTD.
Failure
Seen
Is it seen
by all?
Yes
No
Machine
overloaded
Determining the status of proxy (netmon)
Machine
down with
failure
DIAGNOSTIC NODE CNTD.
Is it
not
reachable
for
all?
Yes
Machine
Down
No
Send
back
to
back
querie
-s
i
n
t
e
r
n
Problem ina
hierarchyl
a
n
s
w
o
t
h
e
r
c
a
s
e
s
Determining the status of dns servers
Machine
overloaded
DIAGNOSTIC NODE CNTD.

Offline mode


Online mode


statistics for specified time period
statistics for last 10 minutes
Remote query mode

query status of node at specified time
EXPERIMENTAL
EVALUATION
SETUP
Server node at 8 locations around the campus
 Client node at 3 locations around campus
 Collected data from 26th March – 15th June



No data for 25th May to 2nd Jun
Measurements for following nodes:
IP Address
Name
IP Address
Name
192.0.50.1
h8router-interface1
10.107.1.250
ccrouter-interface1
10.12.250.1
h8router-interface2
192.0.20.2
ccrouter-interface2
10.12.250.2
h12switch
192.0.40.2
ccrouter-interface3
10.2.250.1
h3router-interface1
10.105.250.1
cserouter-interface1
192.0.50.2
ccrouter-interface4
10.129.1.1
kresit-dns
10.200.1.11
iitbombay-dns
10.129.250.1
ccrouter-interface5
10.105.1.7
cse-dns
10.129.1.250
ccrouter-interface6
DNS SERVICE TIME DISTRIBUTION
DNS SERVICE TIME DISTRIBUTION:
OBSERVATIONS
DNS
Max
(in
ms)
Min
(in
ms)
Avg
(in
ms)
Median
(in ms)
90th
Percntile
(in ms)
cse-dns
4974
0
244
2
559
iitbombay-dns
4995
0
230
10
583
kresit-dns
4992
0
332
8
1012
• Median response time is very less for all
• Average is significantly greater than median
• heavy tailed
• kresit-dns has much higher average and 90th percentile
OUTAGE DISTRIBUTIONS
• Most of the outages are of
smaller length.
• Median is <= 2 minutes,
90th Percentile <= 10 for
almost all.
PERCENTAGE DOWNTIME ACROSS
DAYS
• On most of the days
downtimes are < 2 % for most
of the nodes.
• There is not much pattern
across days
COMBINED DOWNTIME
Element
Atleast one
down (%)
All not
working (in %)
Hostel 8 Router
2.144
1.980
Hostel 3 Router
0.686
0.686
CC Router
0.657
0.565
DNS Servers
0.414
0.406
netmon ~ 0.24 %
 Percentage time atleast on interface is not working
is close to all not working



Either machine goes down
Or the measurements are not taking place at same time

Time to check the status of machine is variable
RESULTS SUMMARY

Router failure > DNS failure > netmon failure

Median node outage <= 2 min

Small number of outages each day


No pattern across days
Average DNS Service time ~ 300 ms
netmon is less than generally perceived
 Dependence on other services: LDAP, DNS
 A lot of machinery in the network is old

PROXY LOG ANALYSIS
MOTIVATION
Per day logs are huge, over 6 Gb
 Storing logs to perform long historical analysis a
problem


Over 2 Tb for a year !
What is the traffic distribution ?
 What is the object size distribution ?
 What is response time distribution ?
 Is there some trend across days?
 What strategy can be used to select logs for long
term historical analysis ?

PROBLEM DEFINITION
1.
2.
Build a measurement tool which monitors the
status of elements in network back- bone, such
that in case of network failure, it is able to
detect and diagnose the cause of failure. These
elements include the subnet routers, switches,
DNS servers and network proxy.
A measurement study of the network proxy to
study the response time variation, traffic
pattern and object size variation across the day
PROXY LOG ANALYSIS

Log file has following format

Month Date Time Proxy_Server squid_process_id
epoch_timestamp process_time_ms source_ip
tcp_status/http_status_code object_size request_type
URL user_id hierarchy_code/server_ip
object_type/object_sub_type
Stored in a MySQL database
 Processed logs for a week from


May 14, 2009 – May 20, 2009
Size of the log file ~ 6 Gb
 Number of requests in a day ~ 22 million
 Bytes downloaded ~ 401.6 Gb

TRAFFIC DISTRIBUTION ON OBJECT
TYPE: REQUESTS
• Percentage distribution remain same across days
• Multimedia traffic is the least ~ 0.2 %
• Text traffic is maximum ~ 40 %
TRAFFIC DISTRIBUTION ON OBJECT
TYPE: DOWNLOADED BYTES
• Percentage distribution remain same across days
• Multimedia traffic is the maximum ~ 38 %
TRAFFIC DISTRIBUTION ON
LOCATION: REQUESTS
• Percentage distribution remain same across week days
• Increase in hostel traffic on weekends
• Decrease in academic traffic on weekends
TRAFFIC DISTRIBUTION ON
LOCATION: DOWNLOADED BYTES
• Percentage distribution for downloaded bytes follow number of
requests
• Object type distribution remains same across days, thus majority of
users have similar behavior in different locations
TRAFFIC DISTRIBUTION: SUMMARY
Category
Requests
Bytes
Category
Requests
Bytes
Application
(in %)
Image
(in %)
Text
(in %)
Multimedia
(in %)
Other
(in %)
11.02
30.52
35.43
12.05
42.76
14.94
0.18
38.28
10.61
4.20
Admin
(in %)
Acad
(in %)
Hostel
(in %)
Resnet
(in %)
3.50
2.83
28.16
25.59
61.90
64.73
6.58
6.85
NUMBER OF ARRIVALS PER
SECOND
• Lesser activity from 2 a.m. – 11 a.m, lan curtailment
• Higher activity points at 3 p.m., 7 p.m., and 11 p.m.
• Average ~ 250 , Standard Deviation ~ 135
NUMBER OF REQUESTS
CONCURRENTLY SERVED
• Average ~ 2000 , Standard Deviation ~ 859
• Follows the arrival curve
MEAN RESPONSE TIME AT TIME OF
DAY
• Response time remains almost constant throughout the
day
• A peak at around 4 a.m.
• Average ~ 9.8 seconds
MEDIAN RESPONSE TIME AT TIME
OF DAY
• Median Response time remains constant throughout the day, 480
ms for the day
• Median curve is a better estimate of average value on a day
• Both the median and mean response time do not follow requests
concurrently served and arrival curve
CUMULATIVE RESPONSE TIME
DISTRIBUTION
• For multimedia the curve becomes linear
• For remaining categories it is heavy tailed
• Median response times: application ~472 ms, text ~ 563 ms, image
~ 172 ms, multimedia ~ 10175 ms and other ~ 672 ms
CUMULATIVE OBJECT SIZE
DISTRIBUTION
• For multimedia object sizes are more evenly distributed
• Remaining categories have 90 % of objects < 10 Kb
• Median object sizes: application ~1.5 Kb, text ~ 0.8 Kb, image ~ 1.7
Kb, multimedia ~ 903 Kb and other ~ 0.46 Kb
RESULTS SUMMARY
Multimedia traffic is the major part of WAN
traffic
 Percentage traffic distribution

Similar across object type on days
 Similar in different areas except on weekends
 Thus any log file can be selected as a representative
of the week

Larger log file for more data
 one for weekend and one for weekdays

FUTURE WORK




Characterization of request processing time at
proxy
Explore the other causes of failure including the
LDAP service
Explore the failures from the side of ISP, from a
point outside the network
Studying the traffic within LAN
THESIS CONTRIBUTIONS
Studied the tools and methodologies used for
network measurement
 Surveyed and documented the campus network of
IIT Bombay

Architecture
 Services
 Failures


Developed a tool to detect some of the failures

Can be easily extended to detect others
Experimental evaluation of tool by setting up
testbed
 Measurement analysis of proxy logs

BIBLIOGRAPHY
[1] Computer Center, IIT Bombay.
http://www.cc.iitb.ac.in
[2] dnscache. http://cr.yp.to/djbdns/dnscache.html
[3] Iperf. http://dast.nlanr.net/Projects/Iperf/
[4] iptables.
http://www.netfilter.org/projects/iptables/index.ht
ml.
[5] Jpcap: a Java library for capturing and sending
network packets.
http://netresearch.ics.uci.edu/kfujii/jpcap/doc/.
[6] Squid logs. http://wiki.squidcache.org/SquidFaq/SquidLogs
[7] Traceroute.
BIBLIOGRAPHY CNTD.
[8] Ultra monkey. http://www.ultramonkey.org/
[9] Wikimedia. http://www.squidcache.org/Library/wikimedia.dyn
[10] Kostas G. Anagnostakis, Michael Greenwald,
and Raphael Ryger. cing: Measuring networkinternal delays using only existing
infrastructure. In proceedings of IEEE Infocom,
April 2003.
[11] Ranveer Chandra, Venkata N. Padmanabhan,
and Ming Zhang. Wifiprofiler: Cooper- ative
Diagnosis in Wireless LANs. In Proceedings of
the 4th international conference on Mobile
systems, applications and services, June 2006.
BIBLIOGRAPHY CNTD.
[12] Yu-Chung Cheng, John Bellardo, Peter Benko,
Alex C. Snoeren, Geoffrey M. Voelker, and Stefan
Savage. Jigsaw: Solving the Puzzle of Enterprise
802.11 Analysis. In Proceedings of the 2006
conference on Applications, technologies,
architectures, and protocols for computer
communications, September 2006
[13] Ramesh Govindan and Hongsuda
Tangmunarunkit. Heuristics for Internet map
dis- covery. In proceedings of Nineteenth Annual
Joint Conference of the IEEE Computer and
Communications Societies, 2000. 101102
Bibliography
BIBLIOGRAPHY CNTD.
[14] Bradley Huffaker, Marina Fomenkov, David
Moore, and Ke Claffey. Macroscopic analyses of
the infrastructure: measurement and
visualization of Internet connectivity and
performance. In proceedings of Passive and
Active Measurements, 2001
[15] Van Jacobson. pathchar - a tool to infer
characteristics of Internet paths, 1997.
[16] Alex Rousskov and Valery Soloviev. A
performance study of the Squid proxy on
HTTP/1.0. World-Wide Web Journal, Special
Edition on WWW Characterization and
Performance Evaluation, 1999.
BIBLIOGRAPHY CNTD.
[17] Stefan Savage. Sting: a TCP-based Network
Measurement Tool. In Proceedings of the Second
Conference on USENIX Symposium on Internet
Technologies and Systems, 1999.
[18] Subhabrata Sen and Jia Wang. Analyzing
peer-to-peer traffic across large networks. In
Proceedings of the 2006 ACM CoNEXT
conference, 2006.
[19] Nirav S. Uchat. IIT bombay web traffic
characterization.
[20] Ameya P. Usgaonkar. Network Performance
Analysis by Mining Multi-Variate Time Series
Data, January 2001.
Extra Slides
RELATED WORK

Passive Measurement


WiFiProfiler
 collaborative diagnosis, information from neighbors
 blame assignment algorithm to predict actual cause
Jigsaw
 collect and merge traces from multiple vantage
points
 create single unified view of network
large scale synchronization
 frame unification


Measures
queuing delays experienced by users
 throughput: compare observed vs expected (using
RTT,path loss)
 effect of mobility techniques: scanning, dhcp, initial
association

RELATED WORK CNTD

Squid Log Analysis by Rousskov et. al
Logs from seven proxies, 18 days of logs
 Applied patch to squid to measure: proxy connect time, client
connect time, server reply time, proxy reply time, swap-in
time and swap-out time
 Studied traffic distribution, response time at proxy, number
of requests at proxy, disk traffic intensity, disk utilization,
disk response time: all against TCP_STATUS i.e. HITS and
MISS
 Shortcomings:
 No long term historical analysis
 No comparison of direct traffic with proxied traffic


Active measurements
Pathchar: bandwidth, queue size, packet drop rate
 Traceroute: RTT, Topology

RELATED WORK CNTD
 Active

Measurement
PathChar
 measures: bandwidth, queue size, packet drop rate
 uses TTL field in IP header
 series of probes with varying packet size

Neglecting, queuing delay, Serror/B and tprocessing,
reduces to

RTT = Spacket/B
Packet loss: number of error messages received
th node  Statistic for node n = Statistic till n
Statistic till n-1th node

APPLICATION LAYER FAILURES

Web Access Failures

Service Unavailable

Connection timed out
failure

Connection refused

Connection reset

Gateway Timeout

No data received



Connection Closed at an
intermediate byte
DNS Access Failures

Connection Timed Out

Blank answer field
Router Access Failures

No Route To Host

No response received
IMPLEMENTATION

Client Module


Snoop on incoming packets using jpcap library
Node reachablity
If any packets received -> Subnet switch reachable
 If IP packets from other subnet received -> Subnet router
reachable
 If IP packets from DNS server received -> DNS server
reachable
 If IP packets from netmon received -> netmon reachable


Traffic characteristics
Size of packet
 Delay using inter-arrival time of two packets


Threads to synchronize measurements and
information sending
IMPLEMENTATION CNTD

Server Module







Check plugged in ethernet cable: status of interface
using ifconfig
Status of switch/router: Hop limited IP packets, using
traceroute
Status of DNS server: Query to the DNS server using
dig
Status ofproxy: a HTTP get request at port 80 using
wget
Web download: Using wget
Using JAVA runtime library to run these utilities
Synchronize using SNTP protocol: implemented in
JAVA
IMPLEMENTATION CNTD

Communication Module
Used for sending logs and querying diagnostic-nodes
 Implemented using JAVA Net package





Receiver listens on a port
Sender connects and sends the logs/query
Our protocol to send and receive messages
Logging Module





Used by diagnostic, server and client nodes
Stores log in directory hierarchy: ip/yyyy/mm/dd
Unsent logs stored to be sent in future
New threads are created to make logs -> prevent blocking
Implemented using JAVA threads and JAVA IO package
IMPLEMENTATION CNTD

Diagnostic Module


Uses the logs of server and client nodes
Continuous mode
Analyzes statistics every 10 minutes
 Statistics generated
 Node outages, Percentage status distribution, last uptime
status of nodes, DNS service time statistics


Offline mode
User specifies the start and end time of measurement
 Statistics generated
 Node outages, Percentage status distribution, last uptime
status of nodes, DNS service time statistics, Node status
at given time


Remote query mode

User can query about node status at given time
PERCENTAGE DOWN-TIME ON A
DAY: DNS SERVERS
• Most of the days percentage downtime is < 1 % for all servers
• No pattern in down-time across days
PERCENTAGE DOWN-TIME ON A
DAY: NETMON
• Most of the days percentage downtime is < 0.2 %
• No pattern in down-time across days
PERCENTAGE DOWN-TIME ON A
DAY: HOSTEL 8 ROUTER
• Most of the days percentage downtime is < 2 %
• No pattern in down-time across days
PERCENTAGE DOWN-TIME ON A
DAY: CC ROUTER
• Most of the days percentage downtime is < 1 %
• No pattern in down-time across days
OUTAGE LENGTH DISTRIBUTION:
CC ROUTER
132
6
2
• Most of the outages are of smaller length
133
OUTAGE LENGTH DISTRIBUTION:
DNS
4
6
27
1
• Most of the outages are of small length
• Smaller number of outages
OUTAGE LENGTH DISTRIBUTION:
NETMON
3
• Most of the outages are of length < 3 min
OUTAGE LENGTH DISTRIBUTION:
HOSTEL 8 ROUTER
4
7
2
• Most of the outages are of smaller length
OUTAGE LENGTHS
STATUS DISTRIBUTION
PROXY RESPONSE TIME VS USER
RESPONSE TIME
PROXY RESPONSE TIME VS USER
RESPONSE TIME
PROXY RESPONSE TIME VS USER
RESPONSE TIME
EXPERIMENT: PROXY FAILURE


Setup:
 wget to fetch berkley and netmon (http://netmon.iitb.ac.in)
 Repeatedly performed at ever 6 minute interval
 From 2:42 on 22nd September to 1:06 on 25th Septmber
from kresit (10.129.41.189)
 400 bad request response, denoted by 1, indicates proxy is
up
 -1 for connection refused error
 -3 for 503 server error
Result
 netmon: 0.7 % connection refused error
 berkley: 8.7% connection refused error, 0.28 503 error
 Intersection of failure implies
 Machine not running, or
 Port is closed
EXPERIMENT: PROXY FAILURE
CNTD
EXPERIMENT: DNS FAILURE
 Setup






dig to send back-to-back probes to dns.iitb.ac.in
Periodically sent once every 2 minutes
Conducted fro 22:06 on 17th September to 13:36 on
18th September from kresit(10.129.41.189)
One query for internal domain and other for
external
Both the domains randomly generated
1 -> answer field present, 0 -> answer field not
present
 Result
External queries failed 2.36 % of time
 Internal queries never failed

EXPERIMENT: DNS FAILURE CNTD