* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download present - CSE, IIT Bombay
Survey
Document related concepts
Computer network wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
Multiprotocol Label Switching wikipedia , lookup
Asynchronous Transfer Mode wikipedia , lookup
SIP extensions for the IP Multimedia Subsystem wikipedia , lookup
Remote Desktop Services wikipedia , lookup
Airborne Networking wikipedia , lookup
Distributed firewall wikipedia , lookup
Network tap wikipedia , lookup
List of wireless community networks by region wikipedia , lookup
Deep packet inspection wikipedia , lookup
Wake-on-LAN wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Hypertext Transfer Protocol wikipedia , lookup
Transcript
IIT BOMBAY NETWORK MEASUREMENTS MONITORING THE PERFORMANCE OF BACKHAUL CAMPUS NETWORK Guided by: Prof. Purushottam Kulkarni Submitted by: Manveer Singh Chawla OVERVIEW Motivation Problem statement Related Work IIT Bombay Network Background Our Solution Architecture Implementation Experimental Evaluation Network measurement data Proxy log analysis Future Work Thesis Contribution MOTIVATION Consider following scenarios User writes a mail, clicks send but sending fails!! User is talking with a friend on gtalk and it disconnects User is browsing web but the browsing speed is very slow What will a novice user do? No structured approach: Starts fiddling around with network settings Reboots machine Result? Wastes a lot of time May not even find the cause MOTIVATION CNTD. Multiple points of failure User’s machine Incorrect network settings Failure of ethernet card/cable LAN Switch Router DNS Proxy WAN Web Server Network Congestion No user control over LAN / WAN failures PROBLEM DEFINITION 1. 2. Build a measurement tool which monitors the status of elements in network back- bone, such that in case of network failure, it is able to detect and diagnose the cause of failure. These elements include the subnet routers, switches, DNS servers and network proxy. A measurement study of the network proxy to study the response time variation, traffic pattern and object size variation across the day RELATED WORK Jigsaw Merge traces to passively measure queuing delays, throughput We summarize a trace to determine status of nodes WiFiProfiler Fault diagnosis in wireless setting for user machine Perform distributed analysis Ours is centralized processing of wired network Network measurement tools Pathchar: bandwidth, queue size, packet drop rate Traceroute: RTT, Topology IIT BOMBAY NETWORK MAP SERVICES Proxy: netmon Web caching Authentication Content filtering Firewall NATing Packet filtering Internal and External DNS DNS server for campus DNS servers in few subnets Monitoring Traffic statistics WORKING OF PROXY MEASUREMENT CHALLENGES Permission from Computer Centre Large volume of data Unaware and amateur users Specific h/w required What to measure in such a large network Use existing infrastructure Old h/w: unpredictable failures WAN: firewall makes difficult to diagnose OUR SOLUTION ARCHITECTURE SERVER NODE Check ethernet cable • ifconfig utility on interface ICMP PORT_UNRE ACHABLE Check the subnet switch • traceroute utility to send packet Query reply from server Check the CC router status •traceroute utility to send packet Check the subnet router Check the DNS server status •dig utility to query status • traceroute utility to Bad request send packet on HTTP GET request Check the status of netmon Perform external web download •wget utility to check the status • Send logs to diagnostic-node after collection •wget utility to download URL CLIENT NODE Subnet Switch status Upload/Download rate Subnet router reachablity DNS server reachablity netmon reachablity •Any packet received •Calculate the bytes received and sent •Using inter-packet time difference for a connection •Packet from any other subnet is received •Packet from DNS server is received •Packet from netmon.iitb.ac.in is received • Send logs to diagnostic-node after collection DIAGNOSTIC NODE Client Node Server Node •Status of subnet switch •Reachablity of subnet router, dns server and netmon •Status of subnet switch, subnet router, cc router, dns server, netmon •Web download performance Diagnostic Node Outage lengths Frequency of different failures Status of different machines DNS Service time distribution DIAGNOSTIC NODE CNTD. Failure Seen Is it seen by all? Yes No Machine overloaded Determining the status of proxy (netmon) Machine down with failure DIAGNOSTIC NODE CNTD. Is it not reachable for all? Yes Machine Down No Send back to back querie -s i n t e r n Problem ina hierarchyl a n s w o t h e r c a s e s Determining the status of dns servers Machine overloaded DIAGNOSTIC NODE CNTD. Offline mode Online mode statistics for specified time period statistics for last 10 minutes Remote query mode query status of node at specified time EXPERIMENTAL EVALUATION SETUP Server node at 8 locations around the campus Client node at 3 locations around campus Collected data from 26th March – 15th June No data for 25th May to 2nd Jun Measurements for following nodes: IP Address Name IP Address Name 192.0.50.1 h8router-interface1 10.107.1.250 ccrouter-interface1 10.12.250.1 h8router-interface2 192.0.20.2 ccrouter-interface2 10.12.250.2 h12switch 192.0.40.2 ccrouter-interface3 10.2.250.1 h3router-interface1 10.105.250.1 cserouter-interface1 192.0.50.2 ccrouter-interface4 10.129.1.1 kresit-dns 10.200.1.11 iitbombay-dns 10.129.250.1 ccrouter-interface5 10.105.1.7 cse-dns 10.129.1.250 ccrouter-interface6 DNS SERVICE TIME DISTRIBUTION DNS SERVICE TIME DISTRIBUTION: OBSERVATIONS DNS Max (in ms) Min (in ms) Avg (in ms) Median (in ms) 90th Percntile (in ms) cse-dns 4974 0 244 2 559 iitbombay-dns 4995 0 230 10 583 kresit-dns 4992 0 332 8 1012 • Median response time is very less for all • Average is significantly greater than median • heavy tailed • kresit-dns has much higher average and 90th percentile OUTAGE DISTRIBUTIONS • Most of the outages are of smaller length. • Median is <= 2 minutes, 90th Percentile <= 10 for almost all. PERCENTAGE DOWNTIME ACROSS DAYS • On most of the days downtimes are < 2 % for most of the nodes. • There is not much pattern across days COMBINED DOWNTIME Element Atleast one down (%) All not working (in %) Hostel 8 Router 2.144 1.980 Hostel 3 Router 0.686 0.686 CC Router 0.657 0.565 DNS Servers 0.414 0.406 netmon ~ 0.24 % Percentage time atleast on interface is not working is close to all not working Either machine goes down Or the measurements are not taking place at same time Time to check the status of machine is variable RESULTS SUMMARY Router failure > DNS failure > netmon failure Median node outage <= 2 min Small number of outages each day No pattern across days Average DNS Service time ~ 300 ms netmon is less than generally perceived Dependence on other services: LDAP, DNS A lot of machinery in the network is old PROXY LOG ANALYSIS MOTIVATION Per day logs are huge, over 6 Gb Storing logs to perform long historical analysis a problem Over 2 Tb for a year ! What is the traffic distribution ? What is the object size distribution ? What is response time distribution ? Is there some trend across days? What strategy can be used to select logs for long term historical analysis ? PROBLEM DEFINITION 1. 2. Build a measurement tool which monitors the status of elements in network back- bone, such that in case of network failure, it is able to detect and diagnose the cause of failure. These elements include the subnet routers, switches, DNS servers and network proxy. A measurement study of the network proxy to study the response time variation, traffic pattern and object size variation across the day PROXY LOG ANALYSIS Log file has following format Month Date Time Proxy_Server squid_process_id epoch_timestamp process_time_ms source_ip tcp_status/http_status_code object_size request_type URL user_id hierarchy_code/server_ip object_type/object_sub_type Stored in a MySQL database Processed logs for a week from May 14, 2009 – May 20, 2009 Size of the log file ~ 6 Gb Number of requests in a day ~ 22 million Bytes downloaded ~ 401.6 Gb TRAFFIC DISTRIBUTION ON OBJECT TYPE: REQUESTS • Percentage distribution remain same across days • Multimedia traffic is the least ~ 0.2 % • Text traffic is maximum ~ 40 % TRAFFIC DISTRIBUTION ON OBJECT TYPE: DOWNLOADED BYTES • Percentage distribution remain same across days • Multimedia traffic is the maximum ~ 38 % TRAFFIC DISTRIBUTION ON LOCATION: REQUESTS • Percentage distribution remain same across week days • Increase in hostel traffic on weekends • Decrease in academic traffic on weekends TRAFFIC DISTRIBUTION ON LOCATION: DOWNLOADED BYTES • Percentage distribution for downloaded bytes follow number of requests • Object type distribution remains same across days, thus majority of users have similar behavior in different locations TRAFFIC DISTRIBUTION: SUMMARY Category Requests Bytes Category Requests Bytes Application (in %) Image (in %) Text (in %) Multimedia (in %) Other (in %) 11.02 30.52 35.43 12.05 42.76 14.94 0.18 38.28 10.61 4.20 Admin (in %) Acad (in %) Hostel (in %) Resnet (in %) 3.50 2.83 28.16 25.59 61.90 64.73 6.58 6.85 NUMBER OF ARRIVALS PER SECOND • Lesser activity from 2 a.m. – 11 a.m, lan curtailment • Higher activity points at 3 p.m., 7 p.m., and 11 p.m. • Average ~ 250 , Standard Deviation ~ 135 NUMBER OF REQUESTS CONCURRENTLY SERVED • Average ~ 2000 , Standard Deviation ~ 859 • Follows the arrival curve MEAN RESPONSE TIME AT TIME OF DAY • Response time remains almost constant throughout the day • A peak at around 4 a.m. • Average ~ 9.8 seconds MEDIAN RESPONSE TIME AT TIME OF DAY • Median Response time remains constant throughout the day, 480 ms for the day • Median curve is a better estimate of average value on a day • Both the median and mean response time do not follow requests concurrently served and arrival curve CUMULATIVE RESPONSE TIME DISTRIBUTION • For multimedia the curve becomes linear • For remaining categories it is heavy tailed • Median response times: application ~472 ms, text ~ 563 ms, image ~ 172 ms, multimedia ~ 10175 ms and other ~ 672 ms CUMULATIVE OBJECT SIZE DISTRIBUTION • For multimedia object sizes are more evenly distributed • Remaining categories have 90 % of objects < 10 Kb • Median object sizes: application ~1.5 Kb, text ~ 0.8 Kb, image ~ 1.7 Kb, multimedia ~ 903 Kb and other ~ 0.46 Kb RESULTS SUMMARY Multimedia traffic is the major part of WAN traffic Percentage traffic distribution Similar across object type on days Similar in different areas except on weekends Thus any log file can be selected as a representative of the week Larger log file for more data one for weekend and one for weekdays FUTURE WORK Characterization of request processing time at proxy Explore the other causes of failure including the LDAP service Explore the failures from the side of ISP, from a point outside the network Studying the traffic within LAN THESIS CONTRIBUTIONS Studied the tools and methodologies used for network measurement Surveyed and documented the campus network of IIT Bombay Architecture Services Failures Developed a tool to detect some of the failures Can be easily extended to detect others Experimental evaluation of tool by setting up testbed Measurement analysis of proxy logs BIBLIOGRAPHY [1] Computer Center, IIT Bombay. http://www.cc.iitb.ac.in [2] dnscache. http://cr.yp.to/djbdns/dnscache.html [3] Iperf. http://dast.nlanr.net/Projects/Iperf/ [4] iptables. http://www.netfilter.org/projects/iptables/index.ht ml. [5] Jpcap: a Java library for capturing and sending network packets. http://netresearch.ics.uci.edu/kfujii/jpcap/doc/. [6] Squid logs. http://wiki.squidcache.org/SquidFaq/SquidLogs [7] Traceroute. BIBLIOGRAPHY CNTD. [8] Ultra monkey. http://www.ultramonkey.org/ [9] Wikimedia. http://www.squidcache.org/Library/wikimedia.dyn [10] Kostas G. Anagnostakis, Michael Greenwald, and Raphael Ryger. cing: Measuring networkinternal delays using only existing infrastructure. In proceedings of IEEE Infocom, April 2003. [11] Ranveer Chandra, Venkata N. Padmanabhan, and Ming Zhang. Wifiprofiler: Cooper- ative Diagnosis in Wireless LANs. In Proceedings of the 4th international conference on Mobile systems, applications and services, June 2006. BIBLIOGRAPHY CNTD. [12] Yu-Chung Cheng, John Bellardo, Peter Benko, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. Jigsaw: Solving the Puzzle of Enterprise 802.11 Analysis. In Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, September 2006 [13] Ramesh Govindan and Hongsuda Tangmunarunkit. Heuristics for Internet map dis- covery. In proceedings of Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies, 2000. 101102 Bibliography BIBLIOGRAPHY CNTD. [14] Bradley Huffaker, Marina Fomenkov, David Moore, and Ke Claffey. Macroscopic analyses of the infrastructure: measurement and visualization of Internet connectivity and performance. In proceedings of Passive and Active Measurements, 2001 [15] Van Jacobson. pathchar - a tool to infer characteristics of Internet paths, 1997. [16] Alex Rousskov and Valery Soloviev. A performance study of the Squid proxy on HTTP/1.0. World-Wide Web Journal, Special Edition on WWW Characterization and Performance Evaluation, 1999. BIBLIOGRAPHY CNTD. [17] Stefan Savage. Sting: a TCP-based Network Measurement Tool. In Proceedings of the Second Conference on USENIX Symposium on Internet Technologies and Systems, 1999. [18] Subhabrata Sen and Jia Wang. Analyzing peer-to-peer traffic across large networks. In Proceedings of the 2006 ACM CoNEXT conference, 2006. [19] Nirav S. Uchat. IIT bombay web traffic characterization. [20] Ameya P. Usgaonkar. Network Performance Analysis by Mining Multi-Variate Time Series Data, January 2001. Extra Slides RELATED WORK Passive Measurement WiFiProfiler collaborative diagnosis, information from neighbors blame assignment algorithm to predict actual cause Jigsaw collect and merge traces from multiple vantage points create single unified view of network large scale synchronization frame unification Measures queuing delays experienced by users throughput: compare observed vs expected (using RTT,path loss) effect of mobility techniques: scanning, dhcp, initial association RELATED WORK CNTD Squid Log Analysis by Rousskov et. al Logs from seven proxies, 18 days of logs Applied patch to squid to measure: proxy connect time, client connect time, server reply time, proxy reply time, swap-in time and swap-out time Studied traffic distribution, response time at proxy, number of requests at proxy, disk traffic intensity, disk utilization, disk response time: all against TCP_STATUS i.e. HITS and MISS Shortcomings: No long term historical analysis No comparison of direct traffic with proxied traffic Active measurements Pathchar: bandwidth, queue size, packet drop rate Traceroute: RTT, Topology RELATED WORK CNTD Active Measurement PathChar measures: bandwidth, queue size, packet drop rate uses TTL field in IP header series of probes with varying packet size Neglecting, queuing delay, Serror/B and tprocessing, reduces to RTT = Spacket/B Packet loss: number of error messages received th node Statistic for node n = Statistic till n Statistic till n-1th node APPLICATION LAYER FAILURES Web Access Failures Service Unavailable Connection timed out failure Connection refused Connection reset Gateway Timeout No data received Connection Closed at an intermediate byte DNS Access Failures Connection Timed Out Blank answer field Router Access Failures No Route To Host No response received IMPLEMENTATION Client Module Snoop on incoming packets using jpcap library Node reachablity If any packets received -> Subnet switch reachable If IP packets from other subnet received -> Subnet router reachable If IP packets from DNS server received -> DNS server reachable If IP packets from netmon received -> netmon reachable Traffic characteristics Size of packet Delay using inter-arrival time of two packets Threads to synchronize measurements and information sending IMPLEMENTATION CNTD Server Module Check plugged in ethernet cable: status of interface using ifconfig Status of switch/router: Hop limited IP packets, using traceroute Status of DNS server: Query to the DNS server using dig Status ofproxy: a HTTP get request at port 80 using wget Web download: Using wget Using JAVA runtime library to run these utilities Synchronize using SNTP protocol: implemented in JAVA IMPLEMENTATION CNTD Communication Module Used for sending logs and querying diagnostic-nodes Implemented using JAVA Net package Receiver listens on a port Sender connects and sends the logs/query Our protocol to send and receive messages Logging Module Used by diagnostic, server and client nodes Stores log in directory hierarchy: ip/yyyy/mm/dd Unsent logs stored to be sent in future New threads are created to make logs -> prevent blocking Implemented using JAVA threads and JAVA IO package IMPLEMENTATION CNTD Diagnostic Module Uses the logs of server and client nodes Continuous mode Analyzes statistics every 10 minutes Statistics generated Node outages, Percentage status distribution, last uptime status of nodes, DNS service time statistics Offline mode User specifies the start and end time of measurement Statistics generated Node outages, Percentage status distribution, last uptime status of nodes, DNS service time statistics, Node status at given time Remote query mode User can query about node status at given time PERCENTAGE DOWN-TIME ON A DAY: DNS SERVERS • Most of the days percentage downtime is < 1 % for all servers • No pattern in down-time across days PERCENTAGE DOWN-TIME ON A DAY: NETMON • Most of the days percentage downtime is < 0.2 % • No pattern in down-time across days PERCENTAGE DOWN-TIME ON A DAY: HOSTEL 8 ROUTER • Most of the days percentage downtime is < 2 % • No pattern in down-time across days PERCENTAGE DOWN-TIME ON A DAY: CC ROUTER • Most of the days percentage downtime is < 1 % • No pattern in down-time across days OUTAGE LENGTH DISTRIBUTION: CC ROUTER 132 6 2 • Most of the outages are of smaller length 133 OUTAGE LENGTH DISTRIBUTION: DNS 4 6 27 1 • Most of the outages are of small length • Smaller number of outages OUTAGE LENGTH DISTRIBUTION: NETMON 3 • Most of the outages are of length < 3 min OUTAGE LENGTH DISTRIBUTION: HOSTEL 8 ROUTER 4 7 2 • Most of the outages are of smaller length OUTAGE LENGTHS STATUS DISTRIBUTION PROXY RESPONSE TIME VS USER RESPONSE TIME PROXY RESPONSE TIME VS USER RESPONSE TIME PROXY RESPONSE TIME VS USER RESPONSE TIME EXPERIMENT: PROXY FAILURE Setup: wget to fetch berkley and netmon (http://netmon.iitb.ac.in) Repeatedly performed at ever 6 minute interval From 2:42 on 22nd September to 1:06 on 25th Septmber from kresit (10.129.41.189) 400 bad request response, denoted by 1, indicates proxy is up -1 for connection refused error -3 for 503 server error Result netmon: 0.7 % connection refused error berkley: 8.7% connection refused error, 0.28 503 error Intersection of failure implies Machine not running, or Port is closed EXPERIMENT: PROXY FAILURE CNTD EXPERIMENT: DNS FAILURE Setup dig to send back-to-back probes to dns.iitb.ac.in Periodically sent once every 2 minutes Conducted fro 22:06 on 17th September to 13:36 on 18th September from kresit(10.129.41.189) One query for internal domain and other for external Both the domains randomly generated 1 -> answer field present, 0 -> answer field not present Result External queries failed 2.36 % of time Internal queries never failed EXPERIMENT: DNS FAILURE CNTD