* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ECEN 689 Special Topics in Data Science for Communications
Distributed firewall wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Computer network wikipedia , lookup
Internet protocol suite wikipedia , lookup
Asynchronous Transfer Mode wikipedia , lookup
Piggybacking (Internet access) wikipedia , lookup
Network tap wikipedia , lookup
Serial digital interface wikipedia , lookup
Multiprotocol Label Switching wikipedia , lookup
List of wireless community networks by region wikipedia , lookup
Airborne Networking wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Wake-on-LAN wikipedia , lookup
Packet switching wikipedia , lookup
Real-Time Messaging Protocol wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Organization • • • • • • Instructor: Nick Duffield Contact: duffieldng AT tamu DOT edu ; (979) 845-7328 Class notes: http://cesg.tamu.edu/?p=2667 Class times: Mon/Wed 03:00-04:15pm, CHEN 108 Office hours: WEB 332D, Mon/Wed 11:00am-12:00pm Prerequisites: graduate standing; instructor approval; working background in networking, probability, statistics • Grading: – Homework 50%; Project 15%; Presentation 15%; Final Exam 20%; – Discussion of homework assignments is encouraged, but copying is not allowed. Assignments must be handed in on time to receive full credit Course Materials: All available online • Background references – Baron: Probability and Statistics for Computer Scientists (2nd Edition) http://proquest.safaribooksonline.com.lib-ezproxy.tamu.edu:2048/9781439875919 – Peterson & Davie: Computer Networks (5th Edition) http://proquest.safaribooksonline.com.lib-ezproxy.tamu.edu:2048//9780123850591 • Detailed references: selections from – Leskovec, Rajaraman & Ullman: Mining of Massive Data Sets http://www.mmds.org – Kolaczyk: Statistical Analysis of Network Data: Methods and Models. http://link.springer.com.lib-ezproxy.tamu.edu:2048/book/10.1007%2F978-0-387-88146-1 • Review articles and tutorials – Duffield: Sampling for Passive Internet Measurement: A Review • http://projecteuclid.org/euclid.ss/1110999311 – Cormode, Duffield: Sampling for Big Data • http://nickduffield.net/download/papers/Tutorial_KDD_2014.ppsx • Research literature references: – Will be communicated in class notes Objectives of the course • Broad description: – Statistical and algorithmic methods for acquiring and analysing massive, complex, and incomplete datasets. – Applications to measurement and analysis of operational data in ISP communication networks, routers and protocols. – Understanding of design decisions and trade-offs between statistical and computational goals • Topics on the course – Sampling, sketching, frequent itemset mining, network probing, network tomography, graph sampling. – Relevant background in probability, statistics, and networking recapped as needed, with references for further reading • Topics NOT on this course – Machine learning – Hadoop, MapReduce About me • • • • Joined TAMU August 2014 from Rutgers University Worked for 18 years in AT&T Labs Research in New Jersey Previously Asst. Professor in Europe Undergrad/PhD in Physics and Mathematical Physics Data Science for Communications Networks Data Science and Big Data • Big Data arises in many forms: – – – – Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity Business data: customer behavior tracking at fine detail • Why is “Big Data” is trending up? – Availability of data in new fields – Technological advances • Hardware • Computation • Algorithms – Anticipated value of analysis Data Science in Communications Networks • Motivating application: Internet Service Providers (ISPs) • Many reasons to study data science from ISP viewpoint – – – – – Expertise: instructor’s experience from ISP world Demand: data science methods developed in response to ISP needs Practice: methods widely used in ISP monitoring, built into routers Prescience: ISPs were first to hit many “big data” problems Variety: many different places where data science is needed Data Science Disciplines • Transferable Methods – Algorithms and Data Structures – Probability and Statistics – Inference and Machine Learning • Applications domain – This course: communications networking Data Science in Communications Networks • Internet Service Providers had big data before “Big Data” – Operational metadata concerning network usage and state 1. Telephony call detail records – Originating and receiving telephone number, duration, … 2. IP traffic flow records generated by routers – Source and destination IP address of packet flows, #packets, #bytes, … 3. Protocol transitions – Handovers of mobile device between wireless basestations • Generated continuously, 100s of Terabytes per day • Many other operational datasets • Used in network management over a range of timescales – From months (network planning) to seconds (network attack detection) Structure of Large ISP Networks Peering with other ISPs City-‐level Router Centers Access Networks: Wireless, DSL, IPTV Backbone Links Downstream ISP and business customers Network Management & AdministraHon Service and Datacenters Measuring the ISP Network: Data Sources Peering Router Centers Access Backbone One-‐way Packet Loss & Latency AcHve probing between Measurement devices Business Management Datacenters Measuring the ISP Network: Data Sources Peering Router Centers Access Backbone Roundtrip Packet Loss & Latency Monitoring both direcHons of traffic between two hosts Business Management Datacenters Measuring the ISP Network: Data Sources Router Centers Status Reports: Peering Device failures and transiHons Access Backbone Business Management Datacenters Measuring the ISP Network: Data Sources Peering Router Centers Access Backbone Customer Care Logs Business Management Datacenters Measuring the ISP Network: Data Sources Peering Router Centers Protocol Monitoring: e.g. Wireless Handovers Backbone B Active set: (A,B) D Active set: (C,D) A C Business Management Datacenters Measuring the ISP Network: Data Sources Peering Router Centers Access Backbone Link Traffic Rates Timeseries of traffic per router interface, 5 minute granularity 0:00 0:05 0:10 0:15 0:20 0:25 0:30 0:35 Management Datacenters Measuring the ISP Network: Data Sources Peering Router Centers Access Backbone Business IP Traffic Flow Records Generated by routers Management Datacenters Three challenges for ISP data analysis • Scale: some datasets are enormous – IP Traffic Flow Records, Mobile device handovers,… • Incompleteness: – Not all quantities can be directly measured • Would like to know packet loss and latency per link • Typically only measure these on a path comprising multiple links ? • Complexity – Complex statistical properties difficult to model • Noisy data, skewed distributions, 80-20 laws, correlations • The methods in this course tackle these challenges 1. Traffic Flow Measurement • • • • IP Protocol layers & packet headers Router based traffic measurement Measurement design decisions Traffic flows, NetFlow Protocol layers in the Internet application packet transport payload header IP header link header payload payload Network packet IP packet header 0 15 16 31 IP version 4 (IPv4) Main focus here: 32 bit Source IP address (SrcIP) 32 bit Destination IP address (DstIP) Usually written in dot decimal notation, e.g., 128.194.121.31 Also: IP Protocol (Proto) signifies which IP protocol is used in the remainder of the packet • Routers: use DstIP for packet forwarding – Determine router egress interface for the packet • How? – Routers can’t store (DstIP, egress) for each possible DstIP (232 ~ 4G) IP Prefixes • Prefix = first m bits of IP address for some m ≤ 32 • Represents a block of addresses – First m bits in common; remaining 32 – m bits take any value • CIDR notation for address block – dot_decimal_address / prefix length e.g. 192.168.100.0 / 22 – Comprises 232-22 = 210 addresses from 192.168.100.0 to 192.168.103.255 • In binary notation – 192.168.100.0 = 11000000.10101000.01100100.00000000 First 22 bits common – 192.168.103.255 = 11000000.10101000.01100111.11111111 IP Routing and Prefixes • Routers maintain a routing table – Routing table = lists of (DstIP_Prefix, egress) pairs; currently ~500k – How? Routers communicate by protocols to announce and update tables • Forwarding Packets – Find longest prefix (DstIP_Prefix, egress) in table that matches packet – Forward packet to egress interface • More detail: Petersen & Davie, Chap 3.2 & 3.3 IP Header and Information for ISPs • Have seen that IP header information is used to forward packets in routers in the ISP infrastructure • How could an ISP use this information for network management if it could be monitored, recorded and analysed? • Two example uses: – Network planning: identify potential new customers based on volumes of traffic to or from their IP addresses – Attack detection: detect an anomalous burst of traffic destined to a customer • Many other ISP network management tasks used IP header information over range of timescales: from months to seconds Protocol layers in the Internet application packet transport payload header IP header link header payload payload Network packet Transport and Other Protocols • Most data transmission accomplished by one of two IP transport protocols that provide the appearance of a communications channel between hosts • TCP: Transmission Control Protocol (Proto = 6) – connection oriented protocol providing • three-way handshake to setup connection • reliable ordered transmission • congestion-avoidance • UDP: User Datagram Protocol (Proto = 17) – connectionless, no reliability, no congestion avoidance • Other (non-transmission) IP protocols in common use – ICMP: Internet Contol Message Protocol (Proto = 1) • used to communicate error conditions; leveraged for probing & debugging Transport Layer Header UDP Header 0 15 16 Source port UDP Length 31 Destination port UDP checksum • 16 bit source port (SrcPrt) and destination ports (DstPrt) • Used in both TCP and UDP • Associate packets with applications at hosts (see: binding) – Ports 0-1023: well known, assigned by IANA (mostly) • E.g. HTTP server listens in port 80 ; DNS uses port 53 – Ports 1024-49151: registered ports • E.g. minecraft 19132 – Ports 49152-65535: dynamic ports • More detail: Petersen & Davie, Chap. 5.1, 5.2 TCP/UDP Header and Information for ISPs • Have seen that UDP/TCP header information (port numbers) is used at hosts to associate packets to applications • Many of these associations are registered by IANA • The identify of the application that generated a packet can be inferred (to some degree) from transport header port numbers • How could an ISP use this info for network management? • Two example uses: – Network planning: detecting growth of new applications • E.g. various P2P applications, but some ports dynamic or unofficial – Attack detection: e.g. signature of exploit of application vulnerability • E.g. Slammer Worm: UDP port 1434 (MS SQL Server), 376 byte packets Measuring Network Traffic • ISPs: useful to record packet SrcIP, DstIP, SrcPrt, DstPrt • How can routers do this? • Finest conceivable granularity? – Routers record (SrcIP,…) for each packet, export result to a collector – Constraints: router cycles, network bandwidth for collection • Possible with special purpose measurement devices for limited time • Coarse time granularity? – Maintain counters of packet/bytes for each (SrcIP,…) seen – Report at fixed time interval (e.g. every hour), then reset counters to 0. – Constraints: • storage: how many distinct combinations (SrcIP…) seen in each interval? • staleness: information may lose usefulness with reporting delay Network Traffic Flows • Better: – Exploit inherent timescale of packets generated by a user application • Intuition: packets group into “sessions” e.g. web download, VOIP call, … • Abstractly, define an IP Flow: – Set of packets with a shared property observed over some time period • Shared property is called the key – Typically a tuple of fields from the IP and transport headers – No unique definition of key; depends on purpose • 5-tuple key: (SrcIP, DstIP, SrcPrt, DstPrt, Proto) – Application-to-application flow • 2-tuple key: (SrcIP, DstIP) – Host-to-host flow • 7-tuple key: (SrcIP, DstIP, SrcPrt, DstPrt, Proto, ToS, Ingress Intf) – Used for flow measurement in some routers Flow measurement in routers Hme key 1 key 2 key 3 key 4 • Routers maintain statistics on flows in a flow table – Each flow: key, #packets, #bytes, first & last packet times, … Flow table Packet key, bytes, timestamp, … key4 stats4 hash(key) key1 stats1 key3 stats3 • Each packet key2 stats2 – If no entry for packet key in flow table, instantiate new entry #packets(key) = #bytes(key) = 0; first_packet_time(key) = timestamp, … – Update flow entry #packets(key)++ ; #bytes(key) += bytes; last_packet_time(key) = timestamp, … Flow termination • No precise definition behind intuition of flow as a “session” • Routers use several criteria to terminate flows (configurable) – – – – Protocol signals: packets TCP FIN flag is set, ending TCP connection Inactive timeout: time since last observed flow packet > Tinactive Active timeout: time since first observed flow packet > Tactive Flow table occupancy: terminate some flows if table occupancy > p% Flow records: realization & collection • Statistics of terminated flow exported in flow record – release flow table memory for new flow statistics • Realization – Cisco NetFlow dominates • Current version 9; flow definition, export format highly configurable • Most other router vendors offer (some version of) NetFlow – Embodied in Internet Engineering Task Force Standards • IP Flow Information eXport Working Group • Flow record collectors – Network management software vendors offer collector/analysers – Some public domain tools, e.g. cflowd • Related approaches – e.g. sFlow • The future – Dynamically configurable measurement in software defined networking Background Reading NetFlow and IETF Standards • Cisco NetFlow White Paper: – http://tinyurl.com/cisco-netflow-whitepaper • IETF IPFIX Working Group – WG Charter: http://datatracker.ietf.org/wg/ipfix/charter/ – Applying IPFIX Tutorial: http://www.ietf.org/edu/tutorials/ipfix-tutorial.pdf