Download ECEN 689 Special Topics in Data Science for Communications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Distributed firewall wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Peering wikipedia , lookup

Computer network wikipedia , lookup

Internet protocol suite wikipedia , lookup

Asynchronous Transfer Mode wikipedia , lookup

RapidIO wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

Network tap wikipedia , lookup

Serial digital interface wikipedia , lookup

Multiprotocol Label Switching wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Airborne Networking wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Wake-on-LAN wikipedia , lookup

Packet switching wikipedia , lookup

Net bias wikipedia , lookup

Real-Time Messaging Protocol wikipedia , lookup

IEEE 1355 wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Deep packet inspection wikipedia , lookup

UniPro protocol stack wikipedia , lookup

Transcript
ECEN 689
Special Topics in Data Science for
Communications Networks
Nick Duffield
Department of Electrical & Computer Engineering
Texas A&M University
Organization
• 
• 
• 
• 
• 
• 
Instructor: Nick Duffield
Contact: duffieldng AT tamu DOT edu ; (979) 845-7328
Class notes: http://cesg.tamu.edu/?p=2667
Class times: Mon/Wed 03:00-04:15pm, CHEN 108
Office hours: WEB 332D, Mon/Wed 11:00am-12:00pm
Prerequisites: graduate standing; instructor approval; working
background in networking, probability, statistics
•  Grading:
–  Homework 50%; Project 15%; Presentation 15%; Final Exam 20%;
–  Discussion of homework assignments is encouraged, but copying is not
allowed. Assignments must be handed in on time to receive full credit
Course Materials: All available online
•  Background references
–  Baron: Probability and Statistics for Computer Scientists (2nd Edition)
http://proquest.safaribooksonline.com.lib-ezproxy.tamu.edu:2048/9781439875919
–  Peterson & Davie: Computer Networks (5th Edition)
http://proquest.safaribooksonline.com.lib-ezproxy.tamu.edu:2048//9780123850591
•  Detailed references: selections from
–  Leskovec, Rajaraman & Ullman: Mining of Massive Data Sets
http://www.mmds.org
–  Kolaczyk: Statistical Analysis of Network Data: Methods and Models.
http://link.springer.com.lib-ezproxy.tamu.edu:2048/book/10.1007%2F978-0-387-88146-1
•  Review articles and tutorials
–  Duffield: Sampling for Passive Internet Measurement: A Review
•  http://projecteuclid.org/euclid.ss/1110999311
–  Cormode, Duffield: Sampling for Big Data
•  http://nickduffield.net/download/papers/Tutorial_KDD_2014.ppsx
•  Research literature references:
–  Will be communicated in class notes
Objectives of the course
•  Broad description:
–  Statistical and algorithmic methods for acquiring and analysing
massive, complex, and incomplete datasets.
–  Applications to measurement and analysis of operational data in ISP
communication networks, routers and protocols.
–  Understanding of design decisions and trade-offs between statistical
and computational goals
•  Topics on the course
–  Sampling, sketching, frequent itemset mining, network probing,
network tomography, graph sampling.
–  Relevant background in probability, statistics, and networking
recapped as needed, with references for further reading
•  Topics NOT on this course
–  Machine learning
–  Hadoop, MapReduce
About me
• 
• 
• 
• 
Joined TAMU August 2014 from Rutgers University
Worked for 18 years in AT&T Labs Research in New Jersey
Previously Asst. Professor in Europe
Undergrad/PhD in Physics and Mathematical Physics
Data Science for Communications Networks
Data Science and Big Data
•  Big Data arises in many forms:
– 
– 
– 
– 
Physical Measurements: from science (physics, astronomy)
Medical data: genetic sequences, detailed time series
Activity data: GPS location, social network activity
Business data: customer behavior tracking at fine detail
•  Why is “Big Data” is trending up?
–  Availability of data in new fields
–  Technological advances
•  Hardware
•  Computation
•  Algorithms
–  Anticipated value of analysis
Data Science in Communications Networks
•  Motivating application: Internet Service Providers (ISPs)
•  Many reasons to study data science from ISP viewpoint
– 
– 
– 
– 
– 
Expertise: instructor’s experience from ISP world
Demand: data science methods developed in response to ISP needs
Practice: methods widely used in ISP monitoring, built into routers
Prescience: ISPs were first to hit many “big data” problems
Variety: many different places where data science is needed
Data Science Disciplines
•  Transferable Methods
–  Algorithms and Data Structures
–  Probability and Statistics
–  Inference and Machine Learning
•  Applications domain
–  This course: communications networking
Data Science in Communications Networks
•  Internet Service Providers had big data before “Big Data”
–  Operational metadata concerning network usage and state
1.  Telephony call detail records
–  Originating and receiving telephone number, duration, …
2.  IP traffic flow records generated by routers
–  Source and destination IP address of packet flows, #packets, #bytes, …
3.  Protocol transitions
–  Handovers of mobile device between wireless basestations
•  Generated continuously, 100s of Terabytes per day
•  Many other operational datasets
•  Used in network management over a range of timescales
–  From months (network planning) to seconds (network attack detection)
Structure of Large ISP Networks
Peering with other ISPs City-­‐level Router Centers Access Networks: Wireless, DSL, IPTV Backbone Links Downstream ISP and business customers Network Management & AdministraHon Service and Datacenters Measuring the ISP Network: Data Sources
Peering Router Centers Access Backbone One-­‐way Packet Loss & Latency AcHve probing between Measurement devices Business Management Datacenters Measuring the ISP Network: Data Sources
Peering Router Centers Access Backbone Roundtrip Packet Loss & Latency Monitoring both direcHons of traffic between two hosts Business Management Datacenters Measuring the ISP Network: Data Sources
Router Centers Status Reports: Peering Device failures and transiHons Access Backbone Business Management Datacenters Measuring the ISP Network: Data Sources
Peering Router Centers Access Backbone Customer Care Logs Business Management Datacenters Measuring the ISP Network: Data Sources
Peering Router Centers Protocol Monitoring: e.g. Wireless Handovers Backbone B
Active set:
(A,B)
D
Active set:
(C,D)
A
C
Business Management Datacenters Measuring the ISP Network: Data Sources
Peering Router Centers Access Backbone Link Traffic Rates Timeseries of traffic per router interface, 5 minute granularity 0:00
0:05
0:10
0:15
0:20
0:25
0:30
0:35
Management Datacenters Measuring the ISP Network: Data Sources
Peering Router Centers Access Backbone Business IP Traffic Flow Records Generated by routers Management Datacenters Three challenges for ISP data analysis
•  Scale: some datasets are enormous
–  IP Traffic Flow Records, Mobile device handovers,…
•  Incompleteness:
–  Not all quantities can be directly measured
•  Would like to know packet loss and latency per link
•  Typically only measure these on a path comprising multiple links
?
•  Complexity
–  Complex statistical properties difficult to model
•  Noisy data, skewed distributions, 80-20 laws, correlations
•  The methods in this course tackle these challenges
1. Traffic Flow Measurement
• 
• 
• 
• 
IP Protocol layers & packet headers
Router based traffic measurement
Measurement design decisions
Traffic flows, NetFlow
Protocol layers in the Internet
application
packet
transport payload
header
IP
header
link
header
payload
payload
Network packet
IP packet header
0
15 16
31
IP version 4 (IPv4)
Main focus here:
32 bit Source IP address (SrcIP)
32 bit Destination IP address (DstIP)
Usually written in dot decimal
notation, e.g., 128.194.121.31
Also: IP Protocol (Proto) signifies
which IP protocol is used in the
remainder of the packet
•  Routers: use DstIP for packet forwarding
–  Determine router egress interface for the packet
•  How?
–  Routers can’t store (DstIP, egress) for each possible DstIP (232 ~ 4G)
IP Prefixes
•  Prefix = first m bits of IP address for some m ≤ 32
•  Represents a block of addresses
–  First m bits in common; remaining 32 – m bits take any value
•  CIDR notation for address block
–  dot_decimal_address / prefix length e.g. 192.168.100.0 / 22
–  Comprises 232-22 = 210 addresses from 192.168.100.0 to 192.168.103.255
•  In binary notation
–  192.168.100.0
= 11000000.10101000.01100100.00000000
First 22 bits common
–  192.168.103.255 = 11000000.10101000.01100111.11111111
IP Routing and Prefixes
•  Routers maintain a routing table
–  Routing table = lists of (DstIP_Prefix, egress) pairs; currently ~500k
–  How? Routers communicate by protocols to announce and update tables
•  Forwarding Packets
–  Find longest prefix (DstIP_Prefix, egress) in table that matches packet
–  Forward packet to egress interface
•  More detail: Petersen & Davie, Chap 3.2 & 3.3
IP Header and Information for ISPs
•  Have seen that IP header information is used to forward
packets in routers in the ISP infrastructure
•  How could an ISP use this information for network
management if it could be monitored, recorded and analysed?
•  Two example uses:
–  Network planning: identify potential new customers based on volumes
of traffic to or from their IP addresses
–  Attack detection: detect an anomalous burst of traffic destined to a
customer
•  Many other ISP network management tasks used IP header
information over range of timescales: from months to seconds
Protocol layers in the Internet
application
packet
transport payload
header
IP
header
link
header
payload
payload
Network packet
Transport and Other Protocols
•  Most data transmission accomplished by one of two IP
transport protocols that provide the appearance of a
communications channel between hosts
•  TCP: Transmission Control Protocol (Proto = 6)
–  connection oriented protocol providing
•  three-way handshake to setup connection
•  reliable ordered transmission
•  congestion-avoidance
•  UDP: User Datagram Protocol (Proto = 17)
–  connectionless, no reliability, no congestion avoidance
•  Other (non-transmission) IP protocols in common use
–  ICMP: Internet Contol Message Protocol (Proto = 1)
•  used to communicate error conditions; leveraged for probing & debugging
Transport Layer Header
UDP Header
0
15 16
Source port
UDP Length
31
Destination port
UDP checksum
•  16 bit source port (SrcPrt) and destination ports (DstPrt)
•  Used in both TCP and UDP
•  Associate packets with applications at hosts (see: binding)
–  Ports 0-1023: well known, assigned by IANA (mostly)
•  E.g. HTTP server listens in port 80 ; DNS uses port 53
–  Ports 1024-49151: registered ports
•  E.g. minecraft 19132
–  Ports 49152-65535: dynamic ports
•  More detail: Petersen & Davie, Chap. 5.1, 5.2
TCP/UDP Header and Information for ISPs
•  Have seen that UDP/TCP header information (port numbers)
is used at hosts to associate packets to applications
•  Many of these associations are registered by IANA
•  The identify of the application that generated a packet can be
inferred (to some degree) from transport header port numbers
•  How could an ISP use this info for network management?
•  Two example uses:
–  Network planning: detecting growth of new applications
•  E.g. various P2P applications, but some ports dynamic or unofficial
–  Attack detection: e.g. signature of exploit of application vulnerability
•  E.g. Slammer Worm: UDP port 1434 (MS SQL Server), 376 byte packets
Measuring Network Traffic
•  ISPs: useful to record packet SrcIP, DstIP, SrcPrt, DstPrt
•  How can routers do this?
•  Finest conceivable granularity?
–  Routers record (SrcIP,…) for each packet, export result to a collector
–  Constraints: router cycles, network bandwidth for collection
•  Possible with special purpose measurement devices for limited time
•  Coarse time granularity?
–  Maintain counters of packet/bytes for each (SrcIP,…) seen
–  Report at fixed time interval (e.g. every hour), then reset counters to 0.
–  Constraints:
•  storage: how many distinct combinations (SrcIP…) seen in each interval?
•  staleness: information may lose usefulness with reporting delay
Network Traffic Flows
•  Better:
–  Exploit inherent timescale of packets generated by a user application
•  Intuition: packets group into “sessions” e.g. web download, VOIP call, …
•  Abstractly, define an IP Flow:
–  Set of packets with a shared property observed over some time period
•  Shared property is called the key
–  Typically a tuple of fields from the IP and transport headers
–  No unique definition of key; depends on purpose
•  5-tuple key: (SrcIP, DstIP, SrcPrt, DstPrt, Proto)
–  Application-to-application flow
•  2-tuple key: (SrcIP, DstIP)
–  Host-to-host flow
•  7-tuple key: (SrcIP, DstIP, SrcPrt, DstPrt, Proto, ToS, Ingress Intf)
–  Used for flow measurement in some routers
Flow measurement in routers
Hme key 1 key 2 key 3 key 4 •  Routers maintain statistics on flows in a flow table
–  Each flow: key, #packets, #bytes, first & last packet times, …
Flow table
Packet
key, bytes,
timestamp, …
key4 stats4
hash(key)
key1 stats1
key3 stats3
•  Each packet
key2 stats2
–  If no entry for packet key in flow table, instantiate new entry
#packets(key) = #bytes(key) = 0; first_packet_time(key) = timestamp, …
–  Update flow entry
#packets(key)++ ; #bytes(key) += bytes; last_packet_time(key) = timestamp, …
Flow termination
•  No precise definition behind intuition of flow as a “session”
•  Routers use several criteria to terminate flows (configurable)
– 
– 
– 
– 
Protocol signals: packets TCP FIN flag is set, ending TCP connection
Inactive timeout: time since last observed flow packet > Tinactive
Active timeout: time since first observed flow packet > Tactive
Flow table occupancy: terminate some flows if table occupancy > p%
Flow records: realization & collection
•  Statistics of terminated flow exported in flow record
–  release flow table memory for new flow statistics
•  Realization
–  Cisco NetFlow dominates
•  Current version 9; flow definition, export format highly configurable
•  Most other router vendors offer (some version of) NetFlow
–  Embodied in Internet Engineering Task Force Standards
•  IP Flow Information eXport Working Group
•  Flow record collectors
–  Network management software vendors offer collector/analysers
–  Some public domain tools, e.g. cflowd
•  Related approaches
–  e.g. sFlow
•  The future
–  Dynamically configurable measurement in software defined networking
Background Reading
NetFlow and IETF Standards
•  Cisco NetFlow White Paper:
–  http://tinyurl.com/cisco-netflow-whitepaper
•  IETF IPFIX Working Group
–  WG Charter: http://datatracker.ietf.org/wg/ipfix/charter/
–  Applying IPFIX Tutorial: http://www.ietf.org/edu/tutorials/ipfix-tutorial.pdf