* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download nsa08 - Princeton University
Survey
Document related concepts
Transcript
Traffic Engineering for ISP Networks Jennifer Rexford Computer Science Department Princeton University http://www.cs.princeton.edu/~jrex A Challenge in ISP Backbone Networks Finding a good way to route the data packets – Given the current network topology and offered traffic – For good performance and efficient use of resources Why the Problem is Hard? IP traffic varies, and the service is best effort – The offered traffic is not known in advance – The resources in the network are not reserved The routers do not adapt on their own – Load-sensitive routing is not widely deployed – Due to control overhead and stability challenges Routing protocols were not designed to be managed – At best indirect control over the flow of traffic Fine-grain traffic measurements often unavailable – E.g., only have coarse-grain link load statistics In This Talk… TE with traditional IP routing protocols – Shortest-path protocols with configurable link weights Two main research challenges – Optimization: tuning link weights to the offered traffic – Tomography: inferring the offered traffic from link load Deployed solutions in AT&T’s U.S. backbone – Our experiences working with the network operators – And how we improved the tools over time Ongoing research on traffic management Optimization: Tuning Link Weights Routing Inside an Internet Service Provider Routers flood information to learn the topology – Routers determine “next hop” to reach other routers… – By computing shortest paths based on the link weights Routers forward packets via the “next hop” link(s) 2 3 2 1 1 1 3 5 4 3 Link Weights Control the Flow of Traffic Routers compute paths – Shortest paths as sum of link weights Operators set the link weights – To control where the traffic goes 2 3 2 1 1 31 3 5 4 3 Heuristics for Setting the Link Weights Proportional to physical distance – Cross-country links have higher weights than local ones – Minimizes end-to-end propagation delay Inversely proportional to link capacity – Smaller weights for higher-bandwidth links – Attracts more traffic to links with more capacity Tuned based on the offered traffic – Network-wide optimization of weights based on traffic – Directly minimizes key metrics like max link utilization Why Are the Link Weights Static? Strawman alternative: load-sensitive routing – Link metrics based on traffic load – Flood dynamic metrics as they change – Adapt automatically to changes in offered load Reasons why this is typically not done – Delay-based routing unsuccessful in the early days – Oscillation as routers adapt to out-of-date information – Most Internet transfers are very short-lived Research and standards work continues… – … but operators have to work with what they have Big Picture: Measure, Model, and Control Network-wide “what if” model Offered Topology/ traffic Configuration measure Changes to the network control Operational network Traffic Engineering in an ISP Backbone Topology – Connectivity and capacity of routers and links Traffic matrix – Offered load between points in the network Link weights – Configurable weights for shortest-path routing Performance objective – Balanced load, low latency, service level agreements … Question: Given the topology and traffic matrix in an IP network, which link weights should be used? Key Ingredients of Our Approach Measurement – Topology: monitoring of the routing protocols – Traffic matrix: widely deployed traffic measurement Network-wide models – Representations of topology and traffic – “What-if” models of shortest-path routing Network optimization – Efficient algorithms to find good configurations – Operational experience to identify key constraints Formalizing the Optimization Problem Input: graph G(R,L) – R is the set of routers – L is the set of unidirectional links – cl is the capacity of link l Input: traffic matrix – Mi,j is traffic load from router i to j Output: setting of the link weights – wl is weight on unidirectional link l – Pi,j,l is fraction of traffic from i to j traversing link l Multiple Shortest Paths With Even Splitting 0.25 0.25 0.5 1.0 0.25 0.5 1.0 0.25 0.5 0.5 Values of Pi,j,l Defining the Objective Function Computing the link utilization – Link load: ul = Si,j Mi,j Pi,j,l – Utilization: ul/cl Objective functions – min(maxl(ul/cl)) f(x) – min(Sl f(ul/cl)) x 1 Complexity of the Optimization Problem NP-hard optimization problem – No efficient algorithm to find the link weights – Even for the simple convex objective functions Why can’t we just do multi-commodity flow? – E.g., solve the multi-commodity flow problem… – … and the link weights pop out as the dual – Because IP routers cannot split arbitrarily over ties What are the implications? – Have to resort to searching through weight settings Optimization Based on Local Search Start with an initial setting of the link weights – E.g., same integer weight on every link – E.g., weights inversely proportional to link capacity – E.g., existing weights in the operational network Compute the objective function – Compute the all-pairs shortest paths to get Pi,j,l – Apply the traffic matrix Mi,j to get link loads ul – Evaluate the objective function from the ul/cl Generate a new setting of the link weights repeat Making the Search Efficient Avoid repeating the same weight setting – Keep track of past values of the weight setting – … or keep a small signature (e.g., a hash) of past values – Do not evaluate a weight setting if signatures match Avoid computing the shortest paths from scratch – Explore weight settings that changes just one weight – Apply fast incremental shortest-path algorithms Limit the number of unique values of link weights – Do not explore all 216 possible values for each weight Stop early, before exploring the whole search space Incorporating Operational Realities Minimize number of changes to the network – Changing just 1 or 2 link weights is often enough Tolerate failure of network equipment – Weights settings usually remain good after failure – … or can be fixed by changing one or two weights Limit dependence on measurement accuracy – Good weights remain good, despite random noise Limit frequency of changes to the weights – Joint optimization for day and night traffic matrices Application to AT&T’s Backbone Network Performance of the optimized weights – Search finds a good solution within a few minutes – Much better than link capacity or physical distance – Competitive with multi-commodity flow solution How AT&T changes the link weights – Maintenance done every night from midnight to 6am – Predict effects of removing link(s) from the network – Reoptimize the link weights to avoid congestion – Configure new weights before disabling equipment Example from My Visit to AT&T’s Operations Center Amtrak repairing/moving part of the train track – Need to move some of the fiber optic cables – Or, heightened risk of the cables being cut – Amtrak notifies us of the time the work will be done AT&T engineers model the effects – Determine which IP links go over the affected fiber – Pretend the network no longer has these links – Evaluate the new shortest paths and traffic flow – Identify whether link loads will be too high Example Continued If load will be too high – Reoptimize the weights on the remaining links – Schedule the time for the new weights to be configured – Roll back to the old weight setting after Amtrak is done Same process applied to other cases – Assessing the network’s risk to possible failures – Planning for maintenance of existing equipment – Adapting the link weights to installation of new links – Adapting the link weights in response to traffic shifts Conclusions on Traffic Engineering IP networks do not adapt on their own – Routers compute shortest paths based on static weights Service providers need to adapt the weights – Due to failures, congestion, or planned maintenance Leads to an interesting optimization problems – Optimize link weights based on topology and traffic Optimization problem is computationally difficult – Forces the use of efficient local-search techniques Results of the local search are pretty good – Near-optimal solutions that minimize disruptions Extensions Robust link-weight assignments – Link/node failures – Range of traffic matrices More complex routing models – Destinations reachable via multiple “egress points” – Interdomain routing policies Interaction between ISPs – Inter-ISP negotiation for joint optimization – Grappling with scalability and trust issues Tomography: Inferring the Traffic Matrix Computing the Traffic Matrix Mi,j Hard to measure the traffic matrix – IP networks transmit data as individual packets – Routers do not keep traffic statistics, except link utilization on (say) a five-minute time scale Need to infer the traffic matrix Mi,j from – Current topology G(R,L) – Current routing Pi,j,l – Current link load ul – Link capacity cl Inference: Network Tomography From link counts to the traffic matrix Sources 5Mbps 3Mbps 4Mbps 4Mbps Destinations Tomography: Formalizing the Problem Ingress-egress pairs – p is a ingress-egress pair of nodes (i,j) – xp is the (unknown) traffic volume for this pair Mi,j Routing – Plp is proportion of p’s traffic that traverses l Links in the network – l is a unidirectional edge – ul is the observed traffic volume on this link Relationship: u = Px (work backwards to get x) Tomography: One Observation Not Enough Linear system of n nodes is underdetermined – Number of links e is around O(n) – Number of ingress-egress pairs c is O(n2) – Dimension of solution sub-space at least c - e Multiple observations are needed – k independent observations (over time) – Stochastic model with Poisson iid counts – Maximum likelihood estimation to infer matrix Doesn’t work all that well in practice… Approach Used at AT&T: Tomo-gravity Gravitational assumption – Ingress point a has traffic via – Egress point b has traffic veb – Pair (a,b) has traffic proportional to via * veb 9 6 20 3 14 21 7 10 Approach Used at AT&T: Tomo-gravity Problem with gravity model – Gravity model ignores the load on the inside links – Gravity assumption isn’t always 100% correct – Resulting traffic matrix might not satisfy the link loads Combining the two techniques – Gravity: find a traffic matrix using the gravity model – Tomography: find the family of traffic matrices consistent with all link load statistics – Tomo-gravity: find the tomography solution that is closest to the output of the gravity model Works extremely well (and fast) in practice Conclusions Managing IP networks is challenging – Routers don’t adapt on their own to congestion – Routers don’t reveal much information about traffic Measurement provides a network-wide view – Topology – Traffic matrix Optimization enables the network to adapt – Inferring the traffic matrix from the link loads – Optimizing the link weights based on the traffic matrix New Research Direction: Design for Manage-ability Two main parts of network management – Control: optimization – Measurement: tomography Two research approaches – Bottom up: do the best with what you have – Top down: design systems that are easier to manage Design for manage-ability – “If you are both the professor and the student, you create exam questions that are easy to answer.” Example: Changing the Path Computation Routers split traffic over multiple paths – More traffic on shorter paths, less on longer ones – In proportion to the exponential of path cost 2 3 2 Exciting result 1 1 1 4 3 5 3 – Can achieve optimal distribution of the traffic – With polynomial-time algorithm for setting the weights New Research Direction: Logically-Central Control Traditional division of labor – Routers: real-time, distributed protocols – Management system: offline, centralized algorithms Example: routing protocols and traffic engineering – Routing: routers react automatically to link failures – TE: management system sets the link weights The case for separating routing from routers – Better decisions with network-wide visibility – Routers only collect measurements and forward packets Example: Routing Control Platform (RCP) Logically-centralized server – Collects measurement data from the network – Pushes forwarding tables into the routers Benefits RCP – Network-wide policies – Flexible, easy to customize – Fewer nodes to upgrade ISP Feasibility – High-end PC can compute routes for large ISP – Simple replication to survive failures References Traffic engineering using traditional protocols – http://www.cs.princeton.edu/~jrex/papers/ieeecomm02.pdf – http://www.cs.princeton.edu/~jrex/papers/opthand04.pdf – http://www.cs.princeton.edu/~jrex/papers/ton-whatif.pdf Tomo-gravity to infer the traffic matrix – http://www.cs.utexas.edu/~yzhang/papers/mmi-ton05.pdf – http://www.cs.utexas.edu/~yzhang/papers/tomogravitysigm03.pdf – http://www.cs.princeton.edu/~jrex/papers/sfi.pdf References Design for manage-ability – http://www.cs.princeton.edu/~jrex/papers/pefti.pdf – http://www.cs.princeton.edu/~jrex/papers/optimizability.pdf – http://www.cs.princeton.edu/~jrex/papers/tie-long.pdf Routing Control Platform – http://www.cs.princeton.edu/~jrex/papers/rcp.pdf – http://www.cs.princeton.edu/~jrex/papers/ccr05-4d.pdf – http://www.cs.princeton.edu/~jrex/papers/rcp-nsdi.pdf – http://www.research.att.com/~kobus/docs/irscp.inm.pdf