* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Single External Disruption - EECS
Computer network wikipedia , lookup
Asynchronous Transfer Mode wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Distributed firewall wikipedia , lookup
Airborne Networking wikipedia , lookup
Network tap wikipedia , lookup
Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University) Jia Wang (AT&T Labs Research) 1 Motivation destination Failure AS4 Disruption AS2 AS3 Congestion BR C A BR C B AS1 BR C C Mitigation A backbone network is vulnerable to routing BR C D changes that occur in other domains. source 2 Goal Identify important routing anomalies Lost reachability Persistent flapping Large traffic shifts Contributions: •Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time. •Use the tool to characterize routing disruptions in an operational network 3 Interdomain Routing: Border Gateway Protocol “I can reach 12.34.158.0/24 via AS 1” “I can reach 12.34.158.0/24” AS 1 BR BR C 12.34.158.0/24 eBGP data traffic AS 2 BR BR C iBGP BR BR C eBGP BR BR C AS 3 data traffic 12.34.158.5 Prefix-based: one route per prefix Path-vector: list of ASes in the path Incremental: every update indicates a change Policy-based: local ranking of routes 4 Capturing Routing Changes A large operational network (8/16/2004 – 10/10-2004) BR BR C BR BR C BR BR C BGP CPE Monitor BR BR C BR BR C BR BR C 5 Challenges Large volume of BGP updates Millions daily, very bursty Too much for an operator to manage Different from root-cause analysis Identify changes and their effects Focus on actionable events rather than diagnosis Diagnose causes in/near the AS 6 System Architecture BGP (106) BR E Updates Events (105) BR E BR E BGP Update Grouping Persistent Flapping Prefixes (101) “Typed” Events Event Classification Clusters Event Correlation Frequent Flapping Prefixes (101) Large Disruptions (101) (103) Traffic Impact Prediction Netflow Data BR E BR E BR E From millions of updates to a few dozen reports 7 Grouping BGP Update into Events Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers BR E BR E BR E Approach: BGP BGP Update Updates Grouping Persistent Flapping Prefixes Events •Group together all updates for a prefix with inter-arrival < 70 seconds •Flag prefixes with changes lasting > 10 minutes. 8 Grouping Thresholds Based on our understanding of BGP and data analysis Event timeout: 70 seconds 2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds Convergence timeout: 10 minutes BGP usually converges within a few minutes 99.9% events < 10 minutes 9 Persistent Flapping Prefixes A surprising finding: 15.2% of updates were caused by persistent-flapping prefixes even though flap damping is enabled. Types of persistent flapping Conservative damping parameters (78.6%) Protocol oscillations due to MED (18.3%) Unstable interfaces or BGP sessions (3.0%) 10 Example: Unstable eBGP Session AE ISP DE Peer BE CE p Customer Flap damping parameters is session-based Damping not implemented for iBGP sessions 11 Event Classification Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of the traffic through the network Events Event Classification “Typed” Events, e.g., Loss/Gain of Reachability Solution: classify events by severity of their impact 12 Event Category – “No Disruption” p AS2 AS1 DE No Traffic Shift “No Disruption”: EE AE BE ISP no border routers have any traffic shift. (50.3%) CE 13 Event Category – “Internal Disruption” p AS2 AS1 DE EE AE BE “Internal Disruption”: ISP all traffic shifts are internal. (15.6%) CE Internal Traffic Shift 14 Event Category – “Single External Disruption” p AS2 AS1 DE external Traffic Shift EE AE BE “Single External Disruption”: ISP only one of the traffic shifts is external (20.7%) CE 15 Statistics on Event Classification Events Updates No Disruption 50.3% 48.6% Internal Disruption 15.6% 3.4% Single External Disruption 20.7% 7.9% Multiple External Disruption 7.4% 18.2% Loss/Gain of Reachability 6.0% 21.9% First 3 categories have significant day-to-day variations Updates per event depends on the type of events and the number of affected routers 16 Event Correlation Challenge: A single routing change affects multiple destination prefixes “Typed” Events Event Correlation Clusters Solution: group the same-type, close-occurring events 17 EBGP Session Reset Caused most of “single external disruption” events Check if the number of prefixes using that session as the best route changes dramatically Number of prefixes session recovery session failure time Validation with Syslog router report (95%) 18 Hot-Potato Changes Hot-Potato Changes P AE 11 9 BE ISP 10 “Hot-potato routing” = route to closest egress point CE Caused “internal disruption” events Validation with OSPF measurement (95%) [Teixeira et al – SIGMETRICS’ 04] 19 Traffic Impact Prediction Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations Traffic Impact Prediction Clusters Large Disruptions Netflow Data E BR E BR E BR Solution: weigh each cluster by traffic volume 20 Traffic Impact Prediction Traffic weight Per-prefix measurement from netflow 10% prefixes accounts for 90% of traffic Traffic weight of a cluster the sum of “traffic weight” of the prefixes A small number of large clusters have large traffic weight Mostly session resets and hot-potato changes 21 Performance Evaluation Memory Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB Speed 99% of intervals of 1 second of updates can be process within 1 second Occasional execution lag Every interval of 70 seconds of updates can be processed within 70 seconds Measurements were based on 900MHz CPU 22 Conclusion BGP troubleshooting system Fast, online fashion Operators’ concerns (reachability, flapping, traffic) Significant information reduction millions of update a few dozens of large disruptions Uncovered important network behavior Hot-Potato changes Session resets Persistent-flapping prefixes 23