Download Power Point

Network Processor Algorithms: Design and Analysis Balaji Prabhakar Balaji Prabhakar High Performance Switching and Routing Telecom Center Workshop: Sept 4, 1997. Stanford University Stochastic Networks Conference Montreal July 22, 2004 Overview • Network Processors – What are they? – Why are they interesting to industry and to researchers • SIFT: a simple algorithm for identifying large flows – The algorithm and its uses • Traffic statistics counters – The basic problem and algorithms • Sharing processors and buffers – A cost/benefit analysis 2 IP Routers 19” Capacity is sum of rates of line-cards 19” Capacity: 160Gb/s Power: 4.2kW 6ft Capacity: 80Gb/s Power: 2.6kW 3ft 2ft 2.5ft Cisco GSR 12416 2.5ft Juniper M160 3 A Detailed Sketch Lookup Engine Network Processor Lookup Engine Network Processor Lookup Engine Network Processor Line cards Interconnection Fabric Packet Buffers Output Scheduler Switch Packet Buffers Packet Buffers Outputs 4 Network Processors • Network processors are an increasingly important component of IP routers • They perform a number of tasks – – – – – (essentially everything except Switching and Route lookup) Buffer management Congestion control Output scheduling Traffic statistics counters Security … • They are programmable, hence add great flexibility to a router’s functionality 5 Network Processors • But, because they operate under severe constraints – very high line rates – heat constraints the algorithms that they can support should be lightweight • They have become very attractive to industry • They give rise to some interesting algorithmic and performance analytic questions 6 Rest Of The Talk SIFT: a simple algorithm for identifying large flows – The algorithm and its uses  with Arpita Ghosh and Costas Psounis • Traffic statistics counters – The basic problem and algorithms  with Sundar Iyer, Nick McKeown and Devavrat Shah • Sharing processors and buffers – A cost/benefit analysis  with Vivek Farias and Ciamac Moallemi 7 SIFT: Motivation • Current egress buffers on router line cards serve packets in a FIFO manner • But, giving the packets of short flows a higher priority, e.g. using the SRPT (Shortest Remaining Processing Time) policy – – reduces average flow delay given the heavy-tailed nature of Internet flow size distribution, the reduction in delay can be huge Egress Buffer FIFO: SRPT: 8 But … • SRPT is unimplementable – router needs to know residual flow sizes for all enqueued flows: virtually impossible to implement • Other pre-emptive schemes like SFF (shortest flow first) or LAS (least attained service) are like-wise too complicated to implement • This has led researchers to consider tagging flows at the edge, where the number of distinct flows is much smaller – – • but, this requires a different design of edge and core routers more importantly, needs extra space on IP packet headers to signal flow size Is something simpler possible? 9 SIFT: A randomized algorithm • Flip a coin with bias p (= 0.01, say) for heads on each arriving packet, independently from packet to packet • A flow is “sampled” if one its packets has a head on it T T T T H T H 10 SIFT: A Randomized Algorithm • A flow of size X has roughly 0.01X chance of being sampled – flows with fewer than 15 packets are sampled with prob – flows with more than 100 packets are sampled with prob – the precise probability is: 1 – (1-0.01)X  0.15 1 • Most short flows will not be sampled, most long flows will be 11 The Accuracy of Classification • Ideally, we would like to sample like the blue curve • Sampling with prob p gives the red curve Prob (sampled) – there are false positives and false negatives Flow size • Can we get the green curve? 12 SIFT+ • Sample with a coin of bias q = 0.1 – say that a flow is “sampled” if it gets two heads! – this reduces the chance of making errors – but, you have to have a count the number heads • So, how can we use SIFT at a router? 13 SIFT at a router • Sample incoming packets • Place any packet with a head (or the second such packet) in the low priority buffer • Place all further packets from this flow in the low priority buffer (to avoid mis-sequencing) B All flows B/2 Short flows sampling B/2 Long flows 14 Simulation results • Topology: Traffic Sources Traffic Sinks 15 Overall Average Delays 16 Average Delay for Short Flows 17 Average Delay for Long Flows 18 Implementation Requirements • SIFT needs – – – – two logical queues in one physical buffer to sample arriving packets a table for maintaining id of sampled flows to check whether incoming packet belongs to sampled flow or not  all quite simple to implement 19 A Big Bonus • The buffer of the short flows has very low occupancy – so, can we simply reduce it drastically without sacrificing performance? • More precisely, suppose – we reduce the buffer size for the small flows, increase it for the large flows, keep the total the same as FIFO 20 SIFT Incurs Fewer Drops Buffer_Size(Short flows) = 10; Buffer_Size(Long flows) = 290; Buffer_Size(Single FIFO Queue) = 300; SIFT -----FIFO ------ Reducing Total Buffer Size • Suppose we reduce the buffer size of the long flows as well • Questions: – will packet drops still be fewer? – will the delays still be as good? 22 Drops With Less Total Buffer Buffer_Size(PRQ0) = 10; Buffer_Size(PRQ1) = 190; Buffer_Size(One Queue) = 300; OneSIFT Queue -----FIFO ------ Delay Histogram for Short Flows SIFT -----FIFO ------ 0.3 SIFT 0.25 FIFO Percentage 0.2 0.15 0.1 0.05 0 0.00 0.30 0.61 Transfer Time (sec) 0.91 Delay Histogram for Long Flows 0.06 SIFT -----FIFO ------ SIFT 0.05 FIFO Percentage 0.04 0.03 0.02 0.01 0 0.00 0.30 0.61 Transfer Time (sec) 0.91 Why SIFT Reduces Buffers • The amount of buffering needed to keep links fully utilized – old formula: – corrected to: = 10 Gbps x 0.25 = 2.5 G ¼ 250 M • But, this formula is for large (elephant) flows, not for short (mice) flows – elephant arrival rate: 0.65 or 0.7 of C; hence they smaller buffers for them – mice buffers are almost empty due to high priority, mice don’t cause elephant packet drops – elephants use TCP to regulate their sending rate according to mice SIFT elephants 26 Conclusions for SIFT • A randomized scheme, preliminary results show that – it has a low implementation complexity – it reduces delays drastically (users are happy) – with 30-35% smaller buffers at egress line cards (router manufacturers are happy) • Leads to a 15 pkts or less lane on the Internet, could be useful • Further work needed – at the moment we have a good understanding of how to sample, and extensive (and encouraging) simulation tests – need to understand the effect of reduced buffers on end-to-end congestion control algorithms 27 Traffic Statistics Counters: Motivation • Switches maintain statistics, typically using counters that are incremented when packets arrive • At high line rates, memory technology is a limiting factor for the implementation of counters; for example, in a 40 Gb/s switch, each packet must be processed in 8 ns • To maintain a counter per flow at these line rates, we would like an architecture with the speed of SRAM, and the density (size) of DRAM 28 Hybrid Architecture • Shah, Iyer, Prabhakar, and McKeown (2001) proposed a hybrid SRAM/DRAM architecture SRAM N counters Arrivals (at most one per time slot) … Update counter in DRAM, empty corresponding counter in SRAM (once every b time slots) Counter Management Algorithm DRAM … 29 Counter Management Algorithm • Shah et al. place a requirement on the counter management algorithm (CMA) that it must maintain all counter values accurately • That is, given N and b, what should the size of each SRAM counter be so that no counts are missed? 30 Some CMAs • Round robin – maximum counter value is bN • Largest Counter First (LCF) – optimal in terms of SRAM memory usage – no counter can have a value larger than: 31 Analysis of LCF • This upper bound is proved by establishing a bound on the following potential (Lyapunov) function – let Qi(t) be the size of counter i at time t, then • E.g. for b = 2, • Hence, the size of the largest counter is at most 32 An Implementable Algorithm • LCF is difficult to implement – with one counter per flow, we would like to support at least 1 million counters – maintaining a sorted list of counters to determine the longest counter takes too much SRAM memory • Ramabhadran and Varghese (2003) proposed a simpler algorithm with the same memory usage as LCF 33 LCF with Threshold • The algorithm keeps track of the counters that have value at least as large as b • At any service time, let j be the counter with the largest value among those incremented since the previous service, and let c be its value – if c ¸ b, serve counter j – if c · b, serve any counter with value at least b; if no such counter exists, serve counter j • Maintaining the counters with values at least b is a non-trivial problem; it is solved using a bitmap and an additional data structure • Is something even simpler possible? 34 Some Simpler Algorithms … • Possible approaches for a CMA that is simpler to implement: – arrival information (serve largest counter among those incremented) – random sampling – round-robin pointer • Trade-off between simplicity and performance: more SRAM is needed in the worst case for these schemes 35 An Alternative Architecture DRAM SRAM N counters … FIFO Buffer Counter Management Algorithm … • Decision problem: given a counter with a particular value and the occupancy of the buffer, when should the counter value be moved to the FIFO buffer? What size counters does this lead to? – Interesting question with Poisson arrivals, exponential services, tractable 36 The Cost of Sharing • We have seen that there is a very limited amount of buffering and processing capability in each line card • In order to fully utilize these resources, it will become necessary to share them amongst the packets arriving at each line card • But, sharing imposes a cost – we may need to traverse the switch fabric more often than needed – each of the two processors involved in a migration will need to do some processing; e.g. e local, 1 remote, instead of just 1 – or, the host processor may simply be worse at the processing; e.g. 1 local versus K (> 1) remote • Need to understand the tradeoff between costs and benefits – will focus on a specific queueing model – interested in simple rules – benefit measured in reduction of backlogs 37 The Setup Poisson (l) Poisson (l) Poisson (l) Poisson (l) K 1 exp(1) exp(1) exp(1) 1 exp(1) • Does sharing reduce backlogs? 38 Additive Threshold Policy • Job arrives at queue 1 • Send the job to queue 2 if • Otherwise, keep the job in queue 1 • Analogous policy for jobs arriving at queue 2 39 Additive Thresholds - Queue Tails 1 0.1 0.01 0.001 No Sharing 0.0001 0.00001 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 40 Additive Thresholds - Stability • Theorem: Additive policy is stable if and unstable if • For example, if Stable for Unstable for 41 Inference • The pros/cons of sharing – Reduction in backlogs – Loss of throughput 42 Multiplicative Threshold Policy • Job arrives at queue 1 • Send the job to queue 2 if • Otherwise, keep the job in queue 1 • Theorem: Multiplicative policy is stable for all l < 1 • Interestingly, this policy improves delays while preserving throughput! 43 Multiplicative Thresholds - Queue Tails 1 0.1 0.01 0.001 No Sharing 0.0001 0.00001 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 44 Multiplicative Thresholds - Delay 6.5 6 Average Delay 5.5 5 4.5 4 3.5 3 2.5 2 1 1.5 2 2.5 3 3.5 4 4.5 5 45 Conclusions • Network processors add useful features to a router’s function • There are many algorithmic questions that come up – simple, high performance algorithms are needed • For the theorist, there are many new and interesting questions; we have seen three examples briefly – SIFT: a sampling algorithm – Designing traffic statistics counters – Sharing: a cost-benefit analysis 46

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Power Point