Download Cross-Domain and Cross-Layer Coarse Grained Quality of

Cross-Domain and Cross-Layer Coarse Grained Quality of Service Support in IP-based Networks von der Fakultät für Elektrotechnik und Informationstechnik der Technischen Universität Chemnitz genehmigte Dissertation zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) vorgelegt von Dipl.-Ing. Thomas Martin Knoll geboren am 10. Januar 1973 in Reichenbach eingereicht am 27.7.2009 Gutachter: Univ.-Prof. Dr.-Ing. Thomas Bauschert Univ.-Prof. Dr.-Ing. Jörg Eberspächer Univ.-Prof. Dr.-Ing. habil. Klaus Franke Tag der Verteidigung: 11.11.2009 Verfügbar im MONARCH der TU Chemnitz: http://archiv.tu-chemnitz.de/pub/2009/0165 17.11.2009 Bibliographische Beschreibung Thomas Martin Knoll Cross-Domain and Cross-Layer Coarse Grained Quality of Service Support in IP-based Networks Dissertation (in englischer Sprache) 166 Seiten, 155 Abbildungen, 21 Tabellen, 185 Literaturverweise Referat Mit der zunehmenden Popularität des Internets steigt die Anzahl der Nutzer und vor allem die Anzahl zeit- und verlustkritische Dienste – wie zum Beispiel „Voice over IP“, Videoübertragungen und netzbasierte Spiele. Das Internet ist dabei der Zusammenschluss von ca. 30.000 Betreibernetzen, die mit Hilfe des „Internet Protocol (IP)“ derzeit ohne jede Dienstgüteunterstützung den Datenverkehraustausch realisieren. Massive Überdimensionierung der Netzkapazitäten führen zu einer Netzauslastung von nur ca. 10% und entsprechend guter Übertragungsqualität. Mit steigendem Verkehrsaufkommen wird in dieser Dissertation erwartet, das die Netzbetreiber infolge des Kostendrucks nicht schritthaltend den überhöhten Netzausbau aufrechterhalten können und somit Qualitätseinbußen zu erwarten sind. Innerhalb der Betreiber wird bereits jetzt Verkehrstrennung betrieben, jedoch am Übergabepunkt verworfen und im besten Fall im Nachbarnetz durch aufwendige Analyse erneut vorgenommen. Im Rahmen dieser Arbeit wurde deshalb ein domänen- und schichtenübergreifendes Konzept zur Realisierung grob-granularer Dienstgüte in IP-Netzen entworfen, zur Standardisierung bei der „Internet Engineering Task Force (IETF)“ vorgeschlagen, implementiert und in Auszügen simuliert und getestet. Dabei werden die Verkehrsklasseninformationen mehrere Netzschichten in transitiven Nachrichtenelementen des „Border Gateway Protocol (BGP)“ signalisiert und schichtenübergreifend assoziiert. Die vorliegende Dissertation beinhaltet im wesentlichen drei Teile: 1. Eine umfassende Zusammenstellung von vorhandenen Dienstgütekonzepten einschließlich der bereits existierenden QoS-Funktionselemente in verfügbaren Netzelementen, 2. Die detaillierte Spezifikation des neuen Konzeptes und 3. den Ergebnissen der Simulations- und Implementierungsaktivitäten zum Nachweis der Funktion und Skalierbarkeit des Entwurfes. Zwei wesentliche Erkenntnisse und Forderungen sind durch die Bearbeitung des Themas erwachsen. Die Einfachheit der Konzeptstruktur und die Einfachheit der angestrebten Dienstgüteunterstützung. Die angestrebte Dienstgüte beschränkt sich deshalb auf die primitive Verkehrstrennung in mehrere Klassen, die in den Weiterleitungsknoten getrennt abgelegt und mit verschiedenem Vorrang behandelt werden. Schlagwörter Quality of Service (QoS), Class of Service (CoS), Cross-Domain, Cross-Layer, Inter-AS, Marking Signalling, Ingress limitation Signalling, BGP, Extended Community Attribute ii 17.11.2009 Abstract The increasingly popular Internet with a steadily growing user base, the resulting traffic load and its rising usage for time and loss critical services, such as voice over IP, video streaming and gaming, consists of about 30,000 interconnected service provider networks. Those interconnections are based on the Internet Protocol (IP) and do not distinguish the mixed traffic types within the transported traffic load. The currently observed and mostly sufficient service quality can only be achieved by network internal and inter-domain link capacity over-provisioning. Resource utilization of about 10% is commonly applied to achieve stable and un-congested network operation. However, service providers are increasingly deploying Quality of Service (QoS) support mechanisms within their network domain in order to provide traffic separation and differentiated forwarding. Not only IP QoS, but also underlying link layer QoS mechanisms are applied. Such QoS support is currently removed at the interconnection link and possibly reapplied in an independent and uncoordinated fashion in the neighbouring domain. A new cross-domain and cross-layer coarse grained Quality of Service support concept has therefore been drafted, which allows for the automated inter-domain class of service (CoS) support information exchange about the distinguished traffic classes at different networking layers. The concept is based on the standard inter-domain signalling protocol, the Border Gateway Protocol (BGP) version 4. Transitive BGPbased cross-domain signalling and cross-layer CoS mapping is a novel contribution. The cross-domain signalling of cross-layer mapped class set information has been submitted for standardization within the Internet Engineering Task Force (IETF). This includes a class overload prevention signalling by means of applied token bucket based ingress limitations. Global scale usage and omnipresent traffic class of service support is targeted with the proposed and implemented concept. It is likely, that service providers might be tempted to misuse offered service classes, hence the overload limitation. Three major contributions are documented within this thesis: 1. A comprehensive compilation of QoS support concepts with detailed network and node internal building block descriptions has been arranged, which proves the technical readiness of currently deployed devices for an inter-domain CoS based interconnection. 2. The drafted specification of the new inter-domain CoS concept including the CoS marking and class overload limitation signalling is detailed herein. 3. Simulations and implementations of vital building blocks of the concept have been made to underline its functionality and technical feasibility. Resource estimates and successful field trials provide evidence for its scalable and functioning design. The thesis’ work identified two fundamental design requirements for the concept. They are simplicity in design and QoS support. QoS in this approach therefore refers to primitive traffic separation into several classes, which will experience differently prioritized forwarding behaviour in relaying nodes. Enqueueing in separate queues is thereby aspired to. iii 17.11.2009 Contents 1 Introduction ____________________________________________________________ 3 2 Fundamentals of IP routing and forwarding __________________________________ 4 2.1 IP datagram structure and addressing___________________________________________4 2.2 Routing basics _______________________________________________________________7 2.2.1 2.2.2 2.3 Router architecture _________________________________________________________19 2.3.1 2.3.2 2.3.3 3 Routing protocols and hierarchy _____________________________________ 7 Inter-domain routing using BGP ____________________________________ 12 Router control plane structure ______________________________________ 19 Router internal interconnection structure _____________________________ 20 Router internal queuing structure ___________________________________ 21 Basic QoS aspects ______________________________________________________ 23 3.1 Overview __________________________________________________________________23 3.1.1 3.1.2 3.2 QoS treatment scope_________________________________________________________37 3.2.1 3.2.2 3.2.3 3.3 QoS-based forwarding ____________________________________________ 38 QoS-based routing________________________________________________ 39 QoS-based tunnelling _____________________________________________ 41 Architectural scope__________________________________________________________44 3.3.1 3.3.2 4 relative vs. absolute vs. coarse-grained QoS _________________________ 23 QoS building blocks_______________________________________________ 25 Cross-layer QoS__________________________________________________ 44 Cross-domain QoS _______________________________________________ 45 State of the art QoS Concepts _____________________________________________ 46 4.1 IP QoS ____________________________________________________________________46 4.1.1 4.1.2 4.1.3 4.1.4 DiffServ _________________________________________________________ 47 IntServ __________________________________________________________ 52 IntServ / DiffServ combination ______________________________________ 54 ITU-T IP QoS concept_____________________________________________ 55 4.2 Ethernet QoS_______________________________________________________________56 4.3 MPLS QoS_________________________________________________________________61 4.4 QoS in access networks ______________________________________________________65 4.5 Summary of expected Class of Service support ___________________________________69 5 State of the art AS interconnection _________________________________________ 71 5.1 IP transit __________________________________________________________________74 5.2 IP peering _________________________________________________________________75 5.3 Internet Routing Registry - IRR _______________________________________________77 6 Related work___________________________________________________________ 78 iv 17.11.2009 7 New (coarse grained) CoS concept _________________________________________ 86 7.1 Motivation and target________________________________________________________86 7.2 Usage of BGP for QoS signalling ______________________________________________88 7.3 Definitions and information processing _________________________________________89 7.3.1 7.3.2 8 BGP extended community attribute for CoS marking __________________ 89 BGP class of service interconnection ________________________________ 96 Mapping strategies_____________________________________________________ 101 8.1 Problem statement _________________________________________________________101 8.1.1 8.1.2 mapping between different class sets of the same layer_______________ 101 mapping between different class sets of different layers_______________ 103 8.2 Existing recommendations___________________________________________________104 8.3 Coarse grained CoS mapping recommendations ________________________________113 9 Simulation results _____________________________________________________ 115 9.1 Setup selection for QoS marking and forwarding ________________________________115 9.2 Simulation results for QoS marking and forwarding _____________________________117 9.2.1 9.2.2 9.2.3 9.2.4 9.2.5 9.2.6 9.2.7 Scenario 1: single node interconnection ____________________________ 117 Scenario 2: AS interconnection – Single AS _________________________ 120 Scenario 3: AS interconnection – Multi-AS __________________________ 121 Scenario 4: AS interconnection – Multi-AS 2 ________________________ 122 Scenario 5: AS interconnection – Multi-AS 3 ________________________ 123 Scenario 6: AS interconnection – Multi-AS 4 ________________________ 124 Scenario 7: AS interconnection – Cross-Layer _______________________ 126 9.3 Setup selection for token bucket ingress filtering ________________________________127 9.4 Simulation results for token bucket ingress filtering _____________________________128 9.5 Summary of simulation results _______________________________________________130 10 Concept implementation_______________________________________________ 132 10.1 Linux implementation ____________________________________________________132 10.2 Wireshark implementation ________________________________________________136 10.3 Online debug form _______________________________________________________137 11 Implementation test __________________________________________________ 138 11.1 Test setup _______________________________________________________________138 11.2 Test result and observations _______________________________________________139 11.3 Ethernet QoS support test at IXPs __________________________________________142 11.4 Resource usage estimates __________________________________________________143 11.4.1 11.4.2 12 Increase in routing update information size ________________________ 145 Increase in memory consumption with routers _____________________ 148 Summary and outlook ________________________________________________ 152 12.1 Contributions and results__________________________________________________152 12.2 Practical usage___________________________________________________________153 12.3 Outlook ________________________________________________________________153 v 17.11.2009 Titel Domänen- und schichtenübergreifendes Konzept zur Realisierung grob-granularer Dienstgüte in IP-Netzen Inhaltsverzeichnis 1 Einleitung______________________________________________________________ 3 2 Grundlagen des IP Routing und Forwarding _________________________________ 4 2.1 IP Datagramstruktur und Adressierung _________________________________________4 2.2 Grundlagen des Routings______________________________________________________7 2.2.1 2.2.2 2.3 Router-Architektur _________________________________________________________19 2.3.1 2.3.2 2.3.3 3 Routing-Protokolle und -hierarchien __________________________________ 7 Inter-Domän-Routing mittels BGP___________________________________ 12 Struktur der Router-Steuerungsschicht ______________________________ 19 Struktur Router-internen Verbindungen ______________________________ 20 Struktur der Router-internen Warteschlangen ________________________ 21 Grundlegende Aspekte der Dienstgüte ______________________________________ 23 3.1 Überblick __________________________________________________________________23 3.1.1 3.1.2 3.2 Ausdehnungsbereich von QoS-Mechanismen ____________________________________37 3.2.1 3.2.2 3.2.3 3.3 QoS-basiertes Weiterleiten ________________________________________ 38 QoS-basierte Wegewahl___________________________________________ 39 QoS-basiertes Tunneln ____________________________________________ 41 Einflußbereiche der Konzept-Architektur_______________________________________44 3.3.1 3.3.2 4 Relative vs. absolute vs. grob-granulare QoS_________________________ 23 QoS-Bausteine ___________________________________________________ 25 Schichtenübergreifende QoS_______________________________________ 44 Domänübergreifende QoS _________________________________________ 45 Aktuelle QoS-Konzepte __________________________________________________ 46 4.1 IP QoS ____________________________________________________________________46 4.1.1 4.1.2 4.1.3 4.1.4 DiffServ _________________________________________________________ 47 IntServ __________________________________________________________ 52 Kombination von IntServ und DiffServ _______________________________ 54 IP QoS Konzept der ITU-T _________________________________________ 55 4.2 Ethernet QoS_______________________________________________________________56 4.3 MPLS QoS_________________________________________________________________61 4.4 QoS in Zugangsnetzen _______________________________________________________65 4.5 Zusammenfassung der zu erwartenden Dienstklassenunterstützung _________________69 5 Derzeitige AS-Kopplung _________________________________________________ 71 5.1 IP Transit _________________________________________________________________74 5.2 IP Peering _________________________________________________________________75 5.3 Internet Routing Registratur - IRR ____________________________________________77 6 Bisherige Arbeiten auf dem Gebiet _________________________________________ 78 vi 17.11.2009 7 Das neue (grob-granulare) CoS-Konzept ____________________________________ 86 7.1 Motivation und Zielsetzung ___________________________________________________86 7.2 Nutzung von BGP zur QoS-Signalisierung ______________________________________88 7.3 Definitionen und Informationsverarbeitung _____________________________________89 7.3.1 7.3.2 8 BGP Extended Community Attribut zur CoS-Markierung _______________ 89 Dienstklassen-basierte Kopplung mittels BGP ________________________ 96 Zuordnungsstrategien __________________________________________________ 101 8.1 Problembeschreibung_______________________________________________________101 8.1.1 8.1.2 Dienstklassenabbildungen innerhalb einer Schicht ___________________ 101 Dienstklassenabbildungen zwischen verschiedenen Schichten ________ 103 8.2 Vorhandene Empfehlungen__________________________________________________104 8.3 Empfehlungen zu grob-granularen CoS-Abbildungen ____________________________113 9 Simulationsergebnisse __________________________________________________ 115 9.1 Simulationsplanung für QoS-Markierungen und QoS-Weiterleitung _______________115 9.2 Simulationsergebnisse für QoS-Markierungen und QoS-Weiterleitung______________117 9.2.1 9.2.2 9.2.3 9.2.4 9.2.5 9.2.6 9.2.7 Szenario 1: Einzelknotenkopplung _________________________________ 117 Szenario 2: AS-Kopplung – Einzel-AS ______________________________ 120 Szenario 3: AS-Kopplung – Multi-AS _______________________________ 121 Szenario 4: AS-Kopplung – 2 AS __________________________________ 122 Szenario 5: AS-Kopplung – 3 AS __________________________________ 123 Szenario 6: AS-Kopplung – 4 AS __________________________________ 124 Szenario 7: Schichtenübergreifende AS-Kopplung ___________________ 126 9.3 Simulationsplanung für Token Bucket-Filterung ________________________________127 9.4 Simulationsergebnisse für Token Bucket Filterung ______________________________128 9.5 Zusammenfassung der Simulationsergebnisse __________________________________130 10 Implementierung des Konzeptes ________________________________________ 132 10.1 Linux-Implementierung ___________________________________________________132 10.2 Wireshark-Implementierung _______________________________________________136 10.3 Online-Formular zur Dekodierung __________________________________________137 11 Implementierungstest _________________________________________________ 138 11.1 Testaufbau ______________________________________________________________138 11.2 Testergebnisse und Beobachtungen _________________________________________139 11.3 Tests zur Ethernet-QoS Unterstützung bei IXPs _______________________________142 11.4 Abschätzung des Resourcenverbrauchs ______________________________________143 11.4.1 11.4.2 12 Anstieg der UPDATE-Größe_____________________________________ 145 Anstieg des Speicherbedarfs ____________________________________ 148 Zusammenfassung und Ausblick________________________________________ 152 12.1 Beitrag und Ergebnisse ___________________________________________________152 12.3 Praxisanwendung ________________________________________________________153 12.3 Ausblick ________________________________________________________________153 vii 17.11.2009 Einleitung Die Vernetzung aktueller IP-basierter Datennetze bildet zwar eine moderne Kommunikationstechnologie, besitzt jedoch einige Unzulänglichkeiten in der Netzkopplung. Die nachfolgende geschichtliche Analogie zeigt genau diese Schwachstellen des Internets auf, welche zugleich in dieser Arbeit aufgegriffen und verbessert werden. Im 19. Jahrhundert wurde die Kommunikation zwischen den Kolonien Südaustralien und Westaustralien durch Dampfschiffe realisiert, was durchaus Wochen für den Transport dauern konnte. Damals entschied man, die Kommunikation auf Telegraphie umzustellen. 1874 begann man deshalb mit dem Bau der Telegraphenleitung. Südaustralien trieb die Leitung von Port Augusta westwärts bis zur Grenze und Westaustralien begann mit dem Bau in Albany in Richtung Osten. An der Telegraphenstation in der kleinen Grenzsiedlung Eucla ([145], [166]) wurde 1877 die Verbindung beider Leitungsabschnitte erreicht. Die Station wurde zu gleichen Teilen mit Mitarbeitern betrieben, die entlang eines langen Nord-Süd ausgerichteten Tisches sich gegenüber saßen. Die Grenze war dabei die Mitte des Hauses und die Mitte des Tisches. Nachrichten, die zwischen den Staaten ausgetauscht werden sollten wurden somit vom jeweiligen Personal empfangen, manuell zur anderen Seite des Tisches gereicht und dort erneut als Telegraphennachricht gesendet. Grund dafür waren verschiedene Zeichenkodierungen, die auf beiden Seiten verwendet wurden. Südaustralien verwendete den amerikanischen Morse-Code und Westaustralien den internationalen. Die Ähnlichkeit besteht darin, dass das heutige Internet aus etwa 30000 unabhängig voneinander betriebener IP-Netze, so genannter Autonomer Systeme (AS), besteht, die in unkoordinierter Weise Dienstgütekonzepte verfolgen und auf einfachstem Niveau privat oder öffentlich vernetzt sind. Trotz dessen, dass diese ASse oft intern frei gewählte Verkehrstrennung und –priorisierung anwenden, wird bei deren Zusammenschluss die Trennung entfernt und ohne Verkehrstrennung und vorrangige Behandlung die Verkehrsübergabe vorgenommen. Einige Eintrittsvermittlungen der ASse betreiben dann aufwendige Klassifizierung anhand der gekapselten Empfangsdaten, um eine möglichst gute Schätzung der empfangenen Verkehrsart zu treffen und erneut die passende interne Verkehrstrennung und –priorisierung anzuwenden. Deshalb wurde in dieser Arbeit die Signalisierung und direkte Verkehrsklassen-basierte Kopplung Autonomer Systeme untersucht, dokumentiert und implementiert. viii 17.11.2009 Zusammenfassung und Ausblick Diese Dissertation betrachtet den Zusammenschluss von so genannten „Autonomen Systemen“, die derzeit keinerlei Dienstgüteunterstützung bieten. Die erbrachten Beiträge dieser Arbeit sind in wesentlichen in drei Teile gegliedert. Den ersten Teil bildet eine umfassende Zusammenstellung von vorhandenen Dienstgütekonzepten einschließlich der bereits existierenden QoS-Funktionselemente in verfügbaren Netzen und Geräten zur Netzkopplung. Diese Geräte sind nachweislich für die Unterstützung von domänenübergreifender, klassenbasierter Dienstgüte geeignet. Aus diesen Erkenntnissen und zusammen mit den mündlichen Aussagen führender Europäischer und Amerikanischer Netzbetreiber und Betreibern aus dem Nahen Osten über die akzeptable Komplexität solcher Dienstgütevorhaben entstand die vordringliche Forderung nach einem einfachen, leicht fassbaren und handhabbaren Dienstgütekonzept. In einem zweiten Teil wurde das angestrebte domänenübergreifende Dienstgütekonzept spezifiziert und zur Standardisierung bei der IETF eingereicht. Im dritten Teil wird durch Simulation und Implementierung wesentlicher Konzeptbestandteile deren Funktion und technische Machbarkeit dargelegt. Die Skalierbarkeit und Funktionalität des Konzeptes wurde durch Feldtests und durch Abschätzungen des Ressourcenverbrauchs nachgewiesen. Beitrag und Ergebnisse Folgende Erkenntnisse und Beiträge wurde in der Arbeit erbracht: • Der Zusammenschuss von autonomen Systemen zum globalen Internet stellt aus technischer und ökonomischer Sicht eine neuralgische Schnittstelle zwischen Netzbetreibern dar. Derzeitige Zusammenschlüsse basieren ausschließlich auf dem Austausch von IP-Nachrichten ohne Dienstgüteunterstützung. Überdimensionierung und netzinterne Dienstgüteunterstützung werden derzeit vorgenommen. Durch das anhaltende Wachstum des Internetverkehrs wird in der Dissertation ein Anstieg an Netzausbaukosten und zunehmender Verkehrsstau auf den Kopplungsleitungen erwartet. Eine neues klassenbasiertes Kopplungskonzept wurde deshalb entwickelt, das für globale Anwendung geeignet ist. • Die Einfachheit eines Entwurfes wurde als entscheidendes Entwurfskriterium für die Akzeptanz des Konzeptes in der Internet-Gemeinde erkannt. Es erstreckt sich dabei sowohl auf die Signalisierungsstrukturen als auch das tatsächliche Ausmaß der Klassenunterstützung. • Die Wichtigkeit der Unterstützung von mindestens zwei oder besser 4 Dienstklassen wurde mit Hilfe von Simulationen untermauert. • Im Gegensatz zu existierenden komplexen Dienstgütekonzepten, die Garantien zu Verzögerungen, Verzögerungsschwankungen und Verlustraten anstreben, wird aus Kosten- und Akzeptanzgründen im vorliegenden Konzept nur einfache Verkehrstrennung gefordert. • Der erreichte Grad an Einfachheit durch Wegfall von Dienstgütegarantien ist eine zentrale Voraussetzung für die globale Anwendbarkeit. • Die Entscheidung zur Verwendung von BGP für die Signalisierung wurde auf Basis der Betrachtungen zu bereits existierenden und emporkommenden Signalisierungsprotokolle getroffen. • Im BGP wurden neue so genannte “Extended Communities” und ein neues Pfadattribut definiert, die zur Signalisierung der erforderlichen domänen- und schichtenübergreifenden Klasseninformation verwendet werden. ix 17.11.2009 • • • • • • Das neuartige Prinzip der transitiven Weiterleitung von Dienstklasseninformationen mittels der “Extended Communities” und der vom Betreiber festlegbaren Zuordnung der Dienstgüteeinstellungen verschiedener Netzschichten innerhalb der Signalisierung stellt eine grundlegende Errungenschaft dar. Die Ergebnisse aufwendiger Einzelknoten-Simulationen und Simulationen auf AS-Niveau wurden auszugsweise in dieser Dissertation dokumentiert und sind auf Anfrage vollständig verfügbar. Der Nachweis der Anwendbarkeit des Konzeptes und der Interoperabilität mit vorhandenen Netzelementen wurde durch Tests mit der LinuxImplementierung erbracht. Abschätzungen zum Ressourcenverbrauch wurden vorgenommen, die einen vernachlässigbar kleinen Einfluss des zusätzlichen Signalisierens von Dienstklasseninformationen auf die Größe der BGP-UPDATE-Nachrichten aufzeigten. Ein maßvoller Verbrauch an Speicherressourcen wurde ebenfalls ermittelt. Dabei wurde unter der Annahme von realistischen Szenarien die Anwendbarkeit der Konzepten auch für große Netzausmaße nachgewiesen. Die Gestaltung des Konzeptes behindert nicht den zusätzlichen gezielten Einsatz komplexer Dienstgütemechanismen mit garantierter Dienstgüte. In der Tat wird der universelle Einsatz des hiesigen Konzeptes und der selektive Einsatz höherwertiger Konzepte an ausgewählten Kopplungen oder TransitPfaden unterstützt. Auf der Basis des Konzeptes wird die Umwandlung des heutigen Internets hin zu einem 2- oder besser 4-Klassen unterstützenden Internet. Praxisanwendung Besonderes Augenmerk wurde auf die praktische Nutzung des Konzeptes gelegt. Die folgenden Punkte listen wichtige Meilensteine für die Anwendbarkeit. • • • • Mit der Übertragung der Konzeptspezifikation an die IETF Standardisierung wurde praktisch eine lizenzfreie Nachnutzung ohne patentrechtliche Einschränkungen ermöglicht. Die globale Anwendung des Konzeptes ist angestrebt und mögliche Kosteneinsparungen auf Betreiberseite tragen zum durch das Konzept erreichbaren Gewinn bei. Die Implementierungen in der Linux Routing-Software, Quagga, und dem Netzanalysewerkzeug, Wireshark, sind frei verfügbar. Die WiresharkErgänzung ist dabei bereits von den Entwicklern akzeptiert und in die aktuelle Softwareversion integriert worden. Gleiches ist für die Quagga-Erweiterung geplant. Ein Online-Dienst wurde eingerichtet der die Dekodierung von signalisierten Klasseninformationen im Rohdatenformat akzeptiert. Er ist unter folgender Adresse zu finden: http://www.bgp-qos.org/draft-knoll/decode_attributes.php . Die Nummernvergabestelle, IANA, hat bereits Typnummern für die “QoS Marking” und “CoS Capabilities” Elemente zugeteilt, so dass diese offiziell in den Produktionsnetzen der Betreiber verwendet werden können. Damit hat das Konzept bereits die Schwelle vom Laboraufbau hin zum öffentlichen Einsatz überschritten. x 17.11.2009 Ausblick Derzeit ist die Anwendung des neuen domänen- und schichten-übergreifenden Konzept zur Realisierung grob-granularer Dienstgüte auf Linux-basierte Netzelemente beschränkt. Laufende Gespräche mit Netzbetreibern und Router-Herstellern zielen jedoch auf die generelle Unterstützung des Konzeptes in kommerziellen Routern ab. Die technische Machbarkeit wurde dabei bestätigt und Interesse daran wurde von Europäischen Betreibern bekundet. Zukünftige Praxiserfahrungen und Änderungswünschen werden dabei zur Verfeinerung des Konzeptes führen. Um die Anwendung des Konzeptes zu fördern, wird derzeit an der Ergänzung der herkömmlichen kommerziellen Router um eine interaktive Linux-basierte Fernsteuerung gearbeitet. Fig. 155 zeigt dabei den verdeckten Steuermechanismus des kommerziellen Routers durch einen internen Linux-PC. Dadurch, dass die Signalisierungselemente transitiv definiert wurden, kann der Router mit passivem bidirektionalem Durchleiten die Verarbeitung und Generierung von Dienstklasseninformationen an das Linux-System deligieren. Mit Hilfe einer zweiten Verbindung kann nun der Linux-PC die Steuerschnittstelle des Routers erreichen und die notwenigen Kommandos zur Konfiguration und Aktivierung der vorhandenen Router internen QoS Funktionen absetzen. Dies Übergangslösung erlaubt den Netzbetreibern ohne kostspielige Software- oder Hardwareaktualisierungen eine klassenbasierte Netzkopplung anzubieten. Fig. 153 Steuerung eines kommerziellen Routers durch einen Linux-PC Eine derzeitige Diskussion über „Netzneutralität“ beeinflusst die Bereitschaft von Netzbetreibern und Herstellern, domänenübergreifende Dienstgütemechanismen zu unterstützen. Dabei steht der neutrale Netzbetrieb ohne Dienstlimitierungen, Inhaltsfilter, und ohne jegliche Bevorzugung einzelner Nutzer im Vordergrund. Entsprechende Gespräche mit Netzbetreibern und verschiedener staatlicher Netzagenturen haben ergeben, dass das vorgeschlagene Dienstgütekonzept mit seiner einfachen und allgemein anwendbaren Struktur womöglich als nicht diskriminierende und flächendeckend einsetzbare Verbesserung des Internets angesehen würde. Zusätzliche techno-ökonomische Studien zu erreichbaren Kosteneinsparungen werden von Nöten sein, um die Entscheidungsprozesse der Betreiber hinsichtlich Geräteaktualisierungen und der Einführung von klassenbasierter Dienstgüte zu unterstützen. xi 17.11.2009 In Kapitel 5.2 wurde bereits kurz ein von der Firma Google vorgeschlagener Unterschriftsprozess beschrieben, der mit Hilfe von so genannten BGP „Communities“ die Teilnahme an neuen Diensten und Konzepten besiegelt. Je nach Erfolg dieses Vorhabens kann es dazu führen, dass das vorgeschlagene Dienstgütekonzept als Vertragsbasis für die Vereinbarung von klassenbasierter Dienstgüte zwischen Betreibern genutzt wird. xii 17.11.2009 Acronyms ABR ABR AD ADSL AFI ARP ASBR ASN ATM B-ISDN BA BGP BGRP BRAS CAC CAPEX CBR CBWFQ CIDR CIR CLI CLP COPS CR-LDP CS DE DFZ DiffServ DMA DNS DRR DS DSCP DSL DV E-LSP eBGP ECN EF ECN EGP EIGRP FCFS FIB Area Border Router Available Bit Rate Administrative Distance Asymmetric DSL Address Family Identifier Address Resolution Protocol Autonomous System Border Router Autonomous System Number Asynchronous Transfer Mode Broadband ISDN Behaviour Aggregate Border Gateway Protocol Border Gateway Reservation Protocol Broadband Remote Access Server Call Admission Control Capital Expenditure Constant Bit Rate Class-Based Weighted Fair Queueing Classless Inter-Domain Routing Committed Information Rate Command Line Interface Cell Loss Priority (CLP) bit Common Open Policy Service Constraint-based Routed LDP Class Selector Discard Eligibility bit in frame relay Default Free Zone Differentiated Services Direct Memory Access Domain Name System Deficit Round-Robin Differentiated Services DiffServ Code Point Digital Subscriber Line Distance Vector EXP-Inferred-PSC LSP / now: Explicitly TC-encoded-PSC LSP external Border Gateway Protocol Explicit Congestion Notification Expedited Forwarding Explicit Congestion Notification Exterior Gateway Protocol Enhanced Interior Gateway Routing Protocol First Come First Served Forwarding Information Base xiii 17.11.2009 FIFO First In First Out FR Frame Relay FSM Finite State Machine FTP File Transfer Protocol GbE Gigabit Ethernet GBR Guaranteed Bit Rate GCRA Generic Cell Rate Algorithm GIST General Internet Signalling Transport GMPLS Generalized MPLS GPS Generalized Processor Sharing GRE Generic Routing Encapsulation HDLC High Level Data Link Control HOLB Head of Line Blocking IANA Internet Assigned Numbers Authority iBGP internal Border Gateway Protocol ICMP Internet Control Message Protocol IETF Internet Engineering Task Force IESG Internet Engineering Steering Group IGP Interior Gateway Protocol IGRP Interior Gateway Routing Protocol IntServ Integrated Services IP Internet Protocol IPv4 Internet Protocol version 4 IPv6 Internet Protocol version 6 IRR Internet Routing Registry IS-IS Intermediate System to Intermediate System ISDN Integrated Services Digital Network ISO International Organization for Standardization ISP Internet Service Provider IXP Internet Exchange Point L-LSP Label-only-Inferred-PSC LSP LAN Local Area Network LDP Label Distribution Protocol LIB Label Information Base LIFO Last In First Out Loc-RIB Local RIB LQD Longest Queue Drop LS Link State LSDB Link State Database LSP Label Switched Path MAC Media Access Control MAC-in-MACEncapsulation of Ethernet frames in Ethernet frames MED Multiple Exit Discriminator MESCAL Management of End-to-end Quality of Service Across the Internet at Large MPLS Multi Protocol Label Switching MSS Maximum Segment Size MTU Maximum Transmission Unit NGN Next Generation Network NLRI Network Layer Reachability Information NSIS Next Steps In Signalling NSLP NSIS Signalling Layer Protocol NTLP NSIS Transport Layer Protocol xiv 17.11.2009 OPEX OS OSI OSPF PBB PBT PC PCN PCP PDB PDP PDU PFC PGPS PHB POTS PS PSTN PT q-BGP Q-in-Q QoS QoE RAM ReaSE RED RFD RIB RIP RPSL RPSLng RR RR RS RSVP RSVP-TE SAFI SDH SDU SLA SONET SP SPF SPI TC TCA TCP TOS TTL UBR UDP UMTS Operational Expenditure Operating System Open Systems Interconnection Open Shortest Path First Provider Backbone Bridges Provider Backbone Transport Personal Computer Pre-Congestion Notification Priority Code Point Per Domain Behaviour Policy Decision Point Protocol Data Unit Priority-based Flow Control Packet-by-packet Generalized Processor Sharing Per Hop Behaviour Plain Old Telephone Service Processor Sharing Public Switched Telephone Network Packet Type QoS enhanced BGP 802.1q in 802.1q encapsulation Quality of Service Quality of Experience Random Access Memory Realistic Simulation Environments for IP-based Networks Random Early Detection Route Flap Damping Routing Information Base Routing Information Protocol Routing Policy Specification Language Routing Policy Specification Language next generation Round Robin Route Reflector Router Server Resource Reservation Protocol RSVP-Traffic Engineering Subsequent Address Family Identifier Synchronous Digital Hierarchy Service Data Unit Service Level Agreement Synchronous Optical NETwork Strict Priority Shortest Path First System Packet Interface Traffic Class Traffic Conditioning Agreement Transmission Control Protocol Type of Service Time To Live Unspecified Bit Rate User Datagram Protocol Universal Mobile Telecommunications System xv 17.11.2009 URL VBR VC VLAN VLSM VoIP VOQ VTYSH WAN WDRR WiMAX WRED WRR WLAN WLL Uniform Resource Locator Variable Bit Rate Virtual channel Virtual LAN Variable Length Subnet Mask Voice over IP Virtual Output Queues Virtual TeletYpe shell Wide Area Network Weighted Deficit Round-Robin Worldwide Interoperability for Microwave Access Weighted Random Early Detection Weighted Round Robin Wireless LAN Wireless Local Loop xvi 17.11.2009 Acknowledgments The work presented in this thesis was done at Chemnitz University of Technology in Chemnitz, Germany. The interest for the topic and the idea for the proposed concept arose through the lecturing work at the Chair of Communication Networks. I would like to express my deep thanks to the current and the former head of chair, Prof. Thomas Bauschert and Prof. Klaus Franke, respectively, for their support during the last years and for invaluable discussions and comments on my work. I am very grateful to Prof. Jörg Eberspächer for his offer to act as a co-examiner of my thesis and for the chance to present this work at his institute. A special thanks goes to David Ward, Dr. Yakov Rekhter, Robert Raszuk and Jie Dong for their support with IANA’s number assignment, fruitful discussions and detailed feedback on the concept. I am very grateful to Arnold Nipper and Wolfgang Tremmel from DE-CIX as well as Jens Wengenmayr and Frank Benndorf from envia TEL GmbH for their technical feedback and support. Furthermore, I wish to thank Simon Ehnert for the programming support with the Quagga routing suite, my co-worker Daniel Manns for his support in the work with OMNET++ , Uwe Steglich for challenging hours with NS2 and the other co-workers and students at the chair of Communications Networks for their helpful comments and reflections. My thanks is due as well, to Brian Schaefer, who has helped me with correcting my writing. Finally, I would like to thank my family for their support, patience, and understanding during these challenging years. Thomas Martin Knoll Chemnitz, July 2009 2 17.11.2009 1 Introduction The internetworking of current IP-based data networks is a modern communication technology with some major interconnection drawbacks. The following historical allegory depicts the weak spot of the widely used Internet, that is addressed in this work. Back in the 19th century, the two colonies of South Australia and Western Australia decided to communicate between each other via telegraph, rather than steamship, which took weeks. In 1874 both colonies started to erect a new telegraph line to interconnect their independently operating telegraph systems. South Australia, started its line from Port Augusta towards the border in the west and Western Australia erected its line from Albany towards its eastern border. In 1877, the interconnection was established at the Eucla Telegraph Station ([158], [179]), a small settlement near the border between the colonies. The station was equally staffed and the telegraphists of both colonies sat along a north to south oriented table. In fact the technical border divided the building and the operators’ table in half. The West Australian operators received their inter-state messages at the western half of the table and pushed the message across it towards their respective South Australian colleague. From there, the message was again telegraphed into South Australia and vice versa. The reason for this manual repeater station was the different character encoding used on either side. South Australia used the American Morse code and Western Australia the International one. The similarity lies in the fact that the current Internet consists of about 30.000 independently operated IP networks, called Autonomous Systems (AS), which run uncoordinated quality of service concepts and are in a very basic manner privately or publicly interconnected. Despite the fact, that ASes often apply some sort of independently chosen traffic separation and prioritization within the respective network cloud, their interconnection removes all such separation and handles the exchange traditionally without any separation or prioritization. Some AS ingress routers in turn apply multi-layer ingress classification methods in order to make a good guess on what traffic enters the network and should be separated and or prioritized. The signalling and direct traffic class based interconnection of Autonomous Systems has therefore been investigated, documented and implemented. 3 17.11.2009 2 Fundamentals of IP routing and forwarding The robust and inexpensive exchange of information between end systems in global scale is the major achievement of the current Internet. Many networking technologies exist, which allow for the networking of electronic devices using different layer two technologies. However, such local area networks make use of several, independently chosen technologies, which require interworking functions for an internetworking between them. This barrier is removed with the introduction of the commonly used Internet Protocol (IP) as least common denominator regarding the very basic requirements for a primitive datagram based information exchange. The Internet is therefore a patchwork of many networking clouds, which all provide the means for an end-to-end IP-based datagram transmission service. 2.1 IP datagram structure and addressing In order to understand the capabilities of the globally available IP datagram service, it is best to review the protocol’s control information exchange, which is carried within the header structure of each single protocol data unit. Fig. 1 depicts the datagram structure of the currently predominantly used version four of the Internet protocol. Its original structure was defined in RFC791 [153]. Fig. 1 IP version 4 datagram structure The most important elements of the header are the destination and the source IP address, which are used for a hop-by-hop relay process towards the destination and for backward error reporting in case of delivery failures, respectively. IP addresses used to be grouped into address classes – A, B, C, D, E – following the structure given in Fig. 2. Each node belonging to a network cloud was assigned an IP address containing the same network part within the 32 bit number. A router would therefore decide by the destination address of the datagram as well as of the network 4 17.11.2009 number its receiving interface belongs to, whether the datagram is destined for the originating cloud or needs to be relayed towards a next hop router. Fig. 2 IPv4 address class system - [22] The stiff address class regime, as well as the huge and small network clouds for class A and C type networks, respectively, led to a revised scheme for network/host differentiation allowing any bit position within the 32 bit field as network address boundary. The scheme is called “Classless Inter-Domain Routing (CIDR)” [81], [82] and introduces a network mask field of 32 bit to support “variable length subnet masks (VLSM)”. Combined with the traditional address classes, it now allows the creation of subnets out of one larger network and supernets out of several consecutive smaller networks. Fig. 3 gives a subnetting example for the creation of 128 subnets out of one class B network. 10 Network Subnet Host . 11111111111111111111111 000000000 Fig. 3 CIDR example network mask Routers in CIDR networks now compare the network part of their interface address with the network part of the currently processed IP destination address using a simple AND operation with the network mask applied on both addresses. The major advantage of CIDR in global scale routing lies on the field of route aggregation. IP address ranges (so called prefix blocks) of Internet service providers or some large scale companies tend to have fine grained address allocations with network masks in their twenties. However, routers in the core regions of the Internet might see a number of consecutive address blocks in their routing tables, which all resolve towards the same next hop neighbour. Summarizing those table entries into just one bigger address block with a shorter network mask saves on table storage, table lookup delay and route advertisement messages. Such prefix aggregation by means of CIDR is therefore heavily used in today’s Internet routing. Further work on IP addressing was performed with the introduction of IP version 6 [63], [64], [5][1]. This new version extends the IP addresses to 128 bit fields and specifies a fixed size basic header structure of 40 bytes length. The new scheme of header extensions allows for a dynamic incorporation of additional header information. Fig. 4 depicts the version 6 datagram structure. 5 17.11.2009 Fig. 4 IP version 6 datagram structure The CIDR concept of address and netmask is continued, but the IP address classes vanished. IP version 6 introduces address types instead [92]. The following types have been defined: • Unicast Addresses, o Interface Identifiers, o The Unspecified Address, o The Loopback Address, o Global Unicast Addresses, o IPv6 Addresses with Embedded IPv4 Addresses, o Link-Local IPv6 Unicast Addresses, o Site-Local IPv6 Unicast Addresses (deprecated), • Anycast Addresses and • Multicast Addresses. Under the light of QoS support in IP-based networks, the differentiation of datagrams during the hop-by-hop relay process needs to be made. One approach could be to reserve address blocks for certain forwarding treatments and to enumerate those end devices, which have certain quality of service requirements, with such IP addresses. Relaying nodes could react to such special destination addresses within the datagram header and might even provide different routing decisions in their relay process. However, this puts an unnecessary burden on the globally arranged IP address planning and prevents end devices to support several services with possibly differing QoS requirements concurrently. The original IPv4 “Type of Service (ToS)” as well as the IPv6 “Traffic Class (TC)” header field both provide 8 bits for quality of service datagram marking information, but with different encodings. In the course of “Differentiated Services (DiffServ)” - a quality of 6 17.11.2009 service concept described in section 4.1.1 - the redefinition of both fields by RFC 2474 [142] into a so called “Differentiated Services field (DS field)” was the decisive step forward to achieve a common encoding scheme independent of IP protocol versions. Six bit “Differentiated Services Code Points (DSCP)” were specified for DiffServ purposes. Since DSCP occupies only 6 out of the redefined 8 bits of both fields, a second mechanism is incorporated in the remaining two bits. It is called “Explicit Congestion Notification (ECN)” and allows for forward congestion notification by relaying nodes – RFC 3168 [156]. The combination of both definitions and some clarification on the wording and meaning of the specifications is given in RFC 3260 [85]. Fig. 5 references the major redefinition RFCs as well as the four common differentiated services “Per Hop Behaviour (PHB)“ encodings. Fig. 5 Differentiated Services (DS) field in IPv4 and IPv6 datagram headers 2.2 Routing basics 2.2.1 Routing protocols and hierarchy The relay of IP datagrams is performed in a hop-by-hop manner, which transports the packetized information solely based on the destination address field contained in the IP packet’s header information. The interworking functions only relay the datagram out of the layer two networking cloud, if the IP destination address belongs to a different IP network than the one it originated from. The interfaces of such interworking devices are each members of the respective networking cloud and relay the datagram on behalf of the original source to the neighbouring relaying node, which they believe to be closer to the datagram’s destination. Those layer three interworking devices are called “routers” and can be equipped with different relaying capabilities depending on their positioning within the global hierarchically organized patchwork of networking clouds. The relay process of IP datagrams within a router consists of three major steps: 7 17.11.2009 1. IP lookup of the datagram’s IP destination address in a forwarding table to find the best matching relay path towards the correct next hop router together with the respective output interface connecting to that next hop neighbour, 2. IP header field processing to decrease the time-to-live value and to update the header check sum accordingly and 3. internal transfer of the processed datagram to the output interface for transmission towards the layer two address of the next hop router. The described action points relate to the lower half of the depicted functionality in Fig. 6 – the functions of the so called “forwarding plane”. The upper half controls the setup of the vital routing information within the mentioned forwarding table. The control plane functionality is based on dynamic reachability information exchanges using specialized routing protocols. Each routing protocol instance of a router performs some sort of neighbour discovery and advertises the known IP prefixes to them. Fig. 6 IP routing and forwarding functionality A router typically maintains two “routing tables” internally. The so called “Routing Information Base (RIB)” stores all valid routes the update process has learned locally or dynamically from other routers. In a second step the best routes to each advertised prefix are selected out of the RIB and installed into the so called “Forwarding Information Base (FIB)” used by the forwarding plane. If an IP prefix has been learned in several best path granularities (prefix lengths), then all of them are stored in the FIB. During IP lookup, a so called “longest prefix match” is performed in order to find the most specific best route towards the currently processed IP destination address. The dynamic routing process with the exchange of reachability information for IP prefixes is organized in a hierarchical fashion. In a flat routing structure, every newly established or on the contrary lost connectivity to an IP prefix needs to be communicated to all participating routers. This is not feasible in a global scale Internet, but is used in small portions of the internetworked clouds. 8 17.11.2009 The routing hierarchy is comprised of routing areas, routing domains and autonomous systems (AS) as shown in Fig. 7. Fig. 7 Internet routing hierarchy Each hierarchy level summarises its internal reachability changes and communicates those changes in summary routes to the next upper level. This way of operation reduces the size and frequency of routing updates between the routing hierarchy levels and uses CIDR with VLSM aggregation in large scale. The lowest level of the hierarchy is routing areas, which are solely used to encapsulate routing changes and errors into confined regions. This limits the reach of flooded information, reduces convergence time after route changes and covers (dampens) routing changes, which have only area local significance. Routing areas are run by the same authority and operate identical routing protocols and policies. Inter-area routing information exchange of summarized routes is performed by so called “area border routers (ABR)” or more general, “gateways”. The typical topology is a hub and spoke approach, which consists of one backbone area connecting all other areas within a routing domain. Routing domains are consistently operated by a single authority and normally rely on a single internal routing protocol with the same set of metrics and routing decision policies. They can also be regarded as single routing domains in the case of different internal routing protocols being used, as long as the domain provides a single and consistent routing behaviour to the outside network. An authority can operate one or more routing domains and request a so called “autonomous system number (ASN)” for registration in the global Internet. Each autonomous routing domain, which was assigned an ASN, turns into an “Autonomous System (AS)”. ASes are therefore characterized by a 16 bit (currently being transitioned to 32 bit [176]) AS number and a unified administrative routing policy internally. RFC 1930 [86] gives guidelines for creation, selection, and registration of an AS. AS-internal routing protocols are generally referred to as “Interior Gateway Protocols (IGP)” and the ones interconnecting ASes are called “External Gateway Protocols (EGP)”, respectively. Edge routers at AS interconnection points are referred to as “Autonomous System Border Routers (ASBR)”. Fig. 8 gives an example of a typical Internet routing architecture. 9 17.11.2009 Fig. 8 Internet routing architecture Commonly found routing protocols in today’s networks are RIP (Routing Information Protocol – version 1 [87] and 2 [130]), OSPF (Open Shortest Path First [140]), IS-IS (Intermediate System to Intermediate System [106], [42]) and BGP (Border Gateway Protocol version 4 [157]). Two proprietary protocols, IGRP (Interior Gateway Routing Protocol [48] - obsolete) and EIGRP (Enhanced Interior Gateway Routing Protocol [50]) are also in use, but are limited to networks, which solely deploy Cisco routers. The first inter-domain routing protocol, EGP (Exterior Gateway Protocol [138]), was limited to a tree topology and is no longer in use. IP routing protocols – applicability Intra-domain routing Inter-domain routing Interior Gateway Protocols (IGP) Exterior Gateway Protocols (EGP) - RIPv1 (obsolete) RIPv2 IGRP (obsolete) EIGRP OSPF IS-IS iBGP (version 4) - EGPv3 (obsolete) - eBGP (version 4) Fig. 9 IP routing protocols – classified by applicability 10 17.11.2009 The exchange of reachability information together with path characteristics can be classified in two major working principles: “distance vector routing (DV)” and “link state routing (LS)”. Smaller networks with less stringent convergence requirements and with low processing power / low energy consuming routers will opt for DV protocols. Otherwise, link state protocols are required. A third principle, “path vector routing”, is a modified distance vector principle, which is currently only used with the border gateway protocol. Fig. 10 gives an overview about the principle classification and typical examples. Distance vector routing The advertisement of all known routes and their associated cost (hop count) is periodically sent out to all neighbours within a broadcast/multicast domain. Each router will in turn incorporate the reachable prefixes and adopted costs into its routing table and send this new table out to its topological neighbours. Update processing makes use of the Bellman-Ford algorithm ([22], [78]) , which minimizes the hop count within the route selection. The DV principle incurs a low processing load, considerable network traffic and the evolving update dissemination is vulnerable to routing loops long convergence times. Since routers along the advertisement track can not work out the original source of the information, this principle is also referred to as “routing by rumour”. Path vector routing The major characteristic of this working principle is the recording of the advertisement relay trail in the exchange prefix reachability update information. The trail record in the inter-domain case is the path of AS numbers, the advertisement passed through. Path vector routing can be combined with distance vector or other path selection mechanisms. The prominent example for path vector routing is the border gateway protocol. The advertisement and router selection process is governed by policy (filter) rules and does cover numerous criteria. Reachable IP networks are announced to carefully selected neighbours and might be selectively and neighbouring based filtered out by the above mentioned policy rules. The knowledge about the advertisement trail mitigates routing loops and enables route selection beyond the hop-by-hop scope. However, it still does not disclose precise network topology information. The processing load depends on the filter complexity. Since path vector routing is mainly used in inter-domain setups, routing stability is preferred before fast restoration times in case of resource failures. Hence, a slow convergence time is regarded as less critical. Link state routing Link state routers inform all such other routers in the respective routing areas about their status knowledge of connected links as well as the flooded link state information of the other neighbouring routers. LS routers maintain neighbouring sessions with each other and after an initial phase, where all link states are exchanged, only changes are flooded within the area. This way, each router incrementally receives a complete status of the network’s topology and can work out the shortest path routing table from its point of network view. This increased computational effort uses the “Shortest Path First (SPF)” algorithm of Dijkstra [67]. The higher processing load saves on network traffic and leads to faster convergence times. The flooded link state updates contain the information about the information originator. This routing principle is therefore referred to as “routing by propaganda.” 11 17.11.2009 IP routing protocols – working principle Distance Vector - RIPv1, RIPv2 - IGRP - EIGRP (hybrid) Path Vector - eBGP / iBGP “routing by rumour” Fig. 10 Link State - OSPF - IS-IS - EIGRP (hybrid) “routing by propaganda” IP routing protocols – classified by working principle The routing in the global Internet relies on the meshed interconnection of autonomous systems. So far, point-to-point connections are obvious solutions for the interconnection of ASBRs. However, the vast majority of public interconnections are accomplished by means of Ethernet based “Internet Exchange Points (IXP)”. As Fig. 11 indicates, hierarchical, redundant and mostly distributed switching clusters are common realizations for those neuralgic peering hubs with several hundred interconnected ASes. Fig. 11 Internet Exchange Point - IXP 2.2.2 Inter-domain routing using BGP The Border Gateway Protocol is explained in more detail due to its exclusive usage for inter-domain routing as well as to the importance for the proposed Inter-AS BGP-based signalling concept of this thesis. 12 17.11.2009 BGP is a so called Path-Vector protocol and distributes the reachability information of network prefixes together with associated attributes. An outstanding characteristic is the AS_PATH attribute, which records a list of relaying ASes for the respective reachability information. This way, not only the neighbouring AS for a specific prefix is known to the recipient, but the whole AS path that needed to be traversed in order to reach the announced network(s). The AS path is therefore used as an important metric for path selection (optimized for minimal path length) and loop detection. Only advertised network prefixes, which do not include their own AS number in the path list are accepted as valid route updates. BGP relies on TCP for reliable message exchange and sets up so called “BGP sessions” between interconnected AS border routers. Each end point of the session is called BGP peer and the BGP neighbour session establishment will only be successful, if the parties configure the IP address and AS number of the respective peer in their router internal BGP process. The border gateway protocol distinguishes two horizons, the internal (iBGP) – BGP peerings between edge routers within an AS and the external (eBGP) – BGP peerings between edge routers of adjacent ASes. Since one AS needs to have a consistent knowledge of reachable prefixes at its edges, internal peers need to establish and maintain a full mesh of peering sessions. AS confederations and the concept n ⋅ (n − 1) scalability problem of the full mesh in large of route reflectors to circumvent the 2 ASes are explained below. The exchanged reachability information is flooded across all BGP sessions, however they are filtered (in each ingress and egress) to apply strategic routing policy decisions. The border gateway protocol is therefore a global interconnection protocol with routing policy enforcements. Each AS decides, which of their own prefixes are advertised to which peering partner, which prefixes are accepted from external peers (ingress filter) and which selected best paths are advertised to which external peer (egress filter). This best path selection procedure is vital to understand BGP’s selection decisions of active paths out of the received available paths. The algorithm is applied for the case, when the same prefix is received several times. The decision points are processed in the given order and the first differing criterion will yield the decision. One optional route processing extension is BGP Multipath, which allows it to commit several paths to a prefix in the local forwarding table. The best path will still be worked out and announced to BGP peers, but multiple active paths will be installed and used for nodelocal packet forwarding. 13 17.11.2009 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Fig. 12 Prefer the path with the highest WEIGHT. (Cisco proprietary) Prefer the path with the highest LOCAL_PREF. Prefer the path that was locally originated via a network or aggregate BGP subcommand or through redistribution from an IGP. Prefer the path with the shortest AS_PATH. Prefer the path with the lowest origin type. Prefer the path with the lowest multi-exit discriminator (MED). Prefer eBGP over iBGP paths. If bestpath is selected, go to Step 9 (multipath). Prefer the path with the lowest IGP metric to the BGP next hop. Continue, even if bestpath is already selected. Determine if multiple paths require installation in the routing table for BGP Multipath. Continue, if bestpath is not yet selected. When both paths are external, prefer the path that was received first (the oldest one). Prefer the route that comes from the BGP router with the lowest router ID. If the originator or router ID is the same for multiple paths, prefer the path with the minimum cluster list length. (for Route Reflector environment) Prefer the path that comes from the lowest neighbour session IP address. BGP Best Path Selection Algorithm - [49] Five types of BGP messages are exchanged between peers, which are: • OPEN, (Initial setup of the peering session and exchange of protocol timer settings) • UPDATE, (Reachability and withdraw advertisement of network prefixes combined with path attributes / complete update during initialization and triggered updates later on) • NOTIFICATION, (Closing message of the BGP session) • KEEPALIVE and (periodic handshake message, if no UPDATE is sent) • ROUTE-REFRESH [47]. (Request message for a complete reachability information exchange (refresh) e.g. for non-disruptive policy change enforcements ) All messages share the same structure, which is depicted in Fig. 13. 14 17.11.2009 Fig. 13 BGP message structure The UPDATE message structure (see Fig. 15) in particular consists of a fixed length message header, the variable length withdrawn route section, a variable length path attribute section, and the variable length “network layer reachability information (NLRI)” section. All advertised prefixes within the NLRI are “labelled” with the signalled attributes. Prefixes, which require different attribute association, need to be sent in a separate UPDATE message. New UPDATE messages for the same previously advertised prefixes will override the stored NLRI and attribute information at the receiving end. All attributes are classified by type numbers, the same type must not be included several times in a message and the type number is used as ordering criteria. There are different types of path attributes defined as shown in Fig. 14. Path Attributes optional wellknown mandatory discretionary - Origin - AS-Path - Next Hop - Local Fig. 14 transitive - Aggregator - Community - Extended Preference - Atomic Aggregate non-transitive - Multi-Exit- Discriminator (MED) Community BGP path attribute classification [46], [161] It is important to note, that “transitive” and “non-transitive” in the attributes’ context relates to the attribute signalling. The optional attribute is either relayed across the AS towards the next neighbour or it is terminated within the peering AS. A non-transitive attribute might be sent out to a different AS, but will terminate there. 15 17.11.2009 Fig. 15 BGP UPDATE message structure – after [157] The concept of optional transitive community [46] and extended community [161] attributes has been added to BGP for the purpose of rather free-style signalling of e.g. policy rule triggers or other mutually agreed on activities. Fixed size community and extended community structures have been defined, which are carried within the respective attribute type. Several communities are therefore consecutively embedded within just a single (extended) community attribute. Of particular interest for this thesis are extended communities, which are further detailed as follows. Extended communities are of fixed 8 byte size and consist of a type field followed by the remaining community bytes. Extended community types are divided into “regular types” (1 byte type field) and “extended types” (2 byte type field). The remaining community bytes are therefore either 7 or 6 byte. A type number registry for extended community types is administered by the Internet Assigned Numbers Authority (IANA) [95], which ensures, that no assigned regular type number is the high byte part of an assigned extended type number. Extended communities are contained in the transitive optional extended community path attribute. However, the communities themselves are classified as “transitive” and “nontransitive”. It is important to note, that “transitive” and “non-transitive” in the communities’ context relates to the AS border crossing nature of the community. That is, transitive extended communities are relayed internally and externally. Non-transitive extended 16 17.11.2009 communities, however, must not cross any AS border. Despite the fact, that they are embedded in a transitive attribute, they are by definition confined within the iBGP of a single AS. Fig. 16 depicts the UPDATE message structure with optional transitive extended community path attribute. Fig. 16 BGP UPDATE message structure with Extended Community attribute The above mentioned scalability problem for fully meshed internal BGP sessions between border routers is addressed in two ways: “Router Reflector” and “Confederations”. BGP Route Reflector BGP route reflector has been defined in RFC 4456 [20]. If fully meshed BGP speakers are grouped into clusters, a hierarchy of route reflectors and route clients can be setup. That is, only route reflectors still need to be fully meshed and they serve their clients with incoming UPDATE as a relay node and speak on behalf of the clients for UPDATEs raised within the cluster. 17 17.11.2009 Fig. 17 BGP Route Reflector topology BGP Confederation Autonomous System Confederations for BGP have been defined in RFC 5065 [172]. The main idea of the concept is to divide an AS internally into several confederation ASes. Such internal ASes are not seen externally, so that the overall behaviour of the original AS will not change. Internally, routing convergence and scalability is greatly enhanced due to the route confinement within such artificial (private) ASes. Some special rules on advertisement and attribute handling were to be specified in order to establish the right procedures for confederation internal eBGP and confederation external AS representation. One example would be the AS path handling, which will record private AS numbers during the confederation internal signalling. Such private AS numbers, however shall not cross the original AS border and need to be stripped off outside the confederation. Fig. 18 depicts the resulting AS internal topology of e.g. AS 4321. Fig. 18 Autonomous System Confederations for BGP 18 17.11.2009 2.3 Router architecture A router provides interconnection and relaying functionality between several inputs and typically the same number of outputs. Major characteristics are the number of supported ports and the achievable throughput. Fig. 19 depicts the general block diagram structure of a router. Fig. 19 IP router block diagram 2.3.1 Router control plane structure The control plane of an IP router is equipped with one or more IP routing protocol instances and provides the routing update generation and processing functionality accordingly. Command line and web based interfaces allow the direct access to the control instance for configuration and monitoring. Each enabled routing protocol allocates storage and processing power resources according to its working principle. Fig. 20 depicts the typical block structure for the control plane part. Each protocol maintains its specific update information storage (e.g. link state database (LSDB)) and performs protocol local route selection algorithms. The resulting routing table information is stored in protocol local routing information bases (e.g. display with “command line interface (CLI)” command: “show ospf route”). The route redistribution manager is a central building block, which controls the route selection for the node local routing table as well as the mutual routing information exchange between routing protocol instances. For instance, external BGP learned routes can be redistributed into OSPF to announce global connectivity and vice versa for prefixes originating inside that routing domain. The filtered out routes are stored in the node’s local IP routing information base (display with CLI: “show ip route”). A protocol precedence, called “administrative distance (AD)”, has been defined, which decides on the precedence order of the protocol RIBs. The lower the AD value, the more important the information source is. In a last step, the condensed forwarding relevant information is installed in the forwarding information base (display with CLI: “show ip cef”). This FIB is often replicated in the input port units for fast interface local lookups. 19 17.11.2009 Fig. 20 IP router internal structure -> route processing 2.3.2 Router internal interconnection structure This section looks more closely into the internal structure of the generalized block diagram as of Fig. 19. The internal relaying of IP packets between input and output port units can be implemented using three major concepts for the interconnection: - shared memory, - bus interconnection and - crossbar interconnection. The shared memory concept is the implementation variant, where received packets at the input port are stored in the node local shared main memory and read out again from the output port unit for transmission. Input and output port units use direct memory access (DMA) methods for a fast copy transaction to and from the shared memory. A central processing unit performs the IP lookup operation, the IP header update operations and informs the respective output port about the lined up sending task. The intermediate copy phase with the write and read access of the shared memory limits the achievable throughput, which in turn limits the number of ports that can be served with such a structure. 20 17.11.2009 A common bus infrastructure between the port units is used in the second implementation variant, where the packet data transfer is arranged directly between the input and output units. Each input port unit needs to be equipped with the forwarding table and the route processing unit, which performs the lookup and header update operations. It independently determines the output port unit and initiates the bus transfer. Hybrid solutions with bus interconnection for frequently served prefixes and shared memory operation for error handling and complex routing operations are possible. The saved copy operation as well as the concurrently performed packet processing speeds up the router’s throughput and increases device scalability. The commonly used implementation variant in commercial routers is the crossbar interconnection between the input and output port units. A switching matrix – often referred to as “switching fabric” – provides dynamically arranged interconnections between the respective input and output unit at any given point in time. Such switching fabrics are classified into blocking and non-blocking cross-connect types. Given that two or more input port units independently work out to forward a packet towards different output port units, a non-blocking switching fabric ensures the concurrent transfer of all packets across the crossbar. Such non-blocking operation of the switching fabric can be achieved using Clos’ concept [57] of a multi-stage switching network. The clocked relay of variable length packets across the switching fabric is cumbersome. High speed routers therefore split up the packets into small fixed length packets in the input port unit, which are again reassembled in the output port unit. Such small junks are called “cells” and have a size of e.g. 64 bytes. The direction of the cells across the fabric can either be centrally organised by the switch fabric controller or more frequently used in high speed routers, by so called “self-routed” cells. The latter solution adds a short header – so called “routing tag” – to each cell, which carries the short fabric local addressing of the respective output port unit. 2.3.3 Router internal queuing structure In routers with blocking switching fabrics as well as in non- blocking switching fabrics with certain constellations of packet arrivals and common relay destinations, it is necessary to provide queuing memory in the input port units for tentative package storage. Whenever a packet can not traverse the fabric straight away, it’s servicing is delayed and the input port queue starts to build up. Given that two or more input port units independently work out to forward a packet towards one and the same output port unit at a given point in time, the path across the fabric is blocked towards this free output. If the arrival rate outweighs the servicing rate, dropping mechanisms must be in place that handle the queue’s overflow events. Output port units also need to provide an output queue for line rate adaptation, if the incoming rate of all traffic that is routed out this interface exceeds the available sending rate. This is particularly important for traffic bursts within the multiplexed streams. Given the situation, that a blocked and currently postponed packet is followed by a consecutive packet that could be served instead of the blocked one, this situation is called “Head of Line Blocking (HOLB)”. The vendor Cisco describes in its 12000 series Internet router architecture documentation [51] a multi-queue solution – called “virtual output queues (VOQ)”. The queue within each input port unit is replicated times the number of output port units (plus one for multicast). This way, each postponed packet transfer can be queued in the separate queue towards 21 17.11.2009 the currently blocked output destination and gives way to the servicing of the adjacent queue of the input unit, which is not blocked in that switching time slot. Such an optimized high speed router structure is depicted in Fig. 21. Fig. 21 IP router with non-blocking fabric and virtual output queues 22 17.11.2009 3 Basic QoS aspects 3.1 Overview The abbreviation “QoS” stands for “Quality of Service” and is extensively used in the context of telecommunication systems. This wide applicability and usage leads to a lack of understanding, what the respective QoS is. The term “Quality” as well as “Service” is not standardized and needs to be discussed in each case. In terms of data packet-switched networks, the QoS term refers to packet forwarding characteristics (transmission speed, delay, jitter, packet dropping probability and packet distortion rate). If such forwarding parameters are quantified in fixed values, this is referred to as “absolute QoS”. On the other hand, “relative QoS” distinguishes different kinds of packets or packet flows, which are treated differently in order to achieve prioritized forwarding characteristics. No absolute parameter values are applied, but priorities are assigned to the distinguished kinds of packetized traffic. “Coarse-grained QoS” combines both QoS types in a way, where not single packets or packets flows are associated with fixed characteristic parameters, but groups/classes of traffic are infolded into a fixed characteristic parameter set. Traffic separation and tunnelling techniques are common ways to support coarse-grained QoS. The ability of a packet data networks to deliver data packets in time to the right destination, with acceptable loss and distortion rates and small transfer delay variations are the main technical criteria to judge about the Quality of (transmission) Service. These technically measurable parameters, however, do not easily relate to the experienced application service quality as seen by the users at both end points of the networking system. The term “Quality of Experience (QoE)” addresses the end user’s quality experience of the services, which rely on the underlying interconnected networks and their QoS in terms of transmission quality. The mapping rules between measurable transfer QoS parameters and the resulting QoE are application specific and still in development. Fast evolving application services and QoS adaptive application implementations create a closed control loop, which obfuscates the mapping. Quality of Experience is therefore indirectly addressed but not focussed on in this work. The following sections give an overview of the QoS types, their requirements, building blocks and scopes. 3.1.1 relative vs. absolute vs. coarse-grained QoS There are three general ways to support a distinguished and sufficiently good quality of transmission. The term “sufficiently” is in place since quality requirements are application service specific and a networking system that just fulfils those requirements is “good enough” for the particular service. Any transmission quality provided in excess of those requirements might be regarded as wastage. However, such wastage is justified, if the 23 17.11.2009 capital and operational expenditures (“CAPEX” and “OPEX”) for the QoS control outweighs the economical gain of a higher utilized network. Over-provisioning The first and easiest way to achieve a sufficiently good quality of transmission is overprovisioning, which is also referred to as “over-engineering”. In that case, the transmission capacity – often falsely referred to as “bandwidth” – far exceeds the sustainable transmission requirements of any given service. Link capacity utilization of less than 40% are common in over-provisioned networks, which ensures lightly loaded network components, hardly filled input and output queues in relaying nodes and thus, fast and low latency and low drop rate transmission for all services. As long as technical solutions and the economical trade-off between service revenue and transmission equipment capital and operational expenditures allow for an over-provisioning business case, it is the easiest way to achieve sufficiently good transmission quality. The ease of planning and network operation (configuration, debugging etc.) as well as the resulting network stability are major arguments of support for this approach. Over-provisioning functions well in both cases, either the connection-oriented network operation, where flows of packets are signalled to the network beforehand, or the connection-less transmission of datagrams. Relative QoS through Prioritization As outlined before, the aspired sufficiently good quality of transmission is application service specific and therefore allows for a differentiated quality of transmission solution. As long as each application service receives the sufficiently good transmission quality, it is a completely equivalent QoS solution. This differentiation, however, requires control overhead, which results in packet classification, marking and forwarding treatment rules in each node. The term “relative” relates to the fact, that application service packets no longer receive the same relaying treatment, but some are preferably forwarded and others are delayed or even dropped. Networks with differentiated forwarding treatment support both, connection-oriented and connection-less type of operation as long as the association to a traffic class can be derived from the information carried along in the packets/datagrams. The packet marking approach is common for networks with relative QoS support, but does not preclude the same forwarding treatment as derived from some sort of association between the possible combination of packet header data with the required treatment. The relative high transmission quality of the prioritized application services comes at the expense of transmission quality discrimination for low-priority services, which ideally suffice with the resulting quality. The applicability of relative QoS solutions is only justifiable, if capacity limitations or the economical trade-off between service revenue and CAPEX and OPEX require selective service discrimination. Absolute QoS through Reservation Transmission quality in terms of guaranteed quantified parameter limits for throughput, loss, delay and delay variation can only be safely achieved by means of resource reservation with admission control and limited overbooking. The prerequisites for reservation approaches are as follows. Flows of packets need to be identified as to belong to a pre-established reservation state. All relaying nodes need to be signalled about resource requests and have to admit and reserve reservation grants. They need to keep track of the flows’ resource usage in order to detect excess traffic and to make informed decisions about incoming new reservation requests. Furthermore, the edge 24 17.11.2009 nodes or even every node need to implement admission control functions in order to mitigate excess traffic requests and to screen the granted reservations. The usage metering needs to account for average rates as well as instant peak limitations. In order to prevent false alarms through the packet multiplex in relaying nodes, traffic shapers can smooth out bursts at the expense of increased transfer delay and delay variation. Networks with resource reservation in the forwarding path require a connection-oriented type of operation. The signalling of resource requests, the resource grant along the route, the setup of admission control, traffic metering and possibly shaping units needs to be arranged during connection setup prior to the actual packet transfer phase. The guaranteed high transmission quality with absolute parameter limitations for selected packet flows of application services comes at the expense of service discrimination for low-priority – non-reserved - services, which ideally suffice with the resulting quality. The applicability of absolute QoS solutions is only justifiable, if application services or the economical trade-off between service revenue and CAPEX and OPEX require service guarantees and make up for the high control overhead in every node. Guaranteeing QoS parameters can even be targeted in over-provisioned networks. At the operator’s risk, such commitments are made and abided by, because of the lightly loaded network. All three general QoS approaches are often used in combination. Depending on the carried traffic type and the available link capacity, traffic separation with prioritization, trunk reservation and flow reservation can be applied concurrently. Coarse-grained QoS The level of granularity for absolute or relative QoS is addressed with the “coarse-grained” QoS approach. Single flow reservations with absolute QoS guarantees are targeted with the Resource Reservation Protocol as described in 4.1.2. This is clearly a fine-grained approach and far too detailed in core network nodes with thousands of traffic flows being aggregated in the stream of IP packets going through. The Differentiated Services approach with traffic being classified into traffic classes is a feasible QoS concept in large scale networking. However, up to 64 classes can be distinguished with the 6 bit DSCP marking codepoints. This still appears to be too sophisticated for global scale internetworking tasks, where currently no traffic differentiation is used. The traditional IP precedence approach provided 8 traffic markings within the 3 precedence bits of the TOS field (see Fig. 1). Although the traffic separation in two, three or four classes is expected to be a sufficient level of differentiation, the term “coarse-grained QoS” is said to comprise no more than 8 traffic classes. This coarse traffic separation is easily configured in relaying nodes, enables class based tunnelling as described in 3.2.3 and allows for bundle reservations onto the few separated (and possibly tunnelled) traffic classes. 3.1.2 QoS building blocks The quality of service that a packet perceives during the forwarding process along a given path from one communication end point to the other depends on a number of forwarding decisions along the way. The major decision points reside on the path the packet will take through the networking cloud(s), the mixture of equivalently forwarded packets as well as the per hop forwarding behaviour in each relaying node. Since the QoS treatment scope is addressed in chapter 3.2, the following paragraphs will address the node local QoS-related treatment mechanisms in detail. 25 17.11.2009 The forwarding of packets in a single node is shown in Fig. 22 and characterized by: - (1) the classification, - (2) the routing decision, - (3) the input enqueuing with (4) dropping, - (5) the input queue scheduling, - (6) the fabric transmission, - (7) the output enqueuing and - (8) the output queue scheduling. Fig. 22 Router internal forwarding path per hop behaviour Each of the decision and treatment steps will be looked at more closely in the following paragraphs. Fabric transmission The node internal relay of packets (packet cells) is predominantly performed in nonblocking crossbar structures in a contention free way of operation. It does therefore not negatively influence the overall QoS characteristic of the router. Routing strategies Differentiated forwarding of packets along differing routes is a major QoS treatment, which is addressed in chapter 3.2.2. It is the result of collaborative routing decisions and FIB setups among a set of routers and does not directly relate to node internal QoS strategies. Enqueuing strategies (input and output) The tentative storage of packets after arrival and before send out is the first major internal QoS treatment strategy, which is based on the number of queues supported in input and output port units as well as the sorting criteria for them. This work considers the availability of separated queues for distinguished packet types, the very basic prerequisite for QoS support. Whether the queues are implemented as separate hardware components or virtually distinguished in software within one hardware component is irrelevant for the experienced separation behaviour. The number of queues and the possibly varying queue lengths are important QoS parameters. The enqueuing itself is a mapping operation between the distinguished types of traffic and the available set of queues. Routers will ideally match the classification of incoming traffic with the number of available queues for separate enqueuing. Output enqueuing will follow this decision and decide the mapping based on either the packet header or routing tag carried classification information. Classification strategies If QoS is supported in a node, there needs to be a classification block in the input port unit, which sorts the incoming traffic into distinguished traffic classes. The easiest way of operation would be based on the DSCP class markings in the IP header as depicted in Fig. 5. It will, however require a class mapping (grouping) operation between the poten- 26 17.11.2009 tially 64 available DSCP classes and the actually provided enqueuing classes. Such class mappings are addressed in chapter 8. If DSCP markings are missing or not trusted, the classification can be based on any combination of IP packet header information. However, since traffic separation is most likely be performed following application requirements, a more sophisticated classification based on multi-layer classification is performed. Mainly port information of Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) packets reveal the type of traffic being transported within the packet payload. It is at the operator’s discretion to define the level of deep packet inspection and to configure the resulting classification rules. Current commercial router equipment provides input port units (line cards), which allow for hardware based line rate deep packet inspection even within the TCP and UDP payload. Dropping strategies Because of the burstiness of multiplexed packets, queues can quickly fill up and might flow over. Normally, packets are dropped at the end of the queue as shown in Fig. 23. It is called “Drop-Tail queuing”. Fig. 23 Drop-Tail queue dropping strategy However, two optimization strategies are common, which aim for early dropping of packets as congestion indication for responsive transport protocols such as TCP or selective (preferred) dropping of less important packets, if traffic separation is in place. TCP interprets any lost packet as congestion indication and will reduce the sending rate accordingly. The respective congestion window mechanism therefore tries to mitigate queue overflows in bottleneck nodes. However, packet loss indication with tail drop far too late informs the sender about the congestion, which leads to slow TCP reaction times. Sally Floyd invented the so called “Random Early Detection (RED)” mechanism [77], which randomly selects packets from a filling up queue, in order to give an early congestion indication. However, the randomness is staggered into three sections as depicted in Fig. 24. As soon as the minimum threshold is crossed, the dropping probability of each enqueued packet rises as a function of the calculated average occupied queue size. Based in this, the congestion indication to each sender is roughly proportional to the sender’s share of the transmission capacity. There are several flavours of RED available, which will be briefly mentioned here. “Dropping from the front”: Since congestion indication is time critical, the randomized dropping decision should target packets at the queue’s head instead of the tail. This is particularly important, if traffic flows are accounted for within the queue management. That is, if a packet is worked out to be dropped, flow optimized congestion indication picks an earlier enqueued packet of the same flow for the dropping instead. “Congestion marking instead of dropping”: TCP’s deficiently congestion detection has been tackled by the “Explicit Congestion Notification (ECN)” and “Pre-Congestion Notification (PCN)” extensions, which are optionally available in IP networks. End points 27 17.11.2009 signal their ECN support in one of the two ECN bits in the IP header (see Fig. 5) and relaying routers might indicate with the second bit the upcoming congestion. This forward congestion notification saves the actual packet drop (and the resulting resending of it) and with backward signalling by the receiver achieves exactly the same sending rate reduction effect as before. Enqueued packets crossing the maximum threshold in Fig. 24 are therefore all ECN marked. Fig. 24 Random Early Detection (RED) for congestion avoidance The Pre-Congestion Notification is currently defined at the IETF pcn working group and reuses ECN bits under certain conditions. PCN markings are only applied for IP packets, which are marked with the so called “DSCP for Capacity-Admitted Traffic” value [17]. Those markings are used for domain internal admission and flow control between the ingress and egress routers of a PCN domain. That is, PCN enabled network domains differentiate DSCP marked traffic flows and ensure a smooth transport of admitted flows even under high load conditions. This is achieved by blocking of new incoming flows traversing the domain or even flow termination of some existing ones. A plethora of other dropping strategies are available to operators to choose from for their per hop dropping behaviour configuration. A simple but efficient flow based dropping scheme, called “Longest Queue Drop (LQD)” [170] can be applied, if flows are recorded, equally weighted and virtually queued in a separate queue. The respective queue length related to the flow’s usage of the link capacity and the simple scheme of dropping front end packets out the longest flow queue will actually degrade each flow according to its link share. 28 17.11.2009 Fig. 25 Longest Queue Drop (LQD) of virtually separated flows Lastly, the “Weighted Random Early Detection (WRED)” strategy is widely used for differently prioritized packet dropping behaviour (e.g. in [54]). Weighted RED combines standard RED detection with prioritized dropping behaviour based on DSCP class markings. The vendor Cisco thereby concentrates on the leading three DSCP bits and performs TOS conformant IP precedence prioritization (see Fig. 5). As a result, the RED minimum and maximum thresholds are replicated and differently configured for the distinguished classes of packet traffic. Low priority packets therefore experience an early threshold setting with resulting higher dropping probability. Congestion indication via dropping is only fully effective for responsive applications like TCP – otherwise, it only throttles the respective link share usage. Scheduling strategies (input and output) Stage (5) and (8) in Fig. 22 consider servicing decisions in the view of several queues waiting for transmission. Output port units only provide a set of queues, if they distinguish between traffic classes with associated forwarding priorities. Otherwise, there is only a single queue in place for incoming and outgoing rate adaptation with simple “First In, First Out (FIFO)” scheduling strategy. Input port units provide several queues for the available N output paths to prevent HOLB and might additionally distinguish C traffic classes in separate queues per path resulting to a maximum queue set of size C x N. There are several strategies available, which decide on the servicing order of the queue set. The easiest algorithm is called “Round Robin (RR)”, which services the queues in turns, each enqueued element at a time. 29 17.11.2009 Fig. 26 Round Robin scheduling If packets are prioritised and require differentiated scheduling behaviour, the most stringent algorithm is “Strict Priority (SP)” servicing. Here, queues of lower priority are only then serviced, if all higher priority queues are empty at that time (see Fig. 27). Fig. 27 Strict Priority scheduling The strict servicing paradigm can, however, starve out low priority queues completely, which is not intended. A plethora of altered round robin and priority scheduling strategies has been developed, which all aim for some fairness in the scheduling process. The major flavours called “Weighted Round Robin”, “Deficit Round-Robin”, “Weighted Fair Queuing” and “Self Clocked Fair Queuing” will be briefly described below. “Weighted Round Robin (WRR)” is the simplest approximation of the so called “generalized processor sharing (GPS)” model and aims to grant each scheduled queue the link share according to its priority. Since WRR can only service packets (more often cells), the priority guided fairness granularity of the link share depends on the served packet sizes. The major difference to strict priority scheduling is, that WRR does not necessarily empty the high priority queue before advancing to the lower one, but rather calculates the packets of a certain queue that should be serviced this round depending on the queue’s link share priority. The number of dequeued packets at a time is the queue’s priority divided by the lowest priority of any queue. 30 17.11.2009 Fig. 28 Weighted Round Robin scheduling “Deficit Round-Robin (DRR)” as well as “Weighted Deficit Round-Robin (WDRR)” are both modifications to the relating aforementioned two strategies. The dequeuing granularity is statistically stretched across multiple dequeuing phases and accounts for the difference between the calculated dequeuing share in bytes and the actually used up share in dequeued packet bytes. Unused granted bytes are added to the next round share, which eventually dequeues the delayer packet due to the accumulated granted byte count. The accumulated difference count is called “deficit count.” The number of granted bytes served in each round is called “quantum” and is either defined fixed or weighted by the queues priority. All fair queuing approaches follow the idea of an ideally fair mixing of traffic streams with infinitesimal fine granularity. This scheme is called “Generalized Processor Sharing (GPS)” and can be symbolized as in Fig. 29. Fig. 29 Symbolized fair queuing in an idealized GPS = Fluid-Flow Queuing “Weighted Fair Queuing (WFQ)” [66], [149] is a Packet-by-packet Generalized Processor Sharing (PGPS) approach, which takes into account, that the entire unit of traffic (packet) must be served in a dequeue process. The fluid-flow queuing ideal is approximated by means of an internal simulation of the theoretical service finishing time of each scheduled queue and ordering the actually performed servicing of the packets according this finishing time order. The difference between simulated finishing time and actual finishing time is a metric for the experienced service degradation (unfairness) for the respective queue. 31 17.11.2009 Fig. 30 Fluid-flow approximated queuing in WFQ The scheduling strategy called “Class-Based Weighted Fair Queuing (CBWFQ)” is a modified WFQ in the sense, that it specifically assigns traffic classes to queues and configures different weights and queue lengths to the set of serviced classes. The calculation of the finishing time in fluid-flow simulations is based on actual time steps and independent from enqueuing or dequeuing events processed in a scheduler. This absolute time scale is therefore replaced by an internally generated virtual time scale. The packet stream becomes “self-clocked”, hence the name “Self Clocked Fair Queuing (SCFQ)” [84]. The calculation of the self-clocked virtual finishing time is a function of the following parameters. * packet length, virtual finishing time = f * virt. finishing time of the previous packet of the same queue, * system’s virtual time at packet arrival In practice this can be implemented by stamping an arriving packet with a service tag equal to the corresponding virtual finishing time and servicing the packets in ascending order of service tags. Linking the virtual time to the work progress in the actual packet-based queuing system is a fast and computational simple operation, which yields the same packet ordering result as the original WFQ approach. SCFQ is therefore the preferred implementation variant of any WFQ inclined scheduling strategy. Given the described variety of building blocks along the router internal forwarding path, a basic QoS supporting router with non-blocking switching behaviour, virtual output queues and eight supported QoS classes is sketched in Fig. 31. Cisco’s 12000 series routers [52] apply weighted random early detection for class-based dropping (congestion avoidance) as well as deficit round robin scheduling within the input port unit and an proprietary modified deficit round robin for the output port scheduling. 32 17.11.2009 Fig. 31 VoQ with 8 classes CoS support (scheduling and dropping) As depicted in Fig. 32, the overall per hop forwarding behaviour experienced by any traffic being routed across a single router node is comprised of all router internal building blocks and their configuration. Per hop treatment Separate enqueueing Queue Scheduling Packet Dropping Number of queues 1:1 enqueueing M:N enqueueing Round Robin WRR/WDRR WFQ CBFQ Drop Tail RED WRED Combination = Per hop forwarding behaviour Fig. 32 Per hop forwarding behaviour composition in relaying nodes Additional building blocks are required, if absolute QoS parameters are to be guaranteed. This encompasses metering, rate based remarking, possibly shaping and admission control. All of those blocks rely on measurement algorithms at the input or output of the system, which either directly influence the relaying and sending behaviour (packet dropping, delaying or downgrading) or take the measurement thresholds as indicators for accounting or other traffic related configuration decisions. Generally speaking, a policing block is added to the input port unit and a shaping block to the output port unit. “Policing” is controlling the general access of a certain traffic type for the node or less stringent, a rate limitation element. The functionality can be packet based or flow based, depending on the ability to associate single packets into flows by indication or by classification outside the policing block. Traffic measurements for the respective packets or flows are taken and compared with a configured threshold in order to judge about rate conform- 33 17.11.2009 ing (“in”) traffic or excess (“out”) traffic. Policing of excess traffic can either result in packet drops or in downgraded markings, if traffic differentiation by packet markings is in place. A special policing operation is “admission control”. It is the strict dropping operation of packets either based on the excess traffic measurement for previously agreed on sending rates or based on a flow indication and a list of previously granted or refused flows. This essential part on agreed rates or flows can either be implemented with signalling protocols for rate or flow reservations or be part of a written agreement between interconnected parties, called “Service Level Agreement (SLA)”. “Shaping” controls the burstiness of transmitted streams of packets. A measurement based delaying of the enqueued outgoing packets assures a smoothed out packet sending rate. Traffic aggregation points are typical causes for traffic bursts due to the statistical multiplex of the independent packet streams. The required measurements are generally based on two measurement analogies, “leaky bucket” and “token bucket”. Both algorithms consider the filling and draining of an imagined bucket with controlled throughput characteristics. It is important to mention, that both algorithms come up with filling level and timing values, which can in turn control the policing or shaping action. “Leaky bucket” measurement model (see Fig. 33): This simple model emulates the filling of a bucket with elements and concurrently the dripping out of elements at a fixed rate. Two sources of events (and consecutively followed actions) can be taken out of this model. The first is the occurrence of overflowing elements, if more than b elements have been queued up in the buckets at any point in time. This happens, if the average filling rate of the variable filling process exceeds the constant drainage of the bucket. The size b of the bucket thereby represents the averaging time interval. Elements that are arriving in excess of the bucket’s capacity are dropped. The second source of events is the sending time of dripped out elements. The model is most often used for traffic shaping due to its applicability to smooth out bursty incoming traffic into constant rate outgoing traffic. Traffic peaks are eliminated by the delayed relay procedure. However, the overflow events can also be used in policing blocks, which detect excess traffic this way and predominantly drop or possibly remark the excess elements. Leaky bucket implementations are mostly used in “Asynchronous Transfer Mode (ATM)” networks and named “Generic Cell Rate Algorithm (GCRA)” there. 34 17.11.2009 Fig. 33 Leaky bucket algorithm “Token bucket” measurement model (see Fig. 34): This most commonly used model also fills and drains a bucket, but with a token controlled variable drainage rate. This rate variation is bounded by the bucket size b and leads to a constant average sending rate. Essential parameters of the model are the “Token Bucket Rate – r” and the “Token Bucket Size – b”. Tokens are released into the bucket at a constant rate r and will be used up for each drained element. Incoming elements leave the bucket, if enough tokens are available for the drainage. Otherwise, the elements are stored in the bucket until enough tokens have been released by the token rate. If b elements are stored in the bucket when a new element arrives, this element flows over and is classified as excess traffic. Useful additional parameters for a token bucket model are “Peak Data Rate – p”, “Minimum Policed Unit – m” and “Maximum Packet Size – M” (see e.g. [182]). The peak rate is limited by either the interface line rate or the maximum data generation rate at the sender. The maximum packet size is a fixed statement about the largest packet size being processed and must be smaller than the link’s maximum transmission unit (MTU). The minimum policed unit is the minimum granularity of the decision process. Elements of smaller size are set to have the minimum size in the calculations. Three sources of events (and consecutively followed actions) can be taken out of this model. The first is the occurrence of overflowing elements, if more than b elements have been queued up in the buckets at any point in time. The second source of events is the sending time of dripped out elements. A bucket filled with tokens that is seen by a newly arrived packet will immediately send out any incoming packet at the full incoming rate. However, if the token reserve has been used up, the dripping will follow the token rate. The third source of events is the first moment, where an incoming packet can not be sent out instantly, due to a lack of tokens for it. 35 17.11.2009 Fig. 34 Token bucket algorithm The token bucket metering is used within the integrated services architecture (IntServ) (see 4.1.2) and is addressed in detail in RFC 2210 [182]. Two widely used marking schemes make use of token buckets: - RFC 2697 - A Single Rate Three Colour Marker [89] and - RFC 2698 - A Two Rate Three Colour Marker [90]. They use one or two token buckets to determine the degree of conformity of traffic to several marking levels. Each token bucket can be used to compare the stream of packets against a certain token rate and bucket size parameter set. However, if the stream needs to be classified in several traffic classes, a number of token buckets with differing parameter sets can be used. The single rate marker uses two token buckets with the same token rate, but different sizes, in order to base a three colour marking on the differing burst sizes. The two rate marker also uses two token buckets, however with differing token rates and bucket sizes. The three colour markers are often used in differentiated services (DiffServ) (see 4.1.1) setups. The above mentioned congestion notification scheme PCN makes also use of token bucket metering for its signalling decisions. However, it introduces a filling threshold below the actual bucket size. Early warning signals can now be triggered, if the bucket filling crosses that warning threshold. The algorithm is therefore called “threshold marking algorithm” [70]. Both metering algorithms, leaky bucket and token bucket, are clearly understood, if equally sized elements are used. However, the transition to IP networks reveals a level of uncertainty or unfairness, when variable length packets are measured. If the bucket calculations would be based on packets as a whole, the decision taking remains easy and clean. However, the metering outcome is distorted by unequal packet sizes and thus not useful. 36 17.11.2009 In practise, both models revert to byte or even bit counts for bucket sizes, packet sizes and tokens in order to accommodate the unequal resource shares of the varying packets. However, packets can now partially overflow, find partially available tokens and introduce varying times for sending, threshold crossing and overflow events. This phenomenon is addressed in RFC 3290 [28] with the definition of loose and strict mode of metering operation. Strict conformance “Packets of length L bytes are considered conforming only if there are sufficient tokens available in the bucket at the time of packet arrival for the complete packet (i.e., the current depth is greater than or equal to L): no tokens may be borrowed from future token allocations.” Loose conformance “Packets of length L bytes are considered conforming if any tokens are available in the bucket at the time of packet arrival: up to L bytes may then be borrowed from future token allocations.” The strict mode of operation is commonly used, since it avoids negative parameter values. 3.2 QoS treatment scope Quality of Service differentiation of packet streams can be applied in different levels of granularity as well as treatment scope. The QoS per hop treatment options have been described in the preceding chapters and will be referred to as “QoS-based forwarding”. Outside the single node behaviour, the path - a packet stream takes through a meshed network - can also be guided by QoS differentiation, which will be called “QoS-based routing”. Lastly, groups of packets belonging to a traffic class can be encapsulated in tunnelling technologies either within the IP layer (e.g. Generic Routing Encapsulation (GRE) [72]) or below (e.g. MPLS LSP see 4.3, Ethernet VLANs see 4.2, etc.). Such QoSbased tunnelling will be titled as “QoS-based tunnelling”. Fig. 35 depicts the differences in the control plane routing processing as well as the resulting FIB differences. Fig. 35 QoS-based IP lookup variants As a starting point Fig. 36 depicts a simple inter-domain network setup, that will be used as an example for the following three sections. It assumes three interconnected ASes, which internally perform traffic separation in certain granularities (AS 1 with 4 classes in layer 2 and 3 / AS 2 without separation / AS 3 with 4 classes in layer 2 and 3). The interconnecting AS border routers remove any separation and provide no QoS-based (i.e. 37 17.11.2009 “best effort” only) traffic exchange. From the point of view of AS1 and AS3, AS2 is a transit provider, which also offers best effort only transit service. Fig. 36 Best Effort interconnection example Customer traffic from AS1 towards a prefix originating from AS3 will find AS2BR_1 as next hop and the respective output interface for the interconnection link during the IP lookup procedure. Packet markings within the IP header of the traffic have no influence in this lookup and will be removed or ignored by AS1BR and AS2BR_1. The packet relay within the transit AS is shortest path per-hop forwarding without traffic class separations. AS2BR_2 in turn will find AS3BR as next hop and the respective output interface for the interconnection link during the IP lookup procedure. Packet markings within the IP header of the traffic have no influence in this lookup and will be removed or ignored by AS2BR_2 and AS3BR. The traffic entering AS3 might either be relayed internally as a best effort traffic class toward the destination or AS3 might perform costly multi-layer ingress classification in order to guess the most suitable traffic class out of the supported three class set. The three treatment scope variants are described below. 3.2.1 QoS-based forwarding The simplest quality of service inferred packet transmission behaviour is achieved, if conventional routing (and thus path selection) remains unchanged and the traffic differentiation is made hop-by-hop within relaying nodes. All QoS building blocks as of chapter 3.1.2 are applicable. Packets will traverse the network along the same paths as if no QoS support were enabled. However, the experienced per hop treatment of each traffic class will reveal the differentiated forwarding and dropping behaviour, which results in measurable QoS improvements for higher prioritised classes in terms of delay, loss and throughput compared to the common unclassified best effort network behaviour. It is at the operator’s choice to select and apply a combination of QoS building blocks for forwarding in every node along the chosen route. The standard shortest path first routing behaviour of IP networks remains unchanged, which normally results in unbalanced network load distributions. The exchange of reachability information within the router control plane does not necessarily signal QoS information for advertised IP prefixes and thus allows for no QoS specific best path selection. RIB and FIB contain just one routing entry per IP prefix and point to the relevant next hop and port information. Remapping information might additionally be 38 17.11.2009 stored, if the next hop network requires a different DSCP marking in the IP header for the same traffic class. The quality of service in QoS-based forwarding solely relies on node internal QoS means based on the QoS information carried within the IP header. Fig. 37 QoS-based forwarding interconnection example In the example of Fig. 37, customer traffic from AS1 towards a prefix originating from AS3 will find AS2BR_1 as next hop and the respective output interface for the interconnection link during the IP lookup procedure. Packet markings within the IP header of the traffic are used in AS1BR for queue selection, scheduling and dropping decisions. Both interconnected border routers will respect the QoS packet markings. Since different class sets are supported in AS1 and AS2, either AS1BR or AS2BR_1 is responsible for class mapping and possibly packet remarking. The packet relay within the transit AS is shortest path perhop forwarding with traffic class separation based on the offered QoS-based transit class set as of AS2. AS2BR_2 in turn will find AS3BR as next hop and the respective output interface for the interconnection link during the IP lookup procedure. Packet markings within the IP header of the traffic are used in AS2BR_2 for queue selection, scheduling and dropping decisions. Both interconnected border routers will respect the QoS packet markings. Since AS2 and AS3 both support a three class setup, there might not necessarily be any class mapping and remarking in place. Whether or not identical class sets and markings are used in both systems needs to be checked during separate QoS signalling information exchange. There is no longer a necessity to perform multi-layer ingress classification due to the QoS-based forwarding and consistent marking procedure. 3.2.2 QoS-based routing Shortest path first routing tends to unbalance network loads due to the simple preference of the shortest path algorithm. The resulting routing along congested links can improve the experienced QoS for prioritized classes by means of QoS-based forwarding. However reducing the links’ load by diverting the network traversal for differentiated streams of traffic is far more effective. Routers would ideally support multi-path routing entries in their RIBs and FIBs and direct traffic of different classes along different next hop routes. If multipath routing is not supported, the selection of the shortest path could be augmented by QoS-based conditions. That is, paths with signalled QoS support should be preferred over others, even if the non-QoS paths would be shorter. If multiplex QoS-supporting paths are discovered towards the same prefix, the selection should prefer the one with the best matching QoS class set. This would lead to a best QoS path selection rather than a shortest path selection. A similar best path selection is already used in inter-domain 39 17.11.2009 routing with BGP (see Fig. 12). Applying QoS-based routing to BGP would therefore either require multi-path BGP and/or a modified best path selection with added QoS match checking (see Fig. 38). QoS based path selection Best path selection process Multi-path traffic assignment “QoS based load balancing” Additional selection condition for (extent of) QoS support Available in meshed setups, where multiple (instead of “best”) paths are selected for a prefix Best path selection modification ! Fig. 38 Multi-path support required QoS-based path selection in BGP The exchange of reachability information within the router control plane would ideally comprise the signalling of QoS information for advertised IP prefixes and thus allow for QoS specific best path selection. RIB and FIB either contain the best QoS matching routing entry per IP prefix or even multiple entries for the same IP prefix with associated QoS marking selections. Remapping information might additionally be stored, if the next hop network requires a different DSCP marking in the IP header for the same traffic class. The quality of service in QoS-based routing relies on QoS-based route selection possibly combined with node internal QoS means based on the QoS information carried within the IP header. Fig. 39 QoS-based routing interconnection example In the example of Fig. 39, customer traffic from AS1 towards a prefix originating from AS3 will either find AS2BR_1 or AS4BR_1 as next hop and the respective output interface for the interconnection link during the IP lookup procedure. This route selection would be based on the DSCP marking of the packets and sort out priority traffic using a different transit route than ordinary traffic. In either case, packet markings within the IP header of 40 17.11.2009 the traffic are used in AS1BR for queue selection, scheduling and dropping decisions. Each interconnected border router will respect the QoS packet markings. Since different class sets are supported in AS1, AS2 and AS4, either AS1BR, AS2BR_1 or AS4BR_1 is responsible for class mapping and possibly packet remarking. The packet relay within the transit AS2 or AS4 might or might not apply QoS-based routing internally as well. That is, shortest path per-hop forwarding or best QoS path per-hop forwarding with transit QoS traffic class separation is applied. Either AS2BR_2 or AS4BR_2 will eventually be reached by the relayed packet and in both cases will find AS3BR as next hop and the respective output interface for the interconnection link during the IP lookup procedure. Packet markings within the IP header of the traffic are used in AS2BR_2 or AS4BR_2 for queue selection, scheduling and dropping decisions. Each interconnected border router will respect the QoS packet markings. Remapping and packet header remarking will be performed by either AS4BR_2 or AS3BR in order to match the available class sets. Whether or not identical class sets and markings are used in the involved autonomous systems needs to be checked during the separate QoS signalling information exchange. There is no longer a necessity to perform multi-layer ingress classification due to the QoSbased forwarding and consistent marking procedure. 3.2.3 QoS-based tunnelling Offering network services to customers and service providers increasingly calls for tunnelled traffic transport. Two major reasons can be observed for the current trend towards encapsulated and route pinned IP packet transport. The first reason is traffic engineering and enhanced control over IP packet transmissions. Packet forwarding within tunnels enables operators to skirt around shortest path first routing. Tunnels can be planned, setup with fixed traversal nodes and dynamically switched around without IP forwarding and routing interference. Maintenance operations and network resilience characteristics make use of backup tunnel setups and fast switching operations between the active and the standby tunnels. The second reason for increased tunnelling usage is transparency. Tunnelled customer traffic does not interfere with network local addressing or QoS setup peculiarities. Virtual private networking services are one major application for tunnelled customer traffic transport. Tunnelling can either be realized within the IP layer, e.g. using GRE [72], or below the IP layer, most importantly with MPLS (see 4.3) and Ethernet VLANs (see 4.2). If customer IP traffic is encapsulated in transit provider IP-based GRE, then the DSCP of the outer IP header becomes the tunnel QoS marking. Similarly the 3 bit QoS markings of MPLS (traffic class (TC) marking) and Ethernet VLANs (“Priority Code Point (PCP)” marking) in each respective header format become the 8 class limited QoS marking of those lower layer tunnelling technologies. From the QoS perspective, two major distinctions should be made for QoS-based tunnelled transport. Either there is just one tunnel available with QoS markings in the tunnel header information or several tunnels are setup, which each represent a certain QoS class. The latter is generally available in any “virtual channel (VC)” based networking technologies. With the advent of “Generalized MPLS (GMPLS)”, any time slot, wavelength, fibre etc. can establish a separated channel, which can in turn stand for QoS-based traffic separation. Hence, a channel per traffic class option is expected to be increasingly used in future Internet setups. Applying the described tunnelled transport to IP forwarding is easily setup within an autonomous system. However, the AS interconnections are IP based only and any tunnelled peering needs mutual agreements of the interconnected peers. IP-based GRE encapsulation is always a valid option. However, inter-AS MPLS tunnels and inter-AS 41 17.11.2009 VLANs are more appealing. Point-to-point interconnections might be able to support both lower layer tunnelling options. The predominantly used Internet Exchange Points, however are neuralgic intersections, where Ethernet VLAN tunnelling would make a major difference for transparent QoS-based customer traffic transport. Fig. 40 depicts the described tunnelling options. The E-LSP and L-LSP abbreviations will be explained in detail in chapter 4.3. Tunnelling options Inter-AS tunnelling Intra-AS tunnelling - Inter-AS E-LSPs … - “tunnelling” through L2 marking (VLAN, PCP) - Unlikely: layered peering - E-LSPs, Carrier Eth.+PCP - L-LSPs, VCs, λs, fibres etc. Tunnelled forwarding strongly recommended Operator’s choice Fig. 40 L2 marking most likely Mutual agreement Tunnelling scope The exchange of reachability information within the router control plane would ideally comprise the signalling of IP layer QoS information for advertised IP prefixes as well as the supported tunnel QoS information. Routing is then augmented into a best tunnel selection process together with the instalment of respective tunnel mapping information in the router’s RIB and FIB. Depending on the 1:1 or n:1 class-to-tunnel mapping, there is either one best QoS matching routing entry per IP prefix with tunnel selection and tunnel marking adoption or even multiple entries for the same IP prefix with class-based tunnel selection. Due to the tunnelled transport, customer traffic no longer needs to be remarked in the header, but the tunnel selection and tunnel marking cares for the correct forwarding path and per hop treatment. QoS-based tunnelling relies solely on the tunnel “header” information to select the appropriate QoS building blocks within relaying nodes. Tunnelled transport can therefore provide several transit QoS class sets to external customers and does not need to change customer header information. It is the preferred type of transit service in the future. Major drawbacks of this approach are the currently missing Inter-AS tunnel support as well as the missing standardized mapping between encapsulated IP QoS and outer tunnel QoS markings. This is one of the addressed improvements of this thesis. 42 17.11.2009 Fig. 41 QoS-based tunnelling interconnection example In the example of Fig. 41, customer traffic from AS1 towards a prefix originating from AS3 will find AS2BR_1 as next hop and the respective output interface for the interconnection link during the IP lookup procedure. The packet markings within the IP header of the traffic are used in AS1BR for queue selection, scheduling and dropping decisions. Both interconnected border routers will respect the QoS packet markings. Since different class sets are supported in AS1 and AS2, either AS1BR or AS2BR_1 is responsible for class mapping and possibly packet remarking. The packet relay within the transit AS2 is chosen to be tunnelled in two provided edge-to-edge tunnels of the AS. Two classes are set to be provided for transit QoS, each class per tunnel. That is, incoming packets at the border router AS2BR_1 will find two routing table entries for all reachable IP prefixes originating from AS3, with differing outgoing interface mappings. Depending on the packet’s DSCP marking, either of the two QoS-based transit tunnels will be selected. This includes IP packet encapsulation. In case of single tunnel transport and tunnel QoS support, an additional mapping of IP QoS markings onto the tunnel header QoS marking would be performed. The encapsulated packets are now transferred across AS2 applying per hop behaviour in each relaying node as derived from the tunnel associated QoS building block behaviour. AS2BR_2 will eventually be reached by the relayed packet. The encapsulation is removed and normal IP lookup will work out AS3BR as next hop together with the respective output interface for the interconnection link. Packet markings within the IP header of the traffic are again used in AS2BR_2 for queue selection, scheduling and dropping decisions. Each interconnected border router will respect the QoS packet markings. Remapping and packet header remarking will be performed by either AS2BR_2 or AS3BR in order to match the available class sets. Whether or not identical class sets and markings are used in the involved autonomous systems needs to be checked during the separate QoS signalling information exchange. There is no longer a necessity to perform multi-layer ingress classification due to the QoS-based forwarding and consistent marking procedure. The described tunnelling procedure is known as “peer model”, where the tunnelling is confined within the transit AS boundaries. However, if the interconnection links are inter-AS tunnelling enabled, the so called “overlay model” can be used. That is, customer traffic is encapsulated in the sending AS1 and decapsulated in the egress border router of AS2 or ideally the ingress border router of AS3. Such an inter-AS tunnel would use tunnel QoS markings, which need to be agreed on between at least AS1 and AS2. Multiple inter-AS tunnels are also feasible, where AS2 would offer several entry points for differentiated tunnelled transit. The QoS-based tunnelled transport of IP traffic with differentiated traffic classes is optimal for transport of unmodified packets and for consistent QoS marking support. However, it requires cross-domain reachability and QoS marking signalling combined with cross-layer QoS mapping information. Both requirements are addressed improvements of this thesis. 43 17.11.2009 3.3 Architectural scope 3.3.1 Cross-layer QoS Quality of Service support has been targeted in many networking technologies. Circuit switched networks, such as the “Plain old telephone service (POTS)”, reserve for instance separate lines / fibres, separate time slots, separate frequencies, separate wavelengths, etc. for the interconnection of communication endpoints. Such reserved channels inherently provide excellent quality of service being designed and operated with exactly those reservations as required for the targeted service. Packet switched networks such as “Frame Relay (FR)”, “Asynchronous Transfer Mode (ATM)”, Ethernet with “virtual local area network (VLAN)” support, “Multi Protocol Label Switching (MPLS)” and others provide similar channel emulations by means of virtual channels. Packetized information transported in frames is separated by channel identifiers (FR->DLCI, ATM->VPI/VCI, Ethernet->VLAN-ID, MPLS->Label), which are either preconfigured or dynamically setup during connection setup. In either way, such separated channels provide excellent lower layer support for QoSbased tunnelling as described in 3.2.3. If separate QoS-inferred transport “lanes” are used for IP packet transport, such setups will be named “Layer 1 QoS support” in the remaining chapters. All of the named packet switching networks also provide some means for QoS-related frame markings. First of all, the channel identifier can be associated with certain per hop treatments and possibly reservations. Furthermore, explicit “QoS markings” are available such as: • FR Æ “Discard Eligibility (DE)” bit • ATM Æ “Cell Loss Priority (CLP)” bit, • Ethernet-VLAN Æ 3 bit Priority Code Point (PCP) and • MPLS Æ 3 bit Traffic Class" Field [9]. Such quality of service support for tunnelling technologies will be referred to as “Layer 2 QoS support”. If QoS is aimed at in the IP networking layer (layer 3), the underlying QoS support should ideally be incorporated in the QoS-based routing and forwarding process. However, according to the layering concept, IP does not normally be aware of the underlying technologies and their QoS support. The mapping between the upper layer QoS class set and marking and the lower layer QoS class set and marking is neither harmonized nor standardized. The differentiated services approach mentions this drawback in RFC 2475 [32] and recommends in guideline 15, that specifications of per hop behaviour should include such layer two mapping recommendation. However, this is not the case. For more details, see chapter 8.2. Vendors also give configuration guidelines for such mappings to their customers. Some examples of Cisco default mappings are given in chapter 8.2. The Ethernet standardization addresses VLAN-based frame priorities in its standard 802.1p, which is now incorporated in 802.1D [97]. However, IEEE does not provide mapping rules outside the Ethernet VLAN realm. It does standardize so called “User Priority Regeneration” [97] and [98], which defines how priority classes are to be mutually mapped, if different class set granularities are supported. Only the initial setting is stated, which might be changed by port specific configuration. This is looked at more closely in chapter 8.2 as well. In summary, many configuration options are offered to network operators for cross-layer QoS settings and some recommendations and default settings are provided. However, there is no standardized mapping defined and reliable configurations can only be assumed on single administrative domains. All default configurations can be overwritten and 44 17.11.2009 supported markings and mappings need to be dynamically signalled or manually configured based on SLAs. 3.3.2 Cross-domain QoS The “Resource ReSerVation Protocol (RSVP)” [39] is the predominantly used QoS signalling protocol for single packet flow reservations as well as for trunk reservations together with MPLS or other tunnelling mechanisms. The protocol belongs to the integrated services architecture (see 4.1.2) and sends traffic specifications toward the traffic sink and on the way back receives the actually achieved reservation specification. Strict parameter based reservations with quality guarantees can easily be set up by this method. RSVP is an end-to-end applicable reservation protocol, which potentially allows the resource reservations across domain boundaries. This protocol has further gained importance, since MPLS has chosen RSVP augmented with traffic engineering extensions (RSVP-TE) for the predominantly used path setup protocol for traffic engineered MPLS paths [11]. RFC 5151 [73] addresses the inter-AS setup of MPLS paths using three possible methods: contiguous, nested and stitched paths. One crucial signalling requirement for neighbouring domains is the class set and marking information exchange during the setup procedure. This can be achieved within RSVP-TE by means of the so called “DIFFSERV” object. Two types have been defined for 1:1 and n:1 class-to-tunnel mappings and can be found in RFC 3270 [75]. Besides this Inter-AS MPLS approach, there is no generally applicable protocol available, that signals available traffic separation class sets and their encodings across AS boundaries and beyond. The setup of a QoS supporting IP forwarding path across several ASes with potentially different class sets and encodings is currently unsolved and requires explicitly arranged mutual agreements between neighbouring AS peers along the way in order to establish a SLA guided forwarding path. Furthermore, the availability of transit QoS in a certain class set extent as well as the required packet marking for appropriate QoS class selection is not globally known, which complicates the search for QoS capable peering partners to be contacted for SLA negotiations. This major drawback of a currently missing generally available simple QoS signalling mechanism across AS boundaries is resolved by this thesis. Chapter 7 describes the proposed solution. 45 17.11.2009 4 State of the art QoS Concepts As described in chapter 3, over-provisioning is the easiest and currently used quality of service concept in the Internet. As long as the cost for hugely over-dimensioned transfer capacities (factor 5 to 10) is lower than any QoS scheme cost (investment in QoS capable devices, staff training, operation monitoring, debugging, SLA negotiations etc.), there will always be a sensible decision to go for the over-dimensioned network approach. However, QoS concepts have been developed for IP and other networking technologies, which are used within administrative domains. The following sections give a brief overview to them. 4.1 IP QoS The Internet protocol included from its very beginning a marking option in its header Fig. 5 to differentiate the priority and optimisation indication for the packet’s forwarding treatment. Those precedence and type of service marking were, however not used by vendors and network users and used to be ignored widely for many years. The carried IP traffic used to be data traffic only, which is insensitive to packet loss or delay and can cope with bad transfer qualities through retransmissions and packet buffering. Following the classification in Table 1, this is batch transfer, where sending and receiving have bulk characteristic without any strict timing requirements. Throughput is the main transfer parameter. Table 1 Transfer demand matrix – after [79] Source Bulk Bulk Batch transfer e.g. file transfer Stream “Replay” application e.g. video on demand Stream “Recording” application e.g. recording of a measurement signal Signal transfer e.g. voice over IP, video conferencing Sink “Recording” applications are streaming services, where the sender needs to regularly transmit its packetized information. The reception can be delayed since it is not critical for the recording. If buffering is used within the source, packet retransmissions are feasible to cope with packet losses. “Replay” applications are streaming services, which require bounded transfer delay constraints. Some delay variations can be compensated by replay buffers and time shifted replay start points. However, excess delay causes late reception of packets and becomes useless for the sink. Losses can not be repaired by retransmissions. Signal transfer applications are duplex streaming services, which require bounded transfer delay constraints in both directions. Only very limited sender and receiver buffers can be used for transfer delay variation compensation. In both directions, packet retransmissions are not available for loss compensation. Acceptable one way transfer delay times are for instance investigated in ITU-T Recommendation G.114 . According to the analysis of very good user satisfaction for voice over IP, the mount-to-ear-delay ideally stays below 150ms. 46 17.11.2009 The demand for IP QoS has risen with the increased usage of IP networks for the latter three types of time and loss critical services. 4.1.1 DiffServ The Differentiated Services (DiffServ) architecture is a prioritization concept of aggregated traffic providing relative QoS (see 3.1.1) in IP networks. Work on DiffServ started in the late 1990s and resulted in two major standard documents: • RFC 2475 - An Architecture for Differentiated Services [32] and • RFC 2474 - Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers [142]. Three interesting citations out of the work on DiffServ should be mentioned here, which clarify the situation and intention of its developers. “What problem are we solving? Give “better” service to some traffic (at the expense of giving worse service to the rest). ATM marketing fantasies to the contrary, QoS is a zero-sum game: - it does not create bandwidth. - it does not guarantee that you get better service.” Van Jacobson [121] "QoS is managed unfairness." Kathleen Nichols [141] “DiffServ wasn't chartered to solve the end to end QoS problem. It was chartered to define coarse-grained class-of-service differentiation, which is an entirely different (and much easier) goal.” Brian E. Carpenter [44] The formal description of DiffServ’s intention is given in RFC 3086 [143]: “The differentiated services framework enables quality-of-service provisioning within a network domain by applying rules at the edges to create traffic aggregates and coupling each of these with a specific forwarding path treatment in the domain through use of a codepoint in the IP header”. An overview of the Differentiated Services architecture is depicted in Fig. 42. 47 17.11.2009 Fig. 42 Differentiated Services regions, domains and nodes Packets of the different traffic streams are grouped into so called “Behaviour Aggregates (BA)”, which infers the same per hop treatment behaviour (PHB) in relaying nodes along the path. A BA is identified by as differentiated services codepoint (DSCP). All nodes within a DS domain therefore associate a consistently configured set of treatment policies (queuing, scheduling, dropping) to each specific BA. This applies to core as well as to edge nodes. Core nodes classify behaviour aggregates solely by inspecting the packet‘s DSCP information. Edge nodes of a DS domain, however, additionally perform multi-field classification and conditioning functions. That is, the classification / grouping in the domain’s ingress node inspects a combination of possibly several header information fields and the ingress interface of that packet in order to make a policy guided decision about the packet’s aggregation into a certain BA. This in turn results in the appropriate DSCP marking (see Fig. 43). Fig. 43 Behaviour aggregate classification and DSCP marking The edge node’s conditioning comprises several function elements, such as meter, marker, shaper, and dropper. It is important to understand, that DiffServ assumes the correct classification and in particular the conditioning functionality being sufficiently enforced at the edge of a DS domain. This in fact increases the scalability of the concept and takes the burden of computational intensive multi-field classification and conditioning off of the core nodes. Although this approach does not preclude or address the possible burstiness and congestion conditions that will arise in internal traffic aggregation points, it is a good compromise between strict QoS support and required control overhead. Internal over-provisioning can be used as countermeasure. 48 17.11.2009 RFC 2475 [32] depicts the classifier and traffic conditioner structure as show in Fig. 44. Fig. 44 Logical View of a Packet Classifier and Traffic Conditioner The policy-based setup of PHB associated treatment combinations of queuing, scheduling, dropping, metering, marking and re-marking, shaping and dropping is made by each DS domain administrator and can make use of the plethora of mechanisms being described in chapter 3.1.2. Several DS domain may be operated under the same administration, which relieves the edge node operation of neighbouring DS domains. If both domains use the same configuration, the ingress edge can simply operate with core node functionality. Otherwise, full ingress operation needs to be applied. If different administered DS regions are interconnected, there needs to be an agreement on how to setup the ingress classification and conditioning. This is normally done with SLAs containing the so called “Traffic Conditioning Agreement (TCA)”. As described above, the central point of operation of DS domains are behaviour aggregates with associated PHBs. There is no limitation being standardized regarding the number of possibly applicable PHBs. However, the mapping into DSCP is limited to 64 possible values, which results in operator specific PHB-DSCP mappings (see Fig. 45). Fig. 45 PHB ÅÆ DSCP mapping The state limitation of 4096 “global PHB space” is not a strict limit and relates to the definition of so called “Per Hop Behaviour Identification Codes (PHB-ID)” [31] for PHB signalling usage out a consistently managed DS region. A typical example is inter-provider 49 17.11.2009 PHB identification. The PHB ID encoding distinguishes between standard PHBs as described below and non-standardized PHBs. The latter are IANA assignable and currently limited to 12 bit length, hence the 4096 limitation. Fig. 46 PHB encoding [31] PHB ID codes can contain a set bit 14, which indicates PHB sets. This is best understood for the below described 4 classes of AF PHB with 3 dropping precedence encoding each. Besides the free choice for PHB definitions, there have been 22 per hop behaviour defined with recommended DSCP encoding values, which will be briefly explained below. Default PHB - 000000 RFC 2474 [142] defines the default setting of the DSCP header field to “000000”, which is the standard IP encoding for “best effort” service. This is also the last resort choice, if the DS domain ingress node can not associate another DSCP encoding during its classification process. Class-Selector (CS) PHB – xxx000 RFC 2474 [142] has included the 8 DSCP encodings for backward compatibility reasons with the original IP precedence encoding. The three leading bits of the 6 bit DSCP are thereby numbered with the equivalent class selector number CS0 – CS7. CS0 is identical with the default PHB. Some minimal requirements have been stated concerning packet ordering and timely delivery for those classes. It should be noted, that CS-DSCPs all encode the lower 3 bits to 000. However, as DS domain, which only supports class selector PHBs will classify incoming packets on the three leading bits only. This way, different DSCPs might be merged into single CS DSCPs. Expedited Forwarding (EF) PHB - 101110 RFC 3246 [60] defines this expedited forwarding per hop behaviour, that is intended to support low delay, low jitter and low loss services. It is the only PHB, that is normally associated with a rate limitation that is strictly enforced in the DS ingress edge nodes. EF marked traffic is expected to experience short or even empty waiting queues in relaying nodes, which leads to low packet loss as well as low and hardly jittering transport delay times. Separate enqueuing and highly prioritized scheduling are the keys for EF forwarding behaviour. 50 17.11.2009 Assured Forwarding (AF) PHB RFC 2597 [88] standardizes a group of Assured Forwarding PHB. 12 DS codepoints have been allocated for 4 AF classes (AF1, AF2, AF3, AF4) with 3 drop precedence values (1,2,3) each. An important constraint is given about the packet ordering. Within each class, packets marked with the same AF class (and possibly differing drop precedence) must not be reordered during the forwarding process. This is particularly important for multi-path forwarding (e.g. load balancing). Since PHB are aggregates of IP traffic, such AF classes are also referred to as “ordered aggregate”. The recommended encoding is depicted in Fig. 47 and Table 2. Binary Drop always Class Encoding Precedence ‚0‘ AFcd = XYZab0 Fig. 47 Encoding of Assured Forwarding PHBs Table 2 Assured Forwarding DSCP encoding AF class 1 2 3 4 DP low AF11 001010 AF21 010010 AF31 011010 AF41 100010 DP medium = AF12 = 001100 DP high AF13 = 001110 = AF22 = 010100 AF23 = 010110 = AF32 = 011100 AF33 = 011110 = AF42 = 100100 AF43 = 100110 All standardized PHBs are “per hop behaviour” descriptions, which only define the forwarding behaviour of a single node. No statement can be made about the experienced overall forwarding behaviour by a packet crossing a DS domain. Such chained forwarding behaviours, called “Per Domain Behaviour (PDB)”, are addressed in RFC 3086 [143]. Per domain behaviours are requested to state specific metrics that quantify the treatment, which should be measurable, to be used in SLAs. Interestingly there is currently just one PDB standardized, which does not include any quantifiable metrics. It is called “Lower Effort (LE)” PDB and targets traffic of lower importance than the traditional best effort type. There has been another PDB specification approach around called “’Virtual Wire’ PerDomain Behaviour” (draft-ietf-diffserv-pdb-vw-00). However, it has never reached RFC status and expired in 2001. Lower Effort (LE) PDB – 001000 Traffic that is of lower value than normal best effort traffic is currently marked with the default PHB. However, such low value packets are allowed to be starved out in times of congestion and can serve as lowest priority background traffic to effectively use available link capacity without detrimental interaction with other traffic classes. There is no precise DSCP encoding given, but the mentioned CS1 (001000) in the RFC 3662 [34] is expected to be widely used in LE supporting DiffServ setups. 51 17.11.2009 Table 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Currently specified PHBs DSCP 000 000 001 000 001 010 001 100 001 110 010 000 010 010 010 100 010 110 011 000 011 010 011 100 011 110 100 000 100 010 100 100 100 110 101 000 101 110 110 000 111 000 PHB Default PHB / CS0 LE / CS1 AF11 AF12 AF13 CS2 AF21 AF22 AF23 CS3 AF31 AF32 AF33 CS4 AF41 AF42 AF43 CS5 EF CS6 CS7 It should be clearly stated, that all PHB and PDB encodings are recommendations and network operators may choose alternative DSCP values for the same behaviour. That is why inter-domain PHB signalling should include the global PHB ID signalling together with locally applied encodings. This approach has been proposed in this work and is documented in chapter 7.3.1. A second general statement shall be given on fragmentation. If IP packets are being fragmented along the forwarding path, there is no explicit rule on how DiffServ marking, metering, scheduling and dropping should react to the set “more fragment” bit. As RFC 2474 states: “The policy to apply to packet fragments is outside the scope of this document”. 4.1.2 IntServ The second fundamental QoS architecture for the Internet is the so called “Integrated Services (IntServ)” architecture [38]. In contrast to DiffServ, IntServ is an approach targeting fine-grained end-to-end QoS (see 3.1.1) with guaranteed absolute traffic parameters. Guarantees are enabled through traffic flow specific reservations and connection admission control in each node along the forwarding path. This requires a connection setup procedure with resource request and resource grant messages and leads to flow-specific reservation states in each relaying node. The association of packets to those reserved flow states is not generally based on a single packet header marking, but rather requires multi-field “classification” in each node. In IPv6, the “Flow Label” header field (see Fig. 4) enables efficient IPv6 flow classification [155]. Reservations are application specific end-to-end flow states, which are signalled by means of a specialised signalling protocol called “Resource ReSerVation Protocol (RSVP)” [39]. Reservations exist in hosts and relaying nodes and are unidirectional. RSVP operates on top of IPv4 or IPv6 and follows standard IP routing. Fixed route selection (source routing) can be enforced by so called explicit route objects, which list the pinned down sequence of hops to be used. Soft state reservations are setup, which automatically time out, if not periodically being refreshed again. Refresh cycles of 30 seconds are common. Reservations are receiver initiated. Multicast is supported and upstream reservations are merged in multicast tree joints towards the sender. 52 17.11.2009 Five fundamental questions are addressed in the IntServ architecture’s signalling: • How to identify a flow (associate packets to reserved flows) ? Æ FilterSpec, • What is the sender’s traffic profile ? Æ TSPEC, • What guaranty profile is requested by the receiver? Æ RSPEC, • Which reservation style is used in the FilterSpec? Æ fixed, shared, wildcard and • Which service models are requested for the network element behaviour Æ “Controlled-Load Network Element Service” or “guaranteed service”. Since the resource reservation is flow-oriented, the flow descriptor (see Fig. 48) becomes a central architectural element. Flow Descriptor FilterSpec ‘filter specification’ rules to associate packets to flows Æ classifier setup FlowSpec TSpec RSpec ‘traffic specification’ describe the expected traffic ‘reserve specification’ describe the flow reservations Æ scheduler setup Flow Fig. 48 QoS RSVP flow descriptor structure The FilterSpec is generally not limited to certain header and/or payload fields of the respective IP packets. However, source IP and source port are basically used for flow classification. Three classes of filters (and hence the classifications) are specified: “fixed filter”, which leads to separate reservations for each sender, “shared filter”, which allows for resource sharing between several named senders and “wildcard filter”, which allows for resource sharing between all senders. The working principle of RSVP is as follows. A traffic source offers an application service to potential clients and describes the offered traffic flow characteristics by means of token bucket parameters in the so called TSpec structure. This parameter set is carried unchanged from one node to the next in RSVP PATH messages. The TSpec eventually reaches the potential clients and those receivers decide individually, whether they start a reservation up-stream. This is signalled using RESV messages with concrete receiver TSpec values, which are now referred to as “RSpec values”. Intermediate nodes will reserve the requested resources according to those RSpec values in their scheduler setup. The TSpec and RSpec parameter encoding is defined in RFC2210 [182]. Two QoS service modes of operation have been standardized, which either imply strict bounds on end-to-end datagram queueing delays as “Guaranteed Quality of Service” [166] or approximate QoS by means of capacity based admission control in order to emulate a lightly loaded network to the respective “Controlled-Load” [183] flows. Fig. 49 and Fig. 50 depict the message flow and node internal block diagram structure. Fig. 49 also points out the merging of RSpecs at multicast tree junctions. Single receiver requests are combined and upstream signalling and resource reservation is based on the combined reservations requests. 53 17.11.2009 Fig. 49 RSVP message flow diagram Fig. 50 RSVP support block diagram – after [39] The policy control and the admission control function are normally harmonized across a IntServ domain. Its central management is achieved through policy servers (also referred to as “Policy Decision Point (PDP)”. The information exchange in query response manner makes use of the “Common Open Policy Service (COPS)” protocol [91]. 4.1.3 IntServ / DiffServ combination The fine-granular end-to-end reservations based on IntServ can not be scaled into a globally extended network. However, two solutions exist to tackle this scalability problem: a) create coarse-grained reservations by applying IntServ to tunnels and b) setup the interworking of IntServ with DiffServ networks in order to gain some QoS advantage from non-IntServ enabled networks. The first approach is commonly applied in MPLS-based networks, which is addressed in chapter 4.3. The second approach is standardized in RFC 2998 “A Framework for 54 17.11.2009 Integrated Services Operation over DiffServ Networks“ [29] and will be briefly outlined below. End-to-end QoS in such a IntServ/DiffServ combined network path is being targeted in a way, where whole DiffServ domains are regarded as virtual links between the IntServ capable routers or hosts. That is, fined grained multi-field classification is performed in the IntServ realm and behaviour aggregation (BA) classification (namely DSCP classification) is applied in the virtual links. By the definition of supported PHB and the rate limited admission control into the DiffServ domain, those virtual links obtain predictable forwarding behaviour. As an example, the RSVP support in Cisco routers can be switched between the standard IntServ operation and the RFC2998 mode of operation (see Fig. 51). The difference lies in the way of classification as well as the low latency queuing in the data plane for the IntServ/DiffServ mode. Fig. 51 Cisco’s two RSVP operation models: IntServ and IntServ/DiffServ [53] 4.1.4 ITU-T IP QoS concept Besides the IETF standardization of IP QoS, ITU-T has also targeted IP QoS in its recommendation Y.1221 [107] and Y.1541 [108]. Generally speaking, it distinguishes three transfer capabilities, which are arrangements of traffic control and congestion control functions. A concept of traffic contracts between users and the network is assumed. Y.1541 adds classes of network QoS with objectives for IP network performance parameters in absolute numbers, which will be the base for the respective contracts. Table 4 lists the six specified classes, their parameter limitations as well as the suggested association to DiffServ PHBs. 55 17.11.2009 Table 4 Excerpt of IP QoS class definitions and performance objectives [108] Two further classes (6 and 7) have already been identified, but with provisional status. That is, eight classes can be expected for ITU-T’s IP QoS recommendations in the future. 4.2 Ethernet QoS The predominantly used layer two networking technology in today’s data networks is Ethernet. There are several flavours standardized in respect to the transmission speed and channel allocation behaviour. Commonly used Ethernet variants are Fast Ethernet (100 Mbps) with emulated shared medium operation, Gigabit Ethernet (1 Gbps) with exclusively uses dedicated transmission lines and 10 Gigabit Ethernet (10 Gbps). 40 and 100 Gigabit Ethernet (40 and 100 Gbps) is currently under specification at the IEEE P802.3ba task force. Those Ethernet types differ substantially in the physical and operational specifications, but they all make use of one and the same frame format (see Fig. 52). Fig. 52 Ethernet frame format Almost all local area networks (LAN) and many metro networks rely on Ethernet for device networking. A special development has been made with Wireless LAN (WLAN) [99], which uses an Ethernet-like frame structure as well. Not only, that the Ethernet is the major LAN and metro technology. The Internet in global scale mostly relies on Ethernet based exchange points for the mass data transmission at 56 17.11.2009 the interconnection of ASes in the core. (About 130.000 Terabyte of data is exchanged per month [8].) However, native Ethernet does not provide any QoS functionality and core interconnects are currently based on that standard Ethernet framing. The “virtual LAN (VLAN)” extension to Ethernet added four additional octets in the frame structure for the purpose of grouping end devices into VLAN-ID groups. That is, an Ethernet node can now belong to one or more virtual LANs and communicate only to peers within the same virtual LAN(s). This VLAN-ID filtering in the relay nodes has been standardized in IEEE Std 802.1Q [98]. This tag field not only includes the respective VLAN-ID field, but also provides three priority bits. The actual usage of those user priority bits has been defined by the IEEE P802.1p task group and is now published in IEEE Std 802.1D [97]. Fig. 53 depicts the resulting tagged Ethernet frame structure. Fig. 53 IEEE 802.1p User Priority marking in 802.1q (VLAN) tagged frames VLAN enabled Ethernets are therefore able to classify frames into eight different classes and provide class-based forwarding in Ethernet interconnection devices – called “switches”. IEEE 802.1D gives detailed specifications on how marked frames are to be enqueued (strict priority queuing only) and forwarded depending on the available number of queues in the switching device. Since traffic type “2” is not used, seven so called traffic types are specified as listed in Table 5. Interestingly, user priority “1” and “2” are handled to be of less priority than best effort (“0”) traffic type. Table 5 Ethernet traffic types [97] Table 6 shows the mapping and merging of traffic types according to the number of traffic separating queues. 57 17.11.2009 Table 6 Mapping of traffic types to available queues [97] Because of the widespread usage of Ethernet in modern businesses and its available traffic separation with strict priority queuing, IP QoS and Ethernet QoS are often combined to achieve maximum QoS support in company internal networks – so called “Intranets”. Chemnitz University for example applies the mapping between Ethernet priorities and IP DSCP markings as given in Table 17. Table 7 Chemnitz University applied Ethernet-priority-to-DSCP mapping A further development in QoS-based Ethernet deployment is currently worked on under the title “Carrier-grade Ethernet”. Several companies are pushing the standardization process with different proposals for frame encapsulation and operational twists. Without going into detail, Fig. 54, Fig. 55 and Fig. 56 show the resulting frame structures. Fig. 54 VLAN Cross Connect / VLAN XC [123] 58 17.11.2009 Fig. 55 Q-in-Q / stacked VLAN / Provider Bridges - IEEE 802.1ad [100] Fig. 56 MAC-in-MAC / Provider Backbone Bridges (PBB) – IEEE 802.1ah [101] The most prominent approach is called “PBB Traffic Engineering (PBB-TE) or Provider Backbone Transport (PBT)”, which however reuses the 802.1ah frame structure. No matter, which approach will prevail, they all provide one or even several times the user priority encoding in their respective frame structure. Furthermore, the encapsulation schemes of Ethernet frames being nested in a second Ethernet frame (MAC-in-MAC approach) allows for tunnelled transport of customer frames. That is, not only standardized Ethernet-QoS forwarding is available, but rather QoS-based tunnelling and QoS-based layer two “routing”. This plethora of choice is of particular interest for this thesis due to the importance of Ethernet for future Internet setups as well as the good match to the crosslayer mappings enabled by the new Inter-AS QoS concept (see chapter 7). QoS-based Ethernet congestion control The last paragraph on Ethernet-QoS will touch on a further QoS-related approach currently under development. Ethernet defines a congestion control mechanism, which allows switching devices to stop the upstream neighbour’s sending by means of so called PAUSE control frames. Such a locally generated frame signals a pause request in milliseconds in times of congestion. Upstream neighbours can in turn be forced to pause 59 17.11.2009 their traffic sources creating a chain of backpressure under high network load situations. However, sending selective PAUSE frames to upstream neighbours with differentiated pause times for the 7 traffic classes appears to be a simple and promising approach in order to confine the congestion effects in lower layers to low priority traffic types. Such “Priority-based Flow Control (PFC)” has already been addressed e.g. in [171] and [126]. IEEE has started the work on PFC within a task group 802.1Qbb by the end of 2007. It extends the original PAUSE based flow control as of IEEE 802.3x [102]. In principle, the send stop request will no longer lead to a full stop, but only to a sending stop for a signalled priority. Given that Ethernet traffic is sorted into eight lanes (priorities), Fig. 57 depicts the concept’s idea. Fig. 57 Priority Flow Control [56] The simulator OMNET has been extended to include priority-based PAUSE functionality. However, the research is not yet completed and will not be published in this thesis. Generally, backpressure flow control schemes do raise the hazard of congestion spreading, if several flows to different output directions are sent by a (intermediate) source, which only partially add to a single output congestion. The resulting PAUSE backpressure, however, does stop all flows undifferentiated, which artificially spreads the congestion to practically un-congested forwarding paths as well. Fig. 58 Congestion spreading [103] 60 17.11.2009 The current 802.1Qbb approach therefore includes a congestion notification mechanism as described in 802.1Qau [104]. Due to this combined approach, the right source will eventually receive the congestion notice and slow down selectively. A more simple mechanism is being investigated, which sends the priority-based pause together with a destination MAC address. That is, the sending intermediate node stops only those prioritized frames, which are destined to that MAC address, which has been identified as the highest load on the congested output queue. Research results will be published outside of this thesis. Generally speaking, the VLAN-based QoS support within Ethernet and the fast growing deployment of Ethernet technology in all networking areas underlines the importance of a close cross-layer coordination of traffic separation and its marking with the ubiquitous IP networking and the proposed CoS concept. 4.3 MPLS QoS “Multi-Protocol Label Switching (MPLS)” is the widely accepted tunnelling mechanism in IP networks, which is used to introduce some connection-oriented forwarding behaviour in the datagram network. Although MPLS is capable to encapsulate any networking protocol data units, it is currently exclusively used for IP packet forwarding. MPLS introduces in most cases an additional header structure, called “shim header”, which is effectively used to encapsulate the packet data behind it. The second important characteristic is a new 20 bit addressing scheme, called “label”, which has interconnection local significance. Labels are swapped between input and output labels during the MPLS forwarding procedure using a “label information base (LIB)”. The strict swapping operation together with a LIB establishment procedure creates a forwarding chain with fixed relaying nodes – the so called “Label Switched Path (LSP)”. This concept of fixed length local addresses with chained relay is not new, but can be found e.g. in ATM and Frame Relay networks as well. Although MPLS can be used on top of all underlying packet transport capable networking technologies, separate mapping procedures exist for ATM and FR due to this addressing similarity. The pre-established LSPs are excellent tunnels for traffic engineering approaches. MPLS is therefore widely used to perform traffic engineering (flow steering and fast restoration) besides the mandatory available shortest path first IP routing. Fig. 59 depicts the shim header (label stack entry format [160]) and its usage between e.g. an Ethernet transport in layer two and IP payload in layer three. The figure also implies the label stacking option, which is explicitly shown in Fig. 60. Such hierarchy of tunnels is a new and powerful option for scalable, fast restorable and transparent transport even of already tunnelled customer traffic. This capability is of particular interest, since customer traffic is most elegantly transported in tunnels and might be forwarded in nested tunnels in carrier’s carrier scenarios. The latter implies inter-carrier (that normally means inter-AS) MPLS LSP scenarios, which is not commonly used nowadays, but are envisioned for the near future. In terms of QoS support, the shim header “Traffic Class (TC)” field is the obvious class encoding. It is a three bit field, which allows for eight classes of tunnelled traffic. Historically, this field used to be named “EXP”, since the bits’ usage was for experimental purposes only. However, RFC 3270 [75] explicitly targets the MPLS support for Differentiated Services and introduces the so called “EXP-Inferred-PSC LSPs (E-LSP)”. The eight available behaviour aggregates (BAs) recall DiffServ behaviour aggregates with associated per hop treatments. However, the RFC deliberately delegates the mapping between DSCP encoding and E-LSP encoding to some unspecified signalling or preconfigured setup. 61 17.11.2009 Fig. 59 MPLS shim header structure and hierarchy usage Fig. 60 MPLS Label stack structure Because of the QoS dedication of the EXP bits, a renaming of the field towards “TC” has occurred just recently. RFC 5462 [9] updates the respective DiffServ related MPLS RFCs, but keeps the “E-LSP” abbreviation. The second mapping option for IP traffic classes onto MPLS tunnels is given in so called “Label-Only-Inferred-PSC LSPs (L-LSP)”. Here, each traffic class is associated to its separate tunnel, represented by a different entry label. This extents the limitation of eight classes up to – theoretically – a million supportable classes. This L-LSP QoS support picks up the generally available class separation available in all “tunnel or virtual channel based” transport technologies where each class is mapped into a separate encapsulation, which might also involve separate forwarding paths per class. The EXP/TC-inferred QoS forwarding behaviour combined with Label-inferred QoS tunnelling combined with LSP hierarchies enables numerous quality of service support options. This plethora of choice is of particular interest for this thesis due to the importance of MPLS for future Internet setups as well as the good match to the cross-layer mappings enabled by the new Inter-AS QoS concept (see chapter 7). A crucial component of the MPLS QoS support is the control part for path setup (label distribution) and TC marking distribution. The MPLS working group developed two major signalling branches for QoS-related traffic engineering. One branch extended a specifically developed “Label Distribution Protocol (LDP)” [10] into RFC 3212 “Constraint-based 62 17.11.2009 Routed LDP (CR-LDP)” [122]. The second branch reused the existing IntServ signalling protocol, RSVP and specified the traffic engineering extensions to it. The resulting “Resource Reservation Protocol-Traffic Engineering (RSVP-TE)” [13] defines several protocol objects, which are most importantly used to convey the label information. For this thesis, it is important to mention, that RSVP-TE does not directly signal EXP/TC markings during the LSP setup procedure. A more generalized “colour” concept is used, which is implemented as resource class attribute [14]. This new attribute becomes part of a path description and can for instance be used for constraint-based routing. In 2003, the MPLS working group has come to the decision to favour RSVP-TE for MPLS LSP path setup and to “refrain from entertaining work that intends to progress RFC 3212 or related RFCs beyond proposed standard” [11]. Two major improvements to MPLS have been tackled during the past few years. InterDomain MPLS and GMPLS Traffic Engineering are of importance to the thesis’ class of service concept. Inter-Domain MPLS RFC 5151 [73] is the RSVP-TE extension to a defined Inter-domain MPLS TE framework [74]. Taking the operators’ independence in terms of MPLS support and configuration into account, three methods of LSP establishment have been identified as depicted in Fig. 61. Contiguous LSPs are setup following the procedures from RFC 3209 [13] and 3473 [26]. Nested LSPs follow RFC 4206 [128] and stitched LSPs are described in RFC 5150 [15]. Interestingly, those setup procedures do not specify, how inter-domain EXP/TC encoding information is to be exchanged. Coloured routing is again proclaimed. Fig. 61 MPLS LSP signalling: contiguous, nested, stitched 63 17.11.2009 Generalized Multi-Protocol Label Switching (GMPLS) GMPLS is the logical abstraction of the MPLS approach towards physical representations of “labels”. RFC 3471 [25] describes the functional signalling part of the extended control plane. Targeted physical label switching representations are: time-division (e.g., Synchronous Optical Network and Synchronous Digital Hierarchy, SONET/SDH), wavelength (optical lambdas) and spatial switching (e.g., incoming port or fibre to outgoing port or fibre) – see Fig. 62. Fig. 62 GMPLS label representations Although such generalized labels are not markable with traffic class bits, the virtual channel concept, which those time division, lambda or spatial encapsulations represent, is well capable of QoS based tunnelling support – even more so with the decoupled and hugely hierarchical generalized encapsulation concept (Fig. 63). Fig. 63 GMPLS LSP hierarchy 64 17.11.2009 The class of service concept of this thesis therefore encompasses virtual channel based traffic class separations as layer one QoS approach and includes GMPLS based tunnelling on purpose. 4.4 QoS in access networks Speaking about wired and wireless access networks in a thesis about BGP signalled class of service support is rectified by the increasingly used intra-AS QoS support up to the provider-customer edge of a network. Fast increasing access rates in multi-megabit scale and fast growing acceptance of heavy data load application services by the Internet community causes not only large volumes of traffic, but also a sensitive mixture of loss and time critical data streams in that volume. Standardized access technologies all provide means for QoS-based traffic separation and Europe seems to take a leading role in actually applying them on customer lines and channels. The following section briefly lists some typical QoS class sets, which are often even standardized with fixed parameter limits for loss, delay and delay variation. Wireless access technologies – UMTS The currently widely available wireless access technology, “Universal Mobile Telecommunications System (UMTS)” is a third generation (3G) mobile telecommunications technology. It offers mobile data transfer services, which enable mobile devices to participate in IP communication networks. The UMTS standard documents (e.g. 3GPP TS 23.107 [1]) a detailed quality of service concept and architecture, which is based on four so called UMTS bearer services. Table 8 lists the resulting QoS classes and their respective characteristics and applications. The conversational, streaming, interactive and background class have detailed specifications for requirement and service parameters (see Table 9). Table 8 UMTS QoS classes [1] 65 17.11.2009 Table 9 UMTS Bearer Service Attributes [1] Customer traffic entering through prioritised UMTS bearers will certainly be prioritized in DiffServ capable provider’s backhaul network and should be passed on in e.g. a class set of four across AS boundaries towards the destination network. Wireless access technologies – LTE The highest demand for QoS concepts in the core of the Internet builds up, if the pace of speed increase in the core is slower than the increase pace in the access. This is particularly true for the 3GPP “Long Term Evolution (LTE)” standard, which is currently developed as a set of enhancements to UMTS. The official target of achievable customer data rates is 100 Mbps downlink and 50 Mbps uplink speed [3]. This rapid speed increase becomes even worse for the operator’s core network, if LTE-Advanced gets specified, which promises up to 1Gbps access rate for a slow mobility usage case [4]. LTE addresses the different needs of such potentially high speed interactive services with the definition of nine QoS classes as shown in Table 10. The repetition of classes is due to the distinction between traffic of premium versus non privileged subscribers. 66 17.11.2009 Table 10 LTE QoS class attributes [2] Wireless access technologies – WiMAX The “Worldwide Interoperability for Microwave Access (WiMAX)” wireless telecommunication technology has been standardized under IEEE 802.16 [105] and is regarded as alternative to wired broadband access solutions. WiMAX defines several QoS setups (provisioned, admitted, active), which are unidirectional, flow-based and either dynamically signalled or statically configured. Generally, the supported so called “service flows” are parameter signalled with a detailed set of QoS parameters associated to each active service flow. Large scale WiMAX deployments can, however, not guarantee the same service flow setups being consistently available throughout the network at all time. The solution are so called “global service classes”. A vital part of the WiMAX QoS concept is the naming of service flows, which are then associated with parameters sets. That is, global service classes should carry standardized names, which in this case are standardized by naming rules. The resulting names can therefore be parsed and reveal pointers into standardized tables of fixed options or parameters settings. In a way, those parameter tables and their combined referencing constitute the provided class sets. This spans a large class set range, which is in no way comparable to the aforementioned simple IP or Ethernet class sets. Simple flow prioritization is not provided in WiMAX except for the case, when two flows have identical parameter sets but different priorities, a so called “traffic priority parameter” decides upon the precedence. The interworking of WiMAX with its detailed QoS approach with simple IP or Ethernet classes of service is not trivial. WiMAX, however, provides a means to associate e.g. Ethernet priorities with WiMAX service flows by means of classifier rules. Among others, the Ethernet user priority or the IP DSCP value can be added as classification criteria to a specific flow. Wired access technologies – DSL / ATM “Digital Subscriber Line (DSL)” is a family of wired access technologies, which all use the traditional two-wire access lines for higher speed packetized data access services. The most prominent family members are: “Asymmetric Digital Subscriber Line (ADSL)” [112][113], “ADSL2” [114][115], “Very-high-bitrate DSL (VDSL)” [116] and “VDSL2” [117]. The digital subscriber line standards do not address traffic separation and prioritization, but provide bit pipes for higher layer protocols. DSL uses Asynchronous Transfer Mode (ATM) 67 17.11.2009 as framing and networking technology for packetized data transport services. The only natively available prioritization is found within the “High-Level Data Link Control (HDLC)” encapsulated transport of overhead messages. In order to understand the traffic separation in ATM-based DSL access, Fig. 64 depicts the structure of the fixed length ATM packets, called “cells”. For cell prioritization, there is only a single header bit available, which indicates a higher cell less priority if set. It is a sort of punishment marking for cells and can not be regarded as traffic class marking as seen with MPLS TC, IP DSCP or Ethernet priority. However, ATM cells carry local scope addresses (VPI/VCI), which are swapped at the forwarding switches and need to be setup during a connection setup phase. That is, the set up chain of VPI/VCI forwarding tables again creates a tunnelling scheme for the ATM cell payload by means of the resulting virtual channels. Several virtual channels can be established across a DSL access line and might be used for traffic class separations. Especially in Europe, it is common for DSL providers to configure e.g. 4 or 6 ATM VCs onto DSL lines for QoS purposes. Fig. 64 ATM cell structure Traditionally, the ATM transport not only included the actual access line transport, but also the aggregation network at the providers edge. This trend has changed towards an Ethernet-based aggregation network as specified in TR-101 [68]. Furthermore, TR-101 also includes interface options, where direct usage of Ethernet instead of ATM on the DSL line is recommended. Together with this Ethernet transition, the Ethernet priority support for QoS has explicitly been standardized. It is included as a “MAY” option. Given the above configurations and trends, it is expected to see separated traffics of four to eight classes within DSL access networks. Wired access technologies – GPON Due to the continuing demand for higher speed access rates on one hand and the high transmission capacity of fibre based technologies on the other hand, network operators push fibre as close to the customer as possible. This will include aggregation points in street cabinets or even fibre in customers’ houses. The predominantly used technology today for this purpose is “Gigabit-capable passive optical networks (GPON)” [111]. Within GPON, four service-bearing transmission containers are provided, which represent four differentiated classes of service. GPON specifically offers an Ethernet data service (see Fig. 65), which includes TR-101 [68] functionality. Furthermore, GPON is often combined with VDSL2, which again can lead to Ethernet transport with optional user priority 68 17.11.2009 encoding. Due to the underlying four bearers, this Ethernet transport can additionally be stretched into four parallel Ethernet transport channels. It is expected to see separated traffics of four to eight classes within GPON based access networks. Fig. 65 Functional layering structure for the Ethernet data service [111] 4.5 Summary of expected Class of Service support Given the above listed variants of QoS support in different networking technologies, the following table summarizes the expected number of separated traffic classes. 69 17.11.2009 Table 11 Overview of available layer 2 and 3 quality of service classes ¹ ATM has not been extensively described in this thesis, due to its declining usage. However it needs to be mentioned that ATM had long before one of the most detailed and researched into QoS concept with parameter negotiation, detailed measurement, traffic conditioning, admission control and management functions. Many recent developments – especially in the field of MPLS – seem to have adopted and learned from ATM. The mentioned QoS categories relate to classes of service (CBR, VBR, ABR and UBR as defined by the ATM-Forum). 70 17.11.2009 5 State of the art AS interconnection The Internet is a patchwork of interconnected autonomous systems, which exchange IP traffic. By means of an Exterior Gateway Routing protocol, BGP version 4, each AS announces the IP networks, represented by IP prefixes, which are reachable through that AS. Interconnected autonomous systems establish so called BGP peering sessions for this reachability information exchange. The “Network Layer Reachability Information (NLRI)” is advertised in BGP UPDATE messages together with associated path attributes of the announced routes. Each AS in turn processes the route advertisements and determines, which routes it takes in its own routing table as well as which routes it relays to other external peers. Those policy decisions are taken based on the interconnection topology of that AS, as well as the operator’s policy rules, which process the BGP path attributes. Fig. 66 depicts the basic options for physical AS interconnection, being direct point-to-point links and Ethernet based traffic hubs, the Internet Exchange Points. Fig. 66 AS interconnection options Whether or not an AS decides to establish a direct interconnection or an IXP interconnection depends mainly on the link cost as well as on the business gain through interconnections at central exchange points. The interconnection link technology is increasingly based on Ethernet. It is mandatory for IXP access due to their Ethernet switch based platform, but also has become popular for direct lines. An AS is usually solely responsible for the link towards its IXP switch port or towards the interconnected AS. In the latter case, either the two parties share the cost or the smaller one takes the lead in order to get connected to the bigger one. This can either be fixed line cost or part of the traffic cost in non-zero settlements. The link can be owned or rented. The link speed is normally in the range of 100 Mbps up to 10 Gbps (sometimes bundles of n x 10Gbps) and therefore mostly fibre based. However, dark fibres used with own transmission equipment are more common in the 71 17.11.2009 direct interconnection case. Otherwise, rented virtual channels are used by the vast majority. If the interconnected parties are not co-located in the same building, where direct Ethernet interconnect is possible, Ethernet over carrier (wavelength, MPLS, SDH) seems to be the most often used linking technology. Whether or not a point-to-point interconnection is chosen depends on several factors. Driving forces are geographic location, customer base and customer traffic destinations as well as zero or non-zero settlement based interconnection. Advantages of point-to-point interconnections (private interconnection): - One or only a few peering ASes are geographically close and terminate the majority of traffic loads of the own customer base. So it is worthwhile to rent or even build the direct link, - Link speed requirements to only a few peering ASes exceed the normal IXP interconnection speed, - Private mutual agreements (e.g. QoS requirements, MTU sizes, and other technical twists) should neither be made public nor can be provided by public platforms, - High security and confidentiality requirements request for a private interconnection, - Point-to-Point interconnection is setup to important partner ASes for backup purposes and - Accounting for paid traffic exchange is easy with the link interface counters. Advantages of IXP based interconnections (public interconnection): - The high link (rent) cost are prohibitive for the setup of many single link interconnections to other ASes for large scale connectivity, if only a single link to a major exchange point serves the job (e.g. single line from Africa to a large European exchange point), - The customers’ traffic demand is distributed and most of the high load terminating ASes are present a the chosen IXP, - The reception of several paths towards the same network increases routing robustness, if one interconnected AS fails to reach that destination, - Interconnection at IXPs is normally a zero settlement, - Direct interconnection across a switch towards many other networks reduces latency, which would otherwise build up with up and down transitions through higher tier ASes, - accounting based on source MAC address filtering is possible and - Route servers at IXPs reduce the number of BGP sessions to a single one. The “Router Server (RS)” argument needs to be briefly explained. Since IXPs allow for the interconnection of several hundred ASes at one place, the BGP session load of the single interconnection AS border router is enormous. However, it can be cut down to just one BGP session, if the IXP offers a central route server for its customers. Such a route server acts similar to a route reflector, but is peered with using eBGP with all resulting in and out filter policies in the peering ASes. Route announcements and receptions are therefore exchanged with the central route server only. The actual traffic exchange, however, takes place directly between the interconnected ASes. In numbers, many ASes are interconnected via public exchanges, but most of the Internet traffic load is carried over point-to-point interconnections. Large companies such as Google and Amazon are present at many Internet exchanges for close user proximity. Transit providers with global networks are also commonly present at IXPs, which use the platform as customer aggregation points. 72 17.11.2009 Quality of service support is currently almost exclusively provided through private interconnections. Public exchanges, however, start to support Ethernet based traffic separation [127]. Fig. 67 exemplarily depicts the internal topology of the German Internet exchange, DE-CIX, in Frankfurt, which is distributed across four locations in Frankfurt and has proven to support VLAN user priority marking upon request. Fig. 67 DE-CIX topology 2009 [61] The native interconnection between two ASes does not reveal the associated payment structure – settlement – involved in the interconnection. Payment based wordings are: • “transit” / “non-zero settlement” for paid traffic exchange (see 5.1) and • “peering” / “zero settlement” for free of charge traffic exchange (see 5.2). Both words “transit” and “peering” are, however, also used in purely technical terms. That is, “transit traffic” / “transit network” could simply describe the forwarding architecture across a different network towards the destination, regardless of any payments. The same applies to the word “peering”, which can be used to describe any interconnection between adjacent networks (peers). Moreover, the BGP protocol always speaks about peers and peering sessions no matter what settlement is associated with that interconnection. The level of interconnection of a single AS determines its rank in a global AS hierarchy, the so called “tier structure”. Lower level tiers pay for their traffic exchange with higher level tiers (“they buy transit”), where as same level tiers might “peer” with each other for mutual benefit for free. 73 17.11.2009 In theory, the highest tier group, “tier 1”, applies to ASes, which only interconnect via free of charge peerings with other tier 1 ASes and thereby gain full connectivity to all globally available Internet destinations. Due to their fully meshed interconnection, the routing table of those AS border routers does not contain a default route entry, but holds specific routing entries to all routable prefixes. Routers with such a full routing table make up the so called “Default Free Zone (DFZ)”. At the lower end, “tier 3” ASes exchange all their external traffic through transit interconnections with higher tiers. The vast majority of ASes are “tier 2” ASes, which hold some free of charge peerings to other tier 2 ASes and buy transit for the remaining global connectivity. Deutsche Telekom is such a tier 2 example, which runs a rather widely interconnected network with just one upstream transit (AS1239). Fig. 68 depicts the described Internet hierarchy. Fig. 68 Internet hierarchy 5.1 IP transit The easiest way to achieve global connectivity for its customer base is through buying of transit from the global players, either largely interconnected tier 2 or from one of the roughly 10 core ASes of type tier 1. The selling party is the provider and the buying party the customer. There are many factors, which influence this interconnection decision – cost and reliability being the major drivers. Non-zero settlements are not published and the two parties normally sign non-disclosure agreements about the interconnection specifics and rates. Two major settlement philosophies are common, being classical IP interconnection with volume based charges and upcoming Voice interconnections with time based charges. The latter is a new development, which is coming up through the current transition trend from time based circuit-switched voice interconnections towards “Voice over IP (VoIP)” type voice interconnections. It is assumed to be currently common to have two separate 74 17.11.2009 interconnections for the Internet and the voice services, but the drive towards a purely IP interconnected service platform might lead to a volume based voice interconnect as well. The easiest volume interconnection charging model is the “difference in volume” charge. Incoming traffic volume is subtracted from the outgoing traffic volume at the interface of the upper tier. Since large traffic volumes will stream down from the upper peer to the lower one, the lower gets the positive sign and pays for the difference. This in turn reveals that traffic from the lower tier towards the higher tier limits its revenue. A second charging model is a base and offset model, which includes a monthly fixed transit cost amount for a ”Committed Information Rate (CIR)” and a volume based excess charge for unexpected excess traffic. In densely interconnected regions, such as the US, Europe, Japan etc., a plethora of interconnection choices exist, which enable transit competition due to the cheaply available interconnection links. The trade-off between higher transit cost and alternative interconnection link expenses make the Internet more expensive in remote or underdeveloped regions of the world. Operators tend to establish several transit interconnections for backup and strategic reasons. Contracts are constantly renegotiated, which leads to a ever changing AS interconnection topology. Separate companies have specialized in data mining the AS tree in order to provide consultancy to transit selling parties, monitor the interconnection changes for debugging or trend analysis, document the topology for fault tracking and usage statistics etc. Even some vague conclusions can be drawn from the observed connectivity changes and interconnection path announcements, which estimate customer churn and undisclosed price movements. The clue to the selection of possibly multiple transit paths as well as the base for the mentioned analysis work is the policy based routing protocol BGP. Route attributes are used for mutual or single sided path selection by filtering, e.g. a cheap unreliable transit for the normal traffic load and a high priced transit for some important customers or services. This is often prefix oriented and used in a multi-homing scenario for transit cost optimization. BGP as a path vector protocol creates a record of crossed ASes in the route advertisement process, which provides partial knowledge about the AS graph structure. The introduction of the currently missing quality of service support on transit interconnection raises several issues, which need to be observed and addressed in future analysis. Expected issues are: - transit partner selection based on available QoS support, - transit partner selection and or negotiation about QoS class granularity, marking/remarking and traffic handling (shaping, scheduling, dropping etc.), - possible QoS-related charging models instead of the single class model today and - extended reliability discussion on whether the QoS support is only interconnection local or extends across further interconnections. The thesis’ class of service concept provides a simple marking and rate limitation mechanism, which addresses some basic QoS improvements in a transitive inter-domain manner. 5.2 IP peering The interconnection of ASes for mutual benefit for no cost (zero settlement) is called “peering” and happens largely at internet exchange points. Partnering ASes agree to setup interconnection links with BGP peering sessions to exchange reachability information. If the customer base of each party often exchanges traffic with the opponent customer base, this short-cut interconnection saves both sides transit costs via higher level tiers as well as transfer delay time due to the shortened forwarding path. 75 17.11.2009 Peering requests can be raised informally and the resulting agreements can be rather loose. That is, no complex service level agreement is mandatory and as long as both partners are content, even a handshake can be sufficient. Emails with contact details are exchanged and the IP addresses and AS numbers of the peering equipment are sufficient knowledge. Such a simple setup leaves the questions about interconnection quality, fault handling, service reliability etc. open. If either party becomes unhappy with the way of operation, “depeering” can occur. Generally, peering requests will be refused in the following cases: • requests from own customers, • requests from potential customers, • requests from other peer’s customers, • requests from providers with bad track records, • requests from providers with low infrastructure investment policy and • requests from providers, where the mutual benefit is questioned. The peering / depeering business is a playground for multi-dimensional optimization strategies and resulting business models. Social networking and market influence by route manipulations have strong impact in the request and grant procedure. In the transit case, attracting customers of competitive transit providers by financial or technical incentives to use the own transit path is common practice. However, attracting customers of competitive transit providers to use the own peering agreement is considered bad practice. The peering policies of network providers are often very short and publicly available. They are mainly published to attract potential partners. Existing providers and new players use the information to optimize their peering relationships and to find the right geographical and technical base for the interconnection. Examples of publicly available peering policies are e.g. [162] and [174]. The most comprehensive resource for publicly available provider information, their geographical presence at exchange points, their public policy, their interconnection types and speeds etc. can be found at the so called “PeeringDB” [151]. An interesting case study is the “aggressive, user-driven rollout” peering strategy of the company Google [58]. In order to cut down delay times for good user experience of Google services, the company is welcoming direct peering interconnections with customer networks at many places around the world. Furthermore, the company is a leading force for the ongoing IPv6 transition and mitigates the currently missing global IPv6 transit availability by means of direct IPv6 peering. Google has also made a scaleable “signup” procedure proposal, which is of particular interest for this thesis class of service support signalling approach. Google is running a large IPv6 connectivity test with so called “trusted tester” peering partners. However, the company can not possibly sign up testing agreements with each interested peering AS, considering the number of possibly 55000 assigned and about 30000 actively participating ASes in the Internet [93]. Therefore, a proposal was made for sending a BGP community value of “15169:6666” in the advertisement of a specific IPv4 prefix to “sign up” for the trusted tester program [58]. 76 17.11.2009 5.3 Internet Routing Registry - IRR The “Internet Routing Registry (IRR)” is a database system to store and globally access routing policy information in a structured, humanly readable and automatically usable way. 32 IRRs are currently listed [137], which differ in administration and database quality (performance, consistency and availability of database tools). In theory, the distributed registry information of the different regions in the world would comprise of one consistent information base about all AS routing policies. In practise, operators mainly rely on the information stored in the RIPE database, which is known to be the most advanced and consistently administered database. The representation of stored IP routing policies has been standardized in RIPE document “ripe-81” [159], which has been republished as RFC 1786 [21]. Many registries make use of the “Routing Policy Specification Language (RPSL)” [6] or “Routing Policy Specification Language next generation (RPSLng)” [35]. The biggest advantage of this specification language is its usage for automated router configuration generation. Network operators specify their AS number and assigned prefixes as well as their import and export policies in the registry. Direct interconnection partners can draw their generated filter rules based on that structured information and ASes further away of the originating AS can verify the incoming route information for plausibility and sanity. If all operators would rely on this registry mechanism, malicious route advertisements and the spreading of such fake information could largely filtered out. As the “Youtube outage” [41] in February 2008 showed, however, many operators are still not making use of this valuable Internet routing registry in their update sanity checks. 77 17.11.2009 6 Related work A number of QoS improvement approaches have been proposed before, but none has been standardized and actually used for QoS support in the public inter-domain case. Private inter-domain QoS setups do exist, but are not made public. In such cases, the QoS configurations and parameter settings are agreed on offline and documented in service level agreements. Three major characteristics have been identified about the past QoS improvement approaches: 1. Quality of Service is targeted end-to-end and includes the inter-domain interconnections for the case of sending and receiving parties being in separate ASes, 2. Quality of Service is targeted in a guaranteed quality fashion, which requires detailed parameter signalling, QoS enforcement functions, QoS parameter measurements, violation detection and fining and 3. Quality of Service is targeted in a homogeneous fashion, that is, all participating ASes need to support the same QoS setup. This includes the common signalling protocol, common setup of class sets with the respective classification, scheduling and dropping functionality. Some important examples of those past QoS improvement approaches are addressed below. France Telecom and Alcatel submitted an Internet draft, draft-jacquenet-bgp-qos-00 [59], in 2004, which introduced the so called “QOS_NLRI” attribute in BGP. It is used for propagating QoS-related information associated to the NLRI (Network Layer Reachability Information) information conveyed in a BGP UPDATE message. Single so called "QoS routes" are signalled, which fulfil certain QoS requirements. Several information types are defined for the attribute, which concentrate on rate and delay type parameters. This approach therefore addresses QoS guarantees for selected end-to-end routes. QoS parameters, such as packet rates, loss rates, one-way delays and inter-packet delay variation are signalled in absolute numbers and might need to be re-signalled, if end-toend requirements or network load conditions require adoption. Parameter signalling, however, introduces two major drawbacks in global scale operation. The first is the resource accounting, which actually registers used and available capacity shares together with triggered signalling and admission control. This effort is justifiable for just some QoS routes, but is unmanageable if most of the routes have associated parameter sets. The second drawback is protocol stability. BGP has been designed to trade-off routing dynamic for routing stability. Frequent parameter signalling is therefore counterproductive. In support of this argument, the BGP route flap dampening (RFC 2439 [175]) behaviour should be briefly described. If enabled, BGP will suppress frequent route advertisements based on a penalty scheme with hysteresis thresholds. In this dampening concept, each route flap (withdraw/announcement pair) accounts for a penalty of 1000 and each attribute change for a penalty of 500. Penalties accumulate for each advertisement event and decays by 50% in a configurable time period. Penalty counts above a suppress limit prevents the advertisement relay of a given prefix and its attributes. Penalty counts below a reuse limit switch to normal router advertisement operation. Fig. 69 depicts the dampening characteristic. 78 17.11.2009 Fig. 69 Route Flap Dampening [167] The work on this draft is embedded within a European Union funded project called “Management of End-to-end Quality of Service Across the Internet at Large (MESCAL)” [139]. This extensive project work on inter-domain QoS support goes far beyond the limited class of service approach of this thesis. It is based on a so called “cascaded signalling approach”, which assumes AS-internal DiffServ QoS class support – referred to as “local-QoS-class (l-QC)” – and inter-AS QoS class support with resulting “extendedQoS-classes (e-QC)”. Fig. 70 depicts the cascaded approach with the l-QC and e-QC after interconnection. The MESCAL project targets end-to-end QoS guarantees and therefore signals parameters between ASes. Such parameters are part of the l-QC QoS specification and the e-QCs are constructed recursively from them following parameter specific combination rules. Fig. 70 MESCAL - Cascaded Approach [139] As mentioned before, such QoS guarantees with parameters signalling are out of focus for this thesis’ class of service concept. However, the MESACAL project also includes an option - called “loose guarantees solution” - that renounces end-to-end QoS guarantees. 79 17.11.2009 However, it still performs mutual negotiations on performance parameters and bandwidth requirements and requires either globally understood class indicators (e.g. DSCPs) or SLA-based local indication agreements. Dynamic signalling of (re-)marking information and marking preservation is not provided. France Telecom started a second Internet draft, draft-boucadair-qos-bgp-spec-01 [37], in 2005. It is based on the specified QOS_NLRI attribute and introduces some modifications to it. The notion of AS-local and extended QoS classes is used, which effectively describes the local set of QoS performance parameters or their cross-domain combined result. Two groups of QoS delivery services are distinguished, where the second group concentrates on ID associated QoS parameter propagation between adjacent peers. The first group is of more interest for this thesis’ work, since it concentrates on the "identifier propagation" such as the DSCP value for example. However, this signalling is specified for the information exchange between adjacent peers only and assumes the existence of extended QoS classes and offline traffic engineering functions. The limitations of the inherited QOS_NLRI attribute remain. The co-workers at France Telecom, Christian Jacquenet and Mohamed Boucadair, hold large contributions to the mentioned work and explain provisioning techniques for the currently topical IP/MPLS networking case (see [120]). However, the exchange of class identification (marking) is also not addressed. Another approach has been raised by a group of researchers at Johns Hopkins University. It is described in [24]. The Internet draft associates a list of QoS metrics with each prefix by extending the existing AS_PATH attribute format. Hop-by-hop metric accumulation is performed as the AS_PATH gets extended in relaying ASes. Metrics are generically specified as a list of TLV-style attribute elements. The metrics such as bandwidth and delay are exemplarily mentioned in the draft. One contribution specialized in the signalling of Type Of Service (TOS) values which are in turn directly mapped to DSCP values in section 3.2 of the draft [185]. The TOS value is signalled within an Extended Community Attribute and, if it is understood correctly, will be applied to a certain route. An additional value field is used to identify, which routes belong to which signalled TOS community. Who advertises such attributes and whether they are of transitive or non-transitive type remains unspecified. Advertising multiple paths (and associated metrics) for one prefix is addressed and a new path selection algorithm had been proposed. The concept would therefore support the packet classification and classbased route selection. The draft expired in December 2006. The most comprehensive analysis (although not an IETF draft) is given in [7]. This "Interprovider Quality of Service" white paper examines the inter-domain QoS requirements and derives a comprehensive approach for the introduction of at least one QoS class with guaranteed delay parameters. The implementation aspects of metering, monitoring, parameter feedback and impairment allocations are all considered in the white paper. However, QoS guarantees and frequent parameter signalling have been identified as critical characteristics for the inter-domain global scale routing system and the BGP protocol stability. It is valuable work for the fine-grained QoS setup for an arguably large number of selected end-to-end routes. A general applicability of the concept for nearly all Internet routes is not feasible. A more economically inclined approach has been published during the IEEE ICC2002 conference under the title “Enabling dynamic market-managed QoS interconnection in the next generation internet by a modified BGP mechanism” [94]. It relates to this thesis for two reasons. Firstly, the intention of QoS support at inter-domain interconnections and 80 17.11.2009 secondly the usage of BGP for signalling. It is again based on the QOS_NLRI as described above, but includes price information as well. Although the economical characteristics might become the clinching argument for or against any QoS-based interconnection, it is not expected to undisclose such information in publicly seen routing protocol messages. Furthermore, the limitations of QoS guarantees and the associated parameter signalling have already been described above. A further concept proposes BGP-based QoS service capability signalling for groups of NLRI. This Internet draft was launched in October 2006 under the name “draft-djernaessimple-context-update-00” [68]. The draft does not specify the precise signalling encoding of QoS class markings and parameter signalling, but rather retreats to a more general QOS Service signalling, which might optionally involve interconnection local marking signalling. The fundamental idea in this signalling concept is the grouping of reachability information (prefixes) in QoS Service address families. BGP always signals in the UPDATE messages, which address family the contained network layer reachability information belongs to. The respective concept of “address family identifier (AFI)” and “subsequence address family identifier (SAFI)” has been defined in RFC 4760 [19]. The draft proposes to define a new AFI/SAFI for QoS Service signalling and all NLRI contained in such an UPDATE message belong to that QoS Service context. This approach is attractive for two reasons. Firstly, the signalling overhead scales well due to the grouping effect of possibly numerous prefixes under a common AFI/SAFI based signalling. Secondly, QoS related UPDATE information can selectively be signalled for the separate AFI/SAFI. The same is true for selective route refreshes and soft-reconfigurations. However, this thesis aims for a global scale class of service interconnection support for possibly all Internet routes. Using the capability signalling concept, this would result in double signalling of all prefixes, one time for traditional reachability and a second time for the QoS service context association. Further observations on existing QoS signalling approaches are summarized in RFC 4094 [131] – a review analysis produced by the “Next Steps in Signaling (nsis)” working group. Half of the document is dedicated to RSVP analysis (see 4.1.2), being the most important QoS reservation protocol in today’s networks. Since RSVP is an end-to-end QoS signalling protocol, which is also augmented to establish MPLS traffic engineering tunnels, it has high potential to setup (possibly tunnelled) inter-domain QoS paths. RFC 2814 [184] and RFC 2815 [163] also address the mapping issue of Integrated Services QoS into Ethernet User Priorities. However, no fixed mapping is – and can be – defined, but rather a request and response negotiation between neighbouring nodes about locally available Ethernet resources is suggested. The dissemination of DSCP values has also been standardized for RSVP in RFC 2996 [27]. The usage of RSVP, however, raises concerns about scalability (due to the flow-based end-to-end nature and soft-state signalling behaviour) and lately about direct user <-> provider equipment interaction (see Fig. 73). RFC 4094 analyses several intra-domain signalling protocols, which similarly to RSVP allow for resource reservations for traffic flows. The protocols are Tenet [18], ST-II [65], YESSIR (YEt another Sender Session Internet Reservations) [148], Boomerang [76] and INSIGNIA [129]. Differing aspects are signalling complexity, sender or receiver initiated reservations, and multicast reservation support. Three inter-domain reservation protocols are also analysed, which is closely related to this thesis. The first is the “Border Gateway Reservation Protocol (BGRP)” [147]. BGRP creates a sink-tree reservation structure limiting the reservation states in border nodes. DiffServ forwarding is expected and sender-initiated PROBE/GRAFT reservation messages aggregate resource requests along the way by reusing and re-allocating existing reservations. 81 17.11.2009 The reservation tree structure can not fully aggregate reservations, due to the possibly differing roots of multiple trees. Therefore, a second inter-domain protocol, called “Sharedsegment Inter-domain Control Aggregation protocol (SICAP)” [168] has been defined, which optimizes the aggregation using shared-segment aggregations instead of a tree structure. Due to this change in reservation structure, the state information in border routers can be significantly reduced with SICAP. “Dynamic Aggregation of Reservations for Internet Services (DARIS)” [33], the last analyzed protocol in RFC 4094, provides a threshold based dynamic inter-domain aggregation scheme. Individual reservations are monitored and trigger the setup of an aggregation reservation by crossing a configured threshold. This approach also establishes shared segment reservations along AS path routes. Intermediate ASes can in turn remove individual reservation states and rely on the aggregate instead. All of the analysed protocols have in common, that they setup flow reservations with fixed parameters. This is far too complex for an approach that targets general traffic separation for potentially all flows in the Internet without explicit resource reservations and QoS guarantees. The same argument holds true for the Next Steps in Signalling (nsis) concept, which is standardized within an official IETF working group, which has already produced five RFCs. They focus on signalling framework, protocol design and signalling security. Fig. 71 depicts the layered structure of interconnected NSIS components in a node, which shows the general structure of the concept. The “NSIS Signaling Layer protocols (NSLP)” and the “NSIS Transport Layer protocols (NTLP)” represent the two-layer framework structure. In the upper half, the NSLP for QoS signalling [132] is of most interest here. It is still in proposed standard draft status. The last draft version (-16) expired in August last year. The lower half is dominated by the universal transport layer protocol “General Internet Signaling Transport (GIST)” [164]. It has recently been changed into an “experimental” draft status. Both together provide similar ways of operation and achievable functionality as RSVP. However, the NSLP QoS does not depend on a specific underlying QoS model and supports different reservation types (such as edge-to-edge, access-toedge, edge-to-end). NSIS is a universal signalling concept that appears to be well applicable for a wide range of resource reservation for flows of different granularities. Three independent implementations exist, which are all based on Linux platforms. None of the commercial router producers has NSIS implementations in their products. It is expected, that NSIS is well suited to achieve similar traffic separations as targeted in this thesis at least within the networking layer. However, it has not been chosen as base for the new cross-domain and cross-layer coarse grained Quality of Service support concept for several reasons. First of all, the flow-based reservation signalling is considered counterproductive as mentioned before. Secondly, the concept is too complex for the aspired simple traffic separation. Thirdly, the lack of support in commercial routers would delay the adoption of the proposed concept of this thesis in provider networks at large scale. Lastly, the recent shift towards experimental status is a major drawback on the road to commercial deployment. Especially the reasoning of the “Internet Engineering Steering Group (IESG)”, for the downgrade from proposed standard to experimental standard is remarkable. Fig. 73 and Fig. 72 document the current situation of the GIST standardization process. 82 17.11.2009 Fig. 71 Components of a NSIS node - [80] Next Steps in Signaling Internet-Draft Intended status: Experimental Expires: December 5, 2009 H. Schulzrinne Columbia U. R. Hancock RMR June 3, 2009 GIST: General Internet Signalling Transport draft-ietf-nsis-ntlp-20 ... Fig. 72 GIST protocol change to “Experimental“ status [164] 83 17.11.2009 To: Gerald Ash <gash5107 at yahoo.com>, "iesg at ietf.org" <iesg at ietf.org> Subject: Re: [NSIS] FW: I-D Action:draft-ietf-nsis-ntlp-20.txt From: Ross Callon <rcallon at juniper.net> Date: Thu, 11 Jun 2009 23:17:47 -0400 The fundamental problem with GIST is that is allows normal hosts (laptops, desktops, …) to send traffic to the control plane of routers. This opens up a new vector for hosts to be the source of DDOS attacks against the control plane of the routers. Note that such DDOS attacks are not just theory -- in fact multi-gigabit DDOS attacks against routers have occurred and do occur, and thus protecting against these is critical. It is therefore normal for service providers to prohibit "host to router" signaling packets (such as RSVP packets) from entering their network from the customer networks, for example by discarding these at the CE/PE boundary. Unfortunately the fact that such DDOS attacks are facilitated is not dependent upon the method that the router uses to recognize the packets as signaling packets. So long as a host has the ability to send traffic to the control plane of routers, then attackers will be able to harness the power of thousands of compromised hosts to attack routers. Of course the same issue could come up with RSVP. It became a standards track protocol a very long time ago, and would probably face the same scrutiny if it were a new protocol being proposed today. The current widespread use of RSVP is generally in ISPs limited to support of MPLS within a service provider. The same issue comes up in terms of DDOS attacks against application servers. Here one issue is that we don't have an alternative: hosts have to be able to send traffic to servers. Also, in general at least the largest DDOS attacks against servers need to be dealt with by putting appropriate packet filters / rate limits in place in routers (assuming that the router network is operating, and wasn't taken down by a different DDOS attack). In terms of the right want to deal with such DOS attacks: The reality is that it would be quite a major undertaking to deploy sufficient protection to allow hosts in general to signal to the router's control plane while still protecting against such attacks. For example control traffic would need to be rate limited at pretty much every entry to every major service provider network, and the effect that any DDOS attack would have on legitimate control traffic would need to be understood. If the attack came from a very large number of sources, then the rate at each entry point might be quite low, implying that either the widely deployed rate limits would need to also be very low, or they would need to be adjusted in response to an attack. All of this would need to be documented. However, the amount of difficulty that would be encountered in deploying such a system suggests that this is not an appropriate thing to put into the IETF standards track unless and until there is clear and well documented motivation for whatever new signaling protocol is being proposed. It is also possible that a signaling protocol could be used in a sort of "walled garden" scenario, where the hosts that are permitted to initiate control traffic are known and are protected from compromise. The current use of RSVP within some enterprise networks could be thought of as one example of such a "walled garden". If deployment experience of NSIS is collected from the experiment and presented with a clear definition of the walled garden within which the protocol can be safely operated, then this work might be more likely to be progressed to standards track (with the description of how and why the deployment is limited to that garden). Ross (speaking for myself, but having discussed the issue with other IESG members) Fig. 73 GIST protocol objections explained by Ross Callon [43] Following the reasoning of routing area director Ross Callon, any general signalling protocol that aims for end-to-end resource reservations will no longer pass the IESG for potential denial of service reasons. Either the protocol scope retreats from the end hosts (and applications) or a “walled garden” scenario is implemented, which strictly limits the user to network interaction to non-harmful functionality. Under those circumstances, RSVP appears to be the historical standardization flaw that will prevail. Further work has been completed in the field of guaranteed inter-domain QoS reservations [165]. It introduces a refined version of BGP, called “BGP+”, which is optimized for fast convergence. BGP+ also includes the ability to judge about a route’s QoS capabilities and to exchange this information with QOS NSLP and NTLP (GIST). This is required for the reservation adoption in the case of route changes. 84 17.11.2009 However, this work is again too complex for the aspired generally applicable simple class of service concept of this thesis. Its strong ties with GIST, the guaranteed resource reservation and the requirement of a homogenous set of supported classes of service in all participating provider networks prohibit its usage for the new concept. Related work on QoS provisioning in a wider sense can include the concept of “PreCongestion Notification (PCN)”. Since resource reservations and traffic prioritization provide significant QoS enhancement in highly loaded (congested) networks, the avoidance of congestion by sending rate reductions is an effective means of QoS provisioning. This goal is targeted by PCN using token bucket metering and packet marking for early congestion warning. This marking can either guide the intermediate nodes to select the “right” packet for dropping in the case of congestion or trigger the egress edge of a PCN domain to inform the ingress edge about the congestion. This ingress in turn is responsible for admission control to the PCN domain and possibly flow termination, if the already admitted flows’ QoS degradation does not allow extra flows to enter the domain. Fig. 74 depicts the major PCN components and its working principle. Fig. 74 PCN working principle - [136] From an inter-domain perspective with a general traffic separation scheme in mind, the PCN concept reveals two major drawbacks. The first is its limitation to a single domain and the more subtle second limitation comes out from the missing PCN marking encoding in packet headers. The IPv4 header (see Fig. 1) has no PCN marking bits, but rather 6 DSCP bits and two bits for the explicit congestion notification [156]. However, PCN defines three level of congestion marking (no congestion, admission stop marking and excess traffic marking), which need to be encoded in the packet header. The found compromise is PCN’s limitation on just one DSCP value marking for PCN signalling. That is the so called “DSCP for Capacity-Admitted Traffic” [17] will be used for PCN and redefines the ECN bits accordingly. All other DSCP marked traffic can either not be admitted into a PCN domain, or will be remarked to PCN DSCP or can not be used for PCN marking. Both strong limitations clearly show, that PCN will not be an equivalent replacement for the proposed general concept of cross-domain and cross-layer coarse grained Quality of Service support in IP-based networks as described in chapter 7. 85 17.11.2009 7 New (coarse grained) CoS concept 7.1 Motivation and target The current situation of Quality of Service support within Autonomous Systems and the support between ASes at interconnection points differs dramatically. The fast growing number of Internet users and the rapid increase in access line transmission capacity lead to a steady growth of Internet traffic. Fig. 75 exemplarily shows the overall exchanged traffic statistics of the German Internet exchange in Frankfurt. Fig. 75 DE-CIX yearly traffic graph - [62] Furthermore, service providers are increasingly offering voice over IP or IP-TV services to their customers. In order to ensure the right transfer quality of those delay-sensitive services, most providers still choose a high degree of over-provisioning (less than 20% network load) as the easiest and still cheapest solution. However, European providers seem to take a leading role in consistently setting up Differentiated Services forwarding within their network in order to ensure separated traffic handling and to cut down on the over-provisioning cost. Four to six traffic classes of service are common. However, this approach can only be consistently and homogenously applied intra-domain. Public inter-domain interconnections are still run without any QoS support (“Best-Effort (BE)” interconnection) and solely rely on the over-provisioning solution. Such interconnected “quality islands” exist independently, peer with BE traffic, perform costly multi-parameter ingress classification to locally guess and match on the incoming traffic class, run uncoordinated QoS concepts and might not even be known globally. 86 17.11.2009 Due to the fast access speed increase and the high quality expectations of their customers, service providers are increasingly forced to frequent and costly interface speed upgrades. This new coarse grained CoS concept therefore targets the inter-domain BE interconnection and aims for a traffic separating interconnection style without QoS guarantees. Neighbouring providers are already able to setup such CoS enabled interconnections by means of mutual SLA-based agreements about the supported classes of service, their encoding and mapping. However, the new concept extents this manual local interconnection CoS support by means of transitive signalling of available classes of service and their respective class markings. This enables multi-domain CoS transit paths with automated class of service transfer adoption. The locally available CoS support is disclosed to all providers, which will adopt the forwarded traffic to the neighbour’s class set and encoding. Based on this basic functionality, CoS routing and tunnelling are expected to evolve on top. Furthermore, the new coarse grained CoS concept extends the traffic separation in locally available classes of service across networking layers. IP CoS support is the anchor point of the class set signalling using the inter-domain PHB ID encoding (see Fig. 46). However, depending of the availability of MPLS tunnels, Ethernet QoS support or virtual channels for traffic separation, the concept’s signalling associates the classes and their markings of different layer technologies. This is required, since no definitive standards exist, which defined this cross-layer mapping of available class sets. However, service providers do individually define mapping rules within their domain and now have the means to signal this cross-layer mapping to interconnection partners. This again enables class set approximation, but now augmented to a consistent traffic separation in all CoS forwarding enabled layers. Lastly, the new concept introduces an optional second mechanism, which prevents the possible excessive misuse of higher priority traffic classes. Class-based ingress limitation using token bucket metering with associated dropping or remarking rules for excess traffic protects the CoS enabled AS from overload in high priority traffic classes. Those ingress filter parameters are signalled to adjacent interconnection partners. This results in a predictable forwarding behaviour and allows for informed traffic planning and possibly shaping at the sending AS egress edge. The new coarse grained CoS concept therefore: • provides knowledge about the available traffic separations and markings by means of transitive Cross-domain marking signalling with associated Cross-layer mapping, • enables marking adoption (and possibly route selection) without guarantees and • performs fair signalling of class overload limitations and excess traffic handling. This twofold “free to join” concept about global class set marking signalling with cross-layer mapping and rate limitation signalling is optimized for simplicity. Quality of Service guarantees are waived in favour of signalling, metering, debugging and operating simplicity. QoS in this approach therefore refers to primitive traffic separation into several classes, which will experience differently prioritized forwarding behaviour in relaying nodes. Enqueueing in separate queues is aimed for. Inter-AS Class of Service is targeted by the concept, since simple traffic separation is identified as key characteristic. If widely applied, the public Internet will evolve into a public “Betternet” in the future. 87 17.11.2009 The concept has been formulated in two Internet drafts [124] and [125] and was widely discussed and published in the networking community. The resulting feedback acknowledges the reduced complexity and expressed the preference before the aforementioned more QoS guaranteeing approaches. The concept is expected to be applied in global scale, possibly combined with SLA-based QoS guaranteeing solutions at individual interconnections. Due to the targeted global deployment, scalability, router resource consumption and operational stability has been analyzed. 7.2 Usage of BGP for QoS signalling Signalling Class of Service sets and markings between interconnection partners can either be performed as piggyback on already deployed protocols or by means of a separate signalling protocol. Static CoS sets and markings would be a third and in theory the best solution, when all providers would agree on a single globally available inter-domain CoS set. However, the latter is neither existent nor likely to be standardized any time soon. The definition and usage of specialized signalling protocols for the possibly frequent exchange of load statistics and flow-based quality requests and grants is an appropriate solution and likely to happen for QoS guaranteeing approaches. The concept of this thesis, however, does not require such separate handling due to its simplicity and coarse-grained global signalling scope. Reusing existing protocols with simple extensions is therefore envisaged. An attractive reuse candidate out of the existing signalling protocols is the NSIS protocol family. It has not been chosen for two reasons: 1. None of the existing AS border routers currently runs NSIS protocol entities and 2. The recently observed IETF objections against NSIS seem to dramatically delay its appearance as a proposed standard of the Internet (see Fig. 73). The only IETF standard protocol that is readily available at interconnection points for the reuse purpose is the Border Gateway Protocol. The following two sections explain the pros and cons of this choice. Why to use BGP for signalling BGP is the de-facto interconnection protocol and therefore globally accepted and globally available. It is a well designed flexible protocol that allows for simple signalling extensions. BGP exchanges reachability information and can tag this information with route related attributes. Such attributes have IANA assigned type values listed in respective IANA registries and associated attribute structures. This attribute and IANA registry approach allows for the flexible extension of BGP. All attributes (existing or newly defined attributes) are automatically associated with the network layer reachability information advertised in the respective BGP UPDATE messages. Attributes of transitive type are even relayed globally together with the NRLI. Why not to use BGP for signalling BGP’s stability is achieved through dampened UPDATE message rates and the concept of failure confinement within routing areas or confederations (see chapter 2.2.2). Any fast changing signalling information is therefore not suited for BGP. BGP might also be avoided, if long lived signalling information can be placed in Internet Routing Registries (see 5.3) instead of the UPDATE message transport. 88 17.11.2009 Lastly, since all AS border routers of the Internet need to store and process the large and ever growing BGP communicated reachability information, any extension should barely put any extra burden on the routers’ resource consumption. The new coarse grained CoS concept does make use of BGP and its already defined extended community attribute structure for the following reasons: 1. BGP is readily available for the concept’s deployment, in particular, if widely implemented attribute structures can be used for the CoS signalling, 2. The concept’s CoS signalling is small in size and not rapidly changing, 3. Service providers are familiar with BGP’s community philosophy and can easily adopt to the proposed CoS extensions and 4. The Internet Routing Registry signalling approach is included in the concept’s specification for backup and security purposes, but not exclusively relied on. This is due to the fact, the many service providers are still not making use of IRR information in their border router configurations. Within BGP, the choice of Extended Community attributes for the CoS signalling has been made, since the container size of 8 byte is sufficient and the automatism of associating attributes with all NLRI of an UPDATE message matches the concept’s target of signalling simplicity and efficiency. Specification details are outlined in chapter 7.3. The different approach in related work (see chapter 6) of using a separate address family for “QoS route” signalling has been deselected. The new coarse grained CoS concept’s signalling information is expected to be associated with the majority of Internet routes. The use of a separate address family would require doubled signalling for reachability and CoS support purposes, which is not an efficient signalling solution. 7.3 Definitions and information processing The following two sections outline the design principle, attribute definitions and processing as specified in the respective two IETF draft documents [124] and [125]. They are an integral part of this thesis work. 7.3.1 BGP extended community attribute for CoS marking Cross-domain CoS marking and cross-layer mapping signalling is specified in “draft-knollidr-qos-attribute-04” [124] as follows. Reachability information of IP prefixes is augmented by possibly several instances of a new BGP Extended Community. Each instance signals the availability of a certain class of service together with its technology dependent marking encoding. Several such Extended Communities are needed in order to signal more available classes as well as more associated cross-layer representations in other networking technologies. As a design principle, only the IP prefix originating AS is allowed to initially associate such a set of Extended Communities of supported classes with the advertisement of their own prefixes. Neighbouring and more distant ASes will then: - learn about the available classes and marking encodings, - possibly use the information for best path or multi-path decision making, - relay the respective best path and associated transitive attribute information to their neighbours – possibly adopting the signalled locally applied marking and 89 17.11.2009 use the learned class marking for downstream packet forwarding (including possible remarking at the outgoing edge interface). Transit ASes perform class marking approximation for an as close as possible class set mapping and forwarding adoption. ASes are free to ignore single classes or cross-layer mappings of the classes, but need to indicate this by means of a provided “ignore” flag. Fig. 76 depicts the resulting signalling and traffic forwarding procedure. - Fig. 76 Cross-Domain CoS marking concept Several QoS Marking communities may be included in a single BGP UPDATE message. They are virtually linked together by means of an identical "QoS Set Number" field. Each QoS Marking community is encoded as 8-octet tuple, as defined in [124]. Signalled QoS Class Sets are assumed to be valid for traffic crossing this AS. If different QoS strategies are used with an AS, its provider is responsible for consistent transport of transit traffic across this inhomogeneous domain. In all transit forwarding cases, QoS based tunnelling mechanisms are the means of choice for transparent traffic transport. The availability of the "Best Effort" forwarding class is implied and defaults to a zero encoding on all signalled layers. It is therefore not necessary to include QoS Marking communities for the Best Effort Class as long as the default encoding is in place. 7.3.1.1 Extended Community Type The new QoS Marking community is encoded in a BGP Extended Community Attribute [161]. It is therefore a transitive optional BGP attribute with Type Code 16. The actual encoding within the BGP Extended Community Attribute is as follows. The QoS Marking community is of regular type which results in a 1 octet Type field followed by 7 octets for the QoS marking structure. The Type is IANA-assignable and marks the community as transitive across ASes. The type number has been assigned by IANA to 0x04 (see Fig. 77). Optionally, a non-transitive Type value assignment of 0x44 is provided, which allows for the AS internal marking information exchange. The community format remains untouched for this non-transitive version. Fig. 78 depicts the BGP Extended Community Attribute structure. 90 17.11.2009 http://www.iana.org/assignments/bgp-extended-communities Border Gateway Protocol (BGP) Data Collection Standard Communities (last updated 2009-06-02) … Registry Name: BGP Extended Communities Type - regular, transitive Reference: [RFC4360] Range Registration Procedures ----------- --------------------------------------0x90-0xbf Standards Action/Early IANA Allocation 0x00-0x3f First Come First Served Registry: Type Value Name Reference Registration Date ----------- ------------------------------------ --------- ----------------0x04 QoS Marking [Knoll] 2008-12-30 0x05 CoS Capability [Knoll] 2009-05-18 Registry Name: BGP Extended Communities Type - regular, non-transitive Reference: [RFC4360] Range Registration Procedures ----------- --------------------------------------0xd0-0xff Standards Action/Early IANA Allocation 0x40-0x7f First Come First Served Registry: Type Value Name Reference Registration Date ---------- ------------------------------------- --------- ----------------0x40 Link Bandwidth Extended Community [draft-ietf-idr-link-bandwidth-00] 2009-05-18 0x44 QoS Marking [Knoll] 2008-12-30 … Fig. 77 IANA registry for BGP Extended Community type numbers As already made clear in chapter 2.2.2, it is important to distinguish between the “transitive attribute” and a “transitive community”. This depicted attribute structure is by default of transitive type and will therefore be always relayed across ASes – regardless of the actual processing of it. A special marking flag, so called “partial flag”, is defined for BGP path attributes, which will be set by ASes, which do not interpret such an Extended Community Attribute. With the decision to use Extended Community Attributes as “container structure” for the Extended Communities for CoS marking, it is ensured, that the signalling relay will actually reach all ASes of the Internet. However, one limitation still exists in practise and that is, that providers might decide to generally suppress the relay of Extended Community Attributes no matter what communities are enclosed. In such a case, all ASes up to the blocking one will receive the class set information and might make use of it. Further upstream ASes, however, will not receive the CoS signalling via this relay path. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 1 0 0 0 x 0 0| | +-+-+-+-+-+-+-+-+ 7 octet QoS Marking community structure | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Fig. 78 BGP Extended Community Attribute structure with type 0x40 or 0x44 BGP UPDATE messages can by definition only include a single BGP Extended Community Attribute. However, each attribute can enclose several Extended Communities. Such Extended Communities are in turn again classified as “transitive” or “non-transitive” community type. Here, “transitive” stands for the distinction of whether a community can be signalled across an eBGP session or whether the community is confined to communi- 91 17.11.2009 cation sessions with iBGP peers only. The CoS marking Extended Community has been assigned a transitive as well as a non-transitive type number to give providers the choice for AS external or AS internal only usage of the signalling structure. The remaining explanation will assume the usage as transitive type, since cross-domain signalling is an important target of the concept. 7.3.1.2 QoS Marking Extended Community Structure 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 0 P R I A 0 0| QoS Set Number|Technology Type| QoS Marking Oh| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | QoS Marking Ol| QoS Marking A |0 0 0 0 0 0 0 0| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Fig. 79 Structure of the QoS Marking Community As shown in Fig. 79, each signalled Extended Community contains a “Flags” field, a “QoS Set Number”, a “Technology Type” and two “QoS Marking” fields. The first octet contains four flags, ‘P, R, I and A’, which are used to indicate processing status and results. The 'P' flag indicates the preservation of incoming markings during the transit forwarding process. The IP prefix originating AS should set the flag to '1', which is otherwise implied by an AS_PATH length of 1 AS. Transit ASes must set the flag to '1', if the advertised Marking A is accepted at the ingress and is sent out unchanged at the egress. That is, no remarking occurs - neither for marking adoption with the neighbouring downstream AS nor by resetting the markings. This flag field is set and cleared by each relaying AS according to its handling of markings - irrespective of the possible ignorance of the particular Marking A in the internal per hop forwarding behaviour. The "R, I and A" flags are set to '0' in the advertisement by the IP prefix originating AS. Transit ASes must change the flag value to '1' once the respective event occurred. If the QoS marking actively used in the transit AS internal forwarding is different from the advertised original one, the 'Remarking (R)' flag is set to '1'. This must be signalled separately for each technology type community within the set of Extended Communities. The same applies to the 'Ignore (I)' flag, if the respective advertised QoS marking is ignored in the transit AS internal forwarding. The 'Aggregation (A)' flag must be set to '1' by the UPDATE message relaying transit AS, if the respective IP prefixes will be advertised inside an IP prefix aggregate constituted from differing Class Sets. The handling of prefix aggregation is vital for routing table size reduction and routing stability. However, this aggregation can easily result in the merging of routes to the more specific prefixes with differing class of service sets. In this case, the aggregator becomes the IP prefix originating AS for the prefix aggregate and is responsible for the mapping between the upstream signalled merged class set and the downstream available differing class sets. It is the provider’s responsibility to care for close class set approximation in terms of forwarding and marking behaviour. If the defined "R, I and A" flags are cleared - and by means of the cleared 'Partial' flag of the BGP attribute it is shown, that no "QoS Class ignorant" AS is involved in the forwarding path – a consistent class based overall traffic separated forwarding is available along this path. Several single QoS Marking communities can be logically grouped into a QoS Marking community Set characterized by a identical QoS Set Number. This grouping of the single QoS Marking communities into a set provides cross-layer linking between the QoS class encodings. The number of signalled QoS Marking communities as well as QoS Marking community Sets is at the operator's choice of the originating AS. The enumerated QoS set numbers have BGP UPDATE message local significance starting with set number 0x00. 92 17.11.2009 Since all signalled marking are networking technology specific, the Technology Type field indicates, which technology the marking refers to. Extensive searching has been performed in the course of defining this signalling for existent technology type enumerations. The closest result was the “IANAifType-MIB” enumeration [96]. However, this enumeration is far too detailed and the registry maintainers have discouraged its usage for existing consistency weaknesses. Therefore, a short and simple enumeration has been defined as shown in Table 12. Table 12 Value 0x00 0x01 0x02 0x03 0x04 0x05 0x06 Technology Type Enumeration Technology Type DiffServ enabled IP (DSCP encoding) Ethernet using 802.1q priority tag MPLS using E-LSP Virtual Channel (VC) encoding using separate channels for QoS forwarding / one channel per class (e.g. ATM VCs, FR VCs, MPLS L-LSPs) GMPLS - time slot encoding GMPLS - lambda encoding GMPLS - fibre encoding The two most important fields of the new QoS Marking Extended Community structure are the QoS Marking O and A field. The interpretation of these fields depends on the selected layer and technology. ASes, which process the community and support the given QoS Class by means of a QoS mechanism using bit encodings for the targeted behaviour (e.g. IP DSCP, Ethernet User Priority, MPLS TC etc.) must use a copy of the encoding in the "QoS Marking A" community field. Unused higher order bits default to '0'. Other technologies, which use separate forwarding channels for different classes (such as L-LSPs, VPI/VCI inferred ATM classes, lambda inferred priority, etc.) shall use class enumerations as encoding in this community field. The enumeration count starts with zero for the best effort traffic class and rises by one with each available higher priority class. There are two QoS Marking fields within the QoS Marking community for the "original (O)" and the "active (A)" QoS marking. Higher order bits of those fields, which are not used for the respective behaviour encoding, default to zero. The QoS Marking O (Original QoS Marking) field is a 16 bit QoS Marking field, which consists of a high ("Oh") and a low ("Ol") octet. The IP prefix originating AS copies the internally associated QoS encoding of the given Technology Type into this one octet field. The field value is right-aligned depending on the number of encoded bits. For the IP technology, the encoding of Per Hop Behaviour Codes has to follow the definitions stated in [31]. The field must remain unchanged in BGP UPDATE messages of relaying nodes. QoS Marking A (Active QoS Marking) and QoS Marking O must be identically encoded by the IP prefix originating AS, except for the case, where IP technology Per Hop Behaviours are addressed. "QoS Marking A" will always contain the locally applied encoding for the targeted PHB. All other ASes use this Active QoS Marking field to advertise their locally applied internal QoS encoding of the given class and technology at the interconnection point. The field value is right-aligned depending on the number of encoded bits. A cleared Marking field (all zero) signals that this traffic class experiences default traffic treatment within the transit AS forwarding technology. 7.3.1.3 QoS Marking Extended Community Usage Providers may choose to process the QoS Marking communities and adopt the behaviour encoding and tunnel selection according to their local policy. This may also lead to different IGP routing decisions or even effect BGP update filters. 93 17.11.2009 Only the IP prefix originating AS is allowed to signal the QoS Marking communities and Sets. All advertised prefixes, which originate from that AS will be sent with the same QoS Marking community Set in the respective UPDATE message. Transit ASes must not modify or extend the QoS Marking community Set except for the update of each 'QoS Marking A' field contained in the community Set and the respective "P, R, I, A" flags. Prefixes with associated identical QoS Marking community Sets are to be advertised together in common UPDATE messages in relaying nodes. Fig. 80 shows an AS interconnection example with different Class Sets. It shows the case in AS 5 where different Class Sets are used internally and externally. The proposed QoS Class Set signalling will always use the external definitions within the UPDATE message QoS Marking communities. The example also shows, that IP prefixes, which originate in AS 5 and AS 3 can be advertised together with the same QoS Marking community Set as long as their Layer 2 encoding is identical. Fig. 80 CoS enabled AS interconnection example topology IP packet forwarding based on packet header QoS encoding might require remarking of packets in order to match AS internal policies and encodings of neighbouring ASes. Identical QoS class sets and encodings between neighbouring ASes do not require any remarking. Different encodings will be matched on the outgoing traffic. Outgoing traffic for a given IP prefix uses the 'QoS Marking A' information of the respective BGP UPDATE message QoS Marking community for adopted remarking of the forwarded packet. If the 'I' flag is set for a given encoding, the outgoing traffic remarking should still be applied despite the signalled lack of QoS Class forwarding support. This is particularly important, if the preserve flag 'P' is signalled together with the 'I' flag. Several IP prefixes of different IP prefix originating ASes may be aggregated to a shorter IP prefix in transit ASes. If the original Class Sets of the aggregated prefixes are identical, the aggregate will use the same Set. In all other cases, the resulting IP prefix aggregate is handled the same as if the transit AS were the originating AS for this aggregated prefix. The transit provider may care for AS internal mechanisms, which map the signalled aggregate QoS Class Set to the different original Class Sets in the internal forwarding 94 17.11.2009 path. In case of IP prefix aggregation with different QoS Class Sets, the 'Aggregation (A)' flag of each QoS Marking community within the Set must be set to '1'. 7.3.1.4 Confidentiality Considerations The disclosure of confidential AS intrinsic information is of no concern since the signalled marking for QoS class encodings can be adopted prior to the UPDATE advertisement of the IP prefix originating AS. This way, a distinction between internal and external QoS Class Sets can be achieved. AS internal cross-layer marking adaptation and policy based update filtering allows for consistent QoS class support despite made up QoS Class Set and encoding information within UPDATE advertisements. In case of such policy hiding strategy, the required AS internal ingress and egress adaptation shall be done transparently without explicit "Active Marking" and 'R' flag signalling. 7.3.1.5 QoS Marking Example The example AS is advertising several IP prefixes, which experience equal QoS treatment from AS internal networks. The IP packet forwarding policy within this originating AS defines e.g. 3 traffic classes for IP traffic (DSCP1, DSCP2 and DSCP3). These three classes are also consistently taken care of within a TC bit supporting MPLS tunnel forwarding. The BGP UPDATE message for the announced IP prefixes will contain the following QoS Marking community Set together with the IP prefix NLRI. 95 17.11.2009 Fig. 81 QoS Marking Extended Community signalling example 7.3.2 BGP class of service interconnection Class-overload prevention is specified in “draft-knoll-idr-cos-interconnect-03” [125] as follows. The new coarse grained CoS concept is a twofold concept and provides in its second half an optional mechanism, that prevents the possible excessive misuse of higher priority traffic classes. Class-based ingress limitation using token bucket metering with associated dropping or remarking rules for excess traffic protects the CoS enabled AS from overload in high priority traffic classes. Those ingress filter parameters are signalled to adjacent interconnection partners. This results in a predictable forwarding behaviour and allows for informed traffic planning and possibly shaping at the sending AS egress edge. This fair and square interconnection limitation signalling is specified using two BGP attributes. Two new transitive attributes are specified, which enable adjacent peers to signal Class of Service Capabilities and token bucket Class of Service admission control Parameters. The new "CoS Capability" is deliberately kept simple and denotes the general EF, AF Group, BE and LE forwarding support across the advertising AS. The second "CoS Parameter Attribute" is of variable length and contains a more detailed description of available forwarding behaviours using the PHB ID Code encoding. Each PHB ID Code is associated with rate and size based traffic parameters, which will be applied in the ingress AS Border Router for admission control purposes to a given forwarding behaviour. 96 17.11.2009 A Basic Set of supported Classes, called "Basic CoS" is defined here, which consists of the primitive "Best Effort (BE)" PHB, the "Expedited Forwarding (EF)" PHB [60], the "Assured Forwarding (AF)" PHB Group [88] and the "Lower Effort" Per-Domain Behaviour (PDB) [34]. Providers, which can support this Basic CoS, signal this capability to their interconnection partners by means of the new CoS Capability Extended Community defined below. 4 AF PHB classes have been defined so far, which will be grouped into the generally signalled "AF Group". That is, as long as the AS provider can support at least one out of the 4 AF classes in his externally supported CoS Set, this AS is regarded as AF capable. A second transitive attribute is defined for parameter signalling about the applied access control within the ingress AS border router. The reason for this traffic limitation is the fact, that certain high quality forwarding behaviours can only be achieved, if the percentage of high priority traffic within the traffic mix lies below a certain threshold. This attribute informs the interconnection partner about the applied limitation, which can in turn be used to perform traffic shaping at the neighbouring AS egress. The attribute allows this limitation signalling either associated to the NLRI within the same UPDATE message or with "global" scope to describe the generally applied ingress limitation. Both attributes are likely to be used together, if ingress class limitation is used for the respective AS. Fig. 82 depicts the resulting class overload limitation concept and outlines, how excess traffic can either experience dropping or remarking punishment actions. Fig. 82 Class overload limitation concept 7.3.2.1 CoS Capability Extended Community Structure The CoS Capability Extended Community is encoded as BGP Extended Community path attribute as described in section 7.3.1.1. It is deliberately kept very simple and is defined as outlined in Fig. 83. It is a transitive Extended Community of regular type with the IANA assigned type value of 0x05 (see Fig. 77). The binary encoded support of per hop behaviour classes is detailed in Table 13. 97 17.11.2009 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |B E A L| Currently Unused - default to '0' | |E F F E| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Currently Unused - default to '0' | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Fig. 83 CoS Capability Extended Community Structure Table 13 CoS Capability Attribute – binary class encoding Bit 0 1 2 3 4 .. 7 Flag BE EF AF LE unused Encoding Default to ‘1’ to signal general “Best Effort” PHB support ‘1’ … “Expedited Forwarding” PHB support [60] ‘1’ … “Assured Forwarding” PHB group support [88] ‘1’ … “Lower Effort” PHB support [34] Default to ‘0’ The implied Per-Hop-Behaviour Identification Codes follow the definition as standardized in RFC 3140 [31]. The AF Group needs to consist of at least one of the currently available AF1x, AF2x, AF3x and AF4x. Fig. 84 Per-Hop-Behaviour Identification Codes implied by CoS Capability 98 17.11.2009 7.3.2.2 CoS Parameter Attribute structure The second attribute is a new optional transitive BGP path attribute of variable length. The attribute type number of 0xFF is currently used as specified in RFC 2042 [133]. The attribute contains one or more of the following token bucket parameter sets as shown in Fig. 85. Fig. 85 CoS Parameter Attribute structure The PHB ID Code field associates the respective signalled PHB support with the consecutively followed token bucket parameter set. This parameter set follows the specifications as given in RFC 2210 [182], which is also used within the IETF Integrated Services architecture. Only two flags (‘G’ and ‘DR’) are defined within the one octet Flags field. The 'G' flag signals, whether the limitations have global scope on all incoming traffic ('1') or are associated to traffic that is destined to destinations within the NLRI of the UPDATE message ('0'). NLRI specific limitation will supersede globally signalled ones for traffic destined to those NLRI destinations. The 'DR' flag signals the applied handling of non-confirming traffic. DR='0' signals strict dropping of excess traffic. DR='1' signals the performed remarking of excess traffic packets to Best Effort traffic marking. In order to correctly identify the originator of the signalled limitations, the “ASN of sending AS” holds the corresponding AS number. Depending on the 2-octet or 4-octet AS peering type, the sending AS of the attribute must encode its AS number as right-aligned 32 bit number. 7.3.2.3 CoS Parameter Attribute Usage The signalled parameters are used for PHB ID Code based ingress limitation. Depending on which PHB ID Codes a BGP peer signals in this attribute to its neighbour, it is said, that the respective PHB ID Code is supported and will experience the defined limitations. 99 17.11.2009 Those limitations can be applied to all incoming traffic of a specific PHB ID Code (marked as 'G') or only for incoming traffic, that is destined for the NLRI of the given UPDATE message. The resulting treatment for non-confirming traffic is signalled through the 'DR' flag. All limitations have AS local scope for the advertising AS and the neighbouring AS might or might not adopt its sending behaviour to those advertised limitations. Despite of the transitive nature of the new attribute, its usage for ingress limitation is confined to neighbouring ASes. Processing of the conveyed parameters is only valid for peers, who are peering with the AS specified in the ASN field of the attribute. The attribute should not be transitively relayed to non-adjacent interconnection partners. Since non-transitive BGP path attributes can be sent out into eBGP peering sessions, the CoS Parameter attribute would have been sufficiently defined as of non-transitive type. However, current commercial routers are not aware of this new attribute and will silently discard it. Therefore, the attribute has been defined as transitive type in order to allow for remote router configuration control as outlined in chapter 12. 100 17.11.2009 8 Mapping strategies 8.1 Problem statement The new coarse grained CoS concept of this thesis allows for the cross-domain and crosslayer signalling of supported classes of service, their marking encoding and mapping. The number of supported classes, their networking technology dependent marking and in particular their mapping between networking technologies is chosen independently by different network providers. The intra-domain mapping between technologies and the interdomain mapping between class sets of each layer constitutes the heterogeneity of the internetworking situation. The targeted aim in all such mappings is, that the QoS efforts taken up in one layer or domain should not be destroyed in the other. Fig. 86 Classification of the Mapping scope Three general criteria need to be distinguished: 1. QoS enabled networking technology (e.g. DiffServ, Ethernet QoS, MPLS E-LSPs), 2. Class encoding (marking) and 3. Associated parameter and treatment characteristics. 8.1.1 mapping between different class sets of the same layer This section outlines the difficulties in mapping operations within the same layer and specifically within the same networking technology. The simplest case is the 1:1 mapping between class sets with the same number of classes and associated treatment characteristics, but differing markings. Simple mapping can be performed as long as the resulting remarking is set up consistently. Such a mapping would 101 17.11.2009 not change the experienced forwarding behaviour, but rather enable providers to freely choose the marking value space. In terms of this new CoS concept, the signalling would include the same technology encoding, the same Marking O, but differing Marking A. The more likely and complex mapping occurs, if n original classes would be mapped into m resulting classes. This results in a mapping table with the following categorisation: • n > m Æ Class aggregation, • n = m Æ 1:1 mapping as described above and • n < m Æ Class splitting or Class wastage. The latter case appears to be the easier one, since no traffic merging is possibly required and class marking can be preserved. The degree of traffic separation, as targeted in this coarse-grained concept, is preserved and no detrimental effects are expected. This holds true, if all incoming traffic flows into the finer-grained QoS enabled AS arrive with the same limited class set n as described above. However, the respective AS merges those incoming traffic flows onto its internal forwarding paths and the merging might lead to some finer-grained traffic taking up higher priority classes as compared to the coarsegrained ones. For example, a 5:7 mapping as opposed to a 3:7 mapping might end up with a higher link share on the combined forwarding paths. This can be addressed by a priority scaled mapping, where the scaling in the 5:7 case, leads to 40% priority increase and in the 3:7 case to a 233% priority increase. The uniform distribution of class priorities in this simple case would assume a linear priority scheme in the original and resulting class sets. Un-scaled and scaled mapping would leave the additional classes unused. The difference only lies on the decision, which classes remain unused. Class splitting, however, would make use of all additional classes and differentiate the incoming traffic further. Such an approach could not only rely on the class markings, but rather perform further classification on additional header fields. Depending on the networking technology header, this could be the protocol or type field, the destination MAC or IP address etc. In general, this could also result in waiving the original classification and performing multi-layer packet inspection and re-classification. Class aggregation is the opposite case, where existing traffic classes are merged onto a common forwarding treatment. This can either be realised through a remarking function, where several incoming class markings are merged into one resultant marking or through the unchanged marking, but merged forwarding treatment. The latter will be referred to as “funnel treatment”. For the considered AS, the resulting forwarding treatment will be the same for each traffic class. However, the original traffic separation at the outgoing edge of the AS domain can not be restored in the remarking case and the smallest degree of traffic separation will prevail along the remaining forwarding path. Marking preservation is the preferred solution. The required funnel treatment can either be achieved through classification merging at each relaying AS internal node or through tunnelling. Either, the markings remain untouched within the tunnel and the tunnel marking determines the forwarding behaviour, or separate tunnels can be used to de-aggregate the traffic flows and markings. One specific tunnelling solution for all IP networks is specified in RFC 2983 [30] and formerly in RFC 2003 [152]. This IP in IP encapsulation is attractive for three reasons: 1. The encapsulated IP header remains entirely unchanged, 2. The mapping result and forwarding header processing is solely based on the outer IP header and 3. All internetworked ASes are capable of providing IP tunnelling as long as their edge devices can handle the encapsulation and de-capsulation procedures. 102 17.11.2009 Fundamental drawbacks of this IP tunnelling are the increased processing load on edge devices and the reduced maximum transmission unit size for transit payloads. At least 20 byte MTU reduction will occur due to the outer IP header. The same behaviour can be achieved with the more generic IP-based encapsulation protocol, called “Generic Routing Encapsulation (GRE)”, defined in RFC 2784 [72]. IP encapsulation within GRE would add at least another 4 byte MTU reduction for the minimal GRE header structure. Existing class aggregation recommendations for DiffServ classes and Ethernet priorities are listed in chapter 8.2. Encapsulation (tunnelling) of transit (customer) traffic is highly recommended by this new coarse grained CoS concept. IP-in-IP tunnelling, MPLS label switched paths or some kind of Ethernet encapsulation (refer to chapter 4.2) are strongly encouraged for traffic separation preservation and ease of CoS deployment. 8.1.2 mapping between different class sets of different layers Service providers, who offer Differentiated Service in the IP layer, are likely to undergird this approach with QoS mechanisms in lower layers. Typical QoS support of such link layer technologies are listed in chapter 4. Ideally, the underlying QoS support exactly matches the granularity and forwarding treatment requirements of the DiffServ PHBs. However, in practise this can hardly be achieved due to the differing number of available classes and the lack of standardized mappings between the CoS sets of each technology. Service providers will therefore independently decide on the cross-layer mappings and classification policies applied within their networks. Those configuration decisions are under sole control of the respective providers and therefore expected to be setup consistently and appropriately. The resulting ⎛n⎞ mapping ⎜ ⎟ is a vertical association of n classes of service and their markings to x ⎝ x⎠ underlying classes of service and their markings in the lower layer technology. This can possibly be setup in multiple rows, if several encapsulating technologies are in place. For the inter-domain case, the explained n:m mapping within the same layer is therefore multiplied by the number of technology combinations (such as e.g. IP in MPLS-LSP in Ethernet in FR-VC) in lower layers. ⎛n : m⎞ ⎜ ⎟ The matrix ⎜ ... ⎟ is the resulting layered mapping table in a fully meshed cross-layer ⎜x: y ⎟ ⎝ ⎠ mapping case. Here, each original class within each encapsulation is mapped into a resulting class in each new encapsulation. Such cases need to be addressed in interdomain cases, where the traffic exchange at an interconnection point is not just IP based, but encompasses inter-domain tunnels. 103 17.11.2009 ⎛n:n⎞ ⎜ ⎟ Here, either, the encapsulated class set and marking will prevail ⎜ ... ⎟ or the tunnel CoS ⎜ x: y⎟ ⎝ ⎠ ⎛n : m⎞ ⎛n : n⎞ ⎜ ⎟ ⎜ ⎟ will be harmonized ⎜ ... ⎟ or the encapsulated and the tunnelled CoS ⎜ ... ⎟ are forced ⎜x:x ⎟ ⎜ x : x⎟ ⎝ ⎠ ⎝ ⎠ to be identical between the interconnection partners. As mentioned earlier, this intra-domain and possibly inter-domain tunnelled forwarding is particularly important at places, where class aggregation needs to be performed. This aggregation would be applied by means of a reduced number of tunnel-based forwarding treatments, but will preserve the traffic separation granularity for the encapsulated traffic. The respective signalling of marking preservation is available through the “P” flag in the QoS Marking Extended Community. 8.2 Existing recommendations In the course of work on this thesis, a number of readings and talks revealed, that service providers are occasionally involved in QoS related research and discussions, but are reluctant to turn QoS on for inter-domain interconnections. This explains the current gap between the many QoS proposals and the still missing deployment. However, deployment recommendations are incentives and guidelines to make configuration easier and concepts more acceptable. As outlined in chapter 4 and 6, numerous QoS specifications and approaches exist, which have not yet led to QoS enabled interconnections in large scale. Differentiated Services, however, is widely accepted and increasingly applied within ASes. The same applies to MPLS and its TC bits based DiffServ support. Since the protocols of IETF dominate the AS interconnection and the capabilities of devices within the ASes at each side, the following section will concentrate on recommendations given in RFC 4594 [16] and RFC 5127 [45]. Ethernet specifications for user priority support and the respective priority setup specifications will follow, based on the definitions of IEEE 802.1D [97]. It should be noted, that the large variety of configurable class sets, their encoding and intra-layer and cross-layer mapping contributes to the complexity and uncertainty of an overall class of service based packet forwarding transport. The lack of a single standardized and globally supported class set and its encoding becomes obvious. However, service providers are reluctant to apply such a fixed class of service support and might not be able to define the common CoS base for such a standard activity. This is the sacrifice for standing out from each other in competition and for the valuable freedom of administration of the respective AS. On the contrary, the introduction of PHB IDs revealed, that service providers even requested a 16 bit encoding instead of the 6 bit DSCP encoding. Since the currently applied fallback solution of mere over-provisioning is expected to exceed the capital and operational expenditure budget of declining interface cost limitations, coarse-grained CoS interconnection with class-based over-provisioning will arise. Existing recommendations will be taken up for granted by larger Internet Service Providers and their smaller interconnection partners will adopt it. This is claimed to happen for simple DiffServ-based class setups and a respective recommendation is given in chapter 8.3. Cross-layer mappings will be applied AS-internally, but might evolve inter-domain as well. 104 17.11.2009 Especially the IXP-based CoS interconnection is expected to be used and the new concept provides the means for a consistent automatic setup. Configuration Guidelines for DiffServ Service Classes - RFC 4594 The fundamental guideline document for the deployment of DiffServ is RFC 4594 [16], which gives configuration guidelines to network operators for sensible service class selection, their usage and construction out of queueing, traffic management, PHB selection and DSCP marking elements. All definitions therein are given as recommendations and the authors point out several times, that all can be changed and applied differently, at the provider’s choice. However, they do point out consistency as being of importance for interoperability. The universal deployment of this DiffServ Service Class recommendation is, as the name suggests, based on the “Service Class” approach. That is, all applications that generate similar traffic characteristics and require similar traffic forwarding behaviour can be grouped into classes, so called “Service Classes”. The required characteristics of the traffic aggregate are represented by a PHB, which will in turn be encoded as DSCP value. Network traffic has been classified into two groups: “network control traffic” and “user/subscriber traffic”. The first group is divided into two service classes, being "Network Control" and “OAM”. As shown in Fig. 87, in the “user/subscriber traffic” group, ten Service Classes have been identified. They are grouped into four application categories and reflected against the so called “End-user multimedia QoS categories” as defined by ITU-T in G.1010 [118]. Fig. 87 User/Subscriber Service Classes Grouping - [16] The service classes in turn are associated with descriptive characteristics as well as with coarse statements about the respective loss, delay and jitter tolerance. No absolute values are given, but treatment orientation can easily be taken out from this definition. Fig. 88 depicts the table of associated characteristics. 105 17.11.2009 Fig. 88 Service Class Characteristics - [16] All service classes are given DSCP encoding names and values together with application examples. This sums up to 20 DSCP encodings for all 12 service classes as shown in Fig. 89. Lastly, each service class is guided by recommendations for applicable conditioning at the ingress of the network, the queuing and the queue management. The resulting table is depicted in Fig. 90. It is important to note, that a “Low-Priority Data” Service Class has been defined here with CS1 (‘001000’) encoding. This is of particular interest, since the support of Lower Effort PDB (RFC 3662 [34]) is expected to become important for inter-domain deployment and matches the recommended set of service classes. In practise, DiffServ support will not start off with all 12 service classes being configured at once. A start with fewer classes and a gradual increase as differing demand arises is likely and at the same time supported by the recommendations. 106 17.11.2009 Fig. 89 DSCP to Service Class Mapping - [16] Fig. 90 QoS Mechanisms Used for Each Service Class - [16] 107 17.11.2009 Given the flexible and coarse design of the service class recommendations, its readiness for application as well as the authors’ reputation and company support, it is likely to find such class sets being configured and deployed within provider networks. Since standard DSCP values are specified in the configuration guidelines, a smooth interworking between providers’ networks is feasible, possibly with a reduced number of service classes. Aggregation of DiffServ Service Classes – RFC 5127 [45] As described above, service providers are likely to adopt the DiffServ recommendations given in RFC 4594. However, the possible granularity of 12 service classes with up to 20 DSCP encoding values is far too detailed in core network areas. RFC 5127 [45] therefore gives guidelines on how multiple service classes may be aggregated into a fewer set of so called “forwarding treatment aggregates”. The usage and DSCP encoding of RFC 4594 based Service Classes is assumed and funnel mapping of several different DSCP encoded traffic flows with about the same forwarding treatment requirements into one treatment aggregate is specified. Marking preservation is expected. Treatment aggregates are created by funnel mapping of Service Classes grouped by loss, delay, and jitter requirements. Fig. 91 exemplarily shows the funnel mapping of the 12 Service Classes into e.g. 4 treatment aggregates. Fig. 91 Treatment Aggregate / Service Class Performance Requirements - [45] Fig. 92 in turn adds the funnel mapping of inter-DSCP association for the Treatment Aggregate Behaviour. 108 17.11.2009 Fig. 92 Treatment Aggregate Behaviour - [45] As mentioned above, such bundling of Service Classes into coarser forwarding treatments is typically deployed in high capacity core networks, where statistical multiplexing of similar traffic streams into a few forwarding classes combined with moderate over-provisioning is manageable and sufficient. Fine-grained QoS schemes in the core do not scale, increase processing load, hamper debugging etc. and are thus not acceptable by network providers. The four treatment aggregate example, however, seems to be a realistic, feasible and acceptable compromise. Fig. 93 MPLS E-LSP mapping of Treatment Aggregates - [45] The inter-domain case is also addressed briefly in the recommendation stating, that peering parties need to agree on the exact treatment aggregate content and representation. This results in mutual agreements and limits the extent of the QoS enable intercon- 109 17.11.2009 nection path. Furthermore, the RFC targets not just traffic separation, but rather Services Class based interconnection possibly including the MIT QoS scheme (see chapter 6) combined with precise parameter bounds as given in Y.1541 [108]. Both drawbacks are comprehensible for this higher class QoS interconnection, but distract from the aim of this thesis. RFC 5127 stresses the recommendation of marking preservation despite of treatment aggregate based forwarding. This is completely in line with this thesis’ strong recommendation of tunnelled customer transport. As a common example, the RFC documents the mapping suggestion of treatment aggregates into E-LSP of a MPLS domain. This choice is realistic, since nowadays most providers use MPLS in their network. Fig. 93 depicts the proposed mapping table of the 20 possible DSCP values as of RFC 4594 into 6 TC bit (former EXP bit [9]) encodings of the E-LSP. Four traffic aggregates are claimed to be supported, but sub-differentiated into two dropping levels for the default and the assured elastic aggregate. Again, the lower effort CS1 encoding is consistently used as forwarding treatment with lowest priority. All stated aggregate selections, DSCP encodings and TC bit associations are explicitly defined as recommendations. However, no strong commitment is requested from service providers. IEEE 802.1D user priority mapping definitions Ethernet QoS, in contrast to the above mentioned loose mapping commitment, precisely specifies the available user priorities, the mapping to strict priority forwarding queues and priority regeneration rules. Those specifications are part of IEEE 802.1D [97] and have been already addressed in chapter 4.2. Ethernet priority tagged frames offer 3 bits for 8 available forwarding priorities (see Table 5). Strict mapping is given for switch devices, which support 8 or less internal queues (see Table 6). Furthermore, the standard allows for priority reassignment by means of a configurable regeneration table. The default configuration would simply map the 8 classes to each other without changes. However, the combination with IP and/or MPLS based treatment aggregates could well be consistently reflected by mapped user priority settings. Hence, the queue mapping table can in turn be applied for treatment aggregation in single step granularities between one and eight resulting priorities (see Table 14). Table 14 Queue mapping reuse for priority mapping 110 17.11.2009 Systematic QoS Class Mapping Framework Although not widely deployed, the “Systematic QoS Class Mapping Framework” [150] of the Information and Communications University in Korea is worth mentioning. It targets the QoS class mapping in two steps: “Parameter-To- Class mapping” at the forwarding path ingress and “Class-to-Class Mapping” at the network borders along the way. Fig. 94 QoS Class Mapping framework - [150] For this thesis’ mapping analysis, the framework’s class-to-class mapping is of importance. Since not signalling of supported classes and their encoding is included, the framework is based on the six ITU-T defined IP QoS classes [108]. In order to extend this limited class set, the framework introduces a “Location Information (LI)” and performs a parameter-to-LI mapping at the entry and LI-to-class mapping at each relay boundary. LI is encoded as an integer in a redefined IP TOS field and addresses a 4x16 (loss x delay bounds) matrix. However, this interesting approach is unlikely to gain community acceptance because of the TOS field redefinition and the required LI interpretation and mapping function within all participating interconnected service provider networks. ITU-T NGN Focus Group - Proceedings Part II The NGN Focus Group within ITU-T has also developed a mapping for their defined six IP QoS classes, which is documented in paper [109]. However, the mapping concept is limited to the respective classes 0..5 and considers the mapping into four ATM, Frame Relay and UMTS QoS classes. It is not generally applicable to DiffServ based networks and will therefore not be considered in detail. QoS class mapping in Cisco devices The vendor Cisco gives detailed instructions on how to enable and to configure QoS support in Ethernet switches (see [55]). This includes configurable mapping tables for Ethernet priority (called “CoS”) to DSCP mappings with default settings. As Table 15 and Table 16 show, Cisco relies on the upper 3 bit DSCP value, which relates to the class selector (former IP precedence) values. The eight priority values are exactly matched to the respective class selector and in the opposite mapping direction, funnel mapping based on the upper 3 bit value is performed. Interestingly, one of the most important DSCP encoding, the one for EF traffic (= 46), is not directly covered. However, mapping tables are configurable and for example the fully Cisco equipped Chemnitz University applied a slightly modified mapping as shown in Table 17. 111 17.11.2009 Table 15 Cisco’s default CoS-to-DSCP mapping [55] Table 16 Cisco’s default DSCP-to-CoS mapping [55] Table 17 Chemnitz University applied CoS-to-DSCP mapping MPLS capable devices are as well challenged to define mappings between DSCP encodings and the 3 bit E-LSP encoding. For consistency and simplicity reasons, service providers will configure the DSCP <-> CoS and DSCP <-> TC identically. It should be noted, that this is a best common practice approach and no standard. RFC 3270 [75] describes the Support of Differentiated Services within MPLS domains. This includes E-LSP and L-LSP mappings. 112 17.11.2009 E-LSP mapping as mentioned above, is freely configurable, but requires having a consistent mapping strategy throughout a MPLS domain. If this is not the case for interconnected domains, consistent remapping at ingress and/or egress needs to be cared for. By default, if no mappings are configured, all encodings will map to Default PHB treatment. Special care needs to be taken for L-LSP encodings, since 20 bit label information and 3 bit TC information is combined for QoS treatment selection. L-LSP paths are associated with PHB types, which influences the embedded TC bit encoding. Mandatory mapping exists for TC bit encodings, which must be followed in outgoing marking and incoming classification. Fig. 95 depicts the bidirectional mapping requirement. Fig. 95 Mandatory L-LSP encoding rules - [75] 8.3 Coarse grained CoS mapping recommendations The new coarse grained CoS concept emphasized traffic separation, simplicity and implicitly customer traffic tunnelling. Service Class specifications combined with treatment aggregation creates the base for the selection of proposed simple class sets within the concept. The wide spread deployment and precise mapping of Ethernet priorities (see Table 14) adds a possible mapping strategy for dynamic class set granularity. A subset of eight class encodings is widely available in QoS enabled networking technologies and might be considered as realistic upper bound. However, much simpler setups are envisioned for the upcoming inter-domain class of service support. The default behaviour (“best effort (BE)” PHB) will and must always be available and defaults to an encoding of zero in all technologies. Direct mappings or funnel mappings with unspecified encoding ranges must always map the unknown codepoints into the default behaviour. Secondly, the lower effort (LE) behaviour is strongly encouraged for the inter-domain use case, since the existing BE interconnection can easily be augmented with an exchange of lower value traffic. Hence, the first and simplest recommended class set would consist of BE & LE classes encoded as ‘000000’ / ‘000’ and ‘001000’ / ‘001’ for the six or three bit representations. As for the mapping between BE & LE and other class sets, the LE will either be mapped into the LE class and all other classes into BE – resulting in separate enqueuing for both supported classes with lower scheduling and higher dropping priority for LE. However, in purely Ethernet related setups, BE and LE would both be enqueued in 113 17.11.2009 the same queue, but with possibly differing dropping probabilities. This single queue mapping conforms to the Service Class concept, but is not as well recommended for the simple BE&LE class set case. A second class setup is envisioned, which does not make use of LE, but rather introduces EF for high priority, delay sensitive and AF as medium priority forwarding class. The EF encoding of ‘101110’ would normally map to a priority of 5 (‘101’). However, for Ethernet related enqueuing and mapping, a priority of 6 (‘110’) is recommended. The AF support, is not associated to a single precise DSCP encoding. The four AF classes with 3 dropping priorities each result in 12 possible DSCP values. However, for reasons of simplicity, all AF encodings will possibly be funnel mapped into a single 3 bit encoding. ‘100’ is recommended for Ethernet related enqueuing and mapping. Lastly, a combination of all four basic behaviours (LE, BE, AF and EF) is recommended as inter-domain class of service support. The mapping of the EF&AF&BE case and separate low priority enqueuing for LE (‘001000’ / ‘001’) is aimed for. The described class sets and mapping are recommendations only and can be freely adapted to specific needs. However, in such cases, consistent signalling of supported encodings and cross-layer mappings is required by means of the QoS Marking Extended Community. If virtual channel encodings are used in the interconnection case, providers at both ends need to abide by the mandatory marking requirements for ATM, FR and MPLS L-LSPs as defined in RFC 3290. 114 17.11.2009 9 Simulation results The new cross-domain and cross-layer coarse grained Quality of Service support concept aims for both a general and global deployment. Therefore potentially all kinds of traffic will be carried along short as well as long multi-AS forwarding paths. This universal usage can not be covered by simulation. However, the comparison of the new concept’s improved class of service support contrasted with the current best effort only forwarding will be documented in the following section on simple topology examples with extensively simulated configuration combinations. The actual signalling of marking information does not need to be simulated, because proven BGP UPDATE message based signalling is not fundamentally influenced by the addition of a few Extended Communities. Practical feasibility and scalability are of more concern here, which is addressed in real world tests in chapter 11. Two simulation sections follow below, which address the resulting QoS marking and forwarding behaviour as well as the functionality of token bucket ingress limitation filters. 9.1 Setup selection for QoS marking and forwarding The packet transmission of different traffic type flows across differently configured single nodes as well as ASes has been simulated extensively. Besides the network topology, parameters have been varied, such as the number of traffic sources with appropriate traffic markings, the number of supported classes of service along the forwarding path as well as the scheduling and queuing configurations applied within the relaying nodes. The recommended simple CoS setup consisting of at most LE & BE & AF and EF coarse grained classes has been used. Three queuing disciplines have been varied, which are “no priority = round robin queuing”, “strict priority queuing” and “class based weighted fair queuing”. Varying those parameters resulted in more than 3000 simulations, which will be documented using a few selected examples. The complete result set is available upon request. The simulation of QoS marking and forwarding influence follows in scenario 1 to 7 in different topological granularities. Starting from a single node interconnection, the topology grows into interconnected ASes, into multiple interconnected ASes up to a chain of four interconnected transit-AS with multiple stub-ASes at either side. Scenario 7 concludes the marking and forwarding simulation section with the comparative simulation of cross-layer interworking between IP and Ethernet QoS enabled forwarding. The simulator OMNeT++ [12] has been used for the majority of the simulations. It is a modular, C++ programmed, discrete event simulation framework, which is freely available for academic research as open-source software. The so called “INET Framework for OMNeT++” creates the base for Internet protocol modelling and simulation. Additionally, two further modules have been used for realistic VoIP modelling and topology generation for Autonomous System based simulations. The Voice over IP traffic generation makes use of the “voiptool”, which is documented and available at [36]. The AS based topology generation is done by the “Realistic Simulation Environment for IPbased Networks (ReaSE)” [83]. 115 17.11.2009 Not readily available was the DSCP marking option for simulated sources, DSCP based enqueuing in multi-queue setups, class based weighted fair queuing, class based throughput metering and class based enqueuing and scheduling within the Ethernet switch simulation module. The work is documented in [134]. As mentioned above, the simulation abstracts the numerous traffic sources in real world scenarios into the recommended four basic traffic types being: • EF - Expedited Forwarding, • AF - Assured Forwarding, • BE - Best Effort and • LE - Lower Effort. The EF class is normally used for delay and delay variation critical applications such as Voice over IP, video conferencing etc. To get as close as possible to realistic VoIP packet streams, the voiptool [36] has been used to feed the EF sources and to record the received packets at the respective sink. This tool actually sends out audio samples read in from a wave file and records the received wave packets at the EF sink. Afterwards, the received wave file can be compared to the original one in order to calculate the perceived audio quality expressed in “Perceptual evaluation of speech quality (PESQ)” values [119]. If not stated differently, the simulation parameters were as given in Table 18. Table 18 Traffic source configuration parameters Traffic Type VoIP (EF source with differing DSCP and PESQ meter) Parameter settings Coding rate: 40000 bps DSCP: 101 000 // 0x28 mapped to ‘110’ Sending intervall:8 ms Packet size: 79 Byte Sending rate: 79 kbps Traffic Type CBR0 (BE source) Parameter settings DSCP: 0 Sending interval: 5ms Packet size: 827 Byte Sending rate: 1.323200 Mbps FTP (TCP based AF source with differing DSCP) Port: 21 DSCP: 010 000 // 0x10 Packet size: 1044 Byte Sending rate: variable TCP type: Reno Maximum Segment Size: 1024 Advertised Window: 14336 CBR1 (LE source with drop rate meter) DSCP: 001 000 // 0x08 Sending interval: 1ms Packet size: 83 Byte Sending rate: 664 kbps The differing DSCP encodings for the four recommended forwarding behaviours is due to the fact, that the simulator classification for enqueuing is based on the upper 3 DSCP bits only - hence the truncated EF and AF encodings. Waiting queues are drop tail queues by default and the metering resolution of the throughput meter was set to 250ms. Depending on the number of configured waiting queues for the different simulation runs, the queue mapping strategy follows the one given in Table 14. Thus, the DSCP marking for EF traffic will be mapped to priority 6 (‘110’) for smooth mapping into the IEEE VO priority class. The parameter combinations used within each topology, described later, are arranged in Table 19. The sources are gradually started to create different traffic mixes and the columns (a) to (f) show the supported classes (including separate enqueuing) along the 116 17.11.2009 forwarding path. All combinations were simulated with three scheduling schemes: no priority (round robin), strict priority and class based weighted fair queuing. The latter requires the setup of queue weights, which are given in the last row for the respective supported class set in each column. Table 19 Class and traffic type variations in simulations 9.2 Simulation results for QoS marking and forwarding The following sections will document some simulation results for the setups described above. Six scenarios differ in topological complexity with resulting traffic load and simulation results. Congestion is artificially caused by differing link capacities in order to demonstrate the influence even of this simple class of service on the resulting forwarding quality. Scenario 6 also addresses the mapping between differently setup class support along the AS forwarding chain under the light of marking preservation as opposed to packet remarking. Scenario 7 documents the influence of CoS setup harmonization between different networking technologies. The IP QoS and Ethernet QoS are combined in this cross-layer simulation, which required modified link capacities and adopted source sending rates. 9.2.1 Scenario 1: single node interconnection The topology of the first scenario is deliberately kept simple for demonstration purposes of the used sources and meters. Simulation results are deduced, which allow the direct mapping between source characteristic, relaying behaviour and reception result. Fig. 96 depicts the single node interconnection of four sources 1:1 mapped to their respective sinks and a bottleneck link between two routers. Router 0 is the contention point, where packet losses occur. 117 17.11.2009 Fig. 96 Scenario 1: single node interconnection Fig. 97 exemplarily shows the sending and receiving characteristic of the VoIP source in the four class CoS support case. Slight variations in the sending rate are due to the audio encoding characteristics in wave files, where a larger packet with some replay overhead is consecutively followed by three smaller audio sample packets. A PESQ value of 4,334 and a LE drop rate of 83.8 % has been achieved. Fig. 98 depicts the same situation but with no CoS support. All traffic types are roughly equally affected by the packet drops, which balances the loss rate according to the sourced traffic load. A PESQ value of 4,329 and a LE drop rate of 30 % has been worked out. Fig. 97 S1: 9-f-cbwfq Fig. 98 118 S1: 9-a-no-priority 17.11.2009 Fig. 99 S1: 9-f-cbwfq Fig. 100 S1: 9-a-no-priority Fig. 99 depicts the resulting traffic mix after the contention point. It can be seen, that all traffic classes are transmitted. The rather slow EF traffic passes through without discrimination. The TCP type AF traffic gets a fair capacity share according to its queue weight. Best effort traffic is limited below 500 kbps and LE traffic gets a minimal share. As stated above, about 80% of LE traffic gets discarded. Fig. 100 gives a different picture. EF traffic is varying as already shown in Fig. 98. However, BE traffic uses up most of the capacity followed by LE traffic. TCP based AF traffic is completely starved out. This is due to the fact, that BE and LE combined have been configured to exceed the bottleneck link capacity. However, starvation effects can also result from CoS deployment with strict priority queueing. This is documented in Fig. 101. EF traffic gets excellent forwarding and the TCP based AF traffic uses all remaining capacity. BE and LE traffic dies off. Fig. 101 S1: 9-f-strict-priority 119 17.11.2009 9.2.2 Scenario 2: AS interconnection – Single AS Scenario 2 uses a topology of two stub ASes (AS00 and AS20) and one transit AS (AS10). Each AS is simulated with double sources and sinks. No crossing traffic is modelled in AS10, so that a simple router model can be applied. Fig. 102 Scenario 2: AS Interconnection – Single AS Fig. 103 S2: 9-f-cbwfq Fig. 104 S2: 9-a-no-priority Fig. 103 and Fig. 104 show the resulting throughput graphs sorted by traffic classes. The qualitative result is comparable to the on in Scenario 1. However the usage rate (ur) increase in the CBWFQ case needs to be explained. The sending rate of all fixed rate sources sums up to about 4,1 Mbps. This rate is roughly achieved in the non priority case on the 10 Mbps link between AS00 and AS10. The reading of the diagram distinguishes the rates before the bottleneck with index ‘1’ and the ones after AS10 with index ‘2’. As Fig. 104 shows, TCP traffic is almost completely starved out. Packet drops trigger TCP’s congestion avoidance mechanism, which leads to an ever decreasing sending rate. In the CBWFQ case, the configured fair capacity share is given to the AF class, which leads to a 120 17.11.2009 sustainable TCP throughput rate. Before the bottleneck, the non-responsive CBR sources fill the 10 Mbps link to their nominal sending rate – hence the higher link usage. Fig. 105 S2: 9-b-cbwfq Fig. 106 S2: 9-e-cbwfq Fig. 105 depicts a similar result compared to the four class CBWFQ setup. This time, only two traffic classes (BE & EF) were supported and the classifier would map AF into the EF queue. IEEE based classifiers would therefore require an AF group encoding of ‘100xxx’. However, EF&AF&BE would be a recommended class set, which results in the very well acceptable forwarding behaviour as depicted in Fig. 106. 9.2.3 Scenario 3: AS interconnection – Multi-AS Scenario 3 uses a topology of six stub ASes (AS00, AS01, AS02 and AS20, AS21, AS22) and two transit AS (AS10 and AS11). Fig. 107 Scenario 3: AS interconnection – Multi-AS 121 17.11.2009 Fig. 108 S3: 9-f-cbwfq Fig. 109 S3: 9-a-no-priority Fig. 108 and Fig. 109 clearly demonstrate the superior forwarding behaviour of a simple CoS enabled forwarding behaviour. The increased number of sources, which are mixed with the transit links of AS10 smooth out the multiplexed traffic class throughput graphs. 9.2.4 Scenario 4: AS interconnection – Multi-AS 2 Scenario 4 is a slight modification of the topology of scenario 3. The transit path includes one more AS and the transit links decrease in capacity. This results in two contention places and multiplies the class separation effect. The results for BE only support, in AS10 and AS11 compared to identical four class support in both ASes, is not printed out, since the characteristic throughput graph in each case is found. It is however of more interest to vary the supported class sets along the AS chain. Fig. 110 Scenario 4: AS interconnection – Multi-AS 2 Fig. 111 and Fig. 112 reveal, that the order of the limited class sets to pass through does matter, even in a symmetrical topology. In Fig. 111, all four traffic classes will be taken care of in AS10 before the non-prioritized BE only forwarding in AS11 occurs. Fig. 112 takes the opposite configuration and performs worse, particularly for the TCP based traffic. 122 17.11.2009 Fig. 111 S4: CBWFQ / no priority Fig. 112 S4: no priority / CBWFQ The reason for this behaviour is the resulting traffic mix after AS10. In the first case, a high percentage of AF traffic will remain in the mix, which is then equally discriminated in AS11. In the latter case, hardly any AF traffic is contained in the mix after AS10 and gets starved out in AS11. 9.2.5 Scenario 5: AS interconnection – Multi-AS 3 Scenario 5 is again a slight modification of the topology of scenario 4. A further transit AS has been introduced accompanied with a further link capacity reduction. Fig. 113 Scenario 5: AS interconnection – Multi-AS 3 As Fig. 114 and Fig. 115 show, the advantage of late CoS bottlenecks along a forwarding path remains. However, each introduced transit AS has detrimental effects on the variation and the throughput level of rate adopting sources. However, Fig. 116 and Fig. 117 document the case, where the CoS bottlenecks would at least separate two traffic classes. The resulting traffic characteristics and throughput numbers underline the thesis’ strong request for coarse-grained inter-domain CoS support. 123 17.11.2009 Fig. 114 S5: 2x CBWFQ / no priority Fig. 115 S5: no priority / 2x CBWFQ Fig. 116 S5: 2x CBWFQ / EF&BE Fig. 117 S5: EF&BE / 2x CBWFQ 9.2.6 Scenario 6: AS interconnection – Multi-AS 4 Scenario 6 is again a slight modification of the topology of scenario 5. A further transit AS has been introduced accompanied with a further link capacity reduction. No further fundamental knowledge might be gained from this extended topology, except for the comparison of remarking and non-remarking simulation results. All scenario simulations and results so far assumed, that enqueued traffic would keep its class marking regardless of the respective class support at each relay node. Although this 124 17.11.2009 concept strongly recommends such funnel mapping and marking preservation, network operators are free to remark packets as they traverse their AS. Fig. 118 Scenario 6: AS interconnection – Multi-AS 4 The remarking behaviour will therefore be documented on 2 selected examples. It is assumed, that class mappings for enqueuing and class remarking will be applied identically by the respective CoS limited AS. Fig. 119 S6: 2 classes w/o remark. Fig. 120 S6: 2 classes with remark. Fig. 119 and Fig. 120 depict the result when either the CoS bottleneck is performing marking preservation or remarking. In the latter case, traffic classes are either upgraded (as shown with the AF traffic being remarked into EF type) or downgraded. Consecutive forwarding behaviour can therefore no longer be class type specifically applied. The situation gets worse in the best effort only support case, where all traffic passes through a single (BE) class. As clearly shown in Fig. 121 and Fig. 122, the forwarding within AS 12 applied single class funnel classification without remarking on the left and with remarking on the right. In the remarking case, consecutive transit ASes will no longer be able to distinguish the all BE marked packets. This effectively results in traditional best effort forwarding along the remaining transit path segments. 125 17.11.2009 Fig. 121 S6: 1 class w/o remarking Fig. 122 S6: 1 class with remarking 9.2.7 Scenario 7: AS interconnection – Cross-Layer Scenario 7 addresses the cross-layer marking and mapping challenge, which arises with any underlying transport networking technology. Since AS interconnection is increasingly based on Ethernet links (being IXP platforms or point-to-point links), the example will focus on the interworking of IP QoS and Ethernet QoS. Fig. 123 depicts the selected topology, where AS00 and AS20 are interconnected across an Ethernet switch. Fig. 123 Scenario 7: AS interconnection – Cross-Layer The introduction of an Ethernet model requires the selection of 10 or 100 Mbps link capacities. However, this requires an increase of EF load on the network as well. An aggregated traffic load of 2.64 Mbps VoIP, 13.2 Mbps CBR0 and 8.43 Mbps CBR1 has been chosen. Furthermore, a ‘110’ mapping has been applied for the VoIP traffic in order to conform to the IEEE voice encoding. 126 17.11.2009 Fig. 124 S7: with Ethernet QoS Fig. 125 S7: without Ethernet QoS Fig. 124 and Fig. 125 depict the resulting CoS based throughput in the traffic mix after the 4 queue CoS switch or after the BE only switch, respectively. The best effort only switch virtually destroys the prioritization for EF and AF traffic and prefers the high volume CBR traffic type. Therefore, the underlying CoS support with consistent cross-layer mapping is important for the successful overall performance. It is a constituent part of the proposed CoS concept. 9.3 Setup selection for token bucket ingress filtering Class of Service support will require class overload protection using token bucket based ingress rate limitation. Since the OMNET++ does not support token bucket filtering, the “Network Simulator 2 (ns2)” [145] has been used for these simulations. The simulator’s classification scheme differs from the one used in OMNET++ and generally refers to so called “flow IDs (fid)”. Those source-to-destination flow identifiers allow for the mapping of packets into DiffServ DSCP encodings in DiffServ enabled nodes. Such DiffServ domains are modelled by means of three nodes, the edge node performing ingress classification, the core performing CoS based enqueuing and dropping and the outgoing edge node performing CoS based dequeuing and rate limited priority scheduling. Those nodes can be abstracted into a single ingress rate limited node, when they are arranged in a single forwarding line. Fig. 126 depicts the resulting simulation topology. 127 17.11.2009 Fig. 126 Single node structure with token bucket filtering The rate limitation by means of the ns2 token bucket implementation is a combination of token bucket based metering and the resulting increased dropping probability for excess traffic and a strict rate limitation on CoS enabled enqueuing. Table 1 lists the applied parameters, whereby the queue rate is the limitation setting for each of the four traffic types. 3 Mbps link it Table 20 Simulation parameter settings All simulated packets were of 1000 byte size. Strict priority queuing with the highest queue being queue 1 was applied. 9.4 Simulation results for token bucket ingress filtering The simulation of this single node example did yield the expected rate limitation result and is documented in Fig. 127, Fig. 128, Fig. 129 and Fig. 130. The sending behaviour of the 128 17.11.2009 four sources remains the same, but the mapping groups LE into BE (4:3), then additionally AF with LE into BE (4:2) and lastly all traffic classes into BE. The legend of the graphs reads with the curves “Before TB” as the throughput observed before the token bucket limitation and “At Dest” as the throughput received at the destination. All but Class 2 have constant sending rates. The AF source is a TCP source, which always sends as fast as possible. The sending and queueing rates are deliberately chosen above the physical bottleneck link capacity in order to demonstrate the hard rate limitations (e.g. of 0.5 Mbps for EF) combined with priority based queuing and dropping. Fig. 127 Single node TB 4->4 Fig. 128 Single node TB 4->3 The start time of each source is varied to demonstrate the source’s influence. EF starts first and is limited to a “committed information rate (CIR)” of 500kbps. This slightly reduces the constant sending rate of 600kbps. It has the highest priority and as long as there are two queues available, hardly notices any surrounding traffic changes. It is a typical rate limited EF forwarding service with best forwarding quality. Secondly, a constant bit rate class of type LE is turned on with a nominal sending rate of 2 Mbps. However, token bucket limitation is applied for the selective LE queue in Fig. 127. This limits the LE traffic to just 400kbps. Thirdly, a third constant bit rate class of type BE is turned on with a nominal sending rate of 2.5 Mbps. Its token bucket limitation is set to 2.5 Mbps in the 4 class case. However, token bucket limitation is applied for the selective BE queue in Fig. 127. This limits the BE traffic to 2.2 Mbps. The bottleneck link with 3 Mbps capacity is completely used up. Only 300kbps remain for the Lower Effort class. Lastly, the TCP source starts sending with the highest achievable rate and a priority of 4. The rate limitation of 2.2 Mbps is not reached because of the congestion avoidance mechanism in TCP. LE type traffic gets starved out and BE traffic is reduced to about 800kbps. The TCP stream uses about 1.6 Mbps on average. A similar behaviour is observed in the 3 class setup, where BE and LE share the same unlimited low priority queue. Initially, LE can transmit its full 2 Mbps link load in the uncongested phase. With the BE source in place, both streams equally share the remaining 2.5 Mbps link capacity. Together with the TCP load, BE and LE share the formerly available 800kbps BE link share and the TCP traffic yields the same throughput as in the 4 class support case. 129 17.11.2009 Fig. 129 Single node TB 4->2 Fig. 130 Single node TB 4->1 Fig. 129 depicts the minimal CoS support setup, where a 500kbps rate limited high priority EF class is contrasted with the common link share of the LE, BE and AF sources. It can be seen, that the remaining 2.5 Mbps link capacity is taken up by the constant bit rate BE and LE sources in the relation of their sending rates. The congestion controlled TCP stream dies off, due to the lost prioritization. This situation is stepped up further in the single class support case depicted in Fig. 130. The EF class gains its full transmission rate in the uncongested phase, but is drastically reduced under the view of all sources being active. EF, BE and LE divide the 3 Mbps capacity in equal shares according to their sending rates. No TCP type traffic can sustain this link’s 100% congestion phase. 9.5 Summary of simulation results The simulation results of the QoS marking and forwarding behaviour as well as the functionality of token bucket ingress limitation filters clearly demonstrate the superior class of service forwarding operational quality as contrasted with the currently deployed best effort only transmission capabilities. Due to the impossible handling of arbitrarily complex Internet traffic models for this simulation effort, the concept’s coarse-grained class of service support has been applied in the modelling as well. Up to four most commonly found traffic classes have been distinguished in the setups combined with extensive parameter simulations and scheduling strategy variations. Single node as well as AS interconnection setups have been modelled, which allowed for simulations of varying class support situations in interconnected ASes. Only some few examples of the gained results have been documented in here. The complete result set of all combinations is available upon request. The concept’s expectation of sensible usage shares with matching class set support has been proven. The applicability of strict priority queuing and class based fair queuing is both valid and leads to satisfactory results. However class based fair queuing is preferred due to its configurable prevention of traffic starvation in all classes. The interconnection scenario of multiple-ASes revealed the general advantage even of consistent two class support in transit networks. CoS bottleneck simulations revealed, that the ordering of class support granularities along a forwarding path does matter. The later a merging of traffic occurs, the better. Furthermore, the advantage of cases where marking preservation is performed in CoS bottlenecks as compared with remarking cases has been demonstrated. This backs up the concept’s strong recommendation of tunnelled customer traffic transport with matched tunnel CoS support. 130 17.11.2009 The consequences of missing tunnel CoS support have been simulated and exposed. Class overload prevention will be performed by token bucket ingress filtering as specified in the second IETF draft document. Therefore, the precise limitation characteristics and some typical application scenarios have been simulated. Token bucket metering combined with prioritised queuing is a simple but powerful means for network protection and sensible rate limited class of service based forwarding of traffic. The fundamental building blocks of the new cross-domain and cross-layer coarse grained Quality of Service support concept have been successfully simulated. Given the high level of traffic aggregation across interconnection links and the current poor best effort forwarding situation, even a simple two class of service interconnection is shown to be highly beneficial for the separated transport of prioritized traffic. 131 17.11.2009 10 Concept implementation New concepts and ideas can only become Internet standards, if they are contributed to the respective working group within the “Internet Engineering Task Force (IETF)” and gain community support there. RFC 2418 [40] defines the guidelines and procedures for the working group operation. This formally defines, what is cited as an early quote by David Clark: “We reject kings, presidents and voting. We believe in rough consensus and running code”. Especially the philosophy of running code as a way to find out possibly missing subtleties in specifications and at the same time ensuring that the new specification can actually be used straight away is a fundamental building block in the IETF standardization work. Consequently, the new cross-domain and cross-layer coarse grained Quality of Service support was not only submitted to the “Inter-Domain Routing (idr)” working group of the IETF, but also implemented. Basic functionality has already been achieved and used for test runs in the University laboratories as well as with service providers. 10.1 Linux implementation The open-source operating system Linux includes the routing suite software “Quagga” [154], which includes the implementation of several routing protocols as well as local routing table management. The Border Gateway Protocol is also supported, which has been used and extended for the implementation of the new signalling concept. Fig. 131 Quagga Routing Suite structure Fig. 131 depicts the structure of the routing suite software. It consist of the central software process (daemon) “Zebra” and several processes (daemons) for the depicted routing protocols. Each process has an associated command line interface called “Virtual 132 17.11.2009 TeletYpe shell (vtysh)”. Router administrators can therefore connect to each process and issue control commands to them. All processes exchange their routing information with external peers as well as with the central Zebra daemon for node local routing information and table updates. The BGP daemon and its associated vtysh has been modified for the implementation of the concept’s new routing information exchange as well as for the required new vytsh commands for its configuration. Fig. 132, Fig. 133 and Fig. 134 give example setups for both new Extended Communities as well as for the token bucket rate limitation signalling. All configuration command create the required internal data structures, initiate the respective sending and show up in the “show running-config“ command, which displays the current router configuration. The configuration and activation of the new communities and attributes targets the so called “route-map” mechanism, which is used in such configurations for triggered actions on matched criteria. This powerful mechanism is now extended to selectively output the CoS signalling data to neighbours by attaching the respective route-map to this peering session. Fig. 132 Example setup for 4 QoS Marking Ex. Communities for IP-DiffServ Fig. 133 Example setup for a CoS Capability Ex. Community 133 17.11.2009 Fig. 134 Example setup for a CoS Parameter Attribute The modifications of the BGP daemon within the Quagga routing suite have not yet been submitted to the Quagga team for inclusion. They are still under development and testing, but have proven operational stability and functionality. Table 21 lists all newly added commands and their available parameters. A detailed description of the commands and their parameter handling will be published after the official code adoption within the Quagga software project. Chapter 11 documents the test results, which were achieved by means of this modified Linux routing software. 134 17.11.2009 Table 21 Extended command line syntax for CoS configurations 135 17.11.2009 10.2 Wireshark implementation Measurement tools are required to analyze test results and to aid the debugging process. The most widely used network analyzer tool for data communication networks is the freely available software “Wireshark” [181]. This successor of the former “ethereal” software holds a comprehensive set of protocol dissectors, which allows the user to analyze almost all types of captured data packets in a cleanly structured way and detailed to every bit of the packet’s control information. Since all programming sources of the Wireshark package are available, the new Extended Communities have been added to the dissection repository. The modifications have been submitted to the Wireshark team and were accepted for inclusion. The newest official release therefore includes the decoding functionality for the data structures. Fig. 135 depicts a screenshot of the software. The enlargement shows some examples of transmitted Extended Communities within a BGP UPDATE message. Fig. 135 Wireshark screenshot with captured Extended Communities 136 17.11.2009 10.3 Online debug form The new CoS signalling capabilities in Linux based BGP speakers is not yet available in commercial routers. Therefore, many network operators might well be able to receive those Extended Communities and attributes, but will be confronted with decimal or hexadecimal encodings of the information as shown in Fig. 136. Fig. 136 Reception example of Extended Communities in commercial routers In order to decode the received information, such operators would be forced to use the augmented Wireshark functionality. However, this is will not happen in production style setups. Therefore, a second means of decoding is provided, which accepts the un-decoded command line output and displays the decoding result. This service has been set up as online debug form and can be accessed at the following URL: http://www.bgp-qos.org/draft-knoll/decode_attributes.php . Either single encodings (e.g. “0x420:11778:3422565120”) or complete command line log files can be submitted in the online form for decoding. The result will be returned in structured table output style as shown in Fig. 137. Fig. 137 Decoding result of the online form 137 17.11.2009 11 Implementation test Service providers run large network setups and almost exclusively use commercial router equipment. The interoperability of the Linux-based concept’s implementation with such routers is therefore vital and needs to be tested. Due to the selection of BGP as signalling protocol and the reuse of Extended Communities for most of the signalling information transport allows for the interconnection of modified Linux and commercial router systems. More over, the Extended Community attribute is by definition a transitive attribute and the CoS Parameter attribute has been designed as such as well. Thus, all commercial routers of all vendors will receive and store the attributes and eventually relay them unprocessed. In practise, the latter holds only true, if the network operator has not configured the discard filter for Extended Communities and for unknown attributes. The simple interconnection of a Linux-PC with several types of Cisco routers has been tested. The establishment of a peering session, the exchange of routing information and the attachment of some of the new path attributes was successfully realized. The more challenging testing of the attribute relay between commercial routers as well as the extensive signalling of the new path attributes under DFZ routing conditions has been successfully completed and is documented below. 11.1 Test setup The intention of the test setup as shown in Fig. 138 is to learn a so called “full feed” global routing table view from a public peering, to augment this information with some CoS tagged self-originated prefixes and to relay this advertisement to a commercial Router 1. This in turn relays the full routing information, including the injected CoS signalling information, towards a second commercial Router 2. Fig. 138 Implementation test setup 138 17.11.2009 Three Linux PCs and two Cisco 2811 routers have been used in the testing arrangement. The test has been performed in test labs of independent service providers for the global Internet peering connectivity. All information exchanges were to be captured for documentation and offline analysis purposes. Wireshark was used for this task and the modified Linux PC (“themis” in the figure) was able to do the packet capture for the public interconnection link. The direct links between the Linux-PC and Router 1 as well as the one between Router 1 and Router 2 needed to be eavesdropped by means of two simple Ethernet hubs and two Wireshark equipped Linux-PCs (“leda” and “maia” in the figure). 11.2 Test result and observations The session establishment between the BGP speakers of the described test setup could be successfully realized and routing advertisements within BGP UPDATE messages were observed. The public Internet peering for global connectivity was configured to relay Internet to test setup advertisements and to filter out all locally generated or repeated advertisements. This way, all unacknowledged announcements could be suppressed. This is particularly important, since service level agreements and mutual information exchange within BGP peering sessions are of legal relevance and must be kept clear of any unwanted or uncontrolled leakage of information. Fig. 139 documents the successful reception of a full routing table feed from the public Internet along the modified Linux PC into the Router 1 BGP routing process. The routing table contained 273109 IP prefixes and consumed about 56 MB of RIB memory space. The time between session establishment and table convergence at the Linux PC came to about 6 minutes. The same time was needed for the unchanged relay operation toward Router 1 and again to Router 2. Fig. 139 Router1: show ip bgp sum – full feed In a second test series, the Linux PC started to send out route-map matched reachability UPDATE of its own networks. Some made up networks were configured and associated with a four CoS Extended communities. The signalling of the respective CoS Extended Communities was successful and could be observed in Wireshark and in the routers’ debug and statistic outputs. Fig. 140 and Fig. 141 exemplarily show the resulting memory consumption of 40 bytes for the four attached Extended Communities, which were received in one Extended Community attribute. This consumption is interesting in two ways. Firstly, the Extended Community itself is an 8 byte structure and four such structures should result in 32 bytes memory usage and secondly, no difference in consumption 139 17.11.2009 was found between 1, 10, 100 or several hundred announced prefixes with the same associated CoS Extended Communities. Fig. 140 Router 1: show ip bgp sum – single prefix with 4 communities Fig. 141 Router 1: show ip bgp sum – 10 prefixes with 4 communities Fig. 142 in turn documents the maximum tested simulation run for a single CoS Extended Community being deliberately attached to all incoming full feed announced prefixes. This forced behaviour is not conformant to the specified concept, but clearly proves the sending capability of the Linux-PC as well as the stable handling and efficient storing of this massive load of incoming CoS attributes by commercial routers. Fig. 142 all Router1: show ip bgp sum – full feed with single community attached to Further test runs have been performed, where hundreds of prefixes were associated with thousands of differing CoS Extended Communities in the connectivity advertisements. It can be stated, that all tests have successfully been passed and the resource usage analysis of this testing is discussed in chapter 11.4. 140 17.11.2009 One further observation has been made, which revealed a still unresolved BGP signalling flaw. Fig. 143 Completely processed Extended Community attribute example Fig. 143 depicts the reception of four CoS Extended Communities contained in one Extended Community attribute. However, this UPDATE message passed through a Cisco router which marked the attribute as “completely” processed. This marking flag (complete vs. partial processing) is a mandatory field associated with all BGP path attributes. The standard requests, that all BGP speakers, that receive an unknown to them transitive attribute, must relay the attribute with raised “partially processed” marking. In the Extended Community attribute case, the Cisco router obviously did not raise the partial flag because of its familiarity with Extended Community attributes as such. The unknown content, namely the new QoS Marking Extended Communities, is therefore silently relayed as completely processed. This signalling inconsistency has been acknowledged by the two major router vendors. 141 17.11.2009 11.3 Ethernet QoS support test at IXPs The new cross-domain and cross-layer coarse grained Quality of Service support concept places emphasis on the CoS interworking not only between networking domains, but also between networking layers. Since many potentially CoS capable interconnected Service Providers peer across public Internet Exchange Points, the underlying Ethernet QoS support needed to be tested as well. All major Internet Exchange Points in the world are currently not QoS enabled and switch untagged Ethernet frames. However, talks to IXP operators revealed, that they are willing to support their customers in high class peering setups and want to be prepared for it. Allowing customers to configure VLAN tagged peerings across an IXP platform is a prerequisite for QoS support. The support itself can be divided into QoS marking and marking preservation only support or QoS marking and QoS forwarding support. The latter is unlikely to be enabled soon. Fig. 144 VLAN User Priority test at DE-CIX [144] 142 17.11.2009 The IXP administration at the German Internet Exchange Point in Frankfurt was kind enough to perform the raised testing request on their cascaded platform. Two PCs, languard1 and languard4 were exchanging IEEE 802.1Q tagged Ethernet frames with configured user priority markings. It could be shown, that all marks traversed the Force10 and Foundry based switching platform unchanged. This is a fundamental building block for VLAN tunnelled and priority marked AS interconnections. IXP customers are supplied with VLAN based platform access upon request. The switch hardware of the platform is even capable of performing multi-class prioritised forwarding. Other IXPs are expected to offer high class peerings as well. However, no central database exists currently, which could guide potential peering partners to the QoS enabled peering platforms. Therefore, a new database about QoS enabled IXPs has been initiated. This registry of QoS-enabled IXPs can be found at: http://www.bgp-qos.org/qos-ixp/ . As Fig. 145 already shows, the second European IXP has also acknowledged support for VLAN tagged peering with user priority preservation. Differences are in the number of priority queues supported on the Ethernet hardware platform. Fig. 145 Major European IXPs with VLAN User Priority support 11.4 Resource usage estimates The applicability of an inter-domain concept depends on three major criteria: • simplicity to gain common understanding and usage, • scalability to large number of interconnections and routing table entries and • modesty in resource usage. The latter two are due to the facts of ever increasing Internet routes and autonomous systems. Fig. 146 and Fig. 147 document this continuing growth trend. 143 17.11.2009 Fig. 146 Active BGP entries over time [year] - [93] Fig. 147 Unique ASes over time [year] - [93] Network operators are strongly concerned about the growth rate. The situation is further intensified by AS multi-homing. Here, stub ASes connect via two or more Internet service providers to the Internet. Due to the BGP best path selection, only one homing path would be used at a time and frequent swapping between them is not aimed for and on the contrary is harmful for inter-domain routing stability. Therefore, stub-ASes tend to split their currently aggregated IP address spaces into de-aggregated ranges. This leads to larger prefix lengths and two routing table entries instead of one. This in turn results in an increase in routing table size and has led to the commonly accepted prefix length limitation policy. Prefixes of more than 24 bit length are filtered out in the BGP processing, which virtually disconnects any finer-grained IP address range. Fig. 148 shows the average rate of updated and withdrawn prefixes. Less than ten such prefix manipulations are on average to be processed by any BGP speaker in the Internet. This reveals, that UPDATE processing and UPDATE message size is of minor importance 144 17.11.2009 during time of normal operation. However, it becomes critical for the initialisation phase of BGP sessions. Here, the full routing table is exchanged, which results in considerable UPDATE message amounts and high processing loads. Fig. 148 Hourly Average of Updated and Withdrawn Prefix Rate - [93] Therefore BGP UPDATE messages were generated, which included just one prefix associated with 173 Extended Communities. This extreme message design yielded no operational flaws and no measurable processing increase. The new coarse grained CoS concept has proven to scale well, under the light of thousands of prefixes being associated with CoS signalling in the UPDATE procedure, as well as in the number of differing CoS Extended Communities being received, stored and relayed by BGP speakers. The resource estimation depends on a number of factors, which can vary widely depending on the number of providers adopting this CoS concept. However, due to the limited number of different classes being sensibly used and the fact that identical attributes are going to be stored as a single instance, the overall resource usage is expected to be rather small – especially in terms of additional memory consumption. 11.4.1 Increase in routing update information size The resource usage for the sole transmission of BGP UPDATE messages is best analyzed under worst case conditions. An upcoming BGP peer in a newly established BGP session will require the transmission of the full routing table with all prefixes and associated attributes. This situation will be taken as the starting point for the calculations, which follow below. However, it is to be noted, that such a dense full table exchange does provide the highest information packing density. As far as possible, the sender will group all prefixes with identical attribute sets into one UPDATE message for transmission. The addition of a new CoS attribute will therefore be applied to all advertised prefixes within the message as well, which results in 8 byte signalling overhead for several hundred CoS enabled prefixes. Under normal network operation, single prefixes with associated attributes will be exchanged. Such an UPDATE message of one IPv4 prefix and only the limited set of mandatory attributes (ORIGIN – 4 bytes, AS_PATH with one AS and NEXT_HOP) accounts for a message size of 32 byte. The addition of one CoS Extended Community would need the Extended Community attribute control information (3 byte) and the actual Extended Community (8 byte). 43 bytes would therefore be needed. This yields an overhead of 11 bytes to the original size, which is 34.375 %. 145 17.11.2009 Associating two CoS Extended Communities in the single prefix case requires 3+16=19 bytes overhead, which yields 59.375 % total overhead and goes down to 29.69 % overhead per CoS Extended Community. Fig. 149 depicts the resulting overhead graph. Fig. 149 CoS signalling UPDATE message overhead – single prefix case As mentioned earlier a limitation of 173 Extended Communities within an UPDATE message exists. This limitation is caused by the Ethernet MTU size of 1500 byte. Furthermore, if the same prefix is associated with 173 Extended Communities in a first UPDATE message and the same prefix is announced a second time with 173 Extended Communities in a second UPDATE message, those communities do not accumulate at the receiver, but rather are regarded as new information replacing the old one. As Fig. 150 shows, the UPDATE message with 173 Extended Communities contained 173 * 8 byte = 1384 bytes of Extended Communities, led to 1444 bytes of BGP UPDATE message size and grew to 1498 bytes Ethernet frame size due to the IP, TCP and Ethernet header information. 146 17.11.2009 Fig. 150 Wireshark screenshot with 173 Extended Communities UPDATE The signalling of CoS marking information is done transitively and globally in QoS Marking Extended Communities. The rate-limitation signalling in CoS Parameter Extended Communities, however, is of interconnection local significance (see chapter 7.3.2.3) and therefore of limited concern for resource usage estimation. The signalling of class markings in the newly defined Extended Communities is realized by the construction of single Extended Communities for each class of service and each of its technology representations. The additional signalling bytes transferred in a single update message are therefore calculated as follows. Supdate = classes * technologies * 8 byte + 3 byte To limit the expected attribute’s usage to practical scenarios, one needs to consider the following facts: • most vendors support a maximum of 8 classes (e.g. Ethernet priority, MPLS E-LSP) in their network devices, • most router interfaces support only four hardware queues, • CoS marking signalling for the IP technology does not distinguish between IP version 4 and 6, because of the common DSCP marking design and • common transport technologies for IP packets are Ethernet with or without intermediate MPLS tunnelling support. The maximum UPDATE size increase for the CoS support signalling given the above mentioned assumption for 8 traffic classes encoded within 3 networking technologies yields Supdate = 195 byte / update message. 147 17.11.2009 The exchange of the full feed routing table as shown in Fig. 142 required the sending of 95160 update messages. Under the assumption that all globally reachable prefixes would need to be associated with the 195 byte CoS signalling in their UPDATE advertisements, this would yield an increased UPDATE transfer size of additional 17.7 MB. Given the commonly used interconnection speed of 1Gbps, this results in additional 149 μs transfer delay. The coarse-grained CoS support, which is aimed for in this concept targets only four classes. Because of the intensive usage of IXPs for AS interconnection, It is expected, that CoS support signalling will potentially be limited to IP and Ethernet CoS markings. This two class / two technologies setup sums up to Supdate = 67 byte / update message. Given the current 95160 update messages, this yields a maximum transfer size of 6.1 MB. Given the current interconnection speed of 1Gbit/s, this results in 51 μs transfer delay. An UPDATE size increase for a full BGP table feed will even reduce to a size of 3.2 MB (and a respective 26,6 μs transfer delay on a gigabit interconnection link), if only IP CoS support signalling is deployed. Because of this small additional UPDATE message size increase, the more important factor in terms of resource usage and possible machine update requirement results from the memory usage estimate. 11.4.2 Increase in memory consumption with routers The storage of globally flooded QoS Marking Extended Communities is of highest concern. Each BGP speaker in the world will receive the announced CoS support by the IP prefix originating AS and store this information in its local BGP memory space. Estimating the actual memory usage is non-trivial due to the observed storage concept. Given that the memory consumption for extended community attribute storage is independent from the number of prefixes being advertised, the estimate needs to focus on the number of different attribute sets being advertised. This, however, is not only dependent on the number of classes and technologies being addressed but also on the markings and flags being conveyed in those attributes. Only those attributes, that are identical in every single bit, can be stored as a single attribute instance. Otherwise, they need to be stored separately. This leads to different estimation approaches. 1. Estimate the number of class sets and markings (possibly belonging to different ASes) and calculate the resulting memory space, taking into consideration that flags might change as well. 2. Analyze the attribute structure and multiply the combinations of each field that might independently vary its value. Taking both approaches into account, the actual estimate is the minimum value of both. For convenience, the structure of the QoS Marking Community is again depicted in Fig. 151 below. 148 17.11.2009 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 0 P R I A 0 0| QoS Set Number|Technology Type| QoS Marking Oh| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | QoS Marking Ol| QoS Marking A |0 0 0 0 0 0 0 0| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Fig. 151 Structure of the QoS Marking Community Based on possibly varying attributes fields, the maximum (and unrealistic) number of storable binary combinations sums up as follows: • 4 varying flags bits contribute 16 combinations, • 8 bit QoS Set number contribute 256 combinations, • 6 defined technologies contribute 6 combinations, • 6 DSCP bits and 1 group bit within Marking O contribute 128 combinations and • 6 bit DSCP bits within Marking A contribute 64 combinations. Assuming that no dependencies exist between those fields, a total number of 201,326,592 combinations can theoretically exist. Since each Extended Community is of 8 byte size, up to 1536 MB of additional memory space were required. This theoretical resource estimate is prohibitive for the deployment of the concept. However, the made worst case assumptions are partially dependent and highly unlikely. The respective limitations will therefore be discussed and considered for a more realistic estimate calculation. The four flags from the Flags field indicate processing states of the respective community and might well be observed in all 16 constellations. They will not be observed for a given CoS signalling in all 16 combinations at once. Combined with the combinations of the remaining structure fields, only one of those 16 combinations can be found for a given QoS Marking Community at a given router device. That is only one instance of this Community must be stored in this device. Secondly, the QoS Set Number is the linking field between multiple QoS Marking Communities of different networking technologies. One such number is needed for the cross-layer technology linking per class. It has UPDATE message local significance and recounts with a value of zero. Given the realistic assumption of a maximum of 8 classes for the AS interconnection, this QoS Set Number will also vary between 0 and 7. Thirdly, because of the current best practice for interconnections, the Technology Type field will commonly contain the IP, Ethernet and MPLS E-LSP enumeration values. This yields a value of 3 combinations. Fourthly, Table 3 lists 21 DSCP values, which might be combined with the grouping bit 14 for AF DSCP in the QoS Marking O field. Furthermore, 3 bit priority based technologies will potentially signal up to 8 combinations in this field. The resulting number of combinations is 41. Lastly, the mentioned 21 DSCP values as well as the 8 priority encodings are most likely to be found in the Marking A field. 29 binary combinations are therefore estimated for this field. 149 17.11.2009 Taken the above assumptions into account, the following number of storable binary combinations needs to be considered: • 4 varying flags bits contribute 1 combination, • the 8 bit QoS Set number contributes 8 combinations, • the Technology Type field contributes 3 combinations, • the Marking O field contributes 41 combinations and • the Marking A contributes 29 combinations. This yields a total of 28536 combinations. Since each combination uses 8 byte memory storage, this sums up to 223 KB of additional memory. Taking one more technology (e.g. separate virtual channels per class) into account, this result would change to 57,072 combinations or 446 KB, respectively. Fig. 152 depicts the resulting memory usage estimate for up to 8 classes and four technologies. Fig. 152 Memory usage estimates for up to 8 classes and four technologies To conclude the estimate on memory usage analysis, the following two figures document real world measurement examples of BGP memory consumptions. Fig. 153 correlates the number of Extended Communities associated to a prefix and the displayed statistics about BGP memory consumption for those communities. A single step measurement has been made, which documents non-linear memory consumption for Cisco implementation specific memory allocation. The most obvious non-linearity occurs with the storage of 7 Extended Communities, which raises the memory allocation to 250 byte. The next increase happens with Extended Community 31. 150 17.11.2009 Fig. 153 Memory usage for ext. communities sent within one UPDATE message Fig. 154 Memory usage for large quantities of sent extended communities Fig. 154 documents the linear increase in memory usage under the light of thousands of stored Extended Communities. The storage of e.g. 65567 Communities consumes 533KB of memory. The test clearly demonstrated the technical feasibility and scalability of the proposed signalling solution to global scale deployment. 151 17.11.2009 12 Summary and outlook This thesis focuses on interconnections of autonomous systems with emphasis on the introduction of a currently missing class of service support. In general, this work has three main contributions. In the first part, a comprehensive compilation of quality of service support concepts with detailed network and node internal building block descriptions has been arranged, which proves the technical readiness of currently deployed devices for an inter-domain class of service based interconnection. Combined with an oral survey among major European, American and Middle East networks operators, this contribution led to the strong request for a simple, understandable and manageable concept design. In the second part, the specification of the new interdomain CoS concept has been drafted and submitted to the IETF for standardization. In the third part, simulations and implementations of vital building blocks of the concept have been made to underline its functionality and technical feasibility. Resource estimates and successful field trials provide evidence for its scalable and functioning design. 12.1 Contributions and results In particular, the following contributions have been made: • The interconnection of autonomous systems for global Internet connectivity is a critical point between network providers in technical and economical terms. Current deployments are solely based on basic public Internet Protocol interconnection only without any quality of service support. Capacity over-provisioning and network internal QoS control have been found as state of the art operation strategies. Due to the continuing fast growth of Internet traffic, the thesis forecasts rising capacity provisioning costs combined with a raised level of congestion on the interconnection links. To address this foreseeable trend, a new class of service interconnection concept of global scale has been designed. • Simplicity has been identified as most important design factor for the concept’s acceptance in the Internet community. This simplicity design goal stretches into the designed signalling structures and handling procedures as well as in the actual extent of supported traffic classes. • The importance to introduce at least two and recommendable four classes of service at AS interconnections has been stated and underlined with simulations. • Simple traffic separation as opposed to existing complex quality of service support concepts with delay, loss and jitter guarantees is strongly aspired to in order to avoid complex, costly and prohibitive deployment restrictions. • Simplicity in terms of waived quality guarantees is a prerequisite of the concept and contributes to global deployment. • The analysis of existing and possibly newly defined signalling protocols for the concept’s dissemination of CoS support information led to the selection and reuse of BGP as commonly available signalling protocol at interconnection points. • New Extended Communities and a new BGP path attribute have been designed for the required signalling of cross-domain and cross-layer CoS support information. 152 17.11.2009 • • • • • • The design of transitive relay functionality of CoS signalling via Extended Communities, as well as the provider controllable mapping of CoS support information between different networking technology CoS support concepts, is a novel principle and fundamental contribution. Elaborate simulation results on single node and AS level class of service support have exemplarily been documented within this thesis and are freely available upon request. Implementation test results have been contributed, which prove the concept’s applicability and interoperability with existing networking equipment. Resource estimates have been worked out, which revealed a negligible influence of the new CoS signalling on routing UPDATE message exchanges and moderate memory consumption within routing devices. The analysis of realistic CoS support scenarios documents the concept’s applicability in large scale. The design of the simple CoS concept does not prohibit the selective application of more complex QoS guaranteeing concepts. In fact, the concurrent deployment of the generally available CoS support combined with QoS guaranteeing setups for a limited set of interconnections or transit paths is supported. A global class-based Internet with at least 2 and recommendable 4 generally available classes of service is recommended by this new CoS concept. 12.2 Practical usage Emphasis has been placed on the practical usage of the concept. The following achievements address some important milestones for the deployment. • • • • The intellectual property right free submission of the concept’s design specification to the IETF standardization body prohibits possible patent applications. Free global deployment is aimed for and provider internal cost savings contribute to the benefit of the concept’s deployment. The implementation results within the Linux routing suite Quagga and the network protocol analyzer software Wireshark are freely available. The Wireshark extension has already been contributed to the official software release and the Quagga implementation will be submitted for inclusion in the official source tree. An online service for decoding of raw CoS signalling data has been setup and can be used at the following location: http://www.bgp-qos.org/draftknoll/decode_attributes.php Type number assignments have been granted by IANA, which already enables the public signalling of QoS Markings and CoS Capabilities in production style network operation. The concept has thereby crossed the border from laboratory confined setups into public applicability. 12.3 Outlook The current status of the new cross-domain and cross-layer coarse grained Quality of Service support concept, limits its deployment to Linux based internetworking devices. Ongoing discussions with network operators and router vendors aim for a general concept support in commercial routers. Technical feasibility has been attested by the discussion partners and deployment interest has been raised by European providers. 153 17.11.2009 Future deployment experiences and adoption requests will lead to concept and implementation refinements. To foster the concept’s deployment in production style networks, the augmentation of legacy commercial router equipment by means of an interactive Linux-based remote management mechanism is currently under development. Fig. 155 depicts the concealed CoS control of commercial border routers by an AS internal Linux-PC. The transitive design of all signalling elements ensures that the passive bidirectional signalling relay within the commercial border actually forward the signalling information to and from the Linux-PC. This PC is in charge of the CoS signalling processing and generation and simply uses the router as signalling relay. A second connection of the Linux-PC to the command line interface of the router will be used to issue the respective control commands for the configuration and activation of the router’s existing class of service support functionality. This intermediate solution will allow operators to enable inter-domain CoS support without costly software or hardware upgrades. Fig. 155 Linux remote control of existing commercial AS border router An ongoing discussion on “Network neutrality” is influencing the vendors’ support and operators’ deployment of any inter-domain quality of service enhancements. A neutral Internet operation without any service blockings, content filtering or any favouring of Internet users over other Internet users is requested. Discussions with service providers and federal network agencies revealed, that the designed CoS concept with its simple and generally applicable structure is likely to be regarded as non-discriminating and possibly omnipresent Internet enhancement. Further techno-economic studies on the cost reduction potential of the concept will need to be carried out to guide the device upgrade and CoS deployment decision process. The BGP Community based signup procedure for new services and concepts, proposed by the company Google, is briefly described in chapter 5.2. Depending on the outcome, this CoS support concept can even be used as the contractual base for inter-provider class of service support agreements. 154 17.11.2009 Bibliography [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] 3GPP, “Quality of Service (QoS) concept and architecture (Release 5)”,3GPP TS 23.107 V5.8.0, 2003. 3GPP,” Policy and charging control architecture (Release 8)”, 3GPP TS 23.203 V8.6.0, 2009. 3GPP, ”UTRA-UTRAN Long Term Evolution (LTE) and 3GPP System Architecture Evolution (SAE)”, 3GPP, 2006, [Online]. Available: ftp://ftp.3gpp.org/Inbox/2008_web_files/LTA_Paper.pdf 3GPP, “Requirements for further advancements for Evolved Universal Terrestrial Radio Access (E-UTRA) (LTE-Advanced) (Release 8)”,3GPP TR 36.913, 2009, [Online]. Available: http://www.3gpp.org/ftp/Specs/archive/36_series/36.913/ Abley, J.; Savola, P. and Neville-Neil, G., "Deprecation of Type 0 Routing Headers in IPv6", RFC 5095, IETF, 2007. Alaettinoglu, C.; Villamizar, C.; Gerich, E.; Kessens, D.; Meyer, D.; Bates, T.; Karrenberg, D., Terpstra, M., "Routing Policy Specification Language (RPSL)", RFC 2622, IETF, 1999. Amante, S., Bitar, N., Bjorkman, N., et. al., "Inter-provider Quality of Service - White paper draft 1.1", 2006, [Online]. Available: http://cfp.mit.edu/docs/interprovider-qosnov2006.pdf AMS-IX, “AMS-IX Monthly Reporting”, 2009, [Online]. Available: http://www.amsix.net/technical/stats/CUMU/ Andersson, L., Asati, R., "Multiprotocol Label Switching (MPLS) Label Stack Entry: EXP Field Renamed to Traffic Class Field", RFC 5462, IETF, 2009. Andersson, L.; Minei, I. & Thomas, B., "LDP Specification", RFC 5036, IETF, 2007. Andersson, L., Swallow, G., "The Multiprotocol Label Switching (MPLS) Working Group decision on MPLS signaling protocols", RFC 3468, IETF, 2003. Andras, V., “OMNeT++”, OMNeT Development Team, 2009, [Online]. Available: http://www.omnetpp.org Awduche, D.; Berger, L.; Gan, D.; Li, T.; Srinivasan, V. and Swallow, G., "RSVP-TE: Extensions to RSVP for LSP Tunnels", RFC 3209, IETF, 2001. Awduche, D.; Malcolm, J.; Agogbua, J.; O'Dell, M. and McManus, J., "Requirements for Traffic Engineering Over MPLS", RFC 2702, IETF, 1999. Ayyangar, A.; Kompella, K.; Vasseur, J. and Farrel, A., "Label Switched Path Stitching with Generalized Multiprotocol Label Switching Traffic Engineering (GMPLS TE)", RFC 5150, IETF, 2008. Babiarz, J.; Chan, K. and Baker, F., "Configuration Guidelines for DiffServ Service Classes", RFC 4594, IETF, 2006. Baker, F.; Polk, J. and Dolly, M., "DSCP for Capacity-Admitted Traffic", InternetDraft draft-ietf-tsvwg-admitted-realtime-dscp-05, IETF, Work in progress, 2008. Banerjea, A., Ferrari, D., et. al., "The Tenet Real-Time Protocol Suite: Design, Implementation, and Experiences", IEEE/ACM Transactions on Networking, Volume 4, Issue 1, pp. 1-10, 1996. Bates, T.; Chandra, R.; Katz, D. & Rekhter, Y., "Multiprotocol Extensions for BGP4", RFC 4760, IETF, 2007. Bates, T.; Chen, E. & Chandra, R., "BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP)", RFC 4456, IETF, 2006. Bates, T.; Gerich, E.; Joncheray, L.; Jouanigot, J.-M.; Karrenberg, D.; Terpstra, M. and Yu, J., "Representation of IP Routing Policies in a Routing Registry (ripe81++)", RFC 1786, IETF, 1995. 155 17.11.2009 [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] Bauschert, T., “Lecture: Data Communications”, Chemnitz University of Technology, 2008. Bellman, R. E., "Dynamic Programming", Princeton University Press, Princeton, N.J., 1957. Benmohamed, L.; Liang, C.; Naber, E.; Terzis, A., "QoS Enhancements to BGP in Support of Multiple Classes of Service", draft-liang-bgp-qos-00 (work in progress), IETF, June 2006. Berger, L., "Generalized Multi-Protocol Label Switching (GMPLS) Signaling Functional Description", RFC 3471, IETF, 2003. Berger, L., "Generalized Multi-Protocol Label Switching (GMPLS) Signaling Resource ReserVation Protocol-Traffic Engineering (RSVP-TE) Extensions", RFC 3473, IETF, 2003. Bernet, Y., "Format of the RSVP DCLASS Object", RFC 2996, IETF, 2000. Bernet, Y.; Blake, S.; Grossman, D. and Smith, A., "An Informal Management Model for Diffserv Routers", RFC 3290, IETF, 2002. Bernet, Y.; Ford, P.; Yavatkar, R.; Baker, F.; Zhang, L.; Speer, M.; Braden, R.; Davie, B.; Wroclawski, J. and Felstaine, E., "A Framework for Integrated Services Operation over Diffserv Networks", RFC 2998, IETF, 2000. Black, D., "Differentiated Services and Tunnels", RFC 2983, IETF, 2000. Black, D.; Brim, S.; Carpenter, B. and Faucheur, F. L., "Per Hop Behavior Identification Codes", RFC 3140, IETF, 2001. Blake, S.; Black, D.; Carlson, M.; Davies, E.; Wang, Z. and Weiss, W., "An Architecture for Differentiated Service", RFC 2475, IETF, 1998. Bless, R., "Dynamic Aggregation of Reservations for Internet Services", Proceedings of the Tenth International Conference on Telecommunication Systems - Modeling and Analysis (ICTSM 10), Vol. 1, pp. 26-38, Monterey, 2002, [Online]. Available: http://www.tm.uka.de/doc/2003/ictsm-daris-journal-crc-web.pdf Bless, R.; Nichols, K. and Wehrle, K., "A Lower Effort Per-Domain Behavior (PDB) for Differentiated Services", RFC 3662, IETF, 2003. Blunk, L.; Damas, J.; Parent, F. and Robachevsky, A., "Routing Policy Specification Language next generation (RPSLng)", RFC 4012, IETF, 2005. Bohge, M., Renwanz, M., “A realisitic VoIP traffic generation and evaluation tool for OMNeT++”, First International OMNeT++ Workshop, 2008, [Online]. Available: http://www.tkn.tu-berlin.de/research/omnetVoipTool/ Boucadair, M., "QoS-Enhanced Border Gateway Protocol", draft-boucadair-qosbgp-spec-01 (work in progress), IETF, July 2005. Braden, R.; Clark, D. and Shenker, S., "Integrated Services in the Internet Architecture: an Overview", RFC 1633, IETF, 1994. Braden, R.; Zhang, L.; Berson, S.; Herzog, S. and Jamin, S., "Resource ReSerVation Protocol (RSVP) -- Version 1 Functional Specification", RFC 2205, IETF, 1997. Bradner, S., "IETF Working Group Guidelines and Procedures", RFC 2418, IETF, 1998. Brown, M., Underwood, T., Zmijewski, E., “The Day the YouTube Died”, Renesys Corp. at MENOG 3, [Online]. Available: http://www.renesys.com/tech/presentations/pdf/menog3-youtube.pdf Callon, R., "Use of OSI IS-IS for routing in TCP/IP and dual environments", RFC 1195, IETF, 1990. Callon, R., “Email list discussion on: [NSIS] FW: I-D Action:draft-ietf-nsis-ntlp20.txt”, IETF NSIS working group email archive, 11 June 2009, [Online]. Available: http://www.ietf.org/mail-archive/web/nsis/current/msg08563.html Carpenter, B. E., “Re: [Diffserv] A question”, email discussion on DiffServ working group list, [Online]. Available: http://www.ietf.org/mailarchive/web/diffserv/current/msg04257.html 156 17.11.2009 [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] Chan, K.; Babiarz, J. and Baker, F., "Aggregation of DiffServ Service Classes", RFC 5127, IETF, 2008. Chandra, R.; Traina, P. and Li, T., "BGP Communities Attribute", RFC 1997, IETF, 1996. Chen, E., "Route Refresh Capability for BGP-4", RFC 2918, IETF, 2000. Cisco, “An Introduction to IGRP”, ID: 26825, 2005, [Online]. Available: http://www.cisco.com/en/US/tech/tk365/technologies_white_paper09186a00800c8a e1.shtml Cisco, “BGP Best Path Selection Algorithm”, ID: 13753, 2006, [Online]. Available: http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080094431 .shtml Cisco, “Enhanced Interior Gateway Routing Protocol”,ID: 16406, 2005, [Online]. Available: http://www.cisco.com/en/US/tech/tk365/technologies_white_paper09186a0080094c b7.shtml Cisco, “Cisco 12000 Series Internet Router Architecture: Switch Fabric”, ID: 47240, 2005, [Online]. Available: https://www.cisco.com/en/US/products/hw/routers/ps167/products_tech_note09186 a00801e1da7.shtml Cisco, “Network Infrastructure for Ensuring Predictable Business Service Delivery”, ID: C11-397769-00, 2007, [Online]. Available: http://www.cisco.com/en/US/prod/collateral/routers/ps6342/prod_white_paper0900a ecd805f62b1.html Cisco, “Network Infrastructure – Chapter 3”, ID: OL-13817-04, [Online]. Available: http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/srnd/6x/netstruc.html Cisco, “Congestion Avoidance Overview”, ID: QC-75, [Online]. Available: http://www.cisco.com/en/US/docs/ios/12_0/qos/configuration/guide/qcconavd.html Cisco, “Configuring QoS”, ID: 78-13490-01, [Online]. Available: http://www.cisco.com/en/US/docs/switches/lan/catalyst4500/12.1/8aew/configuratio n/guide/qos.html Cisco, “Evolving Data Center Architectures: Meet the Challenge with Cisco Nexus 5000 Series Switches”, ID: C11-473501-01, [Online]. Available: http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns783/white_p aper_c11-473501.html Clos, C., “A study of non-blocking switching networks”, Bell System Technical Journal, vol. 32 issue 2, pp. 406–424, 1953. Colitti, L., “A strategy for IPv6 adoption”, RIPE 57, October 2008, [Online]. Available: http://www.ripe.net/ripe/meetings/ripe-57/presentations/ColittiA_strategy_for_IPv6_adoption.Z8ri.pdf Cristallo, G.; Jacquenet, C., "The BGP QOS_NLRI Attribute", draft-jacquenet-bgpqos-00 (work in progress), IETF, February 2004. Davie, B.; Charny, A.; Bennet, J.; Benson, K.; Boudec, J. L.; Courtney, W.; Davari, S.; Firoiu, V. and Stiliadis, D., "An Expedited Forwarding PHB (Per-Hop Behavior)", RFC 3246, IETF, 2002. DE-CIX, “DE-CIX topology 2009”, 2009, [Online]. Available: http://www.decix.net/content/network/topology.html DE-CIX, “DE-CIX yearly traffic graph”, 2009, [Online]. Available: http://www.decix.de/content/network.html Deering, S., Hinden, R., "Internet Protocol, Version 6 (IPv6) Specification", RFC 1883, IETF, 1995. Deering, S., Hinden, R., "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, IETF, 1998. 157 17.11.2009 [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] Delgrossi, L. & Berger, L., "Internet Stream Protocol Version 2 (ST2) Protocol Specification - Version ST2+", RFC 1819, IETF, 1995. Demers, A.; Keshav, S.; Shenkar, S.; ”Analysis and simulation of a fair queuing algorithm”. Proceedings of SIGCOMM '89, pages 1-12, 1989. Dijkstra, E. W., “A note on two problems in connexion with graphs”, Numerische Mathematik, 1, pp. 269-271, 1959, [Online]. Available: http://wwwm3.ma.tum.de/twiki/pub/MN0506/WebHome/dijkstra.pdf Djernaes, M., Appanna, C., Ward, D., “Context updates in BGP”, draft-djernaessimple-context-update-00 (work in progress), IETF, 2006. DSL Forum, “Migration to Ethernet-Based DSL Aggregation“, DSL Forum Technical Report TR-101, 2006, [Online]. Available: http://www.broadbandforum.org/technical/download/TR-101.pdf Eardley, P., "Metering and marking behaviour of PCN-nodes", Internet-Draft draftietf-pcn-marking-behaviour-03, IETF, Work in progress, 2009. Evans, J., Filsfils, C., "Deploying IP and MPLS QoS for multiservice networks: Theory and practice", Morgan Kaufmann/Elsevier, Amsterdam, 2007. Farinacci, D.; Li, T.; Hanks, S.; Meyer, D. and Traina, P., "Generic Routing Encapsulation (GRE)", RFC 2784, IETF, 2000. Farrel, A.; Ayyangar, A. and Vasseur, J., "Inter-Domain MPLS and GMPLS Traffic Engineering -- Resource Reservation Protocol-Traffic Engineering (RSVP-TE) Extensions", RFC 5151, IETF, 2008. Farrel, A.; Vasseur, J.-P. and Ayyangar, A., "A Framework for Inter-Domain Multiprotocol Label Switching Traffic Engineering", RFC 4726, IETF, 2006. Faucheur, F. L.; Wu, L.; Davie, B.; Davari, S.; Vaananen, P.; Krishnan, R.; Cheval, P. and Heinanen, J., "Multi-Protocol Label Switching (MPLS) Support of Differentiated Services", RFC 3270, IETF, 2002. Feher, G., Nemeth, K., Maliosz, M., et.al., "Boomerang A Simple Protocol for Resource Reservation in IP Networks", IEEE RTAS, 1999. Floyd, S., Jacobson, V., "Random early detection gateways for congestion avoidance", IEEE/ACM Transactions on Networking, V.1 N.4, p. 397-413, 1993. Ford, L. R. Jr., and Fulkerson, D. R., "Flows in Networks", Princeton University Press, Princeton, N.J., 1962. Franke, K., “Lecture material: Digital Communication Networks”, Chemnitz University, 2006. Fu, X., Schulzrinne, H., Bader, A., et. al., "NSIS: A new extensible IP signaling protocol suite," IEEE Communications Magazine, vol. 43, pp. 133 - 141, 2005. Fuller, V.; Li, T.; Yu, J. and Varadhan, K., "Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy", RFC 1519, IETF, 1993. Fuller, V., Li, T., "Classless Inter-domain Routing (CIDR): The Internet Address Assignment and Aggregation Plan", RFC 4632, IETF, 2006. Gamer, T., Scharf, M., “Realistic Simulation Environments for IP-based Networks”, First International OMNeT++ Workshop, 2008, [Online]. Available: http://doc.tm.uka.de/2008/omnet2008.pdf Golestani, S.: "A Self-Clocked Fair Queueing Scheme for Broadband Applications". Proceedings of IEEE Infocom '94, p. 636-646, 1994. Grossman, D., "New Terminology and Clarifications for Diffserv", RFC 3260, IETF, 2002. Hawkinson, J., Bates, T., "Guidelines for creation, selection, and registration of an Autonomous System (AS)", RFC 1930, IETF, 1996. Hedrick, C., "Routing Information Protocol", RFC 1058, IETF, 1988. Heinanen, J.; Baker, F.; Weiss, W. & Wroclawski, J., "Assured Forwarding PHB Group", RFC 2597, IETF, 1999. 158 17.11.2009 [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] Heinanen, J., Guerin, R., "A Single Rate Three Color Marker", RFC 2697, IETF, 1999. Heinanen, J., Guerin, R., "A Two Rate Three Color Marker", RFC 2698, IETF, 1999. Herzog, S.; Boyle, J.; Cohen, R.; Durham, D.; Rajan, R. & Sastry, A., "COPS usage for RSVP", RFC 2749, IETF, 2000. Hinden, R., Deering, S., "IP Version 6 Addressing Architecture", RFC 4291, IETF, 2006. Huston, G., “BGP reports”, [Online]. Available: http://bgp.potaroo.net/ Hwang, J.; Altmann, J.; Oliver, H.; Suarez, A., “Enabling dynamic market-managed QoS interconnection in the next generation internet by a modified BGP mechanism”, ICC 2002, IEEE International Conference on Communications, 2002, [Online]. Available: http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/7828/21517/00997325.pdf?arnu mber=997325 IANA, “BGP Extended Communities Types”, IANA Protocol Registries, [Online]. Available: http://www.iana.org/assignments/bgp-extended-communities IANA, “IANAifType-MIB”, [Online]. Available: http://www.iana.org/assignments/ianaiftype-mib IEEE, "IEEE Standard for Local and metropolitan area networks Media Access Control (MAC) Bridges", IEEE 802.1D, 2004. IEEE, "IEEE standard for local and metropolitan area networks virtual bridged local area networks", IEEE 802.1Q, p. 1-285, 2006. IEEE, "IEEE Standard for Information technology-Telecommunications and information exchange between systems-Local and metropolitan area networksSpecific requirements - Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications", IEEE Std 802.11-2007 (Revision of IEEE Std 802.11-1999) , C1-1184, 2007. IEEE, "IEEE Std 802.1ad - 2005 IEEE Standard for Local and metropolitan area networks - virtual Bridged Local Area Networks, Amendment 4: Provider Bridges", IEEE Std 802.1ad-2005 (Amendment to IEEE Std 8021Q-2005) , p. 1-60, 2006. IEEE, "IEEE Standard for Local and metropolitan area networksÂ¿Virtual Bridged Local Area Networks Amendment 7: Provider Backbone Bridges", IEEE Std 802.1ah-2008 (Amendment to IEEE Std 802.1Q-2005) , C1-109, 2008. IEEE, "IEEE Standards for Local and Metropolitan Area Networks: Supplements to Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications - Specification for 802.3 Full Duplex Operation and Physical Layer Specification for 100 Mb/s Operation on Two Pairs of Category 3 or Better Balanced Twisted Pair Cable (100BASE-T2)", IEEE Std 802.3x-1997 and IEEE Std 802.3y-1997 (Supplement to ISO/IEC 8802-3: 1996; ANSI/IEEE Std 802.3, 1996 Edition) , p. 1-324, 1997. IEEE, “IEEE 802 Tutorial: Data Center Bridging”, IEEE 802, 2007, [Online]. Available: http://www.ieee802.org/802_tutorials/07-November/Data-CenterBridging-Tutorial-Nov-2007-v2.pdf IEEE, ” Virtual Bridged Local Area Networks — Amendment: Congestion Notification”, IEEE P802.1Qau/D2.1, 2009. IEEE, "IEEE Standard for Local and metropolitan area networks Part 16: Air Interface for Broadband Wireless Access Systems", IEEE Std 802.16-2009 (Revision of IEEE Std 802.16-2004) , C1-2004, 2009. ISO, “Information technology – Telecommunications and information exchange between systems – Intermediate System to Intermediate System intra-domain routeing information exchange protocol for use in conjunction with the protocol for 159 17.11.2009 [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] providing the connectionless-mode network service (ISO 8473)”,ISO/IEC 10589:2002, second edition, 2002. ITU-T, “Traffic control and congestion control in IP based networks”, ITU-T Y.1221, 2002. ITU-T, “Network performance objectives for IP-based services”, ITU-T Y.1541, 2006. ITU-T, “NGN FG Proceedings Part II”, ITU-T NGN Focus Group, 2005. ITU-T, “One-way transmission time”, ITU-T G.114, 2003. ITU-T, “Gigabit-capable passive optical networks (GPON): General characteristics”, ITU-T G.984.1, 2008. ITU-T, “Asymmetric digital subscriber line (ADSL) transceivers”, ITU-T G.992.1, 1999. ITU-T, “Splitterless asymmetric digital subscriber line (ADSL) transceivers”, ITU-T G.992.2, 1999. ITU-T, “Asymmetric digital subscriber line transceivers 2 (ADSL2)”, ITU-T G.992.3, 2005. ITU-T, “Splitterless asymmetric digital subscriber line transceivers 2 (splitterless ADSL2)”, ITU-T G.992.4, 2002. ITU-T, “Very high speed digital subscriber line transceivers”, ITU-T G.993.1, 2004. ITU-T, “Very high speed digital subscriber line transceivers 2 (VDSL2)”, ITU-T G.993.2, 2006. ITU-T, “End-user multimedia QoS categories”, ITU-T G.1010, 2001. ITU-T, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs”, ITU-T P.862, 2001, [Online]. Available: http://www.itu.int/rec/TREC-P862/en Jacquenet, C.; Bourdon, G. ; Boucadair, M., "Service Automation and Dynamic Provisioning Techniques in IP/MPLS Environments (Wiley Series on Communications Networking & Distributed Systems)", Wiley, 2008. Jacobson, V.,“Differentiated Services for the Internet”, Internet2 Joint Applications/Engineering QoS Workshop, 1998, [Online]. Available: ftp://ftp.ee.lbl.gov/talks/vj-i2qos-may98.pdf Jamoussi, B.; Andersson, L.; Callon, R.; Dantu, R.; Wu, L.; Doolan, P.; Worster, T.; Feldman, N.; Fredette, A.; Girish, M.; Gray, E.; Heinanen, J.; Kilty, T. and Malis, A., "Constraint-Based LSP Setup using LDP", RFC 3212, IETF, 2002. Klein, P., Sprecher, N., “Provider Ethernet VLAN Cross Connect”, Seabridgenetworks/NSN, 2006, [Online]. Available: http://www.ieee802.org/1/files/public/docs2006/new-sprecher-vlan-xc-ieee-0106.pdf Knoll, T. M., "BGP Extended Community Attribute for QoS Marking", draft-knoll-idrqos-attribute-04 (work in progress), IETF, 2009. Knoll, T. M., "BGP Class of Service Interconnection", draft-knoll-idr-cosinterconnect-03 (work in progress), IETF, 2009. Knoll, T. M., “Flow control + priority consideration -> PRIORITY_PAUSE”, NG Ethernet Forum post, 2006. Knoll, T. M., “QoS capable Internet Exchange Points”, 2009, [Online]. Available: http://www.bgp-qos.org/qos-ixp/list.php Kompella, K., Rekhter, Y., "Label Switched Paths (LSP) Hierarchy with Generalized Multi-Protocol Label Switching (GMPLS) Traffic Engineering (TE)", RFC 4206, IETF, 2005. Lee, S., Gahng-Seop, A., Zhang, X. and Campbell, A., "INSIGNIA: An IP-Based Quality of Service Framework for Mobile Ad Hoc Networks". Journal of Parallel and Distributed Computing (Academic Press), Special issue on Wireless and Mobile Computing and Communications, Vol. 60, Number 4, pp. 374-406 April, 2000. 160 17.11.2009 [130] Malkin, G., "RIP Version 2", RFC 2453, IETF, 1998. [131] Manner, J; Fu, X., “Analysis of Existing Quality-of-Service Signaling Protocols”, RFC 4094, IETF, 2005. [132] Manner, J.; Karagiannis, G. & McDonald, A., "NSLP for Quality-of-Service Signaling", Internet-Draft draft-ietf-nsis-qos-nslp-16, IETF, Work in progress, 2008. [133] Manning, B., "Registering New BGP Attribute Types", RFC 2042, IETF, 1997. [134] Manns, D., “Simulative Untersuchung von klassenbasiertem Inter-AS IP-Forwarding mit Ethernet IXP”, Diplomarbeit – TU Chemnitz, Chemnitz, 2009. [135] Marques, P.; Sheth, N.; Raszuk, R.; Greene, B.; McPherson, D., "Dissemination of flow specification rules", draft-ietf-idr-flow-spec-09 (work in progress), IETF, May 2009. [136] Menth, M., Lehrieder, F., “Pre-Congestion Notification: Lightweight Admission Control and Flow Termination for the Future Internet”, ICC 2009, Dresden, 2009. [137] Merit, “Internet Routing Registry”, Merit Network Inc., 2009, [Online]. Available: http://www.irr.net/ [138] Mills, D., "Exterior Gateway Protocol formal specification", RFC 904, IETF, 1984. [139] Morand, P., Boucadair, M., Asgari, H., Egan, et al., “D1.4: Issues in MESCAL InterDomain QoS Delivery: Technologies, Bi-directionality, Inter-operability, and Financial Settlements”, MESCAL Consortium, 2004, Online. Available: http://www.istmescal.org/deliverables/MESCAL-D14-final-v2.pdf [140] Moy, J., "OSPF Version 2", RFC 2328, IETF, 1998. [141] Nichols, K., “An Opinionated View of the Current State of IP Differentiated Services”, UC Berkeley MIG Seminar, 1999, [Online]. Available: http://bmrc.berkeley.edu/courseware/cs298/fall99/nichols/kmn_ucbmm.pdf [142] Nichols, K.; Blake, S.; Baker, F. and Black, D., "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers", RFC 2474, IETF, 1998. [143] Nichols, K. and Carpenter, B., "Definition of Differentiated Services Per Domain Behaviors and Rules for their Specification", RFC 3086, IETF, 2001. [144] Nipper, A., “VLAN User Priority test on DE-CIX platform”, DE-CIX, 2009. [145] NS2 team, “Network Simulator 2 – ns2”, NS2 webpage, 2009, [Online]. Available: http://www.isi.edu/nsnam/ns/ [146] Ould-Brahim, H.; Fedyk, D. & Rekhter, Y., "BGP Traffic Engineering Attribute", RFC 5543, IETF, 2009. [147] Pan, P., Hahne, E. and Schulzrinne, H., "BGRP: A Tree-Based Aggregation Protocol for Inter-domain Reservations", Journal of Communications and Networks, Vol. 2, No. 2, pp. 157-167, 2000. [148] Pan, P., Schulzrinne, H.,"YESSIR: A Simple Reservation Mechanism for the Internet". Proceedings of NOSSDAV, Cambridge, UK, 1998. [149] Parekh, A.K.; Gallager, R.G.;”A generalized processor sharing approach to flow control in integrated services networks”. Proceedings of IEEE Infocom ’92, p. 915924, 1992. [150] Park, H., "Systematic QoS Class Mapping Framework for Application Requirement over Heterogeneous Networks" Telecommunications Network Strategy and Planning Symposium, Networks 2008, 2008. [151] PeeringDB, “Peering Database - PeeringDB”, PeeringDB.com, 2009, [Online]. Available: http://www.peeringdb.com [152] Perkins, C., "IP Encapsulation within IP", RFC 2003, IETF, 1996. [153] Postel, J., "Internet Protocol", RFC 791, IETF, 1981. [154] Quagga, “Quagga Routing Software Suite”, 2009, [Online]. Available: http://www.quagga.net [155] Rajahalme, J.; Conta, A.; Carpenter, B. and Deering, S., "IPv6 Flow Label Specification", RFC 3697, IETF, 2004. 161 17.11.2009 [156] Ramakrishnan, K.; Floyd, S. and Black, D., "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, IETF, 2001. [157] Rekhter, Y.; Li, T. and Hares, S., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, IETF, 2006. [158] Rice, L., “The inter-colonial telegraph station at Eucla.”, [Online]. Available: http://members.iinet.net.au/~oseagram/eucla.html [159] RIPE, “Representation of IP Routing Policies in a Routing Registry”, Réseaux IP Européens, 1994, [Online]. Available: ftp://ftp.ripe.net/ripe/docs/ripe-181.txt [160] Rosen, E.; Tappan, D.; Fedorkow, G.; Rekhter, Y.; Farinacci, D.; Li, T. and Conta, A., "MPLS Label Stack Encoding", RFC 3032, IETF, 2001. [161] Sangli, S.; Tappan, D. and Rekhter, Y., "BGP Extended Communities Attribute", RFC 4360, IETF, 2006. [162] Sara, “Peering policy for SARA (AS1126)”, SARA Computing and Networking Services, 2009, [Online]. Available: http://www.as1126.net/ [163] Seaman, M.; Smith, A.; Crawley, E. & Wroclawski, J., "Integrated Service Mappings on IEEE 802 Networks", RFC 2815, IETF, 2000. [164] Schulzrinne, H. and Stiemerling, M., "GIST: General Internet Signalling Transport", Internet-Draft draft-ietf-nsis-ntlp-20, IETF, Work in progress, 2009. [165] Schwabe, T., “IP-Netze mit Interdomain-BGP-Routing: Konvergenzverhalten, Dienstqualität und Dimensionierung”, Ph.D. dissertation, TU München, 2007. [166] Shenker, S.; Partridge, C. and Guerin, R., "Specification of Guaranteed Quality of Service", RFC 2212, IETF, 1997. [167] Smith, P., “BGP Scaling Techniques”, AfNOG workshop, 2006, [Online]. Available: http://ws.edu.isoc.org/data/2006/153397902444822943b7611/bgpscal.ppt [168] Sofia, R., Guerin, R. and Veiga, P., “SICAP, a Shared-segment Inter-domain Control Aggregation Protocol”, High Performance Switching and Routing, HPSR, Turin, 2003. [169] Spenneberg, R., "Linux-firewalls mit Iptables& Co.", Pearson Education, 2006. [170] Suter, B.; Lakshman, T.V.; Stiliadis,D.; Choudhury, A.K.; "Buffer Management Schemes for Supporting TCP in Gigabit Routers with Per-Flow Queueing". IEEE Journals in Selected Areas in Communications, 1999. [171] Suzuki, M., “Per-priority Flow Control”, IEEE 802.1 meeting Portland, 2004, [Online]. Available: http://www.ieee802.org/1/files/public/docs2004/Perpriority%20Flow%20Control1.pdf [172] Traina, P.; McPherson, D. & Scudder, J., "Autonomous System Confederations for BGP", RFC 5065, IETF, 2007. [173] Trick, U., Weber, F., "SIP, TCP/IP und Telekommunikationsnetze: Anforderungen Protokolle - Architekturen", Oldenbourg, München, 2004. [174] Verizon, “Verizon Business Policy for Settlement-Free Interconnection with Internet Networks”, Verizon Business, 2009, [Online]. Available: http://www.verizonbusiness.com/terms/peering/ [175] Villamizar, C.; Chandra, R. & Govindan, R., "BGP Route Flap Damping", RFC 2439, IETF, 1998. [176] Vohra, Q., Chen, E., "BGP Support for Four-octet AS Number Space", RFC 4893, IETF, 2007. [177] Westerlund, M., “Email list discussion on: [NSIS] GIST updated from todays IESG call”, IETF NSIS working group email archive, 9 April 2009, [Online]. Available: http://www.ietf.org/mail-archive/web/nsis/current/msg08534.html [178] Westerlund, M., “Email list discussion on: [NSIS] GIST updated from todays IESG call”, IETF NSIS working group email archive, 21 April 2009, [Online]. Available: http://www.ietf.org/mail-archive/web/nsis/current/msg08543.html [179] Wikipedia, “Eucla, Western Australia”, [Online]. Available: http://en.wikipedia.org/wiki/Eucla,_Western_Australia 162 17.11.2009 [180] Wikipedia, “Network neutrality”, [Online]. Available: http://en.wikipedia.org/wiki/Network_neutrality [181] Wireshark, “Wireshark network protocol analyzer”, Wireshark design team, 2009, [Online]. Available: http://www.wireshark.org [182] Wroclawski, J., "The Use of RSVP with IETF Integrated Services", RFC 2210, IETF, 1997. [183] Wroclawski, J., "Specification of the Controlled-Load Network Element Service", RFC 2211, IETF, 1997. [184] Yavatkar, R.; Hoffman, D.; Bernet, Y.; Baker, F. & Speer, M., "SBM (Subnet Bandwidth Manager): A Protocol for RSVP-based Admission Control over IEEE 802-style networks", RFC 2814, IETF, 2000. [185] Zhang, Z., "ExtCommunity map and carry TOS value of IP header", draft-zhang-idrbgp-extcommunity-qos-00 (work in progress), IETF, November 2005. List of Figures Fig. 1 Fig. 2 Fig. 3 Fig. 4 Fig. 5 Fig. 6 Fig. 7 Fig. 8 Fig. 9 Fig. 10 Fig. 11 Fig. 12 Fig. 13 Fig. 14 Fig. 15 Fig. 16 Fig. 17 Fig. 18 Fig. 19 Fig. 20 Fig. 21 Fig. 22 Fig. 23 Fig. 24 Fig. 25 Fig. 26 Fig. 27 Fig. 28 Fig. 29 Fig. 30 Fig. 31 Fig. 32 Fig. 33 IP version 4 datagram structure..................................................................................................................... 4 IPv4 address class system - [22] ................................................................................................................... 5 CIDR example network mask......................................................................................................................... 5 IP version 6 datagram structure..................................................................................................................... 6 Differentiated Services (DS) field in IPv4 and IPv6 datagram headers ................................................... 7 IP routing and forwarding functionality.......................................................................................................... 8 Internet routing hierarchy ................................................................................................................................ 9 Internet routing architecture.......................................................................................................................... 10 IP routing protocols – classified by applicability ........................................................................................ 10 IP routing protocols – classified by working principle ........................................................................... 12 Internet Exchange Point - IXP.................................................................................................................. 12 BGP Best Path Selection Algorithm - [49] ............................................................................................. 14 BGP message structure ............................................................................................................................ 15 BGP path attribute classification [46], [161] ........................................................................................... 15 BGP UPDATE message structure – after [157]..................................................................................... 16 BGP UPDATE message structure with Extended Community attribute ............................................ 17 BGP Route Reflector topology ................................................................................................................. 18 Autonomous System Confederations for BGP ...................................................................................... 18 IP router block diagram ............................................................................................................................. 19 IP router internal structure -> route processing ..................................................................................... 20 IP router with non-blocking fabric and virtual output queues............................................................... 22 Router internal forwarding path per hop behaviour............................................................................... 26 Drop-Tail queue dropping strategy .......................................................................................................... 27 Random Early Detection (RED) for congestion avoidance................................................................. 28 Longest Queue Drop (LQD) of virtually separated flows...................................................................... 29 Round Robin scheduling ........................................................................................................................... 30 Strict Priority scheduling............................................................................................................................ 30 Weighted Round Robin scheduling ......................................................................................................... 31 Symbolized fair queuing in an idealized GPS = Fluid-Flow Queuing ................................................. 31 Fluid-flow approximated queuing in WFQ .............................................................................................. 32 VoQ with 8 classes CoS support (scheduling and dropping) .............................................................. 33 Per hop forwarding behaviour composition in relaying nodes............................................................. 33 Leaky bucket algorithm ............................................................................................................................. 35 163 17.11.2009 Fig. 34 Fig. 35 Fig. 36 Fig. 37 Fig. 38 Fig. 39 Fig. 40 Fig. 41 Fig. 42 Fig. 43 Fig. 44 Fig. 45 Fig. 46 Fig. 47 Fig. 48 Fig. 49 Fig. 50 Fig. 51 Fig. 52 Fig. 53 Fig. 54 Fig. 55 Fig. 56 Fig. 57 Fig. 58 Fig. 59 Fig. 60 Fig. 61 Fig. 62 Fig. 63 Fig. 64 Fig. 65 Fig. 66 Fig. 67 Fig. 68 Fig. 69 Fig. 70 Fig. 71 Fig. 72 Fig. 73 Fig. 74 Fig. 75 Fig. 76 Fig. 77 Fig. 78 Fig. 79 Fig. 80 Fig. 81 Fig. 82 Fig. 83 Fig. 84 Fig. 85 Fig. 86 Fig. 87 Fig. 88 Fig. 89 Fig. 90 Fig. 91 Fig. 92 Fig. 93 Fig. 94 Fig. 95 Fig. 96 Token bucket algorithm............................................................................................................................. 36 QoS-based IP lookup variants ................................................................................................................. 37 Best Effort interconnection example........................................................................................................ 38 QoS-based forwarding interconnection example .................................................................................. 39 QoS-based path selection in BGP........................................................................................................... 40 QoS-based routing interconnection example......................................................................................... 40 Tunnelling scope ........................................................................................................................................ 42 QoS-based tunnelling interconnection example.................................................................................... 43 Differentiated Services regions, domains and nodes ........................................................................... 48 Behaviour aggregate classification and DSCP marking....................................................................... 48 Logical View of a Packet Classifier and Traffic Conditioner ................................................................ 49 PHB ÅÆ DSCP mapping......................................................................................................................... 49 PHB encoding [31] ..................................................................................................................................... 50 Encoding of Assured Forwarding PHBs ................................................................................................. 51 RSVP flow descriptor structure ................................................................................................................ 53 RSVP message flow diagram................................................................................................................... 54 RSVP support block diagram – after [39] ............................................................................................... 54 Cisco’s two RSVP operation models: IntServ and IntServ/DiffServ [53]............................................ 55 Ethernet frame format................................................................................................................................ 56 IEEE 802.1p User Priority marking in 802.1q (VLAN) tagged frames................................................ 57 VLAN Cross Connect / VLAN XC [123] .................................................................................................. 58 Q-in-Q / stacked VLAN / Provider Bridges - IEEE 802.1ad [100] ....................................................... 59 MAC-in-MAC / Provider Backbone Bridges (PBB) – IEEE 802.1ah [101]......................................... 59 Priority Flow Control [56]........................................................................................................................... 60 Congestion spreading [103]...................................................................................................................... 60 MPLS shim header structure and hierarchy usage............................................................................... 62 MPLS Label stack structure...................................................................................................................... 62 MPLS LSP signalling: contiguous, nested, stitched.............................................................................. 63 GMPLS label representations .................................................................................................................. 64 GMPLS LSP hierarchy .............................................................................................................................. 64 ATM cell structure ...................................................................................................................................... 68 Functional layering structure for the Ethernet data service [111] ....................................................... 69 AS interconnection options ....................................................................................................................... 71 DE-CIX topology 2009 [61] ....................................................................................................................... 73 Internet hierarchy ....................................................................................................................................... 74 Route Flap Dampening [167] ................................................................................................................... 79 MESCAL - Cascaded Approach [139] .................................................................................................... 79 Components of a NSIS node - [80].......................................................................................................... 83 GIST protocol change to “Experimental“ status [164].......................................................................... 83 GIST protocol objections explained by Ross Callon [43] ..................................................................... 84 PCN working principle - [136]................................................................................................................... 85 DE-CIX yearly traffic graph - [62]............................................................................................................. 86 Cross-Domain CoS marking concept...................................................................................................... 90 IANA registry for BGP Extended Community type numbers ............................................................... 91 BGP Extended Community Attribute structure with type 0x40 or 0x44.............................................. 91 Structure of the QoS Marking Community.............................................................................................. 92 CoS enabled AS interconnection example topology............................................................................. 94 QoS Marking Extended Community signalling example ...................................................................... 96 Class overload limitation concept ............................................................................................................ 97 CoS Capability Extended Community Structure.................................................................................... 98 Per-Hop-Behaviour Identification Codes implied by CoS Capability .................................................. 98 CoS Parameter Attribute structure .......................................................................................................... 99 Classification of the Mapping scope...................................................................................................... 101 User/Subscriber Service Classes Grouping - [16]............................................................................... 105 Service Class Characteristics - [16] ...................................................................................................... 106 DSCP to Service Class Mapping - [16]................................................................................................. 107 QoS Mechanisms Used for Each Service Class - [16] ....................................................................... 107 Treatment Aggregate / Service Class Performance Requirements - [45] ....................................... 108 Treatment Aggregate Behaviour - [45].................................................................................................. 109 MPLS E-LSP mapping of Treatment Aggregates - [45] ..................................................................... 109 QoS Class Mapping framework - [150]................................................................................................. 111 Mandatory L-LSP encoding rules - [75] ................................................................................................ 113 Scenario 1: single node interconnection............................................................................................... 118 164 17.11.2009 Fig. 97 Fig. 98 Fig. 99 Fig. 100 Fig. 101 Fig. 102 Fig. 103 Fig. 104 Fig. 105 Fig. 106 Fig. 107 Fig. 108 Fig. 109 Fig. 110 Fig. 111 Fig. 112 Fig. 113 Fig. 114 Fig. 115 Fig. 116 Fig. 117 Fig. 118 Fig. 119 Fig. 120 Fig. 121 Fig. 122 Fig. 123 Fig. 124 Fig. 125 Fig. 126 Fig. 127 Fig. 128 Fig. 129 Fig. 130 Fig. 131 Fig. 132 Fig. 133 Fig. 134 Fig. 135 Fig. 136 Fig. 137 Fig. 138 Fig. 139 Fig. 140 Fig. 141 Fig. 142 Fig. 143 Fig. 144 Fig. 145 Fig. 146 Fig. 147 Fig. 148 Fig. 149 Fig. 150 Fig. 151 Fig. 152 Fig. 153 Fig. 154 Fig. 155 S1: 9-f-cbwfq............................................................................................................................................. 118 S1: 9-a-no-priority .................................................................................................................................... 118 S1: 9-f-cbwfq............................................................................................................................................. 119 S1: 9-a-no-priority .................................................................................................................................... 119 S1: 9-f-strict-priority.................................................................................................................................. 119 Scenario 2: AS Interconnection – Single AS........................................................................................ 120 S2: 9-f-cbwfq............................................................................................................................................. 120 S2: 9-a-no-priority .................................................................................................................................... 120 S2: 9-b-cbwfq............................................................................................................................................ 121 S2: 9-e-cbwfq............................................................................................................................................ 121 Scenario 3: AS interconnection – Multi-AS .......................................................................................... 121 S3: 9-f-cbwfq............................................................................................................................................. 122 S3: 9-a-no-priority .................................................................................................................................... 122 Scenario 4: AS interconnection – Multi-AS 2 ....................................................................................... 122 S4: CBWFQ / no priority.......................................................................................................................... 123 S4: no priority / CBWFQ.......................................................................................................................... 123 Scenario 5: AS interconnection – Multi-AS 3 ....................................................................................... 123 S5: 2x CBWFQ / no priority .................................................................................................................... 124 S5: no priority / 2x CBWFQ .................................................................................................................... 124 S5: 2x CBWFQ / EF&BE......................................................................................................................... 124 S5: EF&BE / 2x CBWFQ......................................................................................................................... 124 Scenario 6: AS interconnection – Multi-AS 4 ....................................................................................... 125 S6: 2 classes w/o remark........................................................................................................................ 125 S6: 2 classes with remark....................................................................................................................... 125 S6: 1 class w/o remarking....................................................................................................................... 126 S6: 1 class with remarking...................................................................................................................... 126 Scenario 7: AS interconnection – Cross-Layer.................................................................................... 126 S7: with Ethernet QoS............................................................................................................................. 127 S7: without Ethernet QoS ....................................................................................................................... 127 Single node structure with token bucket filtering ................................................................................. 128 Single node TB 4->4................................................................................................................................ 129 Single node TB 4->3................................................................................................................................ 129 Single node TB 4->2................................................................................................................................ 130 Single node TB 4->1................................................................................................................................ 130 Quagga Routing Suite structure............................................................................................................. 132 Example setup for 4 QoS Marking Ex. Communities for IP-DiffServ .............................................. 133 Example setup for a CoS Capability Ex. Community.......................................................................... 133 Example setup for a CoS Parameter Attribute..................................................................................... 134 Wireshark screenshot with captured Extended Communities ........................................................... 136 Reception example of Extended Communities in commercial routers............................................. 137 Decoding result of the online form......................................................................................................... 137 Implementation test setup....................................................................................................................... 138 Router1: show ip bgp sum – full feed.................................................................................................... 139 Router 1: show ip bgp sum – single prefix with 4 communities......................................................... 140 Router 1: show ip bgp sum – 10 prefixes with 4 communities .......................................................... 140 Router1: show ip bgp sum – full feed with single community attached to all .................................. 140 Completely processed Extended Community attribute example ...................................................... 141 VLAN User Priority test at DE-CIX [144] ............................................................................................. 142 Major European IXPs with VLAN User Priority support...................................................................... 143 Active BGP entries over time [year] - [93] ............................................................................................ 144 Unique ASes over time [year] - [93] ...................................................................................................... 144 Hourly Average of Updated and Withdrawn Prefix Rate - [93].......................................................... 145 CoS signalling UPDATE message overhead – single prefix case.................................................... 146 Wireshark screenshot with 173 Extended Communities UPDATE................................................... 147 Structure of the QoS Marking Community............................................................................................ 149 Memory usage estimates for up to 8 classes and four technologies ............................................... 150 Memory usage for ext. communities sent within one UPDATE message ....................................... 151 Memory usage for large quantities of sent extended communities .................................................. 151 Linux remote control of existing commercial AS border router ......................................................... 154 165 17.11.2009 List of Tables Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12 Table 13 Table 14 Table 15 Table 16 Table 17 Table 18 Table 19 Table 20 Table 21 Transfer demand matrix – after [79] ___________________________________________________ 46 Assured Forwarding DSCP encoding __________________________________________________ 51 Currently specified PHBs ____________________________________________________________ 52 Excerpt of IP QoS class definitions and performance objectives [108] ______________________ 56 Ethernet traffic types [97] ____________________________________________________________ 57 Mapping of traffic types to available queues [97] ________________________________________ 58 Chemnitz University applied Ethernet-priority-to-DSCP mapping __________________________ 58 UMTS QoS classes [1] ______________________________________________________________ 65 UMTS Bearer Service Attributes [1] ___________________________________________________ 66 LTE QoS class attributes [2]________________________________________________________ 67 Overview of available layer 2 and 3 quality of service classes ___________________________ 70 Technology Type Enumeration _____________________________________________________ 93 CoS Capability Attribute – binary class encoding ______________________________________ 98 Queue mapping reuse for priority mapping __________________________________________ 110 Cisco’s default CoS-to-DSCP mapping [55]__________________________________________ 112 Cisco’s default DSCP-to-CoS mapping [55]__________________________________________ 112 Chemnitz University applied CoS-to-DSCP mapping__________________________________ 112 Traffic source configuration parameters _____________________________________________ 116 Class and traffic type variations in simulations _______________________________________ 117 Simulation parameter settings _____________________________________________________ 128 Extended command line syntax for CoS configurations________________________________ 135 166 17.11.2009 Versicherung Hiermit versichere ich, dass ich die vorliegende Arbeit ohne unzulässige Hilfe Dritter und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe; die aus fremden Quellen direkt oder indirekt übernommenen Gedanken sind als solche kenntlich gemacht. Bei der Auswahl und Auswertung des Materials sowie bei der Herstellung des Manuskripts habe ich Unterstützungsleistungen von folgenden Personen erhalten: Prof. Dr.-Ing. Thomas Bauschert ................................................... Prof. Dr.-Ing. habil. Klaus Franke ................................................... Simon Ehnert ................................................... Daniel Manns ................................................... Uwe Steglich ................................................... Brian Schaefer ................................................... Weitere Personen waren an der Abfassung der vorliegenden Arbeit nicht beteiligt. Die Hilfe eines Promotionsberaters habe ich nicht in Anspruch genommen. Weitere Personen haben von mir keine geldwerten Leistungen für Arbeiten erhalten, die im Zusammenhang mit dem Inhalt der vorgelegten Dissertation stehen. Die Arbeit wurde bisher weder im Inland noch im Ausland in gleicher oder ähnlicher Form einer anderen Prüfungsbehörde vorgelegt. Chemnitz, 17.11.2009 ........................................... ................................................. Ort, Datum Unterschrift 167 17.11.2009 Theses 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. The Internet has become increasingly popular in recent years and has a steadily growing user base. The resulting traffic load, especially due to rapidly increasing Internet access speeds, will lead to high traffic volumes in the core of the network. A rising usage for time and loss critical services, such as voice over IP (VoIP), video streaming (IPTV) and online gaming, across the Internet can be observed, together with high user expectations of the service quality. This will inevitably require quality of service (QoS) handling procedures in provider networks. The current Internet structure consists of about 30,000 interconnected service provider networks. Those interconnections are based on the Internet Protocol (IP) and do not distinguish the mixed traffic types within the transported traffic load. Link capacity over-provisioning is the easiest and most sustainable way to provide high quality of transmission service and will always be used. Over-provisioning of interconnection links results in low link capacity utilization and frequent speed upgrades due to the traffic growth rate. The resulting hardware upgrade cost for faster router interfaces will evolve into a financial burden for service providers, who only apply overprovisioning in their network operation. Today, service providers are already making use of the QoS concept of “Differentiated Services (DiffServ)” within their network domains - Autonomous Systems (AS). Its deployment is increasing and is expected to be universally available within ASes. The Internet’s default packet forwarding behaviour, Best Effort (BE), will not be sufficient in the future on interconnection links. AS interconnections need to support at least simple traffic separation and separate traffic queues for an enhanced interconnection quality support. The setup and operation of AS interconnections is a fundamental element in any provider’s network. Any inter-domain QoS solutions will therefore need to be simple for community acceptance. The thesis’ work identified two fundamental design requirements for a simple QoS concept. They are simplicity in design and simplicity in QoS support. QoS in this approach therefore refers to primitive traffic separation into several classes, which will experience differently prioritized forwarding behaviour in relaying nodes. Enqueueing in separate queues is thereby aspired to. The always performed link capacity over-provisioning combined with a simple traffic class separating inter-domain QoS concept will enable classbased over-provisioned interconnections. The signalling of available traffic classes is required and mutual (Service Level Agreement (SLA) based) solutions can be manually set up. However, 168 17.11.2009 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. the designed new class of service signalling procedures will inform all globally interconnected service providers about the available traffic separation support as well as automate the CoS enabled AS interconnection setup procedure. Because of the currently missing CoS support, service providers often perform multi-layer ingress classification on incoming traffic in order to make a good guess on which traffic is entering the domain. This costly classification procedure can be waived, if the new CoS signalling informs about the available class sets and provides inter-domain mapping information. Simulations have shown, that even the interconnection of differently set up CoS enabled ASes leads to a considerable increase in successfully transferred high priority traffic across a chain of several transit ASes in the path as compared to the BE only interconnections style. Talks to service providers have revealed a strong request for “Lower Effort (LE)” class of service support. De-prioritization support by means of signalled LE CoS is included in the concept’s specification. The support of BE and LE traffic the simplest recommended class set combination in the specification. The simple CoS support concept claims that the support of Expedited Forwarding (EF), one commonly used Assured Forwarding (AF) group, BE and LE traffic classes will suffice at AS interconnections for most service providers. A generally available 2 class or recommended 4 class CoS enabled global Internet, is aspired to. The advertised availability of higher priority traffic class support will potentially lead to misuse. Furthermore, high class forwarding quality can only be supplied to a limited share of the link capacity. Therefore, class-overload protection is required and will optionally be provided by the new concept. The class of service support allows for higher link utilization without noticeable service degradation. This way, the concept allows postponement of interface speed upgrades until a higher utilization threshold is crossed. The deferral is expected to deliver an easily achievable economical benefit to service providers. Quality of service support is not confined to the IP layer, but is offered on several packetized networking technologies. Multi-protocol Label Switching and Ethernet with virtual LAN (local area network) support are the two most common QoS capable tunnelling technologies for IP transport. The harmonization of IP QoS and lower layer QoS is essential and is manually cared for in the intra-domain case. Inter-domain signalling of cross-layer QoS support is a novel feature and is provided in the new CoS concept. Even upcoming tunnelled interconnection can thus be automatically harmonized. Simulations have shown that the preservation of class of service markings is vital for a successful traffic separation. ASes, which remark packets on their way through CoS domains with very limited class support virtually destroy the separation along the remaining AS forwarding chain and finer grained class sets can no longer be utilized for separation. The transparent transport of customer traffic is strongly recommended by the concept. Marking preservation is automatically achieved by traffic encapsulation and tunnelled transport. Current trends are observable, where Ethernet and MPLS based interdomain tunnelling is arising. The new CoS concept already provides the signalling means for the harmonized CoS interconnection. 169 17.11.2009 26. For the practical usage of the new concept, a Linux implementation, the implementation in the official release of the network analysing tool Wireshark and an online decoding form for decoding of raw signalling data is available. 27. Linux remote control of commercial routers via command line sessions is planned as an intermediate deployment solution of the concept with legacy routing equipment. The transitive design of the required signalling elements allows for the passive bidirectional signalling relay through existing routers without hardware or software update requirements. 28. The concept’s integration in commercial equipment is expected due to its simplicity and ease of implementation. 29. The current discussion about network neutrality reveals a fundamental objection to any traffic separation scheme. However, because of the new concept’s universal applicability and the resulting generally available CoS support to all network users, the concept is likely to be regarded as nondiscriminating. 30. Further techno-economic studies on the cost reduction potential of the concept will need to be carried out to guide the device upgrade and CoS deployment decision process. 31. A new BGP Community based signup procedure for new services and concepts has recently been proposed by the company Google. Depending on the outcome, this CoS support concept can even be used as contractual base for inter-provider class of service support agreements. 170 17.11.2009 Lebenslauf 171

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Cross-Domain and Cross-Layer Coarse Grained Quality of