Download Institutionen för datavetenskap Estimating Internet-scale Quality of Service Parameters for VoIP Markus Niemelä

Institutionen för datavetenskap Department of Computer and Information Science Final thesis Estimating Internet-scale Quality of Service Parameters for VoIP by Markus Niemelä LIU-IDA/LITH-EX-A--16/013—SE 2016-03-24 Linköpings universitet SE-581 83 Linköping, Sweden Linköpings universitet 581 83 Linköping Linköpings universitet Institutionen för datavetenskap Final thesis Estimating Internet-scale Quality of Service Parameters for VoIP by Markus Niemelä LIU-IDA/LITH-EX-A--16/013—SE 2016-03-24 Supervisor: Mikael Asplund Examiner: Simin Nadjm-Tehrani Presentation Date 2016-03-24 Publishing Date (Electronic version) Department and Division Software and Systems Department of Computer and Information Science 2016-04-22 Language Type of Publication ISBN (Licentiate thesis) X English Other (specify below) Licentiate thesis Degree thesis Thesis C-level X Thesis D-level Report Other (specify below) ISRN: LIU-IDA/LITH-EX-A--16/013--SE Number of Pages 50 Title of series (Licentiate thesis) Series number/ISSN (Licentiate thesis) URL, Electronic Version http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-127360 Publication Title Estimating Internet-scale Quality of Service Parameters for VoIP Author(s) Markus Niemelä Abstract With the rising popularity of Voice over IP (VoIP) services, understanding the effects of a global network on Quality of Service is critical for the providers of VoIP applications. This thesis builds on a model that analyzes the round trip time, packet delay jitter, and packet loss between endpoints on an Autonomous System (AS) level, extending it by mapping AS pairs onto an Internet topology. This model is used to produce a mean opinion score estimate. The mapping is introduced to reduce the size of the problem in order to improve computation times and improve accuracy of estimates. The results of testing show that estimating mean opinion score from this model is not desirable. It also shows that the path mapping does not affect accuracy, but does improve computation times as the input data grows in volume. Keywords Voice over IP (VoIP), Quality of Service (QoS), cost estimation, QoS-driven routing Abstract With the rising popularity of Voice over IP (VoIP) services, understanding the effects of a global network on Quality of Service is critical for the providers of VoIP applications. This thesis builds on a model that analyzes the round trip time, packet delay jitter, and packet loss between endpoints on an Autonomous System (AS) level, extending it by mapping AS pairs onto an Internet topology. This model is used to produce a mean opinion score estimate. The mapping is introduced to reduce the size of the problem in order to improve computation times and improve accuracy of estimates. The results of testing show that estimating mean opinion score from this model is not desirable. It also shows that the path mapping does not affect accuracy, but does improve computation times as the input data grows in volume. Contents List of Abbreviations iii 1 Introduction 1.1 Problem description . . . . . . . . 1.2 Approach . . . . . . . . . . . . . . 1.2.1 Network modeling . . . . . 1.2.2 Cost-based QoS evaluation 1.2.3 Evaluation of model . . . . 1.3 Related work . . . . . . . . . . . . 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 3 5 5 5 6 2 Background 2.1 Autonomous systems . . . . . 2.2 Routing . . . . . . . . . . . . 2.3 Route Views . . . . . . . . . 2.4 AS relationship inference . . . 2.5 Quality of Service metrics . . 2.6 Additivity of QoS parameters 2.7 Machine learning algorithms . 2.7.1 Clustering . . . . . . . 2.7.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 8 9 9 11 11 12 12 13 3 System design 3.1 System overview . . . . . . . . . . . . . 3.2 Network model . . . . . . . . . . . . . . 3.2.1 Path mapping . . . . . . . . . . . 3.2.2 Path component cost estimation 3.3 Mean opinion score estimator . . . . . . 3.3.1 Evaluation of alternatives . . . . 3.4 Design summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 17 18 22 23 25 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Evaluation 27 4.1 Test objectives and parameters . . . . . . . . . . . . . . . . . . . 27 4.2 Data set selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 i 4.3 4.4 4.5 Network model . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean opinion score estimator . . . . . . . . . . . . . . . . . . . . Full system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Summary and conclusions 5.1 Summary of the outcomes . . 5.2 Lessons learned . . . . . . . . 5.3 Conclusions . . . . . . . . . . 5.4 Future work . . . . . . . . . . 5.4.1 Path mapping . . . . . 5.4.2 Parameter exploration 5.4.3 Alternative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 33 36 37 37 38 39 40 40 40 40 Bibliography 41 Appendices 45 A Source code 46 A.1 Constrained BFS Java implementation . . . . . . . . . . . . . . . 46 A.2 MATLAB estimate tests . . . . . . . . . . . . . . . . . . . . . . . 48 ii List of Abbreviations AS Autonomous System. BGP Border Gateway Protocol. C2P customer-to-provider. CAIDA The Cooperative Association for Internet Data Analysis. CI confidence interval. ICMP Internet Control Message Protocol. IGP Interior Gateway Protocol. IP Internet Protocol. ISP Internet Service Provider. ITU The International Telecommunication Union. LLSE Linear Least Squares Estimation. MOS Mean Opinion Score. MSE Mean Square Error. P2C provider-to-customer. P2P peer-to-peer. QoS Quality of Service. RMSE Root Mean Square Error. RTT Round Trip Time. S2S sibling-to-sibling. VoIP Voice over IP. iii Chapter 1 Introduction Real-time media quality between two or more endpoints is critical for the success of Voice over IP (VoIP) applications. As providers of these services are only in control over parts of the network on which they operate, they are today not controlling how media packets are being routed between endpoints in individual sessions. This is instead done by the routing policies of Internet entities outside of the product’s control. These global packet routes have tremendous impact on real-time communication quality, so in order to be more competitive and provide guaranteed realtime media communication quality, control must be established over real-time media packet routing. The first step towards controlling real-time media packet routing is understanding global network impact on real-time media quality. 1.1 Problem description Today, a large amount of data is being logged for millions of VoIP calls every day to monitor and give insight into call quality amongst other things. With such a large amount of data, it is possible to infer quality properties of individual links, such as Mean Opinion Score (MOS), Round Trip Time (RTT), packet loss, packet delay jitter, and available bandwidth. However, even on the highest level where connections between Autonomous System (AS) pairs are considered, there are thousands of endpoints and millions of pairs of such endpoints, numbers that are steadily growing. Trying to infer parameters for all of these pairs and endpoints is a large problem, which takes a long time to calculate. Additionally, many pairs may not have up-to-date data reflecting the true characteristics of the connection between the pairs’ endpoints. This is because the Internet is constantly changing, and the number of calls logged in a day is dwarfed by the number of possible AS pairs. As a first step toward controlling real-time media packet routing, the goal is to be able to make an informed decision on whether to relay a call through company controlled ASs or not, weighing costs against quality of service benefits. 1 This means that identifying calls where the experience is expected to be poor and being able to offer alternative routes for these is of particular interest. Company Network C A D Internet B Figure 1.1: Two potential paths for a VoIP call to take between A and B. The choice is illustrated in Figure 1.1. A call between AS A and AS B normally takes a path through other ASs on the Internet over which the VoIP provider has no control, represented by the solid edge in the figure. The dotted edges represent the alternative path where the call is relayed through a network over which the provider does have control. While the provider still has no control over edges AC and DB, it can now influence edge CD. Data for the full dotted path is not available however, as traffic has not yet been routed through this path. Therefore it is desirable to infer this from the individual components of the path. The problem’s complexity lies in the fact that there is a large number of combinations of endpoints, and it keeps growing. There are billions of potential pairings, and trying to solve for all of the Quality of Service (QoS) contributions of these is infeasible for two reasons: data will not be available for all these pairs, and the resources required to process the amounts of data required would be immense. How can these billions of decisions be handled efficiently then? Internally at Skype, one provider of VoIP services, a model to try to keep track of all of these pairs for which there is data has been developed. However, this model is considered to be slow and not scalable by the company. To be able to put this into practical use, the complexity must be reduced somehow. This thesis proposes and evaluates an improved model that aims to be faster at processing the large number of AS pairs required to make well-informed decisions. The goal is to reduce it to a more manageable state in order to be able to analyze the data quickly and reliably. The model would be applied to the large amounts of data prior to call set-up, and could then be used to provide real-time information about the expected quality of service of route alternatives during call set-up. The purpose of the thesis is to provide a basis for evaluating whether such a 2 model can be of use in making relay decisions. This will be done by comparing it to the existing model in accuracy of estimates as well as computation times. 1.2 Approach To evaluate QoS given the problem description, we approach the problem in two parts as described in this section. 1.2.1 Network modeling The existing model describes all possible AS pairs and the QoS metrics associated with them by keeping track of every individual pair. This model represents the network as a complete graph of AS vertices. That is, each of the vertices is directly connected to all other vertices, as in Figure 1.2. Recognizing that quality metrics can be affected by network connections within an AS as well as between a pair, inferring these costs from call data becomes an expensive problem. With n vertices, there are N = n2 cost variables to calculate. For C data points, where C > N , using a basic Linear Least Squares Estimation (LLSE) algorithm to solve for the costs has a time complexity of O(N 2 C) = O(n4 C). Currently, there are over 50000 ASs present on the Internet [1], which gives the following: n > 50000 =⇒ n4 > 6.25 ∗ 1018 It is unlikely that all of these AS pairs will be directly involved with VoIP calls, but as the number of ASs is constantly growing, this may well be reasonable in the future if we assume that the proportion of ASs involved in VoIP calls does not drop. Adding that the number of data points C must be greater than N = n2 , this is a daunting size. This model will henceforth be referred to as the naı̈ve model, as the way it represents the Internet is simplified compared to the next model. A B C D E F Figure 1.2: A naı̈ve AS relationship graph. 3 Clearly the Internet is not a complete graph, and in practice most ASs are directly connected to very few other ASs, routing traffic towards each other using the Border Gateway Protocol (BGP). The naı̈ve model would consider a path between pairs to be a direct connection, independent of all other ASs pairs, but knowing that this is not the case, we can model it differently. A path between two ASs is in fact a combination of direct connections between ASs. An example is shown in Figure 1.3. Here, a path between D and A is D → B → A, including two direct connections. Note that both of these connections are also used in paths connecting other AS pairs in the graph. Comparing to Figure 1.2 there are much fewer direct connections, but all ASs still have paths to each other. With n vertices, there are still n2 pairs, but each pair maps to a path P that is a subset of the m direct connections between vertices e on the Internet. (ASi , ASj ) 7→ Pi,j ⊂ {ek : k ∈ [1, m]}∀i, j ∈ [1, n] Necessarily, m ≤ n2 , and looking at BGP routing tables reveals that m appears to have a growth that is closer to O(n) than it is to O(n2 ). With the numbers from earlier, this could mean a potential improvement of 9 orders of magnitude when scaled up fully. It should be noted that while the number of subsets possible is 2n , there is generally only one correct path for each pair. A B D C E F Figure 1.3: A topological AS relationship graph. With this approach, calculating the cost of connections between pairs turns into two problems. The problem of inferring costs for edges and vertices of the graph remains, but with the new time complexity O(m2 C). To achieve this, the problem of mapping AS pairs to paths is introduced. Consequently, finding a good method of solving this is necessary for this proposed model to be feasible to implement. When evaluating the chosen model, we will measure the computation speedup of the model, as well as look at whether or not the produced results are able to identify poor paths and good alternative paths which will help in the decision-making process. 4 1.2.2 Cost-based QoS evaluation Given a graph that has been populated with costs, in order to be able to make a decision about whether to relay a call or not, we need to be able to quantify the QoS of a call between a pair of ASs somehow. It may not be immediately obvious what impact the QoS metrics in the model have on the actual perceived quality of a call; even if a relayed call might improve the metrics, a non-relayed call may still provide a sufficiently high QoS. MOS is generally used to measure the subjective QoS, but it cannot be modelled using the above approach, as it is modelled as a more complex function of other QoS parameters. There is however a relationship between the measurable network metrics and MOS, meaning the decision-making could be reduced to comparing single numbers. This relationship however is not obvious, as is noted in Section 2.5, and other explanatory variables may be necessary to obtain satisfactory results. In order to evaluate the implemented model’s usefulness for decision-making, MOS will be used. As such, models to map network metric measurements to MOS will be explored and their accuracy evaluated. 1.2.3 Evaluation of model To evaluate the network model and MOS estimation model, the following questions need to be answered: • Is the proposed network model as reliable as the current naı̈ve model? • Does performance in terms of computation time improve in practice when comparing the proposed model to the current naı̈ve model? • Can the model directly predict user experience by translating network metrics into MOS? 1.3 Related work The behavior and performance of traffic in networks has long been studied, and advances are made constantly. The Cooperative Association for Internet Data Analysis (CAIDA) runs a project that studies the topology of the Internet as a whole, trying to identify the relationships between entities in the network [10, 11]. The ability to observe and infer these relationships is a challenging endeavour, often imprecise, but has served as an inspiration for the ideas in this thesis. The topology dataset [2] is primarily concerned with classifying ASs and the connections between them. Several other papers have looked at Internet topology and AS relationships as well [13, 14] and the effects of the paths traffic takes as a result [14, 23, 24]. That the Internet is organized in a way that is suboptimal for the traffic types today is found to be quite clear, but no solutions are found that would be easy to implement on the scale of the Internet. As the premise of the model proposed 5 in this thesis assumes knowledge of Internet topology, these studies are highly interesting to investigate. They are discussed further in Sections 2.3 and 2.4. Methods for real-time measurements of some QoS metrics have been studied as well. Using DNS servers, reasonably accurate estimates of RTT and packet loss have been able to be obtained [15, 25]. These measurements enable more research to be performed, though it is not appropriate for the purposes of performing real-time analysis at call set-up in VoIP applications, as it is much too time consuming. The Online Network Traffic Characterization (ONTIC) project is an ongoing project which seeks to explore new ways to characterize network traffic online using large scale data analysis. In one of the project deliverables [4], they provide a collection of the current state of the art in offline algorithms used to this end. The document, based on 101 sources, gathers not only the algorithms, but also the frameworks primarily used to implement them. As it stands today, Apache Hadoop and Apache Spark are the most important frameworks used, taking advantage of distributed computing. Apache Hadoop in particular has been widely used for many years, and is based on the MapReduce paradigm, in which mappers in one step perform transformations of independent data, and reducers then aggregate the results. The ONTIC project is very interesting for the field of study, however the algorithms described in the deliverable are ones for classification only. As the problem statement for this thesis involves the comparison of routes between two points, classification does not seem suitable, as there could be potential improvements even within the elements mapped to the same class. Component-based QoS effects are studied in[26], where real-time application systems are considered. QoS effects are here attributed to components on a much lower level than in this thesis, but the original naı̈ve model described in this thesis is one that has been developed based on this work. This type of model is found to be useful in identifying long-term QoS issues, shifting focus away from short-term problems with components. 1.4 Limitations • The parameters considered in the model are RTT, packet delay jitter, and packet loss. Bandwidth as a cost variable is omitted from the model due to the complexity in both measuring and modeling it. Unlike the other parameters, it is not additive and would require a different approach. (See Section 2.6) • Given the time frame and amounts of data processed, it is not possible to request any new kind of data collection to be performed. The evaluation must therefore rely on data already available. For call data, this means that MOS entries are available for only a small fraction of the potential data. Additionally, all route data has been gathered externally (See Section 2.3), even though this could potentially be gathered via calls in the 6 future. As a result of these two data limitations, the actual number of AS pairs that would be involved in a large-scale implementation of this model will not be reached. • The actual decision-making process is not in the scope of this thesis, as it is largely a business decision. That is, no attempt to compare the potential paths is made, as this requires information about what paths could be used for relaying the calls, as well as when it would be profitable to relay. The intent is solely to provide a basis to make such a decision. 7 Chapter 2 Background This chapter will first introduce some concepts related to Internet routing to understand how the Internet can be modeled appropriately (Section 2.1-2.4). Then the QoS metrics that will be considered for the model and the methods with which to analyze and estimate them will be presented (Section 2.5-2.7). The terms route and path are used interchangeably throughout the thesis. 2.1 Autonomous systems The Internet is a large network of routers connecting hosts to each other. In order to manage the scale of the Internet and allow traffic to be routed correctly, the Internet has been divided up into autonomous systems, all with independently managed internal routing. These ASs connect to each other through Internet Exchange Points or direct links. Formally, an autonomous system is defined by RFC 1930 to be ”a connected group of one or more Internet Protocol (IP) prefixes run by one or more network operators which has a single and clearly defined routing policy”[17], where prefixes are contiguous blocks of IP addresses. At least 50,507 ASs were in use as of May 6th, 2015, a number that has seen a consistent growth of roughly 3,000 per year over the last decade [1]. 2.2 Routing When routing traffic within an AS, any Interior Gateway Protocol (IGP) may be used to determine the most suitable path. Some ASs may opt to consider router hop count, while others use more complex metrics when selecting paths [23]. When routing traffic between ASs, the protocol used is the path vector protocol BGP. Paths are found by advertising routes to neighboring ASs, and selecting the best route among the ones known. While ties between best routes are broken by AS path length, the protocol first considers a local preference 8 value for path selection. This value depends on a local policy, the specifics of which are decided upon at the discretion of the AS administrator. The policy can be affected by any number of things, not all of which are related to path performance optimality [24]. For example, network operators may have agreements with each other that have to be honored, or they might have to handle large amounts of traffic through load balancing [23]. Studies have shown that packets very frequently are not routed using the shortest path, neither on the router level [24] nor on the AS level [14]. Understanding the routing decisions made is clearly quite a complex problem, especially as routing policies are not obvious to an observer. 2.3 Route Views University of Oregon’s Route Views [3] is a project which collects information about the global routing system, and publishes it for public use. It has a number of participating routers around the world that contribute with their BGP routing tables every two hours, showing snapshots of their point of view of the Internet. Each entry in these tables contains the following information: • Network - The IP network reachable through this route • Next Hop - The IP address to which traffic will be forwarded on this route • Metric - A value used to discriminate between different points of access to a neighboring AS • LocPrf - The local preference value of the route, determined by BGP policy • Weight - A local, Cisco specific preference value, not shared through BGP • Path - The AS path taken to reach the network through this path Of particular interest for this thesis is the Path field, which provides information on the connections between ASs. The Route Views data is not a complete set of paths, as it only contains a few viewpoints, but with this information it is possible to construct a graph representation of the Internet, which can be used to infer additional possible paths for traffic to flow. However, as the next section will elaborate on, due to routing policies such a graph is not sufficient to find the paths that traffic actually takes. 2.4 AS relationship inference ASs are typically owned by a company or an Internet Service Provider (ISP), who pays a higher-tier ISP to transit its traffic to the rest of the Internet. Some ASs might be connected to multiple higher level ISPs, yet they do not advertise to these ISPs that these additional connections are possible routes [23]. That is, 9 the ISPs will only see the paths through the AS to lower-tier ISPs as available. It is clear that the existence of a connection does not mean that traffic will be routed through it for various reasons. To better understand how ASs interact, a classification scheme for AS relationships was proposed by Gao [13]. The proposed scheme divides direct connections between ASs into the following four categories: provider-to-customer (P2C), customer-to-provider (C2P), peer-to-peer (P2P), and sibling-to-sibling (S2S). X is considered a provider for Y, considered the customer, if X transits traffic for Y but not the other way around. If X and Y transit traffic for each other, they are considered siblings, and if neither transits traffic for the other, they are considered peers. CAIDA, analyzes Route Views BGP tables in order to gain a better understanding of how ASs are related to one another. They base this on the classifications of Gao, but apply more recent algorithms to infer the relationships. [10, 11] The results of their analysis is continually published in the form of relationship data, where they list all links between ASs, and whether they are provider-to-customer or peer-to-peer. Customer-to-provider relationships are implicit in the data sets. Sibling-to-sibling relationships are not reported in the data sets, although they state that they infer such relationships in their analysis as well. Since they are not present in the data sets however, this type of relationship will be omitted from further discussion. A Customer-to-Provider B Peer-to-Peer C F D G E H Figure 2.1: An AS graph with relationships. Within this framework, some conclusions can be drawn about valid paths in an AS graph. Valid paths must be valley-free, which is defined as: after passing a provider-to-customer or peer-to-peer edge, a path may not pass a peer-to-peer or customer-to-provider edge anymore. Violations of this principal would mean that an ISP would be transiting traffic for other ISPs without getting paid by anyone, which would not be in their best interests. To illustrate, in Figure 2.1, an example of AS relationships is shown with ASs organized in vertical levels by tier. Consider the case where AS F wishes to send traffic to AS H. There are two possible valid paths: F → C → A → D → H and F → C → D → H. These paths first travel up toward providers, then across 0 or 1 peer-to-peer edge, and finally down toward customers. 10 There are also four invalid paths between the two ASs, ignoring looping paths. For example, in the path F → C → A → D → B → E → H, the section A → D → B first travels across a provider-to-customer edge, and then across a customer-to-provider, creating a valley. The other paths contain similar violations, causing D to not act as a provider for anyone, yet still transit traffic. 2.5 Quality of Service metrics The following network metrics that are known to impact QoS [19] will be considered in this thesis: • Round-trip time (RTT) - The time it takes for a packet to travel from the sender to the receiver, and a response to return. • Packet delay jitter - The variance of the time it takes for a packet to travel one way between sender and receiver. Also known as Packet Delay Variation, but will be referred to as simply jitter in this thesis. • Packet loss - The percentage of packets that are lost in transit between sender and receiver. These metrics have been collected alongside a subjective evaluation of call quality made by the call participant after the call has concluded, which is known as Mean Opinion Score (MOS). The participant is asked to rate the call on a scale from 1 to 5, with 1 being ”bad”, 2 being ”poor”, 3 being ”fair”, 4 being ”good”, and 5 being ”excellent”. MOS is the arithmetic mean of these subjective ratings. When the setting for calls is controlled, the expected MOS can be calculated using a formula defined in ITU-T G.107 [18]. This formula depends on additional metrics however, some of which, like codec-specific parameters, cannot be measured or calculated easily [5] and are not part of the data available. 2.6 Additivity of QoS parameters An additive function is a function f that satisfies the following equality: f (x + y) = f (x) + f (y) for any x, y in the domain of the function. In order to apply linear regression to the parameters under observation, we must have them in an additive form (See Section 2.7.2). RTT can be assumed additive as is: twice the geographical distance should result in twice the round trip time. Jitter is a measurement of the variance of the one-way delay, and the performance of individual components in our network are independent of each other. 11 As a result, it too can be considered additive, as the following is true for all components X, Y : V ar(X + Y ) = V ar(X) + V ar(Y ) + Cov(X, Y ) Cov(X, Y ) = 0 The total packet loss over a path is a product of individual packet loss factors, with a fraction of packets being dropped in individual components in a path. It can be calculated as such: Y LP = (1 − Li ) i∈P Where LP is the total packet loss over a path P , and Li is the packet loss over component i. We can transform this into the following additive form: X log(1 − Li ) log(LP ) = i∈P As a note, the bandwidth of a path is determined by the lowest bandwidth of the components on the path: BP = min(Bi ) i∈P Where BP is the bandwidth available over a path P , and Li is the bandwidth available over component i. There is no way to reasonably transform this in order to make it additive, which is why it is not considered in the proposed model. 2.7 Machine learning algorithms Machine learning as a field is concerned with learning from data and being able to make predictions based on this, utilizing algorithms that can adapt to the data it is provided. The algorithms are often iterative, requiring more computations to reach a good result, but may require less assumptions be made about the data. Algorithms that have been considered suitable to try to estimate MOS are described in this section, as well as the LLSE algorithm used to compare the two network models. 2.7.1 Clustering Clustering algorithms are algorithms that aim to group items into sets in such a way that the items in each set are more closely related to each other than to those of other sets. This definition is quite broad, and as such, there are many clustering algorithms to accommodate the many different kinds of clustering that may be desired [12]. Clustering algorithms can be beneficial when it is not obvious how the data points are related, and may as such be interesting for MOS calculations, where several parameters together map into [1, 5]. Two clustering algorithms that appeared to be good candidates for the kind of clustering desired are described here. 12 k-means clustering k-means clustering is a centroid clustering algorithm [16]. Its goal is to partition all data points into sets, minimizing the squared euclidean distance from each member of the set to the mean of the set, the centroid. Initially, k centroids are randomly generated. The algorithm then alternates between two steps until convergence: 1. Assign each of the data points to the cluster to which the euclidean distance to the centroid is the smallest. 2. Recalculate the position of the centroid for each cluster. k-means clustering is considered a hard clustering algorithm, meaning each data point is assigned to a single cluster. Gaussian mixture model clustering Gaussian mixture model (GMM) clustering is a distribution-based clustering model which uses the expectation-maximization algorithm to train the model [16]. This algorithm will produce probabilities that a data point belongs to a certain distribution, having iteratively maximized the log-likelihood of these distribution predictions. Initially, k distributions are randomly generated. The algorithm then alternates between two steps until convergence: 1. Compute the probabilities of belonging to each cluster for every data point using current model parameters. 2. Based on the probability that a data point belongs to a cluster, recompute the mean and variances of the distributions. GMM clustering is considered a soft clustering algorithm, meaning each data point may be assigned to several clusters. It can however be used as a hard clustering algorithm by assigning each data point to the cluster it has the highest probability of belonging to. 2.7.2 Supervised learning Supervised learning algorithms are algorithms that seek to infer a function from a set of training data. The training data takes the form of input values X, and corresponding output values Y. The function produced should then be able to map unseen input values to the correct output value. There are many supervised learning algorithms with different strengths and weaknesses. They can be divided into two different categories: regression, which deals with functions mapping to values, and classification, which deals with functions mapping to different categories. Since a comparison between the values of QoS parameters is sought, we want to map to values, so we need regression algorithms. Here we will look at two ensemble learning algorithms, as these 13 are well-established methods of for increasing accuracy of machine learning algorithms [9]. The Linear Least Squares Estimation algorithm is also described, but it is not considered appropriate for MOS estimation, as MOS is not a linear function of other QoS parameters [18]. Ensemble learning In ensemble learning, an ensemble can be considered a ”committee” of learning algorithms. The individual algorithms in the ensemble weight their results and produce a common result that in many cases is better than that of the individual learning algorithms [16]. Bagging Bagging, or bootstrap aggregating, is an ensemble method that averages a number of bootstrapped models to reduce the variance in its predictions [16]. Bootstrapping refers to sampling from the data set with replacement, resulting in training sets for the model that have a number of duplicate elements. After training the individual models, each of the models’ outputs are averaged to obtain the bagged model’s final output. Boosting Boosting is an ensemble method that trains itself iteratively on the provided data set by focusing on data points that were poorly predicted [16]. Initially, all data points are weighted equally, and the underlying model is trained on the set. Each data point is then in each step weighted by the square error of the aggregated prediction of the models trained so far and the actual value. This causes the next iteration to focus more on data points that were poorly predicted previously. Decision trees Decision trees are often used as the underlying model for ensemble learning, as it is a fast algorithm [16]. The trees are built by analyzing all possible binary splits of the training data, and then selecting the one that gives the lowest mean square error. This is then done recursively for each node in the tree until a stopping criteria is reached, such as maximum tree depth, or the number of observations in the node being too small to continue. The model can then be queried for predictions easily by following the correct splits down the tree. Linear least squares estimation (LLSE) LLSE is a way to fit a linear model to data that isn’t fully explained by the model in question by minimizing the square of the residual errors. Consider the case where we have a relationship of the form x1 β1 + x2 β2 + ... + xN βN = y where xi and y can be measured, and we wish to determine βi , for i ∈ [1, N ]. To do so, we require C equations of this type, where C > N . This can be written 14 using matrix form as Xβ = y We are then interested in finding the best set of coefficients β̂ that minimizes C X i=1 |yi − N X j=1 Xij β̂j |2 The solution to this minimization problem, given that the columns of X are linearly independent, is given by solving the equation (XT X)β̂ = XT y The time complexity of solving this when utilizing parallel computing is given in [7] as N3 CN 2 + + N 2 log(P ) O P P 3 2 with P cores. As C > N , NP is dominated by CN P . In any practical applica2 C 2 tion, P > log(P ) as well, so N log(P ) is dominated by CN P . Thus, the time complexity of the algorithm can be considered O CN 2 P This is of interest as we seek to compare the models on their computation times in particular. 15 Chapter 3 System design There are many ways the problem described in Section 1.1 could be approached, with their own advantages and disadvantages. This chapter will cover the design choices made in the implementation of the approach described in Section 1.2. A brief summary of the design chosen is included at the end of the chapter. 3.1 System overview The two main components of the system are the network model, which is responsible for estimating the network QoS metrics of the expected path for the traffic, and the MOS model, which is responsible for translating these metrics into a single measurement of QoS. Path endpoints {start end [relay]} Network Model Network metrics MOS Model MOS estimate Figure 3.1: Full system query. Figure 3.1 illustrates the expected use case of the system. It will be queried with a list of at least 2 ASs: the start AS, the end AS, and an optional list of relay ASs. This set is evaluated by the network model, which calculate network metrics that are passed on to the MOS model for translation. The estimated MOS value is then returned as the response to our initial query. If the optimal relay points for a call are known, then this system can provide enough information for a relay decision to be made with two queries: one with only the start and end ASs provided, and one with the relay ASs included. If there are certain known thresholds on QoS metrics, it is plausible that a decision could be made without the presence of the MOS model; no such assumption is made here however. Should such information be available though, the two components are easily decoupled by design. 16 3.2 Network model The network model is the component of the system that is of particular interest in this thesis. How can we attempt to accurately model the connections in such a way that useful information is obtained? What are the pros and cons of different approaches? A straightforward way of designing the network model would be to directly assign the network metrics to the input set of ASs. This might be done through simply collecting data on previous calls for that connection and performing some analysis on it. Figure 3.2 illustrates how this would be designed. Call data collection Call network metrics Path endpoints {start end} Cost estimation Network metrics Figure 3.2: A simple network model design alternative. While such a direct approach would likely be quite successful given enough data, there are some concerns with it. For example, if little or no data exists for a particular path, estimating network metrics becomes hard or impossible. This model does not require any information about the network infrastructure, but consequently does not give any new information about it either. Another approach is to utilize information of the network infrastructure and divide paths into network components, assigning network metric costs to each component. Figure 3.3 shows what a query would look like in that system. Call data collection Call network metrics Path endpoints {start end [relay]} Mapping Path components Cost estimation Network metrics Figure 3.3: A path mapping network model design alternative. There is additional complexity in this design, as it requires the addition of the path mapping component, but in return it has the potential to return some information about the behavior of individual components of the infrastructure. It could also allow for the prediction of call metrics of previously unseen AS 17 pairs. To illustrate, consider the connections in Figure 3.4. Assume that we have previously seen a number of calls from AS A to AS C, and from AS B to AS D. As each node and edge in this model is considered a component with associated costs, we could now infer costs for a call between B and C by adding the component costs together. A B E C D Figure 3.4: Four ASs interconnected through a common AS. While both of these approaches would be interesting to explore, the latter has been chosen for its greater perceived potential. The following subsections will go into further detail about the path mapping and cost estimation system components and their design. 3.2.1 Path mapping With the chosen approach, mapping paths to the correct set of components is crucial. If this is not done correctly, large components might be erroneously assessed and applied to other paths, leading to unreliable model output. Some different options have been explored to approach the mapping problem, described below. Least hops model The first approach to modeling the Internet infrastructure is based on the CAIDA dataset of AS relationships [2]. It uses the known relationships between ASs to infer paths between any given pair. While it is known that the shortest path using direct connections between ASs is often not the path used in practice [14], with this model we seek to investigate whether the relationships presented in Section 2.4 can be applied to produce valid path results. The first thing to do is construct a graph representation of the network. The CAIDA dataset provides two files, both of which we use to construct this graph. 18 The first contains a list of all ASs reachable downstream from every AS. More importantly, it lists every AS in the dataset, so we can construct the set of all nodes from this file. The other file contains a list of every relationship that has been inferred, and its type: peer-to-peer or provider-to-customer. 43 Customer-to-Provider 79 Peer-to-Peer 55 1 15 32 4 8 Figure 3.5: An AS graph with relationships and AS numbers. 0 1 6,C2P 1 4 6,C2P 2 8 3,C2P 4,C2P 3 15 2,P2C 4,P2P 7,C2P 4 32 2,P2C 3,P2P 5,C2P 6,P2P 7,C2P 5 43 4,P2C 6,P2C 6 55 0,P2C 1,P2C 4,P2P 5,C2P 7 79 3,P2C 4,P2C Figure 3.6: An AS number list and an adjacency list corresponding to the graph in Figure 3.5. Indices are on the left. The number of ASs and relationships is small, so an adjacency list can be used to represent the graph in-memory. To illustrate, consider the graph in Figure 3.5. It is the same graph as in Figure 2.1, but with ASs numbered. The way this graph would be represented is shown in Figure 3.6. We keep a sorted list of the AS numbers present in the graph, and have the indices of that list correspond to the indices of the adjacency list. Each entry in the adjacency list is then a list of the relationships that AS has, in the form of the AS index and the relationship type. 19 Given this information, we can do a breadth first search on the graph to find viable paths, subject to the constraints described in Section 2.4. A Java implementation of this can be found in Appendix A.1. This will return two different sets of distances and parents for each of the nodes in the graph: one where future path options are constrained by the path taken thus far, and one where they are not. This is of interest as the shortest path to a particular node might not be part of the shortest path to its neighbor. Therefore, in order to properly reconstruct paths, we must keep track of the best paths for both these cases. Having computed this from a source node, we can then recursively reconstruct the set of paths between the source and any other node in the graph as shown in Algorithm 1. Calling the function initially, the parameter d is set to the shortest of the two distances that have been found for the node. On the recursive calls however, it is simply decremented by 1 for each call, preventing an invalid path from being reconstructed by picking the locally optimal distance for the current node. Algorithm 1 Path Reconstruction Require: An end node e. The distance d to e on this path. The distance Df and the direct parents Af for each node on the path from a start node without traversing a P2P or P2C link. The distance Dc and the direct parents Ac for each node on the path from a start node allowing for traversing P2P or P2C links. A function rel(u, v) returning the edge connecting nodes u and v. Ensure: The set of shortest paths to e, P . 1: procedure ReconstructPaths(e, d, Df , Af , Dc , Ac ) 2: P ← {∅} 3: Let D, A be Df , Af or Dc , Ac so that D = d 4: for v ∈ A do 5: P 0 ← ReconstructPaths(v, d − 1, Df , Af , Dc , Ac ) 6: for q ∈ P 0 do 7: q ← q ∪ {e, rel(v, e)} 8: P ← P ∪ {q} 9: if P = {∅} then 10: P ← {{e}} 11: return P The main advantage of this model is that the topology information is readily available and requires little memory to hold the graph. However, computation times can be large unless the input data is sorted appropriately in order to minimize the number of times that the constrained breadth first search algorithm must run. For an initial run it is possible to sort large amounts of call data on source AS, but for random queries no such sorting can be relied on. Unfortunately, the output of this model will in the general case find multiple paths of the same length despite the added constraints imposed. With the 20 topology data used, there is no way to identify all policies applied, so paths that other traffic takes can appear as a valid path for the pair we are interested in. It is also true that it may be the case that none of the paths returned is the correct one, as policies could further inflate path length [14]. With such unpredictable results, this model is clearly unsuitable. Known path model An alternative approach is to collect data on actual paths taken by traffic and use this to map. This is equivalent to keeping routing tables for all ASs, which can then be used to construct the paths. As stated in Section 1.4, complete traceroute data that could give a near complete view of the relevant Internet routes is not available to us. Thus, for this thesis, the Route Views data set described in Section 2.3 is used. This limits the amount of paths that are known to the ones visible through the Route Views routers, but should provide enough known paths to evaluate the model. To construct the routing tables, the AS Path field of the Route Views BGP routing table entries is used. As it gives the complete path from the start AS to the end AS, the path can simply be stepped through to create the next hop entry for each AS in the path. Since BGP propagates paths, it is certain that a path that is a tail of this AS path will be a correct path as well. Had there been a different preferred path for such a tail, then that would also have been reflected in the full path. A map data structure is used to store the (start, end) AS pair and its next hop AS. Reconstructing the path is then a trivial task, as seen in Algorithm 2. Algorithm 2 Known Path Construction Require: A start node s. An end node e. A function NextHop(s, e) that returns the next hop from s on the path to e. A function rel(u, v) returning the edge connecting nodes u and v. Ensure: The components on the path P from s to e. 1: procedure ConstructPath(s, e) 2: P ← {∅} 3: while s 6= e do 4: P ← P ∪ {s, rel(s, e)} 5: s ← NextHop(s, e) 6: P ← P ∪ {e} 7: return P The most important advantage of this model is that the paths are accurate. While paths may change over time, and changes should be monitored, the model provides one path as a response. The case of load balancing has not been considered here, though allowing for the construction of multiple paths would require only small modifications to the model. More interesting is how the multiple paths would be handled by the cost inference, as well as how a decision 21 of path selection would be made if the paths differ in quality. This model is also faster than the least hops model, as it does not need to search a graph every time it is queried, but at the cost of having all path information stored. This may require quite a bit of memory as the available routing information grows. The big disadvantage of course is the data collection, requiring a reliable way of obtaining path information. Given that obtaining such data is realistic though, this model is the one that has been chosen as the path mapping component. 3.2.2 Path component cost estimation Going back to Figure 3.3, we now turn to the second part of the network model. Given the path components from the mapping, network metrics are to be estimated. Training the cost estimation component can be visualized as in Figure 3.7. The first step is again to collect call data to be used for inferring component costs. Before we can continue with that however, the AS pairs must be provided to the mapping system component in order for them to be translated into path components, which are returned to the cost inference step. Now the the call data can be analyzed and finally the inferred component costs can be provided for the cost estimation. Call data collection 1 2 Mapping Cost inference 3 4 Cost estimation Figure 3.7: Training the cost estimation component. Cost inference LLSE has been chosen as the method to analyze the call data. This is due to the naı̈ve model having used this method as well, and we wish to be able to make a fair comparison between the models. We have assumed that our network metrics are additive, and that components are independent of one another, so this is a valid choice. We also know that the algorithm is scalable when our data set grows [7]. 22 The problem to solve is, as stated in Section 2.7.2 Xβ = Y where X is a matrix with rows corresponding to data points, and columns to path components. The matrix simply consists of 1’s and 0’s, signifying whether a component is part of a row’s path or not. Y is a matrix with rows corresponding to those of X, and columns to the network metrics, transformed according to Section 2.6. Cost estimation Having calculated β, cost estimation is trivial. With a query x, equivalent to a row in X previously, the estimate ŷ is obtained by xβ = ŷ Of course, having transformed the metrics to their additive representations, they must be transformed back before they are returned from this step. 3.3 Mean opinion score estimator The second main component of the system is the MOS estimator, which helps with interpreting the values from the network model. As no models that describe MOS using only the parameters available in our data have been found, we will apply statistical analysis to attempt to produce a useful estimate. Several methods are proposed and tested to identify the model with the best results. The tests and results are detailed in Chapter 4, while the resulting design decisions are outlined here. Call data collection Network metric training set RTT, jitter, packet loss Learner ensemble MOS estimation MOS estimate Figure 3.8: A MOS estimation model with ensemble learning applied directly to call data. The first method considered is to apply supervised learning algorithms directly to the call data. In doing so, no assumptions about the relationship between network metrics and MOS are made. The supervised learning algorithms considered are the ones from Section 2.7, the ensemble learning algorithms. In Figure 3.8, a boosting or bagging ensemble 23 is trained with RTT, jitter, and packet loss as input paramters, and reported MOS values as output values. The ensemble is then directly queried to obtain an estimated MOS value. Call data collection Network metric training set RTT, jitter, packet loss Clustering MOS estimation MOS estimate Figure 3.9: A MOS estimation model with clustering applied to call data. The second method considered is to cluster data points by network metric values, effectively discretizing the space of MOS values. The motivation behind clustering is that the MOS values in the call data are observations of a random variable, the mean of which is what is of interest to us. The clustering algorithms described in Section 2.7 require the number of clusters to be specified. However, there is no intuitive way of knowing how many clusters could be appropriate for our purposes. A starting point for thepamount of clusters is therefore chosen through a rule of thumb [22] to be k = n/2, where n is the data set size. The rule of thumb was chosen over other ways of selecting k mainly for its simplicity, as computation times were already growing quite large. Call data collection Network metric training set Clustering MOS estimation Network metric training set with MOS estimate RTT, jitter, packet loss Learner ensemble MOS estimation MOS estimate Figure 3.10: A MOS estimation model with supervised learning applied to call data with MOS estimates from clustering. Figure 3.9 shows how this method functions similar to the previous one, though the training data is here instead clustered with either k-means or Gaus24 sian Mixture clustering. Queries are then assigned to the most appropriate cluster in accordance to the algorithm, and the mean MOS value of the cluster’s data points is returned as an estimate. The final method considered is where clustering is first used, and supervised learning is then applied to the resulting data. Here we wish to see if the clustered results can be improved upon by increasing the set of possible results. Figure 3.10 illustrates how this would look. 3.3.1 Evaluation of alternatives All of these approaches were tested using the full evaluation data set that will be described in Section 4.2. When reviewing the results, the first method of applying supervised learning to the call data directly is found to be the best performing. Of the two algorithms tried for this method, the boosting algorithm produced the best result while also being faster and requiring less resources. Consequently, the MOS estimator component of the system is chosen to be that of Figure 3.8 with a boosting ensemble. The mean square errors of all approaches are shown in Figure 3.11. The test results will be shown in more detail in Section 4.4. 2.5 Mean Square Error 2.3 2.1 1.9 1.7 G M G M+ GM M B M M ag + gi B ng oo kst m in e k k- an -m g m s+ ea ea B n ns ag s + gi B ng oo st B ing ag B ging oo st in g 1.5 Figure 3.11: Mean square errors of all MOS estimation approaches. 3.4 Design summary Figure 3.12 shows an overview of the system’s design after all choices have been considered. Three main components make up the system: path mapping, path component cost estimation, and MOS estimation. The network model, which is 25 AS Path {start end [relay]} Routing data Known path mapping Path components Call network data LLSE cost estimation Network metric estimate Call MOS data Boosting ensemble MOS estimation MOS estimate Figure 3.12: Final system design with three main components and their inputs and outputs. of the greatest interest, consists of the first two components, which work together to produce the network metric estimate. The cost estimation component is tightly coupled with path mapping, providing estimates based on the particular mappings from the previous step. The MOS estimation component however could be removed from the system easily, should a decision be possible from network metrics alone. To train the model, two sets of data are required: the AS-level paths taken by calls, and call data with network metrics and MOS. 26 Chapter 4 Evaluation In Chapter 1, we presented a number of questions to be answered in this thesis. Is the proposed network model as reliable as the current naı̈ve model? Does performance in terms of computation time improve in practice when comparing the proposed model to the current naı̈ve model? Can the model directly predict user experience by translating network metrics into MOS? Having decided on a design for the system, this chapter will answer these questions. The chapter will describe how the system and its components have been tested and evaluated, and present the outcome. First, the objectives of the tests and what the tests will cover is defined. Then, the choice of data sets is presented. Next, the two main components are tested separately. Finally, the system as a whole is tested. 4.1 Test objectives and parameters The goal of testing the model is to ascertain how well the model cost predictions match real data. This will be done by measuring the Mean Square Error (MSE) and Root Mean Square Error (RMSE) of the model outputs compared to the true observed values in the test sets. For the network model, what will be tested is the estimated network metric values for AS pairs, and the computation times required. In addition to the MSE, plotting the estimates against observations is of interest to see what the nature of the errors are, and how the estimator behaves. What is most interesting is how the path mapped model compares to the naı̈ve model, particularly in computation times, as this is where improvements can be expected. For the MOS estimator, it is naturally the estimated MOS value that is compared to the observed MOS values for the test data, with network metrics as input values. For the complete system it is the error when comparing the MOS estimate for an AS pair and the mean observed MOS for that pair that is to be measured. Similar to the network model, the spread of MOS values is also valuable to look 27 at, as it shows whether the user experience is somewhat consistent for a pair as well. The metrics used to evaluate the models have been chosen based on prior experience using them for statistical analysis. 4.2 Data set selection The measured metrics are not the only factors that determine a user’s experience of a call, as noted in Section 2.5. There are factors on the user’s side that affect the perceived experience substantially [18]. In order to reduce the amount of factors that affect the data, the platforms were constrained. Only calls between two Windows desktop clients are used in the test data. Unfortunately, this does not rule out differences in client versions, which can affect quality. It also does not differentiate between wired and wireless Internet connections, which means that things like wireless access point location planning may affect MOS values. The data is based on calls performed throughout February 2015, during which time a fraction of users are were randomly prompted to rate the call on a scale from 1 to 5. If a user opted to rate the call, then the call data was logged. To match the time period, the Route Views data used is from a single point in time at the middle of the time period. Some routes may have changed prior to or after this midpoint, introducing some possibility of error. Source AS 13456 Remote AS 7442 Jitter(ms) 20 RTT(ms) 152 Send loss (%) 0 Receive loss (%) 10 Table 4.1: An example of the call data entries used. Table 4.1 shows an example of the call data used as the basis for training and evaluating the model. ASs are identified by their AS number, and jitter and RTT are measured in milliseconds. Packet loss is measured separately for packets sent and received. This data was further filtered based on AS pairs in order to be able to perform the tests. First, only AS pairs whose paths were present in the Route Views data could be used, as path mapping could not be performed otherwise. Second, only AS pairs that were present at least 100 times in the call data were included when testing the model errors. This is to have some confidence in the data for the pairs, as well as being able to cross validate the model without having AS pairs in the test set that were not present in the training set. The final data set used for the evaluation of accuracy consisted of 2,549,900 calls from 2,178 unique AS pairs. For performance evaluation, the data set contained 2,934,919 calls from 65,579 unique AS pairs. 28 MOS 4 4.3 Network model In testing the network model, 10-fold cross validation [16] has been performed with the data set, training both a naı̈ve model and a path mapped model. The evaluation was performed in MATLAB, and the implementation is available in Appendix A.2. Table 4.2 shows the resulting errors for both models. There is no notable impact on the average error from the path mapping, with the errors being very similar. At a glance, the errors appear quite large however. Model Path Mapped Naı̈ve Error MSE RMSE MSE RMSE Jitter (ms) 2665 163.3 2666 163.3 RTT (ms) 2900 170.3 2913 170.7 Send loss % 0.2 4.5 0.2 4.5 Receive loss % 0.14 3.8 0.14 3.8 Table 4.2: Mean square errors and root mean square errors in 10-fold cross validation of the network models. RTT Jitter Send loss Receive loss Path Mapped ∈CI ∈CI / 1750 428 432 1746 1096 1082 1380 798 Naı̈ve ∈CI ∈CI / 1749 429 435 1743 1105 1073 1385 793 Table 4.3: The number of network parameter estimates that fall inside of and outside of the 95 % confidence intervals for the mean of the observed values for each AS pair. Having done cross validation, the plots in Figures 4.1-4.8 show the estimates generated by the models plotted against the means of the data points for every AS pair in the data set. Table 4.3 shows the number of estimates that fall within and outside of the 95 % confidence interval (CI) for the mean of the observed values, assuming that the observations are normally distributed around the true value. Quite a large number of estimates fall outside of the confidence interval, indicating that the confidence of the model’s estimates is not that great. The data shown is for one of the cross validation folds, though the others behave similarly. Figures 4.1 and 4.2 show the results of the RTT estimation. The two models produce very similar results, and seem to overall follow the diagonal line that would mean a perfect prediction of the means. It would appear that the models underestimate large RTT values, though the number of data points above 600 ms is not large enough to be conclusive. In the more populated ranges, the spread spread is more uniform, though somewhat large, as seen in Table 4.2. Depending on the required accuracy, this may be acceptable, especially as the 29 1,800 1,500 1,500 Observed mean (ms) Observed mean (ms) 1,800 1,200 900 600 300 0 1,200 900 600 300 0 300 600 0 900 1,200 1,500 1,800 0 Estimated mean (ms) 300 600 900 1,200 1,500 1,800 Estimated mean (ms) Figure 4.1: Path mapping net- Figure 4.2: Naı̈ve network model work model RTT estimates plotted RTT estimates plotted against the against the means of test point val- means of test point values. ues. 1,000 1,000 800 800 Observed mean (ms) Observed mean (ms) most interesting data points are the ones with high RTT. 20 % of the estimates fall outside of the 95 % confidence interval, which is the best result of the estimated metrics, but still a large amount. The expected value from a perfect fit would be 95 % accuracy. 600 400 200 0 600 400 200 0 200 400 600 800 0 1,000 Estimated mean (ms) 0 200 400 600 800 1,000 Estimated mean (ms) Figure 4.3: Path mapping net- Figure 4.4: Naı̈ve network model work model jitter estimates plotted jitter estimates plotted against the against the means of test point val- means of test point values. ues. Figures 4.3 and 4.4 show the results of the jitter estimation. Again, the two models have generated very similar results, but the fit is much worse than that of RTT. The regression models have a tendency to overestimate this parameter, and has comparatively few error cases where it has underestimated it. Considering that the majority of the observed means are below 200, the RMSE of 163 30 25 25 20 20 Observed mean (%) Observed mean (%) ms makes this parameter’s estimation very untrustworthy. In fact, 80 % of the estimated values fall outside of the confidence intervals for the observed means. 15 10 5 0 15 10 5 0 5 10 15 20 0 25 0 5 Estimated mean (%) 15 20 25 Figure 4.6: Naı̈ve network model sending packet loss estimates plotted against the means of test point values. 25 25 20 20 Observed mean (%) Observed mean (%) Figure 4.5: Path mapping network model sending packet loss estimates plotted against the means of test point values. 15 10 5 0 10 Estimated mean (%) 15 10 5 0 5 10 15 20 0 25 Estimated mean (%) 0 5 10 15 20 25 Estimated mean (%) Figure 4.7: Path mapping network model receiving packet loss estimates plotted against the means of test point values. Figure 4.8: Naı̈ve network model receiving packet loss estimates plotted against the means of test point values. Figures 4.5 and 4.6 show the results of the send packet loss estimation. Yet again, the models generate very similar results, though quite concentrated close to zero. The estimator has a tendency to overestimate the packet loss, not completely unlike the jitter estimator. This is particularly noticable when the mean is zero, as there are plenty of estimates along the X-axis. Other than that slight bias though, the estimator behaves quite randomly, with values all over 31 the place. Half of the estimates fall within the 95 % confidence interval, which again is poor. Figures 4.7 and 4.8 show the results of the receiving packet loss estimation. Unsurprisingly, the models match closely also for the last parameter. Like the other packet loss parameter, the values are concentrated close to zero. The estimator seems to work slightly better for this data, though the plot is still very scattered. The RMSE is down slightly as well, and 63 % of estimated values fall within the confidence intervals here. The performance must still be considered poor however. Overall, it can be said that the path mapping step has not noticeably influenced the linear regression step of the network model. The regression results appear quite poor, however it can be noted that for the large values in all of the cases the models produce large as well. If these cases with large, i.e. poor, values are compared to significantly lower estimates, then there can be some value derived from these estimates. This is due to the fact that significance of the differences between estimates are likely outweighing the probability of error in the individual estimates. ·104 Total AS count 6 4 2 0 0 2 4 Distinct pair count 6 ·104 Figure 4.9: Total number of ASs present as a function of included distinct AS pairs. Remaining then is to compare computation times between the two models, which was what we hoped to improve. Figure 4.9 shows how the total number of ASs grows as a function of the number of distinct AS pairs. This is of interest to us as it reflects difference in number of variables to solve for in the naı̈ve and the proposed model. As expected, the growth rate decreases fairly rapidly, having included 40,000 of the around 50,000 ASs already at slightly more than 60,000 distinct pairs. At the beginning though, the total number of ASs is naturally higher, as many paths consist of more than two ASs. At 5000 distinct pairs, the two are roughly equal. Figure 4.10 shows the computation times of running the naı̈ve model and the path mapped model as a function of distinct AS pairs. It should be noted that the number of equations in the system was not held constant as the pair count 32 Computation time (s) 15 Naive Path mapping 10 5 0 0 2 4 Distinct pair count 6 ·104 Figure 4.10: Time to solve the LLSE system as a function of distinct AS pairs included. increased, but it was the same between the models at each measured point. The naı̈ve model is in the beginning slightly faster, though by very small amounts. The path mapped model starts to perform better after a while though, at about the same point as AS count growth visibly starts to decline in Figure 4.9. The difference appears to be growing at a consistent rate, though the slope of the graphs could be affected by the varying number of equations. This confirms that the path mapped model performs better than the naı̈ve model in the tests as the number of distinct AS pairs grows. 4.4 Mean opinion score estimator To test the MOS estimator, we again use 10-fold cross validation. However, the data set was required to be reduced in order to allow for computations with limited memory space and time. The MOS data set samples 320,000 data points from the previously used data set to work with. Beginning with the clustering algorithms, the optimal number of clusters was found through p trial and error. Using the rule of thumb mentioned in Section 3.3, we use k = 320, 000/2 = 400 as a starting point. Table 4.4 shows the results of these initial tests. The results of clustering, as stated in Section 3.3, are the mean values of the resulting clusters’ training data points, and it is against this mean value that the tested values are compared. The Gaussian mixture algorithm does not seem affected by varying the number of clusters, whereas k-means loses accuracy as the number of clusters increases. Since the clustering algorithms are computationally intense, greater granularity in the cluster numbers is not feasible to test. Therefore, the best result for each algorithm in this table is selected as the number of clusters to be used, which was 300 in both cases. For these values, the results of the clustering are used to train the boosting and bagging learner ensembles, as depicted in Figure 3.10. We will see that the 33 results would have to be significantly improved to be of interest compared to not using clustering, which seems unlikely. Gaussian mixture k-means 300 2.0023 2.0033 350 2.0024 2.0036 400 2.0023 2.0039 450 2.0024 2.0046 500 2.0024 2.0053 Table 4.4: Mean square errors for the two clustering algorithms with different cluster sizes using 10-fold cross validation. The first row contains the cluster sizes. Below, Table 4.5 shows the results of running the two supervised learning algorithms after processing the training data by either of the clustering algorithms as well as with no pre-processing. In the clustered data cases, the model is trained using the means of the parameters belonging to a cluster. The models are trained with 100 decision trees, limited by memory space. Raw data Gaussian mixture k-means Bagged 1.8683 2.0055 2.0182 Boosted 1.8502 2.1820 2.0165 Table 4.5: Mean square errors for the two supervised learning algorithms using different training data. The extra model on top of clustering seems to somewhat increase the MSE, though only by a small amount. This is largely unimportant though, as it is clear that the model produces the best results when the data has not been clustered prior to training. The boosted model is slightly better than the bagged model for the raw data in the tests, and also has the benefit of being faster to train and requiring less memory. Figure 4.11 shows the result of using the best model to predict MOS on the test partition of one of the cross validation folds. Most data points have values that are high on the scale (indicating high quality), with a large spike at the top values. It stands to reason that a large user base is more likely where the quality is high, so more data points would be collected of this data. Plotting the MOS estimates of the model against the averages for each AS pair shows a less positive picture however, shown in Figure 4.12. The estimated values are overall in a quite slim range compared to the observed values, with plenty of values close to 5, where the estimator has none at all. Only 61 % of predictions fall within the 95 % confidence intervals of mean of the observed values (1324 within CI compared to 854 outside CI). 34 14,000 12,000 Data points 10,000 8,000 6,000 4,000 2,000 0 1 2 3 4 5 Predicted MOS Figure 4.11: Distribution of MOS predictions using the boosted raw data model. Observed mean MOS 5 4 3 2 1 1 2 3 4 5 Estimated mean MOS Figure 4.12: Boosted raw data model estimates plotted against the means of test point values for AS pairs. 35 4.5 Full system For the full system test, the MOS estimator is trained with the full data set, as this particular model for MOS estimation had no issues with the size. 10-fold cross validation is used here as well. MSE 0.196 RMSE 0.443 Table 4.6: Mean square error and root mean square error for the full system’s predictions of AS pair MOS values compared to their observed means. In Table 4.6, the error values for the tests are shown. While smaller than those listed earlier, note that these values are for the means of AS pairs, which would reduce the influence of outliers on the errors. Plotting the estimated values against the observed means for AS pairs in Figure 4.13, it appears to be quite similar to Figure 4.12. The estimates from the network model have slightly improved the fit though, aligning the data points better overall with the optimal diagonal line. Now 64 % of the estimates fall within the 95 % confidence intervals for the mean of observed values (1387 within CI compared to 791 outside CI). While few AS pairs with low observed mean MOS scores are present in the data, the ones that do appear indicate that the system would overestimate those values by multiples of the RMSE, with degrading performance the worse the quality gets. Observed mean MOS 5 4 3 2 1 1 2 3 4 5 Estimated mean MOS Figure 4.13: System estimates plotted against the means of test point values. 36 Chapter 5 Summary and conclusions This chapter will discuss the results from the previous chapter and how the choices made during the design and implementation of the models have affected the outcome of the process. It will also relate back to the original problem, evaluating whether the problem has been solved and what conclusions can be drawn from the results produced. Finally, future directions and alternatives are discussed. 5.1 Summary of the outcomes As noted for the network model previously, none of the parameter estimations were notably affected by the path mapping step in the estimation. Whether this is because of the number of ASs used in training the model, or because no notable amounts of information stands to be gained from the path mapping step is unclear, as a larger set of AS paths was unavailable at the time of testing. With the used set of paths, the number of path components in the regression model was larger than the number of unique ASs and paths, which would not be the case with a larger set of paths. With more more paths than components, the probability that a component is traversed in multiple paths should increase, thereby affecting the estimation of the path network metrics. Another potential issue with the data is that some AS pairs are a lot more common than others, causing their performance to influence the results more than is desirable. Considering that the network model and the naı̈ve model differ only slightly in their predictions when the network model has many more shared components implies that this has a relatively small, if any, impact however. To what extent this is a consequence of the properties of the test data is therefore an open question. Looking at the errors for the network parameter estimates, RTT estimation performs significantly better than jitter and packet loss estimation. Unfortunately, RTT has also been found to be less influential relative to jitter and packet loss in affecting user satisfaction [6]. RTT is directly related to the physical dis- 37 tance that the packets have to travel, whereas jitter and packet loss are not, which could be the cause of a greater degree of consistency in RTT data points. As stated earlier however, the estimations for the observed large values are in most cases accurate enough to be able to compare to estimates of a path that is significantly better. This means that for the pairs with the largest (worst) network metric measurements, that is the most interesting ones, this model can improve QoS. The measured predictive power of the model is also affected by the choice of evaluating metrics (Mean Square Error and Confidence Intervals). Other metrics may have given additional insights into how well the model can be expected to perform, but the current scope was set given the time constraints. The performance of the path mapped model was found to be better than that of the naı̈ve model, though the theoretically expected very large difference was not visible. It is possible that the data set was not large enough to fully illustrate the difference between the models, since Figure 4.9 shows that the difference in growth rate of the number of variables has just begun to manifest itself. It may also be that hardware resources affected computation times, as computations often consumed all available resources on the test machine. The poor results in the MOS estimation are not very surprising, seeing as there are many factors that were not considered in the estimation that affect user satisfaction, as mentioned in Section 2.5. There are other models internally at Skype that are constructed similarly, but take into account a number of additional factors, amongst them bandwidth, that work well for estimating MOS. It stands to reason then that the parameters used are insufficient for producing a useful estimate, and that this network model must be utilized based on its produced network parameters given this. 5.2 Lessons learned A conscious design decision that was made was to retain the simple LLSE in order to be able to compare the effect of path mapping with the naı̈ve model which had utilized it previously. Trying other models would have been an addition to the scope that would have required additional time. Other models may have been more suitable for producing estimates on their own, or might have made use of the path mapping in a better way. It may have been more beneficial to have omitted the MOS estimation component from the scope, and instead solely focused on finding a good model for network parameters. This would not have been sufficient to produce a full system as it was originally intended, but could potentially have laid the foundation for building a more reliable system in the future. When selecting evaluating the system, only a single data point was used for the Internet topology. Ideally, multiple data points would have been tested, however converting the data from its source format took two days per data point with the tools available. This was deemed too expensive considering the time constraints. Given that the effect of path mapping was negligible in tests, the 38 use of a single data point appears acceptable. The majority of the implementation was done in MATLAB, which provided easy access to a multitude of algorithms and functions necessary. However, this also limited what was possible to some degree. MATLAB’s maximum memory limit was often reached with the amounts of data that the model processes, leaving some algorithms unusable for large calculations. It is possible that more efficient implementations of algorithms that are customized for our type of data, and written for a platform with more memory, could lead to better accuracy. Even in cases where MATLAB was able to run an algorithm on the full data set and store the results, the available hardware was many times a bottleneck. With some tests taking several days, the number of possible models and parameters that could be tested was severely limited. The consequences of implementation errors were also harsh, delaying progress at times and necessitating re-prioritization of tasks. In planning for the implementation and testing, there was clearly a need to define how the models should be evaluated with smaller and larger data sets to determine their viability. This would have made it easier to eliminate poor models early, and also to plan for the possible necessity of additional hardware to run large tests more efficiently. 5.3 Conclusions The purpose of this thesis has been to evaluate whether or not a model such as the one constructed here can be used to understand what impact the global networks can have on real-time media quality. With a high RMSE at best, and seemingly random estimations at worst, this model can be said to not work well for data in general, but does have the potential to improve the worst performing cases. The model proposed is also a step forward compared to the previous naı̈ve model as it saw better scaling. Adding information to the network data in the form of a network topology has had no discernible effect on the accuracy of the results, and would from the results appear to only affect the computation times. This does however introduce an additional cost of collecting more network data, and building and maintaining a system that accurately maps data onto this network topology. As the data set was limited by the available BGP route data, this behavior is not necessarily representative of what a full topology would behave like however. MOS estimation is a complex endeavor, and the results in this thesis show that these supervised learning models cannot accurately model MOS with the limited amount of information that was provided. Being a subjective measurement, it is not surprising that three objective measurements cannot model it well in an uncontrolled user environment. For the proposed network model to be of use, decisions must be made from the estimated network performance metrics alone. 39 5.4 Future work Going forward, there are several things that could be done that either continue down the same path, or try a different approach. Here we will list some ideas that we have found interesting. 5.4.1 Path mapping While there has been no sign of any tangible effect on the accuracy of the results in this thesis, there remains the possibility that a larger data set, or a different regression model would produce more favorable results. To fully rule out this possibility, these options should be explored. Procuring a larger data set is a problem in itself however, as procuring accurate AS-level paths from traceroutes has been worked on for years [21, 27]. Hybrid mapping model If large scale mapping does seem beneficial, there may still be some issues in collecting path information. Traceroutes as a way of collecting path information is not completely reliable, as there may be routers filtering Internet Control Message Protocol (ICMP) requests on the path [20]. In the case where full path information is not available, the two path mapping methods described in Section 3.2.1 could be used together to arrive at an estimate of the path. With a small neighborhood search in the cases where the true path is not known, the number of paths found should be substantially less than a full network search, and the likelihood of it being a correct path higher. The viability of this method would be interesting to explore. 5.4.2 Parameter exploration While RTT is directly related to the physical distance that the packets have to travel, jitter and packet loss are not as intuitive on a macro level. For example, jitter has been found to correlate with bandwidth [6], which can differ between links over the path. Furthermore, several sources do not separate packet loss and jitter into seperate variables [6, 8, 18]. How jitter and packet loss are handled is a result of how the codec is designed [8, 18], which could further affect modeling. Researching what network parameters influence jitter and packet loss, and gathering data on these could aid in the estimation of them, and perhaps make the model explored in this thesis more viable. 5.4.3 Alternative models A path mapping approach is only one possible way of approaching this problem, and this type of inference also restricts the possible parameters additive ones. Given that MOS estimation likely requires more parameters, a model that could handle non-additive parameters such as bandwidth would be interesting. The 40 concern then would be to ensure that data could be processed fast enough, as more complex algorithms will require heavy computations for the sizable data sets used. One alternative that was considered that does not require additivity is an A/B testing model, where a relayed path and a directly routed path are alternated. Eventually, enough samples would be gathered to distinguish the true metric values from each other, and a decision could be made about which path to choose going forward. An advantage of this approach is that the underlying network metrics do not need to be considered if one so desires. As any numerical information tracked could be used for the model, a path’s MOS value could be estimated directly instead of estimating it from the network metrics, reducing the errors occurring due to missing information. 41 Bibliography [1] CIDR Report. http://www.cidr-report.org/as2.0/ (Retrieved on 2015-0506). [2] The CAIDA AS Relationships http://www.caida.org/data/as-relationships/ 10). Dataset, (Retrieved 2015-02-01. on 2015-03- [3] University of Oregon Route Views Project. http://www.routeviews.org/ (Retrieved on 2015-03-10). [4] Daniele Apiletti, Elena Baralis, Fabio Pulvirenti, Silvia Chiusa no, Tania Cerquitelli, Paolo Garza, Luigi Grimaudo, and Luca Venturini. D3.1 avaliable algorithms identification. Public deliverable, ONTIC Project (GA number 619633), January 2015. [5] H. Assem, M. Adel, B. Jennings, D. Malone, J. Dunne, and P. O’Sullivan. A generic algorithm for mid-call audio codec switching. In Integrated Network Management (IM 2013), 2013 IFIP/IEEE International Symposium on, pages 1276–1281, May 2013. [6] Kuan-Ta Chen, Chun-Ying Huang, Polly Huang, and Chin-Laung Lei. Quantifying Skype User Satisfaction. In Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’06, pages 399–410, New York, NY, USA, 2006. ACM. [7] Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. Map-Reduce for Machine Learning on Multicore. In NIPS, pages 281–288. MIT Press, 2006. [8] R. G. Cole and J. H. Rosenbluth. Voice over IP Performance Monitoring. SIGCOMM Comput. Commun. Rev., 31(2):9–24, April 2001. [9] Thomas G. Dietterich. Ensemble Methods in Machine Learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, MCS ’00, pages 1–15, London, UK, UK, 2000. Springer-Verlag. 42 [10] X. Dimitropoulos, D. Krioukov, B. Huffaker, k. claffy, and G. Riley. Inferring AS Relationships: Dead End or Lively Beginning? In 4th Workshop on Efficient and Experimental Algorithms (WEA), volume 3503, pages 113– 125, Santorini, Greece, May 2005. Springer Lecture Notes in Computer Science. [11] Xenofontas Dimitropoulos, Dmitri Krioukov, Marina Fomenkov, Bradley Huffaker, Young Hyun, kc claffy, and George Riley. AS Relationships: Inference and Validation. SIGCOMM Comput. Commun. Rev., 37(1):29– 40, January 2007. [12] Vladimir Estivill-Castro. Why So Many Clustering Algorithms: A Position Paper. SIGKDD Explor. Newsl., 4(1):65–75, June 2002. [13] Lixin Gao. On Inferring Autonomous System Relationships in the Internet. IEEE/ACM Trans. Netw., 9(6):733–745, December 2001. [14] Lixin Gao and Feng Wang. The extent of AS path inflation by routing policies. In Global Telecommunications Conference, 2002. GLOBECOM ’02. IEEE, volume 3, pages 2180–2184 vol.3, Nov 2002. [15] Krishna P. Gummadi, Stefan Saroiu, and Steven D. Gribble. King: Estimating Latency Between Arbitrary Internet End Hosts. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet Measurment, IMW ’02, pages 5–18, New York, NY, USA, 2002. ACM. [16] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001. [17] J. Hawkinson and T. Bates. Guidelines for creation, selection, and registration of an Autonomous System (AS). RFC 1930, RFC Editor, March 1996. [18] ITU-T. Recommendation G.107 : The E-Model, a computational model for use in transmission planning. Technical report, 2011. [19] ITU-T. Recommendation Y.1540 : Internet protocol data communication service – IP packet transfer and availability performance parameters. Technical report, 2011. [20] Priya Mahadevan, Dmitri Krioukov, Marina Fomenkov, Xenofontas Dimitropoulos, k c claffy, and Amin Vahdat. The Internet AS-level Topology: Three Data Sources and One Definitive Metric. SIGCOMM Comput. Commun. Rev., 36(1):17–26, January 2006. [21] Zhuoqing Morley Mao, Jennifer Rexford, Jia Wang, and Randy H. Katz. Towards an Accurate AS-level Traceroute Tool. In Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’03, pages 365–378, New York, NY, USA, 2003. ACM. 43 [22] Kanti V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis (Probability and Mathematical Statistics). Academic Press, January 1980. [23] Stefan Savage, Andy Collins, Eric Hoffman, John Snell, and Thomas Anderson. The End-to-end Effects of Internet Path Selection. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM ’99, pages 289–299, New York, NY, USA, 1999. ACM. [24] H. Tangmunarunkit, R. Govindan, S. Shenker, and D. Estrin. The Impact of Routing Policy on Internet Paths. In INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, volume 2, pages 736–742 vol.2, 2001. [25] Y. Angela Wang, Cheng Huang, Jin Li, and Keith W. Ross. Queen: Estimating Packet Loss Rate Between Arbitrary Internet Hosts. In Proceedings of the 10th International Conference on Passive and Active Network Measurement, PAM ’09, pages 57–66, Berlin, Heidelberg, 2009. Springer-Verlag. [26] Ye Wang, Cheng Huang, Jin Li, Philip A. Chou, and Y. Richard Yang. QoSaaS: Quality of Service as a Service. In Proceedings of the 11th USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, Hot-ICE’11, pages 6–6, Berkeley, CA, USA, 2011. USENIX Association. [27] Baobao Zhang, Jun Bi, Yangyang Wang, Yu Zhang, and Jianping Wu. Revisiting IP-to-AS Mapping for AS-level Traceroute. In Proceedings of The ACM CoNEXT Student Workshop, CoNEXT ’11 Student, pages 16:1– 16:2, New York, NY, USA, 2011. ACM. 44 Appendices 45 Appendix A Source code A.1 1 2 3 4 5 6 Constrained BFS Java implementation // Returns distances and parents for all end nodes in the graph private ArrayList < NodeInfo > constrainedBFS ( int startIndex ) { ArrayList < NodeInfo > dist = new ArrayList < NodeInfo >( s . nodeMapping . size () ) ; for ( int i = 0; i < s . nodeMapping . size () ; i ++) dist . add ( new NodeInfo () ) ; dist . get ( startIndex ) . dist = 0; 7 8 9 10 11 12 13 14 15 16 Queue < QueueEntry > queue = new LinkedList < QueueEntry >() ; queue . add ( new QueueEntry (0 , startIndex , false ) ) ; QueueEntry front ; while (( front = queue . poll () ) != null ) { int savedDist = front . constrained ? dist . get ( front . index ) . cDist : dist . get ( front . index ) . dist ; if ( savedDist > front . dist ) continue ; int newDist = savedDist + 1; 17 for ( Relation rel : s . graph . get ( front . index ) ) { NodeInfo neighbor = dist . get ( rel . end ) ; if ( rel . type == RelType . C2P ) { if ( front . constrained ) continue ; if ( newDist == neighbor . dist && ! neighbor . parents . contains ( front . index 18 19 20 21 22 23 24 )) { 25 26 27 28 neighbor . parents . add ( front . index ) ; } else if ( newDist < neighbor . dist ) { neighbor . parents . clear () ; neighbor . parents . add ( front . index ) ; 46 neighbor . dist = newDist ; queue . add ( new QueueEntry ( newDist , rel . end , 29 30 false ) ) ; } } else { if ( front . constrained && rel . type == RelType . P2P ) continue ; if ( newDist == neighbor . cDist && ! neighbor . cParents . contains ( front . index ) ) { neighbor . cParents . add ( front . index ) ; } else if ( newDist < neighbor . cDist ) { neighbor . cParents . clear () ; neighbor . cParents . add ( front . index ) ; neighbor . cDist = newDist ; queue . add ( new QueueEntry ( newDist , rel . end , true ) ) ; } } } } return dist ; 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 1 2 3 4 5 6 7 8 9 10 11 12 13 } // Return the set of paths to a specified end node private Set < ArrayList < Integer > > constructPaths ( int index , ArrayList < NodeInfo > dist ) { Set < ArrayList < Integer > > paths = new HashSet < ArrayList < Integer > >() ; NodeInfo node = dist . get ( index ) ; ArrayList < Integer > parents = node . dist < node . cDist ? node . parents : node . cParents ; for ( int parent : parents ) paths . addAll ( constructPaths ( parent , dist ) ) ; if ( paths . size () == 0) paths . add ( new ArrayList < Integer >() ) ; for ( List < Integer > path : paths ) path . add ( index ) ; return paths ; } 47 A.2 1 2 3 4 MATLAB estimate tests % % Load data load A . dat ; load A_n . dat ; load b . dat ; 5 6 7 A = spconvert ( A ) ; A_n = spconvert ( A_n ) ; 8 9 % % Filter data 10 11 FREQ_CUTOFF = 100; 12 13 14 [C , ia , ic ] = unique (A , ’ rows ’) ; cc = histc ( ic , unique ( ic ) ) ; 15 16 17 18 19 rowmask = find ( cc ( ic ) >= FREQ_CUTOFF ) ; A = A ( rowmask , :) ; A_n = A_n ( rowmask , :) ; b = b ( rowmask , :) ; 20 21 22 23 24 25 colmask = find ( any ( A ) ) ; colmask_n = find ( any ( A_n ) ) ; A = A (: , colmask ) ; A_n = A_n (: , colmask_n ) ; [C , ia , ic ] = unique (A , ’ rows ’) ; 26 27 % % Run cross validation 28 29 FOLD_CNT = 10; 30 31 32 33 cv = cvpartition ( size (A , 1) , ’ KFold ’ , FOLD_CNT ) ; folds = cell ( FOLD_CNT ,1) ; folds_n = cell ( FOLD_CNT ,1) ; 34 35 36 37 38 for i = 1: FOLD_CNT folds { i } = cvprocess (A , b , ic , cv . training ( i ) , cv . test ( i ) ) ; folds_n { i } = cvprocess ( A_n , b , ic , cv . training ( i ) , cv . test ( i ) ); end 39 40 41 42 43 44 45 46 mse = zeros (1 ,4) ; mse_n = zeros (1 ,4) ; for i = 1: FOLD_CNT mse = mse + calcmse ( folds { i }) ; mse_n = mse_n + calcmse ( folds_n { i }) ; end mse = mse ./ 10; 48 47 mse_n = mse_n ./ 10; 48 49 50 51 % % Fold - specific data % means , model predictions , and confidence intervals for means based on % test data points 52 53 FOLD_IDX = 1; 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 means = zeros (4 , size ( folds { FOLD_IDX } ,2) ) ; preds = zeros (4 , size ( folds { FOLD_IDX } ,2) ) ; ci1 = zeros (2 , size ( folds { FOLD_IDX } ,2) ) ; ci2 = zeros (2 , size ( folds { FOLD_IDX } ,2) ) ; ci3 = zeros (2 , size ( folds { FOLD_IDX } ,2) ) ; ci4 = zeros (2 , size ( folds { FOLD_IDX } ,2) ) ; for i = 1: size ( folds { FOLD_IDX } ,2) [p , m , ci ] = parsefelement ( folds { FOLD_IDX }{ i }) ; means (: , i ) = m ; preds (: , i ) = p ; ci1 (: , i ) = ci (: ,1) ; ci2 (: , i ) = ci (: ,2) ; ci3 (: , i ) = ci (: ,3) ; ci4 (: , i ) = ci (: ,4) ; end 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 1 2 3 4 5 6 7 8 9 meansn = zeros (4 , size ( folds_n { FOLD_IDX } ,2) ) ; predsn = zeros (4 , size ( folds_n { FOLD_IDX } ,2) ) ; ci1_n = zeros (2 , size ( folds_n { FOLD_IDX } ,2) ) ; ci2_n = zeros (2 , size ( folds_n { FOLD_IDX } ,2) ) ; ci3_n = zeros (2 , size ( folds_n { FOLD_IDX } ,2) ) ; ci4_n = zeros (2 , size ( folds_n { FOLD_IDX } ,2) ) ; for i = 1: size ( folds_n { FOLD_IDX } ,2) [p , m , ci ] = parsefelement ( folds_n { FOLD_IDX }{ i }) ; means_n (: , i ) = m ; preds_n (: , i ) = p ; ci1_n (: , i ) = ci (: ,1) ; ci2_n (: , i ) = ci (: ,2) ; ci3_n (: , i ) = ci (: ,3) ; ci4_n (: , i ) = ci (: ,4) ; end function [ c ] = cvprocess (X , Y , i , cvtr , cvte ) Xp = X ( cvtr , :) ; Yp = Y ( cvtr , :) ; idx = i ( cvtr , :) ; b = Xp \ Yp ; c = {}; y_hat = Xp * b ; for j = 1: size ( Xp ,1) c { idx ( j ) } = y_hat (j ,:) ; 49 10 11 12 13 14 15 16 17 1 2 3 4 5 6 end Xt = X ( cvte , :) ; Yt = Y ( cvte , :) ; it = i ( cvte , :) ; for j = 1: size ( Xt ,1) c { it ( j ) } = [ c { it ( j ) }; Yt (j ,:) ]; end end function [ prediction , means , ci ] = parsefelement ( el ) el (: ,2:3) = 1 - exp ( el (: ,2:3) ) ; el (: ,1) = sqrt ( el (: ,1) ) ; prediction = el (1 ,:) ; ci = zeros (2 , size ( el ,2) ) ; means = zeros (1 , size ( el ,2) ) ; 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 for i = 1: size ( el ,2) pd = fitdist ( el (2: end , i ) , ’ Normal ’) ; means ( i ) = mean ( el (2: end , i ) ) ; cis = paramci ( pd ) ; ci (: , i ) = cis (: ,1) ; end end function [ mse ] = calcmse ( cvt ) cnt = 0; mse = zeros (1 ,4) ; for i = 1: size ( cvt ,2) cvt { i }(: ,1) = sqrt ( max (0 , cvt { i }(: ,1) ) ) ; cvt { i }(: ,2:3) = 1 - exp ( cvt { i }(: ,2:3) ) ; pred = cvt { i }(1 ,:) ; test = cvt { i }(2: end ,:) ; for j = 1: size ( test ,1) mse = mse + ( pred - test (j ,:) ) .^2; cnt = cnt + 1; end end 14 15 mse = mse ./ cnt ; 16 17 end 50 På svenska Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ In English The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Institutionen för datavetenskap Estimating Internet-scale Quality of Service Parameters for VoIP Markus Niemelä