* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Using Model Trees for Computer Architecture Performance Analysis
Survey
Document related concepts
Transcript
Using Model Trees for Computer Architecture Performance Analysis of Software Applications ElMoustapha Ould-Ahmed-Vall and James Woodlee and Charles Yount and Kshitij A. Doshi and Seth Abraham Intel Corporation 5000 W Chandler Blvd Chandler, AZ 85226 [email protected] and {jim.woodlee,chuck.yount,kshitij.a.doshi, seth.abraham}@intel.com Abstract— The identification of performance issues on specific computer architectures has a variety of important benefits such as tuning software to improve performance, comparing the performance of various platforms and assisting in the design of new platforms. In order to enable this analysis, most modern micro-processors provide access to hardwarebased event counters. Unfortunately, features such as out-oforder execution, pre-fetching and speculation complicate the interpretation of the raw data. Thus, the traditional approach of assigning a uniform estimated penalty to each event does not accurately identify and quantify performance limiters. This paper presents a novel method employing a statistical regression-modeling approach to better achieve this goal. Specifically, a model-tree based approach based on the M5’ algorithm is implemented and validated that accounts for event interactions and workload characteristics. Data from a subset of the SPEC CPU2006 suite is used by the algorithm to automatically build a performance-model tree, identifying the unique performance classes (phases) found in the suite and associating with each class a unique, explanatory linear model of performance events. These models can be used to identify performance problems for a given workload and estimate the potential gain from addressing each problem. This information can help orient the performance optimization efforts to focus available time and resources on techniques most likely to impact performance problems with highest potential gain. The model tree exhibits high correlation (more than 0.98) and low relative absolute error (less than 8 %) between predicted and measured performance, attesting it as a sound approach for performance analysis of modern superscalar machines. I. I NTRODUCTION The identification of performance issues of software applications on specific computer architectures has a variety of important benefits. Primarily, such analysis is used to tune the applications to improve performance on one or more computer platforms, e.g., reduce execution time or support more simultaneous users. It can also be used to compare the performance behaviors of various platforms or even to help design new platforms. In order to enable this analysis, most modern micro-processors provide access to hardware-based event counters. Unfortunately, the resulting event counts cannot be interpreted directly to comprehend the contribution of each event to actual performance. This paper presents a novel method employing a statistical regressionmodeling approach to better achieve this goal. A variety of efficient measurement tools facilitate the selection and collection of hardware event counts during workload execution [1], [2]. The resulting ability to associate processor event statistics with the execution of an application at a fine time-grain with negligible processor perturbation creates the opportunity to understand what causes an application to perform below its true potential. An ideal analysis methodology should be able to dissect the event counter data, both to identify which micro-architectural factors reduce performance and to quantify the impacts of these factors. In practice, gathering and analyzing event data for performance characterization tends to occur largely in an adhoc fashion [3]. It is customary, for example, to capture frequencies of events such as cache misses and branch mispredicts and use them to estimate the per-instruction cycle-penalties from these events independently in a firstorder analysis. Modern machines elide many of these penalties with dynamic and speculative execution –for example, independent instructions can proceed while a load stalls, and control speculation allows execution to proceed when branch resolution stalls. Thus the amount of penalty successfully removed depends on the available instruction level parallelism and the instantaneous interactions between microarchitectural events. In response, the work presented here explores a statistical, machine-learning approach for performance modeling. Specifically, a novel solution based on the M5’ [4], [5] algorithm is proposed for classification and performance evaluation. Model trees are a sub-class of regression trees [6]. At leaf nodes, they employ linear models, which improves compactness and prediction accuracy relative to classical regression trees. M5’ uses a divide and conquer approach to recursively partition the input space into homogenous subsets, so that linear fitting at the leaf nodes can explain remaining variability. The partitioning generates ordered rules for reaching the leaf node models, and the leaf node linear models quantify in a statistically rigorous way, the contribution of each micro-architectural event to the overall performance. The power of the prediction model arrived at in this way is that it is interpretable, in contrast with other machine learning approaches such as neural networks. The approach also extends seamlessly to workloads that contain multiple execution phases [7]. The proposed model is trained by using data collected from executing several SPEC CPU2006 [8] workloads on R a 2.4 GHz Intel CoreTM 2 Duo processor. The execution of each workload is divided into sections of equal numbers of retired instructions. This technique helps to localize classification over distinct phases. In each section, per instruction ratios are obtained for the execution time and for selected hardware events arising during that sections execution. The number of cycles per instruction (CPI) is used as the performance metric. The freely available open-source software package WEKA [9] is employed for model training and for comparing different machine learning techniques. A prototype of the resulting performance model is implemented using MATLAB. The resulting performance model provides accurate estimates of performance impacts from various micro-architectural events and groups of interacting events. A 10-fold cross validation demonstrates a 0.98 correlation between predicted and measured CPI and a relative absolute error below 8%. Comparisons with other machine learning methods establish the soundness of this approach for analyzing performance behaviors of modern superscalar machines. The remainder of the paper is organized as follows. Section II presents some of the related work. Section III formulates the performance analysis problems addressed in this paper. Section IV describes the solution approach. Section V presents our results and evaluates the proposed approach. Section VI concludes the paper. II. R ELATED W ORK Recent years have seen several attempts to build models for performance analysis of processors. Unfortunately, most of these models fail to include many micro-architectural events and design space parameters, which leaves validity unknown when a large set of events and design parameters are present. This is mainly because these models require prior knowledge about significant events and parameters, and the required knowledge is gained from expensive, simulationbased sensitivity analysis. In addition to its prohibitive cost, simulation accuracy is questionable especially in the case of applications whose time varying behaviors are not easily represented in traces. Our work avoids these problems by relaying on run-time measurements during the execution of the entire application rather on simulation data. In [10], the authors propose a linear formula expressing the CPI as a function of data and instruction cache misses, branch mispredicts and the ideal steady-state CPI. The performance penalty of cache misses and branch mispredicts is estimated using trace-driven simulation. The work in [11] extends [10] by including the effects of pre-fetching and resource contention in the model, and uses a probabilistic approach to limit the required number of trace-driven simulation scenarios. These two approaches do not include other critical potential sources of CPI degradation such as DTLB and ITLB misses, various load blocks, and the effects of unbalanced instruction mixes. More importantly, the two models do not account for the inherent interaction effects between various performance events and for differing behaviors from application to application and often among different phases [7] of the same application. In contrast, this work establishes a classification of workloads or phases of workloads and builds a model for each using measured performance data rather than simulation data. In [12], [13], analytical models are used to study the effect of pipeline depth on performance for in-order and outof-order processors. These two works use simulation-based sensitivity analysis to determine important model parameters. In [12], detailed superscalar simulation is used to determine the fractions of stall cycles for different pipeline stages and the degree of superscalar processing that remains viable. In [13], the authors use detailed simulations of a baseline scenario and scenarios with increased processor front-end width to determine the effects of micro-architecture loops (e.g., branch mispredict loops) on the performance. Again, these two models take only into account one aspect of the performance analysis. Our model, on the other hand, considers the processor performance as a whole while including many more potential sources of performance degradation. Several statistical techniques have been used to limit the required number of simulation runs for design space exploration needed during the design phase of new processors. In [14], [15], principal component analysis is used to limit design space exploration by identifying key design space parameters and computing their correlations. Placket and Burman fractional design is used in [16] to establish parameter prioritization for sensitivity analysis. The authors model high and low values of a set of N design parameters using only 2N simulations focusing on parameters with high priority. In [17], the authors define interaction cost to account for the interaction between two different micro-architectural performance events. The authors design new hardware to enable sampling workload execution in sufficient detail to construct representative dependency graphs to be used for the computation of the interaction cost. Our approach also takes into account the interaction between various microarchitectural events. However, we propose the handling of the interaction cost in a statistical manner without the requirement of dedicated new hardware. III. P ROBLEM FORMULATION This work treats performance analysis from the perspective of workload performance optimization and tuning. In particular, two sub-problems are considered here: • • The “what” question: This question tries to identify the key performance problems and potential means for improving performance. In answering this question, one orients performance analysis so that it more directly guides the optimization activity by addressing specific performance issues (e.g., reduction of cache misses). The “how much” question: This question focuses the analysis towards estimating the potential performance gain from mitigating a specific performance issue or set of performance issues. This question is important as it helps prioritize among several alternatives according to the importance of the different performance issues and the cost of addressing them. The ”what” and ”how much” questions can be answered by expressing the performance metric as a dependent variable (Y ) of a set {X1 , X2 , ..., Xn } of n different microarchictectural predictors. In this formulation, it is necessary to take into account potential interaction between different predictor events, so that the impact from one type of event (e.g., L1 cache misses) is calculated differently according to whether a significant number of a second type of event (e.g., DTLB misses) is present. For a given span of execution, let Xi represent the count of the i-th predictor event divided by the number of instructions executed in that span; and let Y be the average number of processor cycles per instruction (i.e., the CPI) over that span. Thus, the problem at hand is to compute the function: Y = f (X1 , X2 , ..., Xn ) + (1) where is an error term. Consistent with [7], we make the assumption that any given workload in general may embody multiple phases or classes of behavior. Accordingly, the functional mapping between the inputs {Xj } and the output Y is different for each class. That is, typically the function f (X1 , X2 , ..., Xn ) is non-continuous. To estimate the potential gain requires the identification of the different workload classes and the estimation of the form of the function f in each class. Let Ij (X1 , X2 , ..., Xn ) be the membership function of class j. That is, Ij (X1 , X2 , ..., Xn ) = 1, (X1 , X2 , ..., Xn ) ∈ Classj Ij (X1 , X2 , ..., Xn ) = 0, otherwise (2) And let fj (X1 , X2 , ..., Xn ) be the form taken by f in Classj . That is, the function f can be expressed as: Y = f (X1 , X2 , ..., Xn ) + = k X Ij (X1 , X2 , ..., Xn )fj (X1 , X2 , ..., Xn ) + (3) j=1 where k is the number of classes. Then, the problem at hand can be summarized as follows: • Identifying the different classes, i.e., estimating the number k of distinct classes. and identifying the subset of the input space that is covered by each class (computing the Ij function) • Estimating the k different class functions, f1 , f2 , ..., fk class functions j. • Decomposing each class function fj to isolate the effect of each micro-architectural event in class j. IV. M ODEL T REE BASED P ERFORMANCE A NALYSIS To answer the “what” and “how much” questions posed in Section III we consider model trees as a suitable machine learning approach that can provide both insight workload performance problems and estimates for the potential benefits from addressing the problems. Model trees are extensions of regression trees. Both classes of algorithm (model trees and regression trees) are used to predict the value of a dependent, continuous variable (e.g., performance in our case) from a set of independent variables, termed attributes (e.g., microarchitectural events in this case). The main difference is that regression trees are used to fit piecewise constant functions, while model trees are used to fit piecewise multi-linear function. The term model trees is sometimes used in certain publications to mean regression trees. Another term that is also sometimes used interchangeably with model trees is function trees. In the rest of the paper, we use model trees to mean a class of algorithms that divide the input space into a tree format and fit predictive regression models at the leaf nodes. Model trees offer several benefits that make them a suitable regression approach for performance analysis. These benefits include: • Good prediction accuracy: The prediction accuracy of model trees (see Section V) is comparable to that of black-box techniques such as artificial neural networks [18] and is known to be higher than the prediction of regression trees such as CART [6]. • Interpretable models: Both the derived tree structure and the regression models at the leaf nodes ca be used to gain insight into the nature and severity of performance problems. This property is particularly important in order to be able to answer the “what” and “how much” questions. Regression approaches such as Neural Networks [18] and Support Vector Machines [19] lack interpretability. • Additional properties: Model trees are also known to efficiently handle large data sets with a high number of attributes and high dimensions [20]. The model tree algorithm used in this paper is M5’ [5], which is a re-implementation of Quinlan’s original M5 algorithm [4] in the open-source software package WEKA [9]. While most other linear and non-linear regression techniques fit a single function to predict a dependent variable from a set of independent variables, M5’ partitions the input space (training set) into a number of disjoint hyperspaces each corresponding to an instance under a specific leaf node. In the case of performance analysis, each hyperspace represents a separate class of sections or phases of workloads, for which a distinct performance model is established. Figure 1 gives a typical tree structure as produced by M5’for predicting the the function Y = f (X1 , X2 , X3 , X4 ). The terms LM1, LM2, etc. at the leaf node stand for the linear model applied in the class represented by the corresponding leaf node. The rest of this section details how M5’ build tree models and how these models can be used for performance analysis. A. Tree Construction Model building consists of applying M5’ to a representative training set, and comprises two main phases. The training set consists of sections of workloads with known (experimentally measured) performance and counts of the different micro-architectural events (predictors). In the first phase, a large tree is grown. In the second phase, the tree is pruned back to avoid overfitting. The tree construction algorithm uses a classical divide-and-conquer, top-down approach. The relevant parameters for tree construction algorithms are the splitting rule, the termination criterion, and the leaf assignment criterion. The splitting rule in M5’ is to use at each non-leaf node the most discriminative attribute (independent variable). In our case, all the attributes are continuous and the most discriminative attribute is defined as the one that reduces most the variance of the dependent variable (class variable). After each split based on one of the attribute variables, a subset of the training set goes to the left or right branch derived from the split. The tree construction algorithm is recursively called on the subset of training data in each branch. For the termination criterion, we use a pre-pruning strategy with a minimum number of training instances in each leaf node. A node is not split further if the variance of its subset is small enough. A node is also not split, if its population is at or below a threshold number; this allows control against overfitting. Post-pruning (described in the next subsection) enforces the population criterion as well, so that by limiting overfitting the model is prevented from following the outliers too closely. The minimum number of instances needed to strike a balance between prediction accuracy on training and new data (problem of bias versus variance) depends on many factors such as the number of attributes and the number of outliers. Here, it was determined experimentally that a minimum number of 430 instances is a reasonable one. The leaf assignment criterion for M5’ is to fit, at each leaf node, a linear regression model predicting the dependent variable as a function of the attributes for instances falling under the corresponding class. This is in contrast with classical regression tree algorithms, such as CART [6], in which a constant value is predicted at each leaf node. Use of a linear model better fits our performance analysis objective of understanding the impact of different factors, whereas a constant value model, while simpler, would not meet the purpose. B. Pruning Pruning mitigates potential overfitting. During the pruning phase, the tree is traversed depth-first. At each non-leaf node, two error measures are estimated and compared to determine whether or not to make the node a leaf node by pruning its sub-tree. The first error measure is an estimate of the prediction error at the non-leaf node if it was pruned to a leaf. The second error measure is an estimate of the prediction error of the sub-tree below the non-leaf node. If the former is smaller than the later, the entire sub-tree is pruned to a leaf node, otherwise the sub-tree remains as it is. When the pruning occurs, the linear model at the new leaf node is used for prediction. C. Use of the Model for Performance Analysis To analyze the performance of a given workload, data is collected for the different sections of the workload and arranged into a matrix with rows representing sections and columns representing the events. Each section then traverses the tree from the root node, until it arrives at some leaf node; i.e., it is placed in a specific class. The class is characterized by the variables used in decision rules leading to the corresponding leaf and by the split points for each variable. Each one of these split variables is considered a source of potential performance improvement when the leaf node is on the right side of the split point (high values of corresponding count). As each leaf node specifies a distinct linear model to be used for performance prediction, it explicitly characterizes the impact of each event in the linear model for the sections /. - "!# - - %$ - $ "!& *) +$ "!( "!&, "!& Example M5’ tree structure that reach it. Thus the fractional contribution of a performance event to the execution time, as well as the percent improvement that may be expected from optimizing for that event are readily available at the leaf nodes. In addition to the explicit factors that are enumerated in the regressions at the leaf nodes, the split nodes on the path from the root to a leaf node represent the implicit categorical factors that control performance at the leaf node. AND %+$ ) "!(' Fig. 1. V. R ESULTS E VALUATION This section presents the results of applying model trees to event-counter-based processor performance analysis, and interprets the class decomposition yielded by the M5 algorithm. The interpretation comports nicely with well known sensitivities of computer architecture in general and the CoreTM 2 Duo processor micro-architecture in particular, and provides confidence that the decomposition is valid. This section also provides an evaluation of the model along several prediction accuracy metrics, and shows that the model has excellent predictive power. R The data used in this section is collected on an Intel TM Core 2 Duo processor-based desktop platform. The test machine has a speed of 2.4 GHz and 1 GB of memory. Each core has a 32 KB, level-one instruction cache and a separate data cache of the same size. The two cores share a level-two cache of 4 MB. For more details on CoreTM2 Duo processor architecture, the reader is referred to CoreTM2 Duo processor Optimization Guide [21], [22]. The data collection platform R is running a Microsoft WindowsTMXP 64-bit operating system. Our study uses CPI as the main performance metric described as a function of 20 other performance counters. The data is collected for a subset of a subset SPEC CPU2006 workloads. These events were chosen identified as candidates likely to be most relevant in the performance analysis. They represent the execution time and various performance-related micro-architectural events characterizing the instruction mix, the memory sub-system, the branch prediction accuracy, the data and instruction translation lookaside buffers and other known potential sources of performance degradation as described in Table I. Data collection was grouped into sections of equal counts of executed instructions. For a detailed description of our experimental setup, the reader is refered to [23]. A. Performance Model The performance model has two components. The first component consists of a tree structure used for classification purposes. When predicting on a new instance, the tree is TABLE I S ELECTED METRICS USED IN Metric CPI InstLd InstSt BrMisPr BrPred InstOther L1DM L1IM L2M DtlbL0LdM DtlbLdM DtlbLdReM Dtlb ItlbM LdBlSta LdBlStd LdBlOvSt MisalRef L1DSpLd L1DSpSt LCP Corresponding event CPU CLK UNHALTED.CORE INST RETIRED.LOADS INST RETIRED.STORES BR INST RETIRED.MISPRED BR INST RETIRED.ANY – BR INST RETIRED.MISPRED INST RETIRED.ANY – (INST RETIRED.LOADS + INST RETIRED.STORES + BR INST RETIRED.ANY) MEM LOAD RETIRED.L1D LINE MISS L1I MISSES MEM LOAD RETIRED.L2 LINE MISS DTLB MISSES.L0 MISS LD DTLB MISSES.MISS LD MEM LOAD RETIRED.DTLB MISS DTLB MISSES.ANY ITLB.MISS RETIRED LOAD BLOCK.STA LOAD BLOCK.STD LOAD BLOCK.OVERLAP STORE MISALIGN MEM REF L1D SPLIT.LOADS L1D SPLIT.STORES ILD STALL traversed until reaching a specific leaf node to find the appropriate class. The second component consists of the linear models at the leaf nodes. These models are used to predict the CPI, and to estimate the impact of each microarchitectural event on the overall performance. 1) Tree Structure: Figure 2 presents the performance analysis model tree obtained from applying M5’ to the training set. Each leaf node in the tree represents a distinct class of sections of workload class. The number in parentheses indicates the percent of the training set that falls into the corresponding leaf. The performance within each class is explained by a linear model. At the root node and the next few levels of the tree, we can immediately see that the model identifies the level 2 cache misses (L2M) as the single event that most strongly impacts performance. As L2 cache miss is the single longest latency event in this study, this result is to be expected. The right side of the root node represents workloads or sections of workloads with a significant number of L2 cache misses, while the THIS STUDY Description CPU clock cycles per instruction Loads per instruction Stores per instruction Mispredicted branches per instruction Correctly predicted branches per instruction Non-branch and memory instructions per instruction L1 data misses per instruction L1 instruction misses per instruction L2 misses per instruction Lowest level DTLB load misses per instruction Last level DTLB load misses per instruction Last level DTLB retired load misses per instruction Last level DTLB misses (including loads) per instruction ITLB misses per instruction Load block store address events per instruction Load block store data events per instruction Load block overlap store per instruction Misaligned memory references per instruction L1 data split loads per instruction L1 data split stores per instruction Length changing prefix stalls per instruction left sub-tree represents instances without significant cache miss problems. It must be noted here that by significant number we mean a number of occurrences per instruction that is greater than a given threshold. This threshold, or split point, is automatically derived by the algorithm. We can immediately see a certain pattern. The model decides first based on cache misses (i.e., level 2 misses), then DTLB misses followed by branch related events. Less frequent discriminative predictors are found in the lower levels of the tree. In CoreTM2 Duo processors the first level caches (i.e., those closest to the processor pipeline) are divided into instruction and data caches, while the second level cache is unified. A miss in the level 1 instruction (L1IM) or the level 1 data cache (L1DM) results in an access to the unified level 2 cache. Interestingly, we see on the right side of the root node that the model tries to determine whether the significant level two cache misses result from data accesses or instruction accesses. 354768 !)"7()":; - ?>@( + 1 1 ,;!)6 <4$= 0- * + !)"7()":; / .- - + * , ACB.'140DE4$F:" 354768 + AB,'24$<4 ,- . * !)"7()":; 1 . ,* * 2 Performance Analysis Tree . . + AB,'24$<4 Fig. 2. 39 * !#"$&% '()" + , !)"7()":; AG !#"$&% '()" !#"$&% '()" Sections of workload characterized by a high number of level 1 instruction cache misses (L1IM) combined with a high number of L2 misses fall in linear model (LM18). LM18 is simply a constant: CP I = 2.2, indicating poor avergae peformance under these conditions. In this case, the performance degradation resulting from these combination of events overshadows that from any other events. This result makes sense because an instruction miss prevents the introduction of new instructions to the out-of-order core. The main SPEC workload that falls into this category is 436.cactusADM, where more than 95% of the sections experience high L2 cache misses combined with a high rate of L1 instruction misses. In contrast, when a high number of L2 cache misses is combined with a high number of L1 data cache misses, the CPI is represented by LM17, a linear equation containing several predictors including L2 cache and DTLB misses. In addition to the effects measured in the linear equation, both the level 1 data cache misses and the level 2 cache misses further affect performance, since both are split variables used in the decision rules to reach this class. An example workload with a large percentage of sections falling in LM17 is the SPEC benchmark 429.mcf where more than 70% of the sections are classified in LM17. On the left side of the root node, where level 2 misses are not present in significant number, the model tests for the presence or absence of DTLB misses. Testing for DTLB misses in absence of a significant number of L2 misses makes sense when one considers the capacity relationship between the DTLB and the L2 cache. The CoreTM 2 Duo processor DTLB contains only enough entries to map about 14 of the full L2 cache. So it is not surprising that DTLB miss events become significant even when the referenced data hits the L2 cache. After the first two levels of the tree, branch related events seem to become very important. In particular, the model tests on the number of branch mispredictions per instruction (BrMisPr) on several occasion in both sides of the tree. Each mispredict, in addition to causing a pipeline flush, also requires fetching to restart at the correct address, and thus causes significant loss of execution cycles. The tree model indicates that after the L2 and DTLB misses, branch mispredict events are the most discriminative of the performance events. A close examination of figure 3 reveals that despite their importance, branch events (BrMisPr and BrPred) impact CPI to a much smaller extent than cache misses. Either branch events appear low in the tree (sub-tree on the right side of the root node) when L2 cache misses occur, or they occur in workload sections without significant numbers of L2 misses. It is instructive to compare the importance of branch mispredicts in this architecture with their controlling role on the PentiumTMNetburst processor, as reported in [13], where the much longer pipeline translated into a greater pipeline flush and resteering cost. Other events beside cache and DTLB misses and branch mispredicts do not show strong impact in general; however, for specific sections of the workloads they are very important. For instance, LM10 is characterizes workload sections that are significantly affected by length changing prefix (LCP) stalls. An example workload in this category is SPEC 403.gcc (also affected by cache misses), where about 20% of the sections experience performance degradation due to LCP stalls. Workloads that have a significant number of sections that fall into this model would be candidates for software optimizations or compiler code generator changes directed at removing these instructions that have a length changing prefix. Without this model-tree-based approach, events that do not have a strong impact in general would be difficult to detect, and an optimization opportunity would be potentially lost. 2) Linear Models: Each leaf node in the tree corresponds to a distinct class of performance behavior, where the performance (i.e., CPI) can be explained using a linear function of the micro-architectural events. This function can be used to estimate the impact of each event on the overall workload performance. For instance, in case of linear model 8 (LM8) shown in Equation 4, the contribution of L1 instruction misses (L1IM) to the overall performance can be measured by the ratio of 6.69∗L1IM/CP I. For a numerical illustration, let’s assume that the predicted CPI is 1.0, while L1IM is 0.03 per instruction. In this case, the contribution of level 1 instruction cache misses to the performance is 6.69 ∗ 0.03/1.0 = 0.20. In other words, our model predicts a potential performance improvement of about 20%, if all L1 instruction misses are addressed by some code optimization technique such as code block placement or profile-guided optimization. CP I = 0.52 + 139.91 ∗ ItlbM + 2.22 ∗ DtlbL0LdM +28.21 ∗ DtlbLdReM + 6.69 ∗ L1IM + 1.08 ∗ InstLd (4) This approach allows the ranking of performance issues by their respective predicted contributions to the overall performance. This ranking can be used to answer both the “what” and “how much” questions. It shows performance analysts which micro-architectural events to target first and how much gain to expect. Another example is given in linear model 11 (LM11) in CP I = 0.75 + +193.98 ∗ DtlbLdReM (5) The previous example illustrates how to measure the effects of variables that appear in the linear models. To assess the effects of the split variables (used in decision rules) not appearing in the linear models, we can examine the differences in performance statistics between the two branches of the split. A simple approach is to use the average performance difference between the sub-tree on the left side and the sub-tree on the right side. For example, consider the split on the LdBlSta variable within the left subtree. The average CPI for the two classes within the left subtree are 0.57 and 0.51, while that for right subtree is 0.84. Thus, the net impact of this variable to the right subtree is approximately (0.84 - M ean(0.57, 0.51)), i.e., 0.30 or 35% of the CPI. A more sophisticated approach would be to use a weighted average instead of the simple mean. Another is to combine the data for instances falling under both subtrees and fit a simple linear regression of performance (CPI) on the split variable. In this case, the regression R 2 can be used as an indication of the contribution of the split variable to the overall performance. B. Model Evaluation Model trees, like many non-linear regression techniques, can overfit the data and produce models that perform well on the training data but poorly on unseen data (that was not used for model fitting). To check for overfitting and to evaluate the appropriateness of the model tree algorithm for the performance data, a 10-fold cross validation [24] was employed. In this technique, the total dataset is divided into 10 disjoint subsets or folds. The model is then trained using 9 of the subsets and evaluated using the tenth subset. The process is repeated 10 times, and each time, a different subset is used for testing while the remaining 9 subsets are used to train the model. The fitness of the modeling algorithm is evaluated by averaging the prediction metrics from the 10 different models. Three performance metrics are used in this work: the correlation coefficient (C), the mean absolute error (MAE) and the Relative Absolute Error (RAE), described in [23]. Predicted vs. Actual CPI 10 Actual CPI Predicted CPI 8 6 Predicted CPI Equation 5. In this case, only one events (number of retired loads that miss the last level DTLB per instruction) appears in the linear model. The performance of sections falling in this category is only influenced by DTLB misses and the split variables leading to this leaf node (e.g., L2M and BrMisPr,..). For some leaf nodes, the performance is only affected by the split variables leading to the node. This is the case of linear model 18 (LM18) discussed in the previous subsection. 4 2 0 −2 0 1 Fig. 3. 2 3 4 5 Actual CPI 6 7 8 9 10 Predicted CPI vs. Actual CPI Using M5’ Our model shows a high correlation coefficient of 0.98 on test data. The mean absolute error also shows high prediction accuracy with an M AE = 0.05. The relative absolute error is 7.83%. This number indicates that our model combines good accuracy with model interpretability as illustrated in the previous subsection. The accuracy of this approach is competitive even with non-linear black-box techniques that sacrifice intepretability for better prediction accuracy. For instance, Artificial Neural Networks [18] and Support Vector Machines [25] give correlation coefficients of 0.99 and 0.98, respectively, on the same data. For a detailed comparison, the reader should refer to [23]. To illustrate the predictive power of our approach, Figure 3 plots the predicted CPI values on the cross validation versus the measured CPI values. Note that the prediction is performed on data points in the test fold. In other words, the prediction on each data point is performed using a model that was build on training data that does not include the data point. The figure shows a strong correlation between the actual and predicted CPI values. Except for few outliers, most data points on the figure are very close to the unity line (i.e., perfect correlation). VI. C ONCLUSIONS In summary, this paper described the use of model trees for performance analysis. As proof of concept, M5 was used to build a model tree using processor event data collected during execution of a subset SPEC CPU2006 workloads on the R Intel CoreTM2 Duo processor. Data collection was grouped into spans of equal counts of executed instructions. A training subset derived randomly from this data was used to first construct the tree in a recursive top-down fashion, and then prune the tree in a depth-first, bottom-up fashion. Pruning reduced the likelihood of overfitting, and helped strike a balance between model compactness and discriminative ability. The model built in this way is easy to reason about, and the split decisions can be examined in juxtaposition with the known behaviors and parameters of the physical machine being modeled. It is therefore easy to comprehend the context and severity of performance issues arising during the execution of a workload. A 10-fold cross validation showed that the model had a mean absolute error of less than 5% and a prediction correlation of 0.9845. These accuracy measures compare very well with other non-linear machine learning techniques such as artificial neural networks and support vector machine that were also evaluated in this work, and demonstrate that model trees are a very attractive approach for conducting performance analysis of complex micro-architectures. ACKNOWLEDGMENT The authors would like to thank the following people for their help with this work: Antonio C. Valles, Garrett T. Drysdale, James C. Abel, Agustin Gonzalez, David A. Levinthal, Stephen P. Smith, Henry Ou, Yong-Fong Lee, Alex A. LopezEstrada, Kingsum Chow, Thomas M. Johnson, Michael W. Chynoweth, Annie Foong, Vish Viswanathan. R EFERENCES [1] J. Levon, “Profiling tool for linux, kernel profiling, etc., oprofile,” http://oprofile. sourceforge.net, 2004. [Online]. Available: http://oprofile. sourceforge.net [2] R. Hundt, “Hp caliper: A framework for performance analysis tools,” IEEE Concurrency, vol. 8, no. 4, 2000. [3] R. Kufrin, “Perfsuite: An accessible, open source performance analysis environment for linux,” in Proceedings of the 6th International Conference on Linux Clusters: The HPC Revolution 2005 (LCI-05), 2005. [4] R. Quinlan, “Learning with continuous classes,” in Proceedings of the 5th Australian Joint Conference on Artificial Intelligence (AI’92), 1992. [5] Y. Wang and I. Witten, “Inducing model trees for continuous classes,” in Proceedings of the 9th European Conf. on Machine Learning, Poster Papers, 1997. [6] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Wadsworth International Group, 1984. [7] T. Sherwood, S. Sair, and B. Calder, “Phase tracking and prediction,” in Proceedings of 30th Annual International Symposium on Computer Architecture (ISCA’03), 2003. [8] “Standard performance evaluation corporation. SPEC CPU benchmark suite,” http://www.specbench.org/osg/cpu2006, 2006. [Online]. Available: http://www.specbench.org/osg/cpu2006 [9] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000. [10] T. S. Karkhanis and J. E. Smith, “A first-order superscalar processor model,” in Proceedings of the International Symposium on Computer Architecture (ISCA’04), 2004. [11] L. Simonson and L. He, “Micro-architecture performance estimation by formula,” in Proceedings of the 5th International Workshop on Embedded Computer Systems: Architectures, MOdeling, and Simulation (SAMOS’05), 2005. [12] A. Hartstein and T. Puzak, “The optimum pipeline depth for a microprocessor,” in Proceedings of the International Symposium on Computer Architecture (ISCA’02), 2002. [13] E. Sprangle and D. Carmean, “Increasing processor performance by implementing deeper pipelines,” in Proceedings of the International Symposium on Computer Architecture (ISCA’02), 2002. [14] K. Chow and J. Ding, “Multivariate analysis of Pentium Pro processor,” in Proceedings of Intel Software Developers Conference, 1997. [15] G. Cai, K. Chow, T. Nakanishi, J. Hall, and M. Barany, “Multivariate power/performance analysis for high performance mobile microprocessor design,” in Proceedings of Power Driven Microarchitecture Workshop, 1998. [16] J. Yi, D. Lilja, and D. Hawkins, “A statistically-rigorous approach for improving simulation methodology,” in Proceedings of 9th IEEE Symposium on High Performance Computer Architecture, 2003. [17] B. Fields, R. Bodick, M. Hill, and C. Newburn, “Interaction cost and shotgun profiling,” ACM Transactions on Architecture and Code Optimization, vol. 1, no. 3, pp. 272–304, 2004. [18] T. Mitchell, Machine Learning. McGraw Hill, 1997. [19] J. Platt, “Using sparseness and analytic qp to speed training of support vector machines,” in Proceedings of Advances in Neural Information Processing Systems (NIPS’99), 1999. [20] D. Solomatine and K. N. Dulal, “Model tree as an alternative to neural network in rainfall-runoff modeling,” Hydrological Sc. J., vol. 48, no. 3, 2003. [21] Intel, “Intel 64 and IA-32 architectures optimization reference manual,” http://developer.intel.com/design/Pentium4/manuals/index new.htm, 2006. [22] ——, “IA-32 intel architecture optimization: reference manual,” http://www.intel.com/design/Pentium4/manuals/248966.htm. [23] E. Ould-Ahmed-Vall, J. Woodlee, C. Yount, and K. A. Doshi, “On the comparison of regression algorithms for computer architecture performance analysis of software applications,” in Workshop on Statistical and Machine learning approaches applied to ARchitectures and compilaTion (SMART’07), co-located with International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’07), 2007. [24] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proceedings of 14th International Joint Conference on Artificial Intelligence, 1995. [25] S. Shevade, S. Keerthi, C. Bhattacharyya, and K. Murthy, “Hp caliper: A framework for performance analysis tools,” IEEE Concurrency, vol. 8, no. 4, 2000.