Download Using Model Trees for Computer Architecture Performance Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cross-validation (statistics) wikipedia , lookup

Neural modeling fields wikipedia , lookup

Agent-based model in biology wikipedia , lookup

Mathematical model wikipedia , lookup

Structural equation modeling wikipedia , lookup

Time series wikipedia , lookup

Transcript
Using Model Trees for Computer Architecture Performance Analysis of
Software Applications
ElMoustapha Ould-Ahmed-Vall and James Woodlee and Charles Yount and Kshitij A. Doshi and Seth Abraham
Intel Corporation
5000 W Chandler Blvd
Chandler, AZ 85226
[email protected] and {jim.woodlee,chuck.yount,kshitij.a.doshi, seth.abraham}@intel.com
Abstract— The identification of performance issues on specific computer architectures has a variety of important benefits
such as tuning software to improve performance, comparing
the performance of various platforms and assisting in the
design of new platforms. In order to enable this analysis,
most modern micro-processors provide access to hardwarebased event counters. Unfortunately, features such as out-oforder execution, pre-fetching and speculation complicate the
interpretation of the raw data. Thus, the traditional approach
of assigning a uniform estimated penalty to each event does
not accurately identify and quantify performance limiters.
This paper presents a novel method employing a statistical
regression-modeling approach to better achieve this goal.
Specifically, a model-tree based approach based on the M5’
algorithm is implemented and validated that accounts for event
interactions and workload characteristics. Data from a subset
of the SPEC CPU2006 suite is used by the algorithm to
automatically build a performance-model tree, identifying the
unique performance classes (phases) found in the suite and
associating with each class a unique, explanatory linear model
of performance events. These models can be used to identify
performance problems for a given workload and estimate the
potential gain from addressing each problem. This information
can help orient the performance optimization efforts to focus
available time and resources on techniques most likely to impact
performance problems with highest potential gain.
The model tree exhibits high correlation (more than 0.98) and
low relative absolute error (less than 8 %) between predicted
and measured performance, attesting it as a sound approach
for performance analysis of modern superscalar machines.
I. I NTRODUCTION
The identification of performance issues of software applications on specific computer architectures has a variety
of important benefits. Primarily, such analysis is used to
tune the applications to improve performance on one or
more computer platforms, e.g., reduce execution time or
support more simultaneous users. It can also be used to
compare the performance behaviors of various platforms or
even to help design new platforms. In order to enable this
analysis, most modern micro-processors provide access to
hardware-based event counters. Unfortunately, the resulting
event counts cannot be interpreted directly to comprehend the
contribution of each event to actual performance. This paper
presents a novel method employing a statistical regressionmodeling approach to better achieve this goal.
A variety of efficient measurement tools facilitate the
selection and collection of hardware event counts during
workload execution [1], [2]. The resulting ability to associate
processor event statistics with the execution of an application
at a fine time-grain with negligible processor perturbation
creates the opportunity to understand what causes an application to perform below its true potential. An ideal analysis
methodology should be able to dissect the event counter data,
both to identify which micro-architectural factors reduce
performance and to quantify the impacts of these factors.
In practice, gathering and analyzing event data for performance characterization tends to occur largely in an adhoc fashion [3]. It is customary, for example, to capture
frequencies of events such as cache misses and branch
mispredicts and use them to estimate the per-instruction
cycle-penalties from these events independently in a firstorder analysis. Modern machines elide many of these penalties with dynamic and speculative execution –for example,
independent instructions can proceed while a load stalls, and
control speculation allows execution to proceed when branch
resolution stalls. Thus the amount of penalty successfully
removed depends on the available instruction level parallelism and the instantaneous interactions between microarchitectural events. In response, the work presented here
explores a statistical, machine-learning approach for performance modeling.
Specifically, a novel solution based on the M5’ [4], [5]
algorithm is proposed for classification and performance
evaluation. Model trees are a sub-class of regression trees [6].
At leaf nodes, they employ linear models, which improves
compactness and prediction accuracy relative to classical
regression trees. M5’ uses a divide and conquer approach
to recursively partition the input space into homogenous
subsets, so that linear fitting at the leaf nodes can explain
remaining variability. The partitioning generates ordered
rules for reaching the leaf node models, and the leaf node
linear models quantify in a statistically rigorous way, the
contribution of each micro-architectural event to the overall
performance. The power of the prediction model arrived at
in this way is that it is interpretable, in contrast with other
machine learning approaches such as neural networks. The
approach also extends seamlessly to workloads that contain
multiple execution phases [7].
The proposed model is trained by using data collected
from executing several SPEC CPU2006 [8] workloads on
R
a 2.4 GHz Intel
CoreTM 2 Duo processor. The execution of each workload is divided into sections of equal
numbers of retired instructions. This technique helps to
localize classification over distinct phases. In each section,
per instruction ratios are obtained for the execution time
and for selected hardware events arising during that sections
execution. The number of cycles per instruction (CPI) is used
as the performance metric. The freely available open-source
software package WEKA [9] is employed for model training
and for comparing different machine learning techniques. A
prototype of the resulting performance model is implemented
using MATLAB. The resulting performance model provides
accurate estimates of performance impacts from various
micro-architectural events and groups of interacting events.
A 10-fold cross validation demonstrates a 0.98 correlation
between predicted and measured CPI and a relative absolute
error below 8%. Comparisons with other machine learning
methods establish the soundness of this approach for analyzing performance behaviors of modern superscalar machines.
The remainder of the paper is organized as follows.
Section II presents some of the related work. Section III
formulates the performance analysis problems addressed
in this paper. Section IV describes the solution approach.
Section V presents our results and evaluates the proposed
approach. Section VI concludes the paper.
II. R ELATED W ORK
Recent years have seen several attempts to build models
for performance analysis of processors. Unfortunately, most
of these models fail to include many micro-architectural
events and design space parameters, which leaves validity
unknown when a large set of events and design parameters
are present. This is mainly because these models require prior
knowledge about significant events and parameters, and the
required knowledge is gained from expensive, simulationbased sensitivity analysis. In addition to its prohibitive cost,
simulation accuracy is questionable especially in the case
of applications whose time varying behaviors are not easily
represented in traces. Our work avoids these problems by
relaying on run-time measurements during the execution of
the entire application rather on simulation data.
In [10], the authors propose a linear formula expressing
the CPI as a function of data and instruction cache misses,
branch mispredicts and the ideal steady-state CPI. The performance penalty of cache misses and branch mispredicts is
estimated using trace-driven simulation. The work in [11]
extends [10] by including the effects of pre-fetching and
resource contention in the model, and uses a probabilistic approach to limit the required number of trace-driven
simulation scenarios. These two approaches do not include
other critical potential sources of CPI degradation such
as DTLB and ITLB misses, various load blocks, and the
effects of unbalanced instruction mixes. More importantly,
the two models do not account for the inherent interaction
effects between various performance events and for differing
behaviors from application to application and often among
different phases [7] of the same application. In contrast,
this work establishes a classification of workloads or phases
of workloads and builds a model for each using measured
performance data rather than simulation data.
In [12], [13], analytical models are used to study the
effect of pipeline depth on performance for in-order and outof-order processors. These two works use simulation-based
sensitivity analysis to determine important model parameters.
In [12], detailed superscalar simulation is used to determine
the fractions of stall cycles for different pipeline stages and
the degree of superscalar processing that remains viable.
In [13], the authors use detailed simulations of a baseline
scenario and scenarios with increased processor front-end
width to determine the effects of micro-architecture loops
(e.g., branch mispredict loops) on the performance. Again,
these two models take only into account one aspect of the
performance analysis. Our model, on the other hand, considers the processor performance as a whole while including
many more potential sources of performance degradation.
Several statistical techniques have been used to limit
the required number of simulation runs for design space
exploration needed during the design phase of new processors. In [14], [15], principal component analysis is used
to limit design space exploration by identifying key design
space parameters and computing their correlations. Placket
and Burman fractional design is used in [16] to establish
parameter prioritization for sensitivity analysis. The authors
model high and low values of a set of N design parameters
using only 2N simulations focusing on parameters with high
priority.
In [17], the authors define interaction cost to account
for the interaction between two different micro-architectural
performance events. The authors design new hardware to
enable sampling workload execution in sufficient detail to
construct representative dependency graphs to be used for
the computation of the interaction cost. Our approach also
takes into account the interaction between various microarchitectural events. However, we propose the handling of
the interaction cost in a statistical manner without the requirement of dedicated new hardware.
III. P ROBLEM
FORMULATION
This work treats performance analysis from the perspective of workload performance optimization and tuning. In
particular, two sub-problems are considered here:
•
•
The “what” question: This question tries to identify
the key performance problems and potential means for
improving performance. In answering this question, one
orients performance analysis so that it more directly
guides the optimization activity by addressing specific
performance issues (e.g., reduction of cache misses).
The “how much” question: This question focuses the
analysis towards estimating the potential performance
gain from mitigating a specific performance issue or
set of performance issues. This question is important as
it helps prioritize among several alternatives according
to the importance of the different performance issues
and the cost of addressing them.
The ”what” and ”how much” questions can be answered
by expressing the performance metric as a dependent variable (Y ) of a set {X1 , X2 , ..., Xn } of n different microarchictectural predictors. In this formulation, it is necessary
to take into account potential interaction between different
predictor events, so that the impact from one type of event
(e.g., L1 cache misses) is calculated differently according to
whether a significant number of a second type of event (e.g.,
DTLB misses) is present. For a given span of execution, let
Xi represent the count of the i-th predictor event divided by
the number of instructions executed in that span; and let Y
be the average number of processor cycles per instruction
(i.e., the CPI) over that span. Thus, the problem at hand is
to compute the function:
Y = f (X1 , X2 , ..., Xn ) + (1)
where is an error term.
Consistent with [7], we make the assumption that any
given workload in general may embody multiple phases or
classes of behavior. Accordingly, the functional mapping
between the inputs {Xj } and the output Y is different for
each class. That is, typically the function f (X1 , X2 , ..., Xn )
is non-continuous. To estimate the potential gain requires
the identification of the different workload classes and the
estimation of the form of the function f in each class.
Let Ij (X1 , X2 , ..., Xn ) be the membership function of
class j. That is,
Ij (X1 , X2 , ..., Xn ) = 1, (X1 , X2 , ..., Xn ) ∈ Classj
Ij (X1 , X2 , ..., Xn ) = 0, otherwise
(2)
And let fj (X1 , X2 , ..., Xn ) be the form taken by f in
Classj . That is, the function f can be expressed as:
Y = f (X1 , X2 , ..., Xn ) + =
k
X
Ij (X1 , X2 , ..., Xn )fj (X1 , X2 , ..., Xn ) + (3)
j=1
where k is the number of classes.
Then, the problem at hand can be summarized as follows:
• Identifying the different classes, i.e., estimating the
number k of distinct classes. and identifying the subset
of the input space that is covered by each class (computing the Ij function)
• Estimating the k different class functions, f1 , f2 , ..., fk
class functions j.
• Decomposing each class function fj to isolate the effect
of each micro-architectural event in class j.
IV. M ODEL T REE BASED P ERFORMANCE A NALYSIS
To answer the “what” and “how much” questions posed
in Section III we consider model trees as a suitable machine
learning approach that can provide both insight workload
performance problems and estimates for the potential benefits
from addressing the problems. Model trees are extensions of
regression trees. Both classes of algorithm (model trees and
regression trees) are used to predict the value of a dependent,
continuous variable (e.g., performance in our case) from a
set of independent variables, termed attributes (e.g., microarchitectural events in this case). The main difference is that
regression trees are used to fit piecewise constant functions,
while model trees are used to fit piecewise multi-linear
function. The term model trees is sometimes used in certain
publications to mean regression trees. Another term that is
also sometimes used interchangeably with model trees is
function trees. In the rest of the paper, we use model trees to
mean a class of algorithms that divide the input space into
a tree format and fit predictive regression models at the leaf
nodes.
Model trees offer several benefits that make them a suitable regression approach for performance analysis. These
benefits include:
• Good prediction accuracy: The prediction accuracy of
model trees (see Section V) is comparable to that
of black-box techniques such as artificial neural networks [18] and is known to be higher than the prediction
of regression trees such as CART [6].
• Interpretable models: Both the derived tree structure
and the regression models at the leaf nodes ca be used
to gain insight into the nature and severity of performance problems. This property is particularly important
in order to be able to answer the “what” and “how
much” questions. Regression approaches such as Neural
Networks [18] and Support Vector Machines [19] lack
interpretability.
• Additional properties: Model trees are also known to
efficiently handle large data sets with a high number of
attributes and high dimensions [20].
The model tree algorithm used in this paper is M5’ [5],
which is a re-implementation of Quinlan’s original M5 algorithm [4] in the open-source software package WEKA [9].
While most other linear and non-linear regression techniques
fit a single function to predict a dependent variable from a
set of independent variables, M5’ partitions the input space
(training set) into a number of disjoint hyperspaces each
corresponding to an instance under a specific leaf node. In
the case of performance analysis, each hyperspace represents
a separate class of sections or phases of workloads, for which
a distinct performance model is established. Figure 1 gives
a typical tree structure as produced by M5’for predicting the
the function Y = f (X1 , X2 , X3 , X4 ). The terms LM1, LM2,
etc. at the leaf node stand for the linear model applied in the
class represented by the corresponding leaf node.
The rest of this section details how M5’ build tree models
and how these models can be used for performance analysis.
A. Tree Construction
Model building consists of applying M5’ to a representative training set, and comprises two main phases. The
training set consists of sections of workloads with known
(experimentally measured) performance and counts of the
different micro-architectural events (predictors). In the first
phase, a large tree is grown. In the second phase, the tree
is pruned back to avoid overfitting. The tree construction
algorithm uses a classical divide-and-conquer, top-down approach. The relevant parameters for tree construction algorithms are the splitting rule, the termination criterion, and
the leaf assignment criterion. The splitting rule in M5’ is to
use at each non-leaf node the most discriminative attribute
(independent variable). In our case, all the attributes are
continuous and the most discriminative attribute is defined
as the one that reduces most the variance of the dependent
variable (class variable). After each split based on one of the
attribute variables, a subset of the training set goes to the left
or right branch derived from the split. The tree construction
algorithm is recursively called on the subset of training data
in each branch.
For the termination criterion, we use a pre-pruning strategy
with a minimum number of training instances in each leaf
node. A node is not split further if the variance of its subset
is small enough. A node is also not split, if its population is
at or below a threshold number; this allows control against
overfitting. Post-pruning (described in the next subsection)
enforces the population criterion as well, so that by limiting
overfitting the model is prevented from following the outliers
too closely. The minimum number of instances needed to
strike a balance between prediction accuracy on training and
new data (problem of bias versus variance) depends on many
factors such as the number of attributes and the number
of outliers. Here, it was determined experimentally that a
minimum number of 430 instances is a reasonable one.
The leaf assignment criterion for M5’ is to fit, at each
leaf node, a linear regression model predicting the dependent
variable as a function of the attributes for instances falling
under the corresponding class. This is in contrast with
classical regression tree algorithms, such as CART [6], in
which a constant value is predicted at each leaf node. Use of
a linear model better fits our performance analysis objective
of understanding the impact of different factors, whereas a
constant value model, while simpler, would not meet the
purpose.
B. Pruning
Pruning mitigates potential overfitting. During the pruning
phase, the tree is traversed depth-first. At each non-leaf node,
two error measures are estimated and compared to determine
whether or not to make the node a leaf node by pruning
its sub-tree. The first error measure is an estimate of the
prediction error at the non-leaf node if it was pruned to a leaf.
The second error measure is an estimate of the prediction
error of the sub-tree below the non-leaf node. If the former
is smaller than the later, the entire sub-tree is pruned to a
leaf node, otherwise the sub-tree remains as it is. When the
pruning occurs, the linear model at the new leaf node is used
for prediction.
C. Use of the Model for Performance Analysis
To analyze the performance of a given workload, data
is collected for the different sections of the workload and
arranged into a matrix with rows representing sections and
columns representing the events. Each section then traverses
the tree from the root node, until it arrives at some leaf
node; i.e., it is placed in a specific class. The class is
characterized by the variables used in decision rules leading
to the corresponding leaf and by the split points for each
variable. Each one of these split variables is considered a
source of potential performance improvement when the leaf
node is on the right side of the split point (high values of
corresponding count).
As each leaf node specifies a distinct linear model to be
used for performance prediction, it explicitly characterizes
the impact of each event in the linear model for the sections
/.
-
"!#
-
-
%$ -
$
"!&
*)
+$
"!(
"!&,
"!&
Example M5’ tree structure
that reach it. Thus the fractional contribution of a performance event to the execution time, as well as the percent
improvement that may be expected from optimizing for that
event are readily available at the leaf nodes. In addition to
the explicit factors that are enumerated in the regressions
at the leaf nodes, the split nodes on the path from the root
to a leaf node represent the implicit categorical factors that
control performance at the leaf node.
AND
%+$
)
"!('
Fig. 1.
V. R ESULTS
E VALUATION
This section presents the results of applying model trees
to event-counter-based processor performance analysis, and
interprets the class decomposition yielded by the M5 algorithm. The interpretation comports nicely with well known
sensitivities of computer architecture in general and the
CoreTM 2 Duo processor micro-architecture in particular, and
provides confidence that the decomposition is valid. This
section also provides an evaluation of the model along several
prediction accuracy metrics, and shows that the model has
excellent predictive power.
R
The data used in this section is collected on an Intel
TM
Core 2 Duo processor-based desktop platform. The test
machine has a speed of 2.4 GHz and 1 GB of memory. Each
core has a 32 KB, level-one instruction cache and a separate
data cache of the same size. The two cores share a level-two
cache of 4 MB. For more details on CoreTM2 Duo processor
architecture, the reader is referred to CoreTM2 Duo processor
Optimization Guide [21], [22]. The data collection platform
R
is running a Microsoft
WindowsTMXP 64-bit operating
system. Our study uses CPI as the main performance metric
described as a function of 20 other performance counters.
The data is collected for a subset of a subset SPEC CPU2006
workloads. These events were chosen identified as candidates
likely to be most relevant in the performance analysis. They
represent the execution time and various performance-related
micro-architectural events characterizing the instruction mix,
the memory sub-system, the branch prediction accuracy,
the data and instruction translation lookaside buffers and
other known potential sources of performance degradation
as described in Table I. Data collection was grouped into
sections of equal counts of executed instructions. For a
detailed description of our experimental setup, the reader is
refered to [23].
A. Performance Model
The performance model has two components. The first
component consists of a tree structure used for classification
purposes. When predicting on a new instance, the tree is
TABLE I
S ELECTED METRICS USED IN
Metric
CPI
InstLd
InstSt
BrMisPr
BrPred
InstOther
L1DM
L1IM
L2M
DtlbL0LdM
DtlbLdM
DtlbLdReM
Dtlb
ItlbM
LdBlSta
LdBlStd
LdBlOvSt
MisalRef
L1DSpLd
L1DSpSt
LCP
Corresponding event
CPU CLK UNHALTED.CORE
INST RETIRED.LOADS
INST RETIRED.STORES
BR INST RETIRED.MISPRED
BR INST RETIRED.ANY
–
BR INST RETIRED.MISPRED
INST RETIRED.ANY – (INST RETIRED.LOADS +
INST RETIRED.STORES + BR INST RETIRED.ANY)
MEM LOAD RETIRED.L1D LINE MISS
L1I MISSES
MEM LOAD RETIRED.L2 LINE MISS
DTLB MISSES.L0 MISS LD
DTLB MISSES.MISS LD
MEM LOAD RETIRED.DTLB MISS
DTLB MISSES.ANY
ITLB.MISS RETIRED
LOAD BLOCK.STA
LOAD BLOCK.STD
LOAD BLOCK.OVERLAP STORE
MISALIGN MEM REF
L1D SPLIT.LOADS
L1D SPLIT.STORES
ILD STALL
traversed until reaching a specific leaf node to find the
appropriate class. The second component consists of the
linear models at the leaf nodes. These models are used to
predict the CPI, and to estimate the impact of each microarchitectural event on the overall performance.
1) Tree Structure: Figure 2 presents the performance analysis model tree obtained from applying M5’ to the training
set. Each leaf node in the tree represents a distinct class
of sections of workload class. The number in parentheses
indicates the percent of the training set that falls into the
corresponding leaf. The performance within each class is
explained by a linear model. At the root node and the
next few levels of the tree, we can immediately see that
the model identifies the level 2 cache misses (L2M) as the
single event that most strongly impacts performance. As
L2 cache miss is the single longest latency event in this
study, this result is to be expected. The right side of the
root node represents workloads or sections of workloads
with a significant number of L2 cache misses, while the
THIS STUDY
Description
CPU clock cycles per instruction
Loads per instruction
Stores per instruction
Mispredicted branches per instruction
Correctly predicted branches per instruction
Non-branch and memory instructions per instruction
L1 data misses per instruction
L1 instruction misses per instruction
L2 misses per instruction
Lowest level DTLB load misses per instruction
Last level DTLB load misses per instruction
Last level DTLB retired load misses per instruction
Last level DTLB misses (including loads) per instruction
ITLB misses per instruction
Load block store address events per instruction
Load block store data events per instruction
Load block overlap store per instruction
Misaligned memory references per instruction
L1 data split loads per instruction
L1 data split stores per instruction
Length changing prefix stalls per instruction
left sub-tree represents instances without significant cache
miss problems. It must be noted here that by significant
number we mean a number of occurrences per instruction
that is greater than a given threshold. This threshold, or
split point, is automatically derived by the algorithm. We
can immediately see a certain pattern. The model decides
first based on cache misses (i.e., level 2 misses), then DTLB
misses followed by branch related events. Less frequent
discriminative predictors are found in the lower levels of
the tree.
In CoreTM2 Duo processors the first level caches (i.e.,
those closest to the processor pipeline) are divided into
instruction and data caches, while the second level cache is
unified. A miss in the level 1 instruction (L1IM) or the level 1
data cache (L1DM) results in an access to the unified level 2
cache. Interestingly, we see on the right side of the root node
that the model tries to determine whether the significant level
two cache misses result from data accesses or instruction
accesses.
354768
!)"7()":;
- ?>@(
+ 1
1
,;!)6 <4$=
0- *
+ !)"7()":;
/
.-
- + *
,
ACB.'140DE4$F:"
354768
+ AB,'24$<4
,-
. * !)"7()":;
1
.
,*
*
2
Performance Analysis Tree
.
.
+ AB,'24$<4
Fig. 2.
39
*
!#"$&% '()"
+ ,
!)"7()":;
AG
!#"$&% '()"
!#"$&% '()"
Sections of workload characterized by a high number of
level 1 instruction cache misses (L1IM) combined with a
high number of L2 misses fall in linear model (LM18).
LM18 is simply a constant: CP I = 2.2, indicating poor
avergae peformance under these conditions. In this case, the
performance degradation resulting from these combination
of events overshadows that from any other events. This
result makes sense because an instruction miss prevents the
introduction of new instructions to the out-of-order core.
The main SPEC workload that falls into this category is
436.cactusADM, where more than 95% of the sections
experience high L2 cache misses combined with a high rate
of L1 instruction misses.
In contrast, when a high number of L2 cache misses is
combined with a high number of L1 data cache misses, the
CPI is represented by LM17, a linear equation containing
several predictors including L2 cache and DTLB misses. In
addition to the effects measured in the linear equation, both
the level 1 data cache misses and the level 2 cache misses
further affect performance, since both are split variables used
in the decision rules to reach this class. An example workload
with a large percentage of sections falling in LM17 is the
SPEC benchmark 429.mcf where more than 70% of the
sections are classified in LM17.
On the left side of the root node, where level 2 misses
are not present in significant number, the model tests for
the presence or absence of DTLB misses. Testing for DTLB
misses in absence of a significant number of L2 misses makes
sense when one considers the capacity relationship between
the DTLB and the L2 cache. The CoreTM 2 Duo processor
DTLB contains only enough entries to map about 14 of the
full L2 cache. So it is not surprising that DTLB miss events
become significant even when the referenced data hits the
L2 cache.
After the first two levels of the tree, branch related
events seem to become very important. In particular, the
model tests on the number of branch mispredictions per
instruction (BrMisPr) on several occasion in both sides of
the tree. Each mispredict, in addition to causing a pipeline
flush, also requires fetching to restart at the correct address,
and thus causes significant loss of execution cycles. The
tree model indicates that after the L2 and DTLB misses,
branch mispredict events are the most discriminative of the
performance events.
A close examination of figure 3 reveals that despite their
importance, branch events (BrMisPr and BrPred) impact CPI
to a much smaller extent than cache misses. Either branch
events appear low in the tree (sub-tree on the right side
of the root node) when L2 cache misses occur, or they
occur in workload sections without significant numbers of L2
misses. It is instructive to compare the importance of branch
mispredicts in this architecture with their controlling role on
the PentiumTMNetburst processor, as reported in [13], where
the much longer pipeline translated into a greater pipeline
flush and resteering cost.
Other events beside cache and DTLB misses and branch
mispredicts do not show strong impact in general; however,
for specific sections of the workloads they are very important.
For instance, LM10 is characterizes workload sections that
are significantly affected by length changing prefix (LCP)
stalls. An example workload in this category is SPEC
403.gcc (also affected by cache misses), where about 20% of
the sections experience performance degradation due to LCP
stalls. Workloads that have a significant number of sections
that fall into this model would be candidates for software
optimizations or compiler code generator changes directed
at removing these instructions that have a length changing
prefix. Without this model-tree-based approach, events that
do not have a strong impact in general would be difficult to
detect, and an optimization opportunity would be potentially
lost.
2) Linear Models: Each leaf node in the tree corresponds to a distinct class of performance behavior, where
the performance (i.e., CPI) can be explained using a linear
function of the micro-architectural events. This function
can be used to estimate the impact of each event on the
overall workload performance. For instance, in case of linear
model 8 (LM8) shown in Equation 4, the contribution of L1
instruction misses (L1IM) to the overall performance can be
measured by the ratio of 6.69∗L1IM/CP I. For a numerical
illustration, let’s assume that the predicted CPI is 1.0, while
L1IM is 0.03 per instruction. In this case, the contribution
of level 1 instruction cache misses to the performance is
6.69 ∗ 0.03/1.0 = 0.20. In other words, our model predicts a
potential performance improvement of about 20%, if all L1
instruction misses are addressed by some code optimization
technique such as code block placement or profile-guided
optimization.
CP I = 0.52 + 139.91 ∗ ItlbM + 2.22 ∗ DtlbL0LdM
+28.21 ∗ DtlbLdReM + 6.69 ∗ L1IM + 1.08 ∗ InstLd
(4)
This approach allows the ranking of performance issues
by their respective predicted contributions to the overall
performance. This ranking can be used to answer both the
“what” and “how much” questions. It shows performance
analysts which micro-architectural events to target first and
how much gain to expect.
Another example is given in linear model 11 (LM11) in
CP I = 0.75 + +193.98 ∗ DtlbLdReM
(5)
The previous example illustrates how to measure the
effects of variables that appear in the linear models. To
assess the effects of the split variables (used in decision
rules) not appearing in the linear models, we can examine
the differences in performance statistics between the two
branches of the split. A simple approach is to use the average
performance difference between the sub-tree on the left side
and the sub-tree on the right side. For example, consider
the split on the LdBlSta variable within the left subtree.
The average CPI for the two classes within the left subtree
are 0.57 and 0.51, while that for right subtree is 0.84.
Thus, the net impact of this variable to the right subtree is
approximately (0.84 - M ean(0.57, 0.51)), i.e., 0.30 or 35%
of the CPI. A more sophisticated approach would be to use
a weighted average instead of the simple mean. Another is
to combine the data for instances falling under both subtrees
and fit a simple linear regression of performance (CPI) on
the split variable. In this case, the regression R 2 can be used
as an indication of the contribution of the split variable to
the overall performance.
B. Model Evaluation
Model trees, like many non-linear regression techniques,
can overfit the data and produce models that perform well
on the training data but poorly on unseen data (that was
not used for model fitting). To check for overfitting and to
evaluate the appropriateness of the model tree algorithm for
the performance data, a 10-fold cross validation [24] was
employed. In this technique, the total dataset is divided into
10 disjoint subsets or folds. The model is then trained using
9 of the subsets and evaluated using the tenth subset. The
process is repeated 10 times, and each time, a different subset
is used for testing while the remaining 9 subsets are used to
train the model. The fitness of the modeling algorithm is
evaluated by averaging the prediction metrics from the 10
different models.
Three performance metrics are used in this work: the
correlation coefficient (C), the mean absolute error (MAE)
and the Relative Absolute Error (RAE), described in [23].
Predicted vs. Actual CPI
10
Actual CPI
Predicted CPI
8
6
Predicted CPI
Equation 5. In this case, only one events (number of retired
loads that miss the last level DTLB per instruction) appears
in the linear model. The performance of sections falling in
this category is only influenced by DTLB misses and the split
variables leading to this leaf node (e.g., L2M and BrMisPr,..).
For some leaf nodes, the performance is only affected by the
split variables leading to the node. This is the case of linear
model 18 (LM18) discussed in the previous subsection.
4
2
0
−2
0
1
Fig. 3.
2
3
4
5
Actual CPI
6
7
8
9
10
Predicted CPI vs. Actual CPI Using M5’
Our model shows a high correlation coefficient of 0.98 on
test data. The mean absolute error also shows high prediction
accuracy with an M AE = 0.05. The relative absolute error
is 7.83%. This number indicates that our model combines
good accuracy with model interpretability as illustrated in
the previous subsection.
The accuracy of this approach is competitive even with
non-linear black-box techniques that sacrifice intepretability
for better prediction accuracy. For instance, Artificial Neural Networks [18] and Support Vector Machines [25] give
correlation coefficients of 0.99 and 0.98, respectively, on the
same data. For a detailed comparison, the reader should refer
to [23].
To illustrate the predictive power of our approach, Figure 3 plots the predicted CPI values on the cross validation
versus the measured CPI values. Note that the prediction is
performed on data points in the test fold. In other words, the
prediction on each data point is performed using a model
that was build on training data that does not include the
data point. The figure shows a strong correlation between
the actual and predicted CPI values. Except for few outliers,
most data points on the figure are very close to the unity line
(i.e., perfect correlation).
VI. C ONCLUSIONS
In summary, this paper described the use of model trees for
performance analysis. As proof of concept, M5 was used to
build a model tree using processor event data collected during
execution of a subset SPEC CPU2006 workloads on the
R
Intel
CoreTM2 Duo processor. Data collection was grouped
into spans of equal counts of executed instructions. A training
subset derived randomly from this data was used to first construct the tree in a recursive top-down fashion, and then prune
the tree in a depth-first, bottom-up fashion. Pruning reduced
the likelihood of overfitting, and helped strike a balance
between model compactness and discriminative ability. The
model built in this way is easy to reason about, and the split
decisions can be examined in juxtaposition with the known
behaviors and parameters of the physical machine being
modeled. It is therefore easy to comprehend the context and
severity of performance issues arising during the execution of
a workload. A 10-fold cross validation showed that the model
had a mean absolute error of less than 5% and a prediction
correlation of 0.9845. These accuracy measures compare
very well with other non-linear machine learning techniques
such as artificial neural networks and support vector machine
that were also evaluated in this work, and demonstrate that
model trees are a very attractive approach for conducting
performance analysis of complex micro-architectures.
ACKNOWLEDGMENT
The authors would like to thank the following people for
their help with this work: Antonio C. Valles, Garrett T. Drysdale, James C. Abel, Agustin Gonzalez, David A. Levinthal,
Stephen P. Smith, Henry Ou, Yong-Fong Lee, Alex A. LopezEstrada, Kingsum Chow, Thomas M. Johnson, Michael W.
Chynoweth, Annie Foong, Vish Viswanathan.
R EFERENCES
[1] J. Levon, “Profiling tool for linux, kernel profiling, etc.,
oprofile,” http://oprofile. sourceforge.net, 2004. [Online]. Available:
http://oprofile. sourceforge.net
[2] R. Hundt, “Hp caliper: A framework for performance analysis tools,”
IEEE Concurrency, vol. 8, no. 4, 2000.
[3] R. Kufrin, “Perfsuite: An accessible, open source performance analysis environment for linux,” in Proceedings of the 6th International
Conference on Linux Clusters: The HPC Revolution 2005 (LCI-05),
2005.
[4] R. Quinlan, “Learning with continuous classes,” in Proceedings of
the 5th Australian Joint Conference on Artificial Intelligence (AI’92),
1992.
[5] Y. Wang and I. Witten, “Inducing model trees for continuous classes,”
in Proceedings of the 9th European Conf. on Machine Learning, Poster
Papers, 1997.
[6] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and
Regression Trees. Wadsworth International Group, 1984.
[7] T. Sherwood, S. Sair, and B. Calder, “Phase tracking and prediction,”
in Proceedings of 30th Annual International Symposium on Computer
Architecture (ISCA’03), 2003.
[8] “Standard performance evaluation corporation. SPEC CPU
benchmark suite,” http://www.specbench.org/osg/cpu2006, 2006.
[Online]. Available: http://www.specbench.org/osg/cpu2006
[9] I. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations. Morgan Kaufmann,
2000.
[10] T. S. Karkhanis and J. E. Smith, “A first-order superscalar processor
model,” in Proceedings of the International Symposium on Computer
Architecture (ISCA’04), 2004.
[11] L. Simonson and L. He, “Micro-architecture performance estimation
by formula,” in Proceedings of the 5th International Workshop on Embedded Computer Systems: Architectures, MOdeling, and Simulation
(SAMOS’05), 2005.
[12] A. Hartstein and T. Puzak, “The optimum pipeline depth for a
microprocessor,” in Proceedings of the International Symposium on
Computer Architecture (ISCA’02), 2002.
[13] E. Sprangle and D. Carmean, “Increasing processor performance by
implementing deeper pipelines,” in Proceedings of the International
Symposium on Computer Architecture (ISCA’02), 2002.
[14] K. Chow and J. Ding, “Multivariate analysis of Pentium Pro processor,” in Proceedings of Intel Software Developers Conference, 1997.
[15] G. Cai, K. Chow, T. Nakanishi, J. Hall, and M. Barany, “Multivariate
power/performance analysis for high performance mobile microprocessor design,” in Proceedings of Power Driven Microarchitecture
Workshop, 1998.
[16] J. Yi, D. Lilja, and D. Hawkins, “A statistically-rigorous approach
for improving simulation methodology,” in Proceedings of 9th IEEE
Symposium on High Performance Computer Architecture, 2003.
[17] B. Fields, R. Bodick, M. Hill, and C. Newburn, “Interaction cost
and shotgun profiling,” ACM Transactions on Architecture and Code
Optimization, vol. 1, no. 3, pp. 272–304, 2004.
[18] T. Mitchell, Machine Learning. McGraw Hill, 1997.
[19] J. Platt, “Using sparseness and analytic qp to speed training of support
vector machines,” in Proceedings of Advances in Neural Information
Processing Systems (NIPS’99), 1999.
[20] D. Solomatine and K. N. Dulal, “Model tree as an alternative to neural
network in rainfall-runoff modeling,” Hydrological Sc. J., vol. 48,
no. 3, 2003.
[21] Intel,
“Intel
64
and
IA-32
architectures
optimization
reference
manual,”
http://developer.intel.com/design/Pentium4/manuals/index new.htm, 2006.
[22] ——, “IA-32 intel architecture optimization: reference manual,”
http://www.intel.com/design/Pentium4/manuals/248966.htm.
[23] E. Ould-Ahmed-Vall, J. Woodlee, C. Yount, and K. A. Doshi, “On
the comparison of regression algorithms for computer architecture
performance analysis of software applications,” in Workshop on Statistical and Machine learning approaches applied to ARchitectures
and compilaTion (SMART’07), co-located with International Conference on High Performance Embedded Architectures and Compilers
(HiPEAC’07), 2007.
[24] R. Kohavi, “A study of cross-validation and bootstrap for accuracy
estimation and model selection,” in Proceedings of 14th International
Joint Conference on Artificial Intelligence, 1995.
[25] S. Shevade, S. Keerthi, C. Bhattacharyya, and K. Murthy, “Hp caliper:
A framework for performance analysis tools,” IEEE Concurrency,
vol. 8, no. 4, 2000.