Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS5014 Research Methods in CS Dr. Ayman Abdel-Hamid Computer Science Department Virginia Tech Performance Evaluation Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 1 Outline Performance Evaluation •Introduction •Common Mistakes Some of the material is based on Dr. Cliff Shaffer’s Notes for CS5014. Department of Computer Science, Virginia Tech Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 2 Examples 1/2 • Evaluate design alternatives • Compare two or more computers, programs, algorithms Speed, memory, usability • Determine optimum value of a parameter (tuning, optimization) • Locate bottlenecks • Characterize load • Prediction of performance on future loads • Determine number and size of components required Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 3 Examples 2/2 •Which is the best sorting algorithm? •What factors affect data structure visualizations? •Code-tune a program •Which interface design is better? •What are the best parameter values for a biological model? Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 4 The Art of Performance Evaluation •Throughput in Transactions per Second System Workload1 Workload2 A 20 10 B 10 20 How Does system A compare to system B? Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 5 Evaluation Issues •System • Hardware, software, network: Clear bounding for the “system" under study •Techniques • Measurement, simulation, analytical modeling •Metrics • Response time, transactions per second •Workload: The requests a user gives to the system •Statistical techniques •Experimental design • Maximize information, minimize number of experiments Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 6 Common Mistakes 1/5 •No goals Each model is special purpose Performance problems are vague when first presented • Biased goals OUR system is better than THEIRS • Unsystematic approach • Analysis without understanding People want guidance, not models • Incorrect performance metrics Want correct metrics, not easy ones • Unrepresentative workload Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 7 Common Mistakes 2/5 •Wrong evaluation technique Easy to become married to one approach • Overlooking important parameters • Ignoring significant factors Parameters that are varied in the study are called factors. There's no use comparing what can't be changed • Inappropriate experimental design • Inappropriate level of detail Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 8 Common Mistakes 3/5 •No analysis • Erroneous analysis • No sensitivity analysis Measure the effect on changing a parameter • Ignoring errors in input • Improper treatment of outliers Outliers are values that are too high or too low compared to a majority of values Some should be ignored (can't happen) Some should be retained (key special cases) Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 9 Common Mistakes 4/5 •Assuming no change in the future • Ignoring variability Mean is of low significance in the face of high variability • Too complex analysis Complex models are \interesting" and so get published and studied Real world use is simpler Decision makers prefer simpler models Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 10 Common Mistakes 5/5 •Improper presentation of results •The proper metric to measure the performance of an analyst is the number of analyses that helped decision makers (not the number of analyses performed) • Ignoring social aspects • Omitting assumptions and limitations Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 11 The Error of one-sided Hypothesis •Consider the hypothesis “X performs better than Y". •The danger is the following chain of reasoning: Could this hypothesis be true? I have evidence that the hypothesis might be true. Therefore it is true. What got ignored is any evidence that the hypothesis might not be (or is not) true. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 12 A Systematic Approach 1. State goals and define the system boundaries 2. List services and outcomes 3. Select metrics 4. List parameters (System and Workload) 5. Select factors to study 6. Select evaluation technique 7. Select workload 8. Design experiments 9. Analyze and interpret data 10. Present results. Start over, if necessary! Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 13 Technique Selection •Choices: Analytical Modeling, Simulation, and Measurement “Until validated, all evaluation results are suspect." •Validate one of these approaches by comparing against another. Measurement results are just as susceptible to experimental errors and biases as the other two techniques. •Criteria for technique selection Stage of analysis Time required Tools Accuracy Trade-off evaluation Cost Saleability Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 14 Common Performance Metrics 1/3 •Response Time Interval between a user’s request and the system response Simplistic definition assuming request and responses are instantaneous Definition 1 Time between the user finishes the request and the system starts response Definition 2 Time between the user finishes the request and the system completes response Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 15 Common Performance Metrics 2/3 •Throughput: The rate (requests per unit of time) at which requests can be serviced by the system. Throughput generally increases as the load initially increases. Eventually it stops increasing, and might then decrease. Nominal capacity is maximum achievable throughput under ideal workload conditions (bandwidth for computer networks) Usable capacity is maximum throughput achievable without violating a limit on response time. Efficiency is ratio of usable to nominal capacity. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 16 Common Performance Metrics 3/3 •Utilization Fraction of time the resource is busy servicing requests Ratio of busy time to total elapsed time over a given period •Bottleneck: Component with highest utilization Improving this component often gives highest payoff •Reliability Probability of errors Mean time between errors •Availability Uptime and downtime Mean uptime (Mean time to Failure) •Cost/performance ratio Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 17 Workloads •A workload is the requests made by users of the system under study. A test workload is any workload used in performance studies A real workload is one observed on a real system. It cannot be repeated. A synthetic workload is a reproduction of a real workload to be applied to the tested system •Examples (for CPU performance) Addition instruction Instruction mixes Kernels Synthetic programs Application benchmarks Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 18 Selecting Workloads 1. Determine the services for the SUT (System Under Test) • View the system as a service provider • Component Under Study (CUS)? 2. Select the desired level of detail 3. Confirm that the workload is representative 4. Is the workload still valid? • A real-world workload is not repeatable • Most workloads are models of real service requests Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 19 Selecting Workloads-Level of Detail • Select the desired level of detail Most frequent request Frequency of request types Time-stamped sequence of requests (trace) Average resource demand Might be necessary to specify complete probability distribution Distribution of resource demands Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 20 Selecting Workloads-Representativeness • A test workload representative of real application • Test workload and real application should match in the following Arrival rate Resource demands Resource usage profile Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 21 Selecting Workloads-Other Factors • Loading Level A workload might exercise a system to its full capacity (best case) Beyond its capacity (worst case) Load level observed in real workload (typical case) • Impact of external components • Repeatability Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 22 Some Workload Characterization Techniques 1/3 • • Averaging • Uses arithmetic mean • Alternatives? Specifying Dispersion • Variance • Markov Models • Clustering Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 23 Some Workload Characterization Techniques 2/3 Markov Models Assume the next request depends only on the last request Next system state depends only on current system state Transition matrices Ex: typical distribution for some system is about 4 small packets followed by a single large packet Random chance: probability of small is always .8, large is always .2. Markov model: Small follows small .75, large follows small .25. In contrast, small always follows large. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 24 Some Workload Characterization Techniques 3/3 • Clustering To separate a population into groups with similar characteristics. minimize the within-group variance while maximizing the between-group variance Select representatives to simplify further processing Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 25 Statistics: Basic Concepts 1/2 • • Independent Events: Knowing that one event has occurred does not in any way change our estimate of the probability of the other event. Random Variable: takes one of a specified set of values with a specified probability. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 26 Statistics: Basic Concepts 2/2 For independent variables, covariance is zero. Why? Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 27 Mean, Median, and Mode (highest probability xi) Probability Density Function (pdf) is the derivative of the CDF. P(x1 < x ≤ x2) = F(x2) – F(x1) Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 28 Summarizing Data by a Single Number 1/2 • A single number gives key characteristic of data set • Indices of central tendencies: mean, median, and mode Sample mean Sample median Sort observations in increasing order and take observation in the middle of the series (if number even? take mean of middle two values) Sample mode Plot a histogram and specify the midpoint where the histogram peaks, or category occurring most frequently Mean is affected more by outliers than median or mode Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 29 Summarizing Data by a Single Number 2/2 • Mean and Median always exist and are unique • Mode may not exist, or might not be unique • Examples Unimodal, symmetrical pdf Bimodal, symmetrical pdf Uniform density function Skewed pdf Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 30 Choosing Mean, Median, or Mode • You can't take the median or mean of categorical data Use Mode. • Is the total of interest? Use mean (Ex: Total CPU time for five database queries vs. number of windows on the screen open for each query.) Probably use mean for the times median for windows. • Are the data skewed? Use Median Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 31 Common Misuses of Means • Using means of significantly different values Mean of 10 and 1000 msec is 505 msec • Using mean without regard to skewness of distribution 10 5 • 9 5 11 10 5 4 10 31 Sum Mean Typical 50 10 10 50 10 5 Multiplying means to get the mean of a product Mean of a product of two random variables is equal to product of means, IF two random variables are independent Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 32 Summarizing Variability • Mean, median, mode attempt to provide a single “characteristic" value for the population. But a single value might not be meaningful People generally prefer systems with low variability • There are different ways to measure variability (Indices of dispersion) Range (max - min) poor, tends to be unbounded, unstable over a range of observations, susceptible to outliers Variance or standard deviation 10 and 90 percentiles Quartiles (box plots) Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 33 Variance and Standard Deviation • Why division by n-1? Only n-1 of the n differences are independent • Variance is expressed in units that are the square of the unit of the observations. • Standard deviation is in units of the mean • Better to use coefficient of variation, since it takes the scale of measurement unit out of variability consideration Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 34 Percentiles • Can be performed for any variables, even without bounds • 0.9-quantile is the 90-percentile • Percentiles at multiples of 10% are deciles • Quartiles divide data into 4 parts at 25, 50, and 75% 25% of observations are less than or equal to first quartile 50% of observations are less than or equal to second quartile • -quantiles can be estimated by sorting the observations and taking [(n-1)+1]th element in the ordered set Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 35 Selecting Index of Dispersion • Is variable bounded? Specify range • If no bounds, is distribution unimodal, symmetric? Use coefficient of variation • Is distribution nonsymmetric? Use percentiles • Use of average traffic for network design? Usually designed to carry 95 to 99 percentile of observed load (range or percentile) • Finding a percentile requires several passes through the data Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 36 Comparing Systems Using Sample Data • Most CS research work involves samples from some “population.“ Performance experiments on workloads for programs, systems, and networks • A fundamental goal of CS experimentation is to determine the mean () and variance of some population characteristic (a parameter). • We can only measure the mean characteristic for the sample, not the population (a statistic) The parameter value (population mean) is fixed. The statistic (sample mean) is a random variable. The values for the statistic come from some sampling distribution. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 37 Hypothesis Testing • One fundamental experimental question What is the parameter value? • Another fundamental experimental question Are two populations different? Does a subpopulation perform better? Is one algorithm better than another? • Null Hypothesis: Two populations have the same mean • Rejection of the Null Hypothesis: Do so if you are “confident" that the means are different • Significant Difference: Two means are different with a certain reference confidence (probability) Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 38 Confidence Intervals Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 39 Central Limit Theorem Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 40 Normal Distribution The sum of a large number of independent observations from any distribution has a normal distribution •Normal variate denoted as N(, σ), where is the mean and σ is the standard deviation •A normal distribution with zero mean, and unit variance is called a unit normal or standard normal distribution denoted N(0,1) •An -quantile of a unit normal variate z ~ N(0,1) is denoted by z •If a random variable has N(, σ) distribution, then (x- )/ σ has a N (0,1) distribution Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 41 Confidence Interval Example Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 42 Confidence Intervals for Small Samples •The sampling distribution only approximates a normal curve when n gets big enough, over about 30. •Before that, the “tails" of the distribution are too big, so the estimated confidence interval is too small. •You can calculate the confidence intervals more precisely, but you need a separate table for every n value. •Usually there is a table to look it up in, or better yet, the statistical software will calculate it for you •The 100(1-)% confidence interval is given by x t[1 / 2;n1] s / n Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 43 Small Sample CI Example •8 observations -0.04, -0.19, 0.14, -0.09, -0.14, 0.19, 0.04, and 0.09 •Mean = 0 •Sample standard deviation = 1.38 •t[0.95;7] = 1.895 •90% Confidence interval is 0 1.895 * 0.138 / 8 (0.0926,0.0926) Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 44 Testing Against Zero •Are the differences in processor times for two algorithms significant? Testing if mean (differences) is zero. Check if zero is within confidence interval at desired confidence. Note that the number of samples is likely to be small. Take from table for correct t value Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 45 Testing Against Zero Example Difference in processor times of two different implementations of the same algorithm was measured on seven similar workloads {1.5, 2.6, -1.8, 1.3, -0.5, 1.7, and 2.4}. Can we say with 99% confidence that one implementation is superior to the other? •Sample size = 7 •Mean = 7.2/7 = 1.03 •Sample variance = 2.57 •Sample standard deviation = 1.6 •100(1-) = 99, = 0.01, 1- /2=0.995 •t[0.995,6] = 3.707 1.03 3.707 *1.6 / 7 (1.21,3.27) •Interval includes zero, cannot say •Confidence Interval is Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 46 Paired Observations •Need to compare two systems on very similar workloads If more than 2 systems, or workloads are significantly different, analysis of experimental design techniques should be used •Conduct n experiments on each of the two systems, one-to-one correspondence between ith test on the 2 systems, observations are paired •Calculate difference in performance •Construct a difference interval for the difference if includes zero, systems are not significantly different Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 47 Paired Observations Example Six similar workloads used on 2 systems {(5.4, 19.1), (16.6, 3.5), (0.6, 3.4), (1.4, 2.5), (0.6, 3.6), (7.3,1.7)} Performance difference {-13.7, 13.1, -2.8, -1.1, -3.0, 5.6} •Sample size = 6 •Mean = -0.32 •Sample variance = 81.62 •Sample standard deviation = 9.03 •100(1-) = 90, = 0.1, 1- /2=0.95 •t[0.95,5] = 2.015 •Confidence Interval is 0.32 2.015 * 3.69 (7.75,7.11) •Interval includes zero, two systems are not different Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 48 Unpaired Observations •Two samples of different sizes •No correspondence between ith observations in two samples •Make an estimate of the variance and effective number of degrees of freedom classical t-test Compute sample means Compute sample standard deviations Compute mean difference Compute standard deviation of mean difference Compute effective number of degrees of freedom Compute confidence interval for mean difference Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 49 What Confidence Interval to Use? •Depends on the loss if wrong •Sometimes 50% confidence is enough, or 99% confidence is not enough Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 50 Regression Models •A regression model predicts a random variable as a function of other variables. Response variable: the estimated variable Predictor variables, predictors, factors: variables used to predict the response All must be quantitative variables to do calculations •What is a good model? Minimize the distance between predicted value and observed values Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 51 Linear Regression Model 1/2 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 52 Linear Regression Model 2/2 ei yi yˆ i n n SSE e ( yi b0 b1 xi ) i 1 n 2 i i 1 n e ( y b i 1 Performance Evaluation i i 1 2 i 0 b1 xi ) 0 © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 53 Estimation of Model Parameters 1/2 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 54 Estimation of Model Parameters 2/2 30 CPU time in msec 25 20 15 Series1 10 Series2 5 0 -5 0 20 40 60 80 100 120 Number of disk I/O Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 55 Allocation of Variation 1/2 •The purpose of a model is to predict the response with minimum variability. No parameters: Use mean of response variable One parameter linear regression model: How good? Error ei yi y Variance of errors without regression 1 n 2 1 n 2 e ( y y ) variance of y i i n 1 i 1 n 1 i 1 n SSE without regression ( yi y ) 2 i 1 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 56 Allocation of Variation 2/2 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 57 Confidence Intervals for Regression Parameters •b0 and b1 are estimates from a single sample size n •b0 and b1 are “statistics” •Values for b0 and b1 are the mean value •Obtain standard deviations for b0 and b1 1 x sb0 se 2 2 n x nx se sb1 2 2 1/ 2 x nx 2 1/ 2 •Confidence intervals are b0 tsb0 b1 tsb1 Performance Evaluation t t1 / 2;n2 © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 58 CI for Regression Parameters Example Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 59 Verifying Regression Assumptions 1/2 •Assumption: Linear relationship between x and y If you plot your two variables, you might find many possible results: 1. No visible relationship 2. Linear trend line 3. Two or more lines (piecewise) 4. Outlier(s) 5. Non-linear trend curve Only the second case can use a linear model The third case could use multiple linear models Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 60 Verifying Regression Assumptions 2/2 •Assumption: Independent Errors Make a scatter plot of errors versus response value If you see a trend, you have a problem dependence of errors on predictor value •Assumption: The predictor variable x is nonstochastic and it is measured without any error. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 61 Limits to Simple Linear Regression •Only one predictor is used •The predictor variable must be quantitative (not categorical) •The relationship between response and predictor must be linear •The errors must be normally distributed •Visual tests: Plot x vs. y: Anything odd? Plot residuals: Anything odd? Residual plot especially helpful if there is a meaningful order to the measurements Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 62 Other Regression Models •Multiple Linear Regression More than one predictor variable is used •Categorical Predictors Predictor variables not quantitative but represent categories CPU type or disk type •Curvilinear Regression Nonlinear relationship between response and predictors •Transformations Errors are not normally distributed or variance is not homogeneous Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 63 Multiple Linear Regression 1/2 y1 1 y 1 2 . . . . . . yn 1 Performance Evaluation x11 x12 . . . x1n x21 x22 . . . x2 n . . . . . . xk 1 b0 b xk 2 1 . . . . . . xkn bk © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 e1 e 2 . . . ek 64 Multiple Linear Regression 2/2 Parameter Estimation b X X SST SSY SS 0 T X y 1 T SSE y T y bT X T y SST SSE R2 SST Standard deviation of errors se SSE n k 1 Standard deviation sb j se c jj , where c jj is j th diagonal term of X T X 1 SSE degrees of freedom n k 1 Confidence intervals computed using t1 / 2;n k 1 Prediction : Mean of m future observatio ns Standard Deviation of prediction s se Performance Evaluation 1 xTp ( X T X ) 1 x p m © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 65 Multiple Linear Regression Example 1/3 Observations of disk I/Os, memory sizes, and CPU time for 7 programs Performance Evaluation CPU Time 2 5 Disk I/O 14 16 Memory Size 70 75 7 9 10 13 27 42 39 50 144 190 210 235 20 83 400 © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 66 Multiple Linear Regression Example 2/3 •Find a linear equation to estimate CPU time CPU time = b0+b1(Disk I/O)+b2(Memory Size) •b = (-0.1614,0.1182,0.0265)T •We can compute the error terms from this (difference between the regression formula and the actual observations) •Compute R2 = SSR/SST=0.97 (regression explains 97% of variation of y) •Coefficient of Multiple Correlation = R = 0.970.5 = 0.99 •Standard deviation of errors = Square Root(SSE/(n-3)) = 1.2 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 67 Multiple Linear Regression Example 3/3 •The 90% confidence intervals for the parameters are (-2.11, 1.79), (-0.29, 0.53), and (-0.06, 0.11), respectively. What does this mean? •What is the predicted CPU time for 100 disk I/Os and a memory size of 550? y=-0.1614+0.1182*(100)+0.0265*(550)=26.2375 Standard deviation of predicted observation = 3.3435 90% confidence interval using a t-value of 2.132 = (19.1096,33.3363) Standard deviation for a large number of observations = 3.1385 90% confidence interval = (19.5467,32.9292) Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 68 Analysis of Variance (ANOVA) 1/5 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 69 Analysis of Variance (ANOVA) 2/5 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 70 Analysis of Variance (ANOVA) 3/5 •Assuming errors are independent and normally distributed. All are identically distributed (same mean and variance) •y’s are normally distributed since x’s are nonstochastic (measured without errors) •Sum of squares of normal variates has a chi-square distribution •Given two sums of squares SSi and SSj with vi and vj degrees of freedom. The ratio (SSi / vi) / (SSj / vj) has an F distribution •The hypothesis that SSi is less than or equal to SSj is rejected at the significance level is the ratio is greater than the 1- quantile of the F-variate •Compare ratio with F[1-,vi,vj] Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 71 Analysis of Variance (ANOVA) 4/5 •MSR = SSR/k = mean square of the regression Any sum of squares divided by its degrees of freedom gives the corresponding mean square •MSE = SSE/(n-k-1) •MSR/MSE has an F[k,n-k-1] distribution F distribution with k numerator degrees of freedom and n-k-1 degrees of freedom •If MSR/MSE is greater than value read from F-table, predictor variables are assumed to explain a significant fraction of the response variation (reject null hypothesis) •This procedure is known as the F-test Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 72 Analysis of Variance (ANOVA) 5/5 •In simple linear regression •One predictor variable •F-test reduces to testing b1=0 •If confidence interval does include zero, parameter is nonzero •Regression explains a significant part of the response variation •F-test is not required Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 73 ANOVA Example 1/2 ANOVA calculations for disk-memory-CPU example Computations performed column by column from the left Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 74 ANOVA Example 2/2 ANOVA calculations for disk-memory-CPU example •Regression passes F-test Computed ratio > value from F-Table None of the regression parameters are significantly different than zero? •We have very high confidence that the regression model has predictive ability Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 75 Problem of Multicollinearity •Dilemma: None of our parameters are significant, yet the model is!! •The problem is that the two predictors (memory and disk I/O) are correlated (R = .9947). •Next we test if the two parameters each give significant regression on their own. We already did this for the Disk I/O regression model, and found that it alone accounted for about 97% of the variance. We get the same result for memory size. •Conclusion: Each predictor alone gives as much predictive power as the two together! •Moral: Adding more predictors is not necessarily better in terms of predictive ability (aside from cost considerations). Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 76 Curvilinear Regression 1/3 •Verify the validity of the linear regression model Do a scatter plot of response vs. predictor to see if it is linear •Often you can convert to a linear model and use the standard linear regression Take the log when the curve looks like y = bxa ln y = ln b + a ln x y = a +b/x y= x/(a+bx) Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 77 Curvilinear Regression 2/3 •Example: Amdahl’s law says I/O rate is proportional to the processor speed I/O rate = (CPU rate)b1 Taking logs we get log (I/O rate) = log + b1 log (MIPS rate) b0 = log Using standard linear regression, we find that the regression explains 84% of the variation. Performance Evaluation No 1 2 3 4 5 6 7 8 9 10 MIPS 19.63 5.45 2.63 8.24 14.00 9.87 11.27 10.13 1.01 1.26 © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 I/O Rate 288.6 117.30 64.60 356.40 373.20 281.10 149.60 120.60 31.10 23.70 78 Curvilinear Regression 3/3 Parameter Mean b0 b1 Performance Evaluation 1.423 Standard Deviation 0.119 Confidence Interval (1.20,1.64) 0.888 0.135 (0.64, 1.14) © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 79 Common Mistakes with Regression •Verify that the relationship is linear. Look at a scatter diagram •Don't worry about absolute value of parameters. They are totally dependent on an arbitrary decision regarding what dimensions to use CPU time (sec) = 0.01 (disk I/Os)+0.001(Memory in KB) •Always specify confidence intervals for parameters and coefficient of determination •Test for correlation between predictor variables, and eliminate redundancy. Test to see what subset of the possible predictors is “best“ depending on cost vs. performance •Don't make predictions too far out of the measured range Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 80 Experimental Design “The goal of a proper experimental design is to obtain the maximum information with the minimum number of experiments“ •Determine the affects of various factors on performance •Does a factor have significant effect? •An experimental design consists of specifying The number of experiments The factor/level combinations of each experiment The number of replications of each experiment Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 81 Potential Pitfalls in Experimentation •All measured values are random variables. Must account for experimental error •Control for important parameters Example: User experience in a user interface study •Must be able to allocate variation to the different factors •There is a limit to the number of experiments you can perform Some designs give more information per experiment. •One-factor-at-a-time studies do not capture factor interactions. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 82 Types of Experimental Design 1/3 •Simple designs One factor is varied at a time Inefficient Does not capture interactions Not recommended Given k factors ith factor having n levels, need i k 1 (n 1) experiments i 1 i Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 83 Types of Experimental Design 2/3 •Full factorial design Perform experiment (s) for every combination of factors at every level Many experiments required Captures interactions If too expensive, can reduce number of levels (ex: 2k design k factors used at 2 levels), number of factors, or take another approach Given k factors ith factor having n levels, need i k ni experiments i 1 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 84 Types of Experimental Design 3/3 •Fractional Factorial Design Only run experiments for a carefully selected subset of the full factorial combinations Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 85 2k Factorials Designs •A 2k experimental design is used to determine the effect of k factors, each of which have two alternatives or levels. Used to prune out the less important factors for further study Pick the highest and lowest levels for each factor Assumes that a factor's effect is unidirectional Performance either continuously decreases or continuously increases as the factor is increased from minimum to maximum Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 86 22 Factorials Designs Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 1/2 87 22 Factorials Designs Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 2/2 88 Allocation of Variation Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 89 General 2k Factorials Designs 1/2 k main effects k! two - factor interactions (k 2)!2! k! three- factor interactions (k 3)!3! Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 90 General 2k Factorials Designs Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 2/2 91 2kr Factorials Designs •Experiments are observations of random variables •With only one observation, can't estimate error •If we repeat each experiment r times, we get 2kr observations. •With a 2 factor model, we now have: y = q0 +qAxA +qBxB +qABxAxB +e e is the experimental error Use sign table, and put in y column the sample means of r measurements at the given factor levels Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 92 2kr Factorials Designs Example eij yij yˆ i 22 r SSE eij2 i 1 j 1 2 SST 2 2 rq A2 2 2 rq B2 2 2 rq AB eij2 i, j Factor A explains 79%, B explains 15.4%, AB explains 4%, and error explains about 1.5%. SST SSA SSB SSAB SSE Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 93 Confidence Intervals for Effects Variance of erros computed from SSE SSE 2 s e 2 2 (r 1) Degrees of freedom 2 k (r 1) Since r error term s should add up to zero s s s s s e q q q q 0 A B AB 22 r Confidence intervals for previous example q0 (39.08, 42.91) qA (19.58, 23.41) qB (7.58,11.41) qAB (3.08,6.91) confidence intervals for effects q t s i [1 / 2;2 2 (r 1)] q i Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 94 2k-p Fractional Factorial Designs Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 95 Selecting The Experiments •The sum of the squares of each column is 2k-p qA= (-y1+y2-y3+y4-y5+y6-y7+y8)/8 Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 96 2k-p Fractional Factorial Designs Example Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 97 Assumptions •Big assumption: These experiments only provide so much information •The effects due to interaction are summed into the values calculated for the separate variables • The experiments are masking these “confounding" interactions The effects whose influence cannot be separated are said to be confounded • If the interaction effects are small, it is OK Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 98 One-Factor Experiments Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 1/2 99 One-Factor Experiments 2/2 r observatio ns for each of the a alternativ es Total of ar observatio ns arranged in an r a matrix r observatio ns belonging to j th alternativ e form a column vec tor 1 Grand mean y.. yij ar i , j Column mean y. j j y. j y.. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 100 One-Factor Experiments Example Effects of 3 processors, observed 5 times each •First alternative requires 13.3 bytes less than an average processor •Third alternative requires 37.7 bytes more than an average processor Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 101 Errors and Variation For each observation, the SSY 144 2 120 2 .... 633,639 error is the difference 2 2 SS0 ar 3 5 187 . 7 528,281.7 between the observation and the sum of the SSA r 2j 10,992.1 mean and alternative effect. j SST SSY - SS0 105,357.3 Mean is 187.7, R's effect is SSE SST - SSA 94,365.2 -13.3, so the first error term Percentage of variation is 144 - (187.7-13.3)=-30.4 10,992.13 10.4% 105,357.3 89.6% due to experiment al errors Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 102 ANOVA Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 103 Two-Factor Full Factorial Design 1/2 •Two parameters carefully controlled and varied to study effect on performance Compare several processors using several workloads yij j i eij yij observatio n with factor A at level j and factor B at level i mean response j effect of factor A at level j j effect of factor B at level i Performance Evaluation j 0 j 0 © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 104 Two-Factor Full Factorial Design 2/2 Factors A and B have a and b levels require ab experiment s Total of ab observatio ns arranged in an b a matrix Columns correspond to levels of A and rows correspond to levels of B 1 Grand mean y.. yij ab i , j j y. j y.. i yi. y.. Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 105 Two-Factor Full Factorial Example Effects of 3 different cache choices, 5 different workloads For each row (or column), compute mean of observations in that row (or column) Difference between a row (or column) mean and overall mean gives row (or column) effect Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 106 Estimating Experimental Errors Estimated response yˆ ij j i eij yij yˆ ij b a Variance of errors SSE e i 1 j 1 2 ij SSE 3.5 0.2 .... (2.4) 236.8 2 Performance Evaluation 2 © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 2 107 Allocation of Variance SSY SS0 SSA SSB SSE SSY yij2 91,595 ij SS0 ab 2 3 5 (72.2) 2 78,192.59 SSA b 2j 12,857.20 j SSB a i2 308.40 i SST SSY - SS0 13,402.41 SSE SST - SSA - SSB 236.80 Percentage of variation explained by the caches 12,857,20 95.9% 13,402.41 Percentage of variation due to workloads 2.3% Unexplaine d variation (due to errors) 1.8% Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 108 ANOVA Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 109 Precautions Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 110 Quantile-Quantile Plots •In order to determine distribution of data Plot observed quantiles versus theoretical quantiles •Assume yi is the observed qith quantile •Using theoretical distribution, the qith quantile xi is computed and a point is plotted at (xi, yi) •To determine xi, need to invert CDF qi=F(xi) xi = F-1(qi) For N(0,1) xi = 4.91[qi0.14-(1-qi)0.14] For N(,σ), scale to + σxi before plotting Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 111 Quantile-Quantile Plots Example •Modeling error for eight predictions of a model were found to b {-0.04, -0.19, 0.14, -.09, -0.14, 0.19, 0.04, and 0.09} 0.4 Residual Quantile 0.3 0.2 0.1 -1.6 -1.1 -0.6 0 -0.1 -0.1 0.4 0.9 1.4 -0.2 -0.3 -0.4 Normal Quantile Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 112 Residual quantile Q-Q Plot For Two-Factor Example 10 8 6 -2 -1.5 -1 4 2 0 -0.5 -2 0 -4 0.5 1 1.5 -6 -8 -10 Normal quantile Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 113 2 Introduction to Simulation •If system not available, a simulation model provides an easy way to predict the performance or compare several alternatives •If system is available, a simulation model may be preferred over measurements (Why?) •Simulation models might fail! Produce no useful results Produce misleading results Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 114 Common Mistakes in Simulation •Inappropriate level of detail •Improper implementation language •Unverified models •Invalid models (may not represent the real system correctly) •Improperly handled initial conditions •Too short simulations •Poor random-number generators •Improper selection of seeds Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 115 Terminology 1/3 •State variables variables whose values define the state of the system e.g., length of job queue in a CPU scheduling simulation •Event a change in the system state •Continuous-time and discrete-time models Continuous system state defined at all times (e.g., CPU scheduling) Discrete system state defined at particular instants of time Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 116 Terminology 2/3 •Continuous-state and discrete-state models Nature of state variables. Time spent by students studying a certain subject (can take an infinite number of values) Queue length in a CPU scheduling simulation can only assume integer values (discrete state model) Discrete-state discrete-event model Continuity of time does not imply continuity of state •Deterministic and probabilistic models Output of a model can be predicted with certainty deterministic Different results on repetition for same set of input parameters probabilistic Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 117 Terminology 3/3 •Static and dynamic models System state changes with time? •Open and closed models Input external to the model and independent of it open No external input closed •Stable and unstable models Steady state reached which is independent of time stable Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 118 Programming Language Selection •Simulation language •General-purpose language •Extension of a general-purpose language Extensions as a collection of routines to handle tasks commonly required in simulations •Simulation package inflexibility Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 119 Types of Simulation 1/3 •Emulation A simulation using hardware or firmware e.g., Terminal emulator, processor emulator. •Monte Carlo Simulation Static simulation or one without a time axis Model probabilistic phenomenon that do not change characteristics with time Evaluate non-probabilistic expressions using probabilistic models Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 120 Types of Simulation 2/3 •Trace-driven simulation Use a trace as input A trace is a time-ordered record of events on a real system e.g., A trace of page reference patterns of key programs can be input to compare different memory management schemes Advantages: (among others) credibility, easy validation, and accurate workload Disadvantages: (among others) complexity and being a single point of validation Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 121 Types of Simulation 3/3 •Discrete-event simulation Components Event scheduler simulation clock and a time advancing mechanism Unit time versus event-driven approaches System state variables Event routines Input routines, report routines, initialization routines, and trace routines Dynamic memory management Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 122 Analysis of Simulation Results •Model Verification Techniques •Model Validation Techniques •Transient Removal •Stopping Criteria Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 123 Model Verification Techniques •Top-down modular design •Anti-bugging •Structured walk-through •Deterministic models •Run simplified cases •Trace: time-ordered list of events and associated variables •On-Line graphic displays •Continuity test •Degeneracy tests •Consistency tests •Seed independence Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 124 Model Validation Techniques •Validate three key aspects of the model Assumptions Input parameter values and distributions Output values and conclusions •Each of three aspects maybe subjected to a validity test by comparing with that obtained from possible sources Expert intuition Real system measurements Theoretical results Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 125 Transient Removal 1/5 •Steady-state performance is of interest •What constitutes the transient state? •Methods for transient removal are heuristic Long runs Proper initialization Truncation Initial data deletion Moving average of independent replications Batch means Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 126 Transient Removal 2/5 •Truncation Variability during steady state less than that during transient state Measure variability in terms of range- min and max of observations Observations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 10, 9, 10, 11, 10, 9, 10, 11, 10, 9 Given a sample of n observations Ignore first l observations Calculate min and max of n-l observations Repeat until (l+1)th observation is neither min or max of remaining observations Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 127 Transient Removal 3/5 •Initial data deletion Average does not change much as observations are deleted Randomness in observations causes averages to change slightly even during steady state Average across replications (complete run with no change in input parameter values, seed value is different) Get a mean trajectory across replications Get overall mean Delete the first l observations (init l to 1), and get overall mean for n-l values Compute relative change in overall mean Determine the knee where the relative change graph stabilizes Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 128 Transient Removal 4/5 •Moving average of independent replications Similar to initial data deletion but computes mean over a moving time interval window instead of the overall mean Get a mean trajectory across replications Plot a trajectory of moving average of successive 2k+1 values (init k to 1) Repeat with k= 2,3,… until plot sufficiently smooth Find the knee of the plot Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 129 Transient Removal 5/5 •Batch means Run a very long simulation and later divide into several parts of equal duration (batch) Study the variance of batch means as a function of batch size A long run of N observations divided into m batches of size n each (start with n=2 for example) For each batch, compute a batch mean Compute overall mean Compute variance of batch means Increase n and repeat Plot variance as a function of n Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 130 Stopping Criteria: Variance Estimation •Simulation could be run until confidence interval for the mean response narrows to a desired width •Can obtain variance of sample mean = variance of observations/n Valid if observations are independent Observations in most simulations are not independent •To correctly compute the variance of the mean of correlated observations Independent replications Batch means Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 131 1/3 Stopping Criteria: Variance Estimation •Independent replications Means of independent replications are independent though observations in a single replication are correlated Conduct m replications of size n+n0 (n0 is length of transient phase) Compute a mean for each replication Compute an overall mean for all replications Calculate the variance of replicate means Calculate confidence interval for mean response Will discard mn0 initial observations Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 132 2/3 Stopping Criteria: Variance Estimation •Batch means Run a long simulation run, discard initial transient interval, and divide remaining observations into several batches Compute means for each batch Compute overall mean Calculate variance of batch means Calculate confidence interval for mean response Will discard n0 observations (less waste) Calculate covariance of successive batch means to find correct batch size n Covariance should be small compared to the variance Performance Evaluation © Dr. Ayman Abdel-Hamid, CS5014, Fall 2006 133 3/3