Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Foundations of statistics wikipedia , lookup
Omnibus test wikipedia , lookup
Student's t-test wikipedia , lookup
Resampling (statistics) wikipedia , lookup
0. Why Conduct an Experiment? • Infer causation • Ensure repeatability Also, • Determine the relationships between variables • Estimate significance of each variable 1 1. Components of Experimentation • Formulate research hypotheses – Derivations from a theory – Deductions from empirical experience – Speculation Research hypotheses are the questions we hope to answer by conducting the experiment 2 Components of experimentation • Define variables and design – Are the independent variables capable of testing the hypotheses? – Are the independent variables confounded? • For example, assume buffer size and cycle time are independent variables – Condition 1: buffer size = 10, cycle time = 1 min – Condition 2: buffer size = 15, cycle time = 1 min – Condition 3: buffer size = 15, cycle time = 2 min 3 Components of experimentation – No meaningful inference can be drawn by conducting an experiment that includes conditions 1 and 3 because the variables are confounded, that is, varied simultaneously under control of the researcher. 4 Components of experimentation Design of experiments is a technique for examining and maximizing the information gained from an experiment. 5 Components of experimentation • Conduct experiment – Collect data – Extract information • Analyze results – Test hypotheses • Report outcomes 6 Components of experimentation An experiment is conducted usually to test a theory. If the outcome of the experiment is negative, the experiment may be inadequate while the theory may be valid. 7 Definitions • Factor - an input variable • Level - a possible value of a factor • Treatment - a combination of factors, all at a specified level, as in a simulation run • Parameter - a measure calculated from all observations in a population • Statistic - a measure calculated from all observations in a sample 8 2. Hypothesis Testing In analyzing data from an experiment, we are interested in describing not only the performance of the subjects selected in the treatment conditions–we want to make inferences about the behavior of the source population of our sample of subjects. 9 Hypothesis testing • We start by making an assumption about the value of a population parameter. • We can test this assumption in two ways: – census • foolproof, time consuming – random sample • not foolproof, faster than census 10 Hypothesis testing • In hypothesis testing, we make an assumption (hypothesis) about the value of a population parameter and test it by examining evidence from a random sample taken from the population. • Since we are not testing the entire population, we must be aware of the risk associated with our decision. 11 Hypothesis testing • We start by formulating two competing hypotheses. • We test the hypothesis that is the opposite of the inference we want to make. • The hypothesis we test is called the null hypothesis (H0); the inference we want to make is called the alternative hypothesis (H1). 12 Example 1 Yosemite recently acquired the Acme Disintegrating Pistol. However, after repeated attempts with the pistol, he has been unsuccessful at destroying Bugs. Yosemite suspects that the pistol is not delivering its rated output of 10 megatons/shot. He has decided to keep the pistol only if the output is over 10 megatons/shot. He takes a random sample of 100 shots and records the outputs. What null and alternative hypotheses should Yosemite use to make the decision? 13 Example 1 - One Sided Alternative • Let denote the mean output/shot. H0: 10 H1: >10 • Practically H0: =10 H1: >10 14 Example 2 - Two Sided Alternative Suppose Yosemite bought a used Pistol and he suspects that the output is not 10 megatons/shot. What should be the null and alternative hypotheses? H0: =10 H1: 10 15 Hypothesis Testing-Two Populations • With one population, we are interested in making an inference about a parameter of a population. • With two populations, we are interested testing hypotheses about parameters of two populations. • We want to compare the difference between the parameters, not their actual values. 16 Example 3 Han Solo has been disappointed with the performance of his X-Wing fighter lately. He usually finds himself trailing the other fighters on Death Star missions. He suspects the quality of the fuel he is getting from the neighborhood fuel portal on his home planet of Tatooine. He decides to try the fuel portal located on the nearby planet of Betelgeuse. After each fill, Han checks the fighter’s logs for the time it takes to jump to hyperspace and compares it with the logs from the Tatooine fuel. The jump takes an average of 17.01 trilons on Tatooine fuel and 16.9 trilons on Betelgeuse fuel. Can Han attribute this difference to fuel? 17 Example 3 • Let 1 denote the time taken to jump to hyperspace on Tatooine fuel and 2 denote the time taken to jump to hyperspace on Betelgeuse fuel. H0: 1 – 2 0 H1: 1 – 2 < 0 18 Hypothesis Testing-Two Populations • Practically H0: 1 – 2 = 0 H1: 1 – 2 < 0 or H0: 1 = 2 H1: 1 < 2 19 Hypothesis testing • We formulate hypotheses to assert that the treatments (independent variables) will produce an effect. We would not perform an experiment otherwise. • We formulate two mutually exclusive hypotheses that cover all possible parameter values. 20 Hypothesis testing • The statistical hypothesis we test is called the null hypothesis (H0). It specifies values of a parameter, often the mean. • If the values obtained from the treatment groups are very different than those specified by the null hypothesis, we reject H0 in favor the alternative hypothesis (H1). 21 Hypothesis Testing-Multiple Populations The null hypothesis usually assigns the same value to the treatment means: H0: 1= 2= 3= … H1: not all s are equal 22 Hypothesis Testing-Multiple Populations 1 = 2 = 3 1 2 3 23 Hypothesis Testing-Multiple Populations • The null hypothesis is an exact statement – the treatment means are equal. • The alternative hypothesis is an inexact statement – any two treatment means may be unequal. Nothing is said about the actual differences between the means because we would not need to experiment in that case. 24 Hypothesis Testing-Multiple Populations • A decision to reject H0 suggests significant differences in the treatment means. • If the treatments means are reasonably close to the ones specified in H0, we do not reject H0. • We usually cannot accept H0; we question the experiment instead. 25 2.1 Experimental Error • We can attribute a portion of the difference among the treatment means to experimental error. • This error can result from: – – – – sampling error in entering input data error in recording output data inadequate run length 26 Experimental error • Under the null hypothesis, we have two sources of experimental error – differences within treatment means and differences between treatment means. • Under the alternative hypothesis, we have genuine differences among treatment means. However, a false null hypothesis does not preclude experimental error. 27 Experimental error • A false null hypothesis implies that treatment effects are also contributing toward the differences in means. 28 2.2 Evaluating H0 • If we form a ratio of the two experimental errors under H0, we have: • This can also be thought of as contrasting two experimental errors: differences between groups differences within groups experimental error experimental error 29 Evaluating H0 • Under H1, there is an additional component in the numerator: treatment effects experimental error experimental error 30 3. ANOVA and the F ratio • To evaluate the null hypothesis, it is necessary to transform the between- and within-group differences into variances. • The statistical analysis involving the comparison of variances is called the analysis of variance. 31 ANOVA and the F ratio sum of squared deviations from the mean degrees of freedom SS MS df variance • Degrees of freedom is approximately the number of observations with independent information, that is, variance is roughly an average of the squared deviations. 32 3.1 The F Ratio MS between MS treatment F MS within MS error • Under H0, we expect the F ratio to be approximately 1. • Under H1, we expect the F ratio to be greater than 1. 33 Typical data for 1-way ANOVA Treatment level 1 2 l Observation y11 y21 yl1 y12 y22 yl2 y1n y2n yln 34 ANOVA table Source of variation Treatment Error Total Sum of squares SSt SSe SST df l1 Nl N1 Mean square MSt MSe F MSt MSe l - treatment levels N - total number of observations 35 Computational formulas n yi. yij , j 1 l n y.. yij , i 1 j 1 yi . y i. , i 1,, l n y.. y .. N 2 y .. SST yij2 i 1 j 1 N 2 l y2 y SSt i. .. i 1 n N l n 36 3.2 Evaluating the F ratio • Assume we have a population of scores and we draw at random 3 sets of 15 scores each. • Assume the null hypothesis is true, that is, each treatment group is drawn from the same population (1=2=3). • Assume we draw a very large number of such experiments and compute the value of F for each case. 37 3.3 Sampling Distribution of F • If we group the Fs according to size, we can graph them by the frequency of occurrence. • A frequency distribution of a statistic such as F is called the sampling distribution of the statistic. 38 39 Sampling distribution of F • The graph demonstrates that the F distribution is the sampling distribution of F when infinitely many experiments are conducted. • This distribution can be determined for any experiment, that is, any number of groups and any number of subjects in the groups. 40 Sampling distribution of F • The F distribution allows us to make statements concerning how common or rare an observed F value is. For example, only 5% of the time would we expect an Fobs 3.23. • This is the probability that an Fobs 3.23 will occur on the basis of chance factors alone. 41 Sampling distribution of F • We have considered the sampling distribution of F under H0. However, we conduct experiments expecting to find treatment effects. • If H0 is false, we expect that F > 1. The sampling distribution of F under H1 is called F'. 42 Sampling distribution of F • We cannot plot the distribution of F' as we can with F, because the distribution of F' depends on the magnitude of the treatment effects as well as the df s. 43 3.4 Testing the Null Hypothesis H0: all means are equal H1: not all means are equal Alternatively, H0: there are no treatment effects H1: there are some treatment effects 44 Testing the null hypothesis • When we conduct an experiment, we need to decide if the observed F is from the F distribution or the F' distribution. • Since we test the null hypothesis, we focus on the F distribution. • Theoretically, it is possible to obtain any value of F under H0. 45 Testing the null hypothesis • Thus, we cannot be certain that an observed F is from the F or the F' distribution, that is, we do not know if the sample means are different due to chance. • We can take this attitude and render the experiment useless or we can be willing to make mistakes in rejecting the null hypothesis when it’s true. 46 Testing the null hypothesis • We select an arbitrary dividing line for any F distribution where values of F falling above the line are unlikely and ones falling below the line are likely. • If the observed F falls above the line, we can conclude that it is incompatible with the null hypothesis (reject H0). 47 Testing the null hypothesis • If the observed F falls below the line, we can conclude that it is compatible with the null hypothesis (retain H0). • The line conventionally divides the F distribution so that 5% of the area under the curve (cumulative probability) is the region of incompatibility. This probability is called the significance level. 48 Testing the null hypothesis • We can choose any significance level, as long as it’s done before the experiment. • The formal rule is stated as: Reject H0 when Fobs F(dfnum,dfdenom); otherwise retain H0 49 Testing the null hypothesis Most software reports the probability of occurrence of Fobs. This relieves us from consulting the F tables (but not from specifying before the test). The formal rule becomes: If p , reject H0; otherwise retain H0 50 3.5 Errors in Hypothesis Testing • There are two states of reality (H0 true/false) and two decisions we may make (reject/retain H0). • Out of the four combinations, only two lead to correct decisions. The other two lead to errors. 51 Errors in hypothesis testing Reality Decision Reject H0 Retain H0 H0 true H0 false Type I error Correct decision Correct decision Type II error 52 Errors in hypothesis testing H0 true F Type I error () Power (1–) H1 true Type II error () H0 true H1 true Retain H0 Reject H0 53 Errors in hypothesis testing • Type I and type II errors are related inversely, that is, decreasing the level increases type II error. • The power of a statistical test is the probability of rejecting the null hypothesis when it is false. Thus, power is the probability of making a correct decision. 54 Errors in hypothesis testing • If denotes the probability of making a type II error, then power = 1. Thus, a smaller indicates more power. • Power is an index of the sensitivity of an experiment. A well designed experiment should have high power so that the results are repeatable. 55 4. Effect Size and Power • The power of an experiment depends on – level – sample size – effect size • While the F ratio is a measure of statistical significance, effect size is a measure of practical significance. 56 Effect size and power • Effect size indicates whether the treatments have a practical effect on the response variables. • Unlike the F test, effect size is not biased by sample size. • The F ratio will usually indicate significance with a large sample size even with small treatment effects. 57 Example A researcher compares four religious groups on their attitude toward education. Ten items are used to assess attitude. There are 800 usable responses. The Protestants are split into two groups for analysis purposes. 58 Example n x s Prot1 Catholic Jewish Prot2 238 182 130 250 32 33 34 31 7.1 7.6 7.8 7.5 ANOVA indicates Fobs=5.61, which is significant at the .001 level. 59 Effect size and power • Thus, while the F ratio indicates statistical association, its size does not reflect the degree of this association, that is, a large F is not necessarily better then a small one. • The effect size provides the degree of the statistical association. 60 4.1 Omega Squared (2) 2 is a measure of effect size. It is the proportion of the population variance accounted for by the treatment. For single factor experiments, (l 1)( F 1) (l 1)( F 1) l n 2 l = number of factor levels n = number of observations 61 4.2 Sample Size and Power • We have seen that the power of an experiment depends on the level, the effect size, and the sample size. • The level is conventionally fixed at .05 and the effect size is usually assumed to be large. • This leaves the researcher with the sample size to control power. 62 Sample size as a function of Power, 2, and • Four factor levels • Sample sizes are per factor level 2 Power .1 .2 .3 .4 .01 .06 .15 21 5 3 53 10 5 83 14 6 113 19 8 .01 .06 .15 70 13 6 116 20 8 156 26 11 194 32 13 .5 = .05 144 24 10 = .01 232 38 15 .6 .7 .8 .9 179 30 12 219 36 14 271 44 17 354 57 22 274 45 18 323 53 20 385 62 24 478 77 29 63 4.3 Estimating Power “Power reflects the degree to which we can detect the treatment differences we expect and the chances that others will be able to duplicate our findings when they attempt to repeat our experiments” 64 Estimating power • If the power of an experiment is .5, it indicates that there is only a 50% chance of the result being duplicated. • A well designed experiment should have a power of at least .8. 65 Estimating power • We can estimate the power of an experiment from Pearson-Hartley power curves by calculating 2 and another statistic (2) • Suppose for an experiment, Fobs=3.2, l=3, and n=5. Since F.05(2,12)=3.89, Fobs is not significant. 66 Estimating power (l 1)( F 1) (3 1)(3.2 1) .227 (l 1)( F 1) l n (3 1)(3.2 1) 3(5) 2 2 .227 2 n 5 1.469 2 1 1 .227 1.21 67 Estimating power • The power curves for dfnum=l1=2 and dfden=l(n1)=12 indicate that power is approximately .36, which is too low. • We can use the same equation to estimate the sample size needed to detect an effect of .227 and reject H0 at =.05 with power=.8. 68 Estimating power .227 n .294n or .542 n 1 .227 2 If we try n=12, =1.88. Since dfden=33, we can use the power curve for dfden=30. Locating =1.88 on this curve gives a power .8. 69 5. Assumptions in ANOVA • Suppose we have l levels of a factor that we wish to compare. In the single factor case, different levels of the factor are also called treatments. • The linear model underlying ANOVA states: 70 Assumptions yij j ij , i 1,, n j 1,, l yij = observation i under treatment level j = overall mean j = j = jth treatment effect ij = yij j = experimental error 71 Assumptions • Independence - The observations are independent within and between treatment groups. • Normality - The observations in the treatment groups are distributed normally. • Homoscedasticity - The variances of the treatment groups are equal. 72 Definitions • Nominal - The level set by the experimenter. It is the percent of time one is rejecting H0 falsely when all assumptions are met. • Actual - The percent of time one is rejecting H0 falsely if one or more assumptions are violated. 73 5.1 Independence The independence assumption is the most important one. A violation of this assumption can increase the actual to 10 times the nominal . For example, if nominal =.05 then actual alpha=.5, which indicates a 50% probability of making a Type I error. 74 Independence • Correlation in time series data can be checked by the Durbin-Watson statistic (d). The statistic, however, checks for first-order autocorrelation only. • d < 2 for positive correlation d 2 for no correlation d > 2 for negative correlation 75 Independence • To control autocorrelation, we can – decrease the level of significance – use non-overlapping random number streams – use batch means 76 5.2 Normality • Normality can be checked by – – – – plotting the data in each treatment group normal probability plot Anderson-Darling test Shapiro-Wilk test • If the distribution is skewed then we can decrease the significance level. 77 Normality The F-test is robust against non-normality when the sample sizes are equal, that is, actual nominal . 78 5.3 Homoscedasticity • There are a number of tests for assessing homoscedasticity. Among the more popular are the Brown-Forsythe and the Levene tests. • If the data is heteroscedastic, we can – decrease the significance level – use a variance stabilizing transform 79 Homoscedasticity The F-test is robust against heterogeneous variances when the sample sizes are equal, that is, actual nominal . 80 6. Cumulative Type I Error • Assume we perform a 3-way ANOVA (ABC) and conduct all 7 tests (A, B, C, AB, AC, BC, ABC) at =.05. • The probability of type I error increases with the number of tests, that is, overall is no longer .05 for the set of tests. 81 Cumulative type I error • Overall for a set of tests is the probability of at least one false rejection when H0 is true. • The Bonferroni Inequality provides an upper bound for overall . For a 3-way ANOVA: overall .05 .05 .05 .35 82 6.1 Bonferroni Inequality • In general, if we are testing k hypotheses at 1,…, n then overall 1,+…+, n. • If all hypotheses are tested at the same level ' then overall n'. • If the tests are independent then overall = 1(1)n. 83 6.2 Bonferroni Correction The Bonferroni Correction divides the desired overall equally among the n tests: overall n Thus, each test can be conducted at the ' significance level. 84 6.3 Disadvantages of -Correction • Loss of power for detecting true differences when they exist – Impediment in discovering new findings. • Undue importance to overall – The overall calculation assumes H0 is true. This is not the case in most experiments. Thus, the calculation overestimates the probability of committing type I error. 85 Disadvantages of -correction • The (misleading) definition of overall : – For a set of tests it is the probability of one or more false rejection when H0 is true. – The overall error is produced mostly from experiments in which only one type I error has occurred. – The instances of two type I errors are fairly small and decrease with the number of errors. 86 7. Factorial Design • The reason for conducting experiments is to identify factors contributing to a phenomenon. • This can be done by focusing on a single factor while keeping other factors constant or by focusing on multiple factors simultaneously. 87 Factorial design • The issue in the latter case is whether a particular factor studied simultaneously with another factor will show the same effect as it would when studied in a single factor design, that is, do the factors interact? • A factorial experiment is used to simultaneously examine the effect of two or more factors. 88 Example 1 A factorial experiment with two factors - A and B, with 2 levels each: A B b 10 30 1 a a 1 2 b 20 40 2 89 Definitions Simple effect - of a factor is result of the component single factor experiment. The rows are the simple effects of B and the columns are the simple effects of A. Main effect - of a factor is the difference in the averages of the component single factor experiments. The main effect of A is 20, and the main effect of B is 10. 90 Example 2 A factorial experiment with two factors - A and B, with 2 levels each: A B b 10 30 1 a a 1 2 b 20 5 2 91 7.1 Interaction 1 No Interaction b2 40 30 b1 20 10 0 a1 a2 2 Interaction 40 30 b1 20 10 b2 0 a1 a2 • In example 1, the effect of A does not depend on the levels of B (parallel curves). • In example 2, the effect of A is not the same for the levels of B (non-parallel curves). 92 Interaction • An interaction is present when the effects of one factor change at different levels of the second factor. • In most experiments, interactions are the primary interest in the study. It is not particularly revealing to report on the significance of main effects. 93 Interaction • The presence of an interaction often requires ingenuity in explaining the relationships in the data. • It also requires that main effects not be reported in isolation as they are meaningless without the interaction information. 94 7.2 Advantages of Factorial Experiments • More efficient than single-factor experiments • Necessary if interactions are present • Allow the effects of a factor to be estimated at several levels of other factors, yielding conclusions that are valid over a range of experimental conditions 95 7.2.1 Efficiency A factorial experiment with two factors at 2 levels each: A B b ab ab 1 a a 1 2 1 1 2 1 b ab ab 2 1 2 2 2 96 Efficiency • Effects of changing A: a2b1 a1b1, a2b2 a1b2 • Effects of changing B: a1b2 a1b1, a2b2 a2b1 Thus, we have two estimates of both effects. 97 Efficiency The equivalent single-factor experiment: A B b ab ab 1 a a 1 2 1 1 2 1 b ab 2 1 2 98 Efficiency • Effect of changing A: a2b1 a1b1 • Effect of changing B: a1b2 a1b1 We have one estimate of each effect. We need 3 more observations to get two estimates each–a total of 6 observations. 99 Efficiency • The relative efficiency of the factorial design to the single-factor experiment is 6/4=1.5. • In general, with n factors each at 2 levels (2n design), the relative efficiency is (n+1)/2. 100 Exercise The following are 8 different outcomes of the same 2-factor experiment. • Calculate the main effects • Plot the data to check for interactions • Can interactions occur in the absence of main effects? 101 Exercise a1 a2 a1 1 a2 a1 2 a2 a1 a2 3 4 b1 5 5 4 6 7 7 6 8 b2 5 5 4 6 3 3 2 4 5 6 7 8 b1 6 4 5 5 8 6 7 7 b2 4 6 3 7 2 4 1 5 6 4 8 8 10 6 6 8 4 4 2 2 0 0 6 4 2 0 1 2 1 8 8 6 6 2 0 2 1 2 10 1 2 8 8 6 6 4 4 2 2 0 0 4 4 1 2 2 2 0 1 2 0 1 2 1 2 102 Typical data for 2-way ANOVA Factor B 1 2 1 y111,y112, y121,y122, ,y11n ,y12n 2 y211,y212, y221,y222, ,y21n ,y22n b y1b1,y1b2, ,y1bn y2b1,y2b2, ,y2bn a ya11,ya12, ya21,ya22, ,ya1n ,ya2n yab1,yab2, ,yabn 103 7.3 Linear Model yijk= + i + j + ij + ijk where = overall mean i = average treatment effect at level ai j = average treatment effect at level bj ij = interaction effect at cell aibj ijk = experimental error 104 Linear model H0: All i = 0 H1: Not all i = 0 B main effect: H0: All j = 0 H1: Not all j = 0 A×B interaction: H0: All ij = 0 H1: Not all ij = 0 A main effect: 105 7.4 Types of Factors Fixed - The levels are selected specifically and inferences are confined to these levels. Random - The levels are selected arbitrarily from a population of levels and inferences extend to the sampled populations of the levels (always Qualitative). 106 Types of factors Quantitative - Numeric and ordered (always Fixed) Qualitative - Categorical, non-numeric, or numeric but unordered 107 8. Statistical Inference • Statistical inference deals with inferring the characteristics of a population by examining a sample. • Every sample will have an associated sampling error because a sample is a subset of the population. • Sampling error decreases as the sample size is increased. 108 Statistical inference • We can estimate a population parameter, such as the mean (), with a sample statistic, such as the sample mean (x), or we can make an inference about the interval in which this mean falls. 109 8.1 Sampling Distribution of the Mean Suppose we have a large rabbit population and we select all possible samples of 100 rabbits and calculate the mean of each sample. These sample means form a distribution called the sampling distribution of the mean (x). 110 8.2 Central Limit Theorem If x has a distribution with mean and standard deviation , then the sampling distribution of the mean will approach the normal distribution as the size of the sample is increased. Its mean will be and standard deviation /n. 111 8.3 Standard Error of the Mean • The term x= /n is called the standard error of the mean because it measures the sampling error. • 1.96 x covers approximately the middle 95% of the total possible sample means. 112 8.4 Interval Estimation of the Mean • In practice, we don’t have the resources to select all possible samples. • We usually have one sample on which to draw conclusions from about the population. 113 Interval estimation of the mean • We use the standard deviation of the sample (s) as an estimate for and substitute it in the formula for standard error. Thus, sx= s/n. • It is assumed that the sample size, n 30. 114 Example A sample of 100 rabbits is selected at random from the Vogon forest and the mean weight is computed to be 10 lb, with a standard deviation of 1 lb. sx= s/n = 1/100 = .1. x 1.96 sx= 10 .196 Thus, a 95% CI around the mean weight is (10.196, 9.804) 115 Interval estimation of the mean x3 x1 x2 x4 95% of the intervals contain 116 9. Linear Regression • In many problems there are two or more variables that are related and it may be useful to quantify this relationship. • Regression analysis is a statistical technique for modeling the relationship between two variables for prediction or optimization. 117 Linear regression • In general, there is a single dependent or response variable y related to n independent or regressor variables x1, x2,…,xn, under the control of the experimenter. • The relationship between these variables is described by a mathematical model called a regression equation of the form y = f(x1, x2,…,xn), where f is unknown. 118 Linear regression • In simple regression there is only one independent variable; in multiple regression there are many independent variables. • “Linear” implies that the relationship between the dependent and the independent variables in linear. Since this is a restrictive condition, Polynomial regression allows non-linear relationships. 119 9.1 Least Squares • Assume that we have n pairs of observations, (x1,y1),…,(xn,yn) and the relationship between y and x is a straight line. • Therefore, each observation can be described by the model, yi = 0+1xi+i, where i is a random error distributed normally with mean zero and variance 2. 120 Least squares • The i are also assumed to be independent. • The i capture the influence of omitted variables, measurement errors, and random factors on y. • is called the random error term because it disturbs what would be otherwise an exact relationship between x and y. 121 Least squares • The assumptions imply that: – yi N(0+1xi,) – yi and yj are independent • The method that is used to estimate 0 and 1 from the observations (xi,yi) is called least squares. 122 Reality f(x) y x1 1 x2 x3 x4 2 3 4 123 Assumptions f(x) y x1 x2 x3 x4 1 2 3 4 124 Least squares Minimize ei2 yi ei ŷi y SSt SSr SSe ( yi y ) ( yˆ i y ) ( yi yˆ i ) 2 2 2 x 125 Least squares Assume we estimate the model y 0 1 x by yˆ ˆ 0 ˆ 1 x ˆ 1 S xy ˆ 0 y ˆ 1 x S xx where (x) S xx x n 2 2 x y S xy xy n 126 9.2 Linear Regression - Problem • Given a probabilistic relationship, we cannot estimate the exact value of a dependent variable solely from the value of an independent variable. • We also require values of , which are unobservable. 127 Linear Regression - Problem • At any given setting of the independent variable, there is a sub-population of values of the dependent variable—we don’t know the actual value. • A compromise is to determine the average value of the dependent variable for a given value of the independent variable. 128 Linear Regression - Problem • This average is called the conditional mean of y (yx). • Forecasting on the basis of the conditional mean is more accurate than on the basis of the unconditional mean. 129 9.3 Hypothesis Testing • We can formulate hypotheses to test the significance of regression. H0: 1=0 H1: 10 • 1 represents the expected change in y for a unit change in x. 130 Hypothesis testing • Failing to reject H0 is equivalent to concluding that the relationship between x and y does not have significant slope. • This may imply that either x is of little value in explaining the variation in y or that the relationship between x and y is not linear. • Alternatively, if H0 is rejected then x is of some value in explaining the variation in y. 131 Hypothesis testing H0 not rejected H0 rejected 132 ANOVA for testing significance of regression Source of variation Regression Error Total Sum of squares SSr SSe SSt df 1 n2 n1 Mean square MSr MSe F MSr MSe 2 ( y ) SS r ˆ 1S xy , SSe SSt SS r , SSt y 2 n Reject H0 if Fobs>F,1,n2 133 9.4 Interval Estimation • We can construct a confidence interval for the average value of y for a given x. This is also called a CI about the regression line. • A 100(1)% CI about the regression line at x is given by yˆ t / 2,n2 1 ( x x )2 MS e S xx n 134 Interval estimation • A 100(1)% CI about x will enclose x 100(1)% of the time if the experiment is conducted 100 times. For example, a 90% CI about x will enclose x 9 times out of 10. 135 9.5 Prediction • We can construct a prediction interval for the actual value of y for a given value of x. • A 100(1)% PI about the regression line at x is given by yˆ t / 2,n2 1 ( x x )2 MS e 1 S xx n 136 9.6 Interval Estimation and Prediction y CI PI x 137 Interval estimation and prediction • A CI is for estimating the average value of y A PI is for predicting the actual value of y. • A CI is constructed about parameters. • A PI is constructed about variables. • A PI is always wider than the corresponding CI as it is about a quantity that may vary unlike the average value of y, which is constant. 138 Interval estimation and prediction • The confidence bands widen at the boundaries of the regression line indicating that we should not extrapolate the average value of y. 139 9.7 Assumptions in Linear Regression • We require the following assumptions when fitting a regression model: – i ~ NID(0,) – the relation between x and y is linear 140 Assumptions • These assumptions can be checked by analyzing the residuals (error terms). – the normality assumption can be checked by plotting the residuals on normal probability paper – the independence and constant variance assumptions can be checked by plotting the residuals against the predicted values 141 Assumptions Normal Heteroscedastic Error in calculation Curvilinear 142 9.8 Transformations • In some situations, there is a need to transform the dependent or the independent variable to linearize the relationship. • The transformation depends on the curvature of the scatterplot. 143 Selecting a transformation x y x y x y x y x2 x3 log y 1/y log x 1/x y2 y3 log x 1/x log y 1/y x2 x3 y2 y3 144 Transformations • If we replace x by log x, the regression model is: y = 0+ 1z + , where z = log x • If we replace y by log y, the regression model is: log y = 0+ 1x + 145 9.9 Correlation • Correlation analysis allows us to measure the strength of the relationship between the two variables. • There are two correlation measures: – coefficient of correlation (r) – coefficient of determination (r2) 146 9.9.1 Coefficient of Correlation r • • • • S xy S xx S yy –1 < r < 1 r > 0 indicates a positive linear relationship r < 0 indicates a negative linear relationship r = 0 indicates no linear relationship 147 Coefficient of correlation r>0 r<0 r=0 r=0 148 9.9.2 Coefficient of Determination SS r explained variation r SSt total variation 2 r2 accounts for the proportion of the variation in the y values that is explained by the x variable. 149 9.10 Common Errors and Limitations • Estimates from a regression equation should not be made beyond the range of the original observations. • Correlation analysis does not indicate a cause-and-effect relationship – r2 indicates the proportion of explained variation if there is a causal relationship. It is not necessarily the percentage variation in y caused by x. 150 Common errors and limitations • The correlation coefficient should not be interpreted as a percentage. • It is possible to omit the intercept (0) from the model so that y = x + . This is a strong assumption and it implies that y=0 when x=0. A model with the intercept usually gives a better fit. 151 10. Multiple Regression In some situations, simple linear regression is not adequate in describing the relationship between the dependent and the independent variable: – the relationship may involve many independent variables (not “simple”) – the relationship may not be a straight line (not “linear”) 152 Multiple regression • However, most of the linear regression concepts apply to multiple regression. • In multiple regression the dependent variable relates to a set of independent variables: y 0 1 x1 2 x2 n xn 153 Multiple regression • If interactions are present in the model then: y 0 1 x1 2 x2 3 x1 x2 • The (1,…, n) are called the partial regression coefficients or partial slopes. i (i>0) represents the expected change in y for a unit change in xi holding all other xs constant. 154 10.1 Issues Common to LR and MR • • • • • Least squares Hypothesis testing Interval estimation Assumptions Multiple correlation (r) 155 10.2 Multicollinearity • To maximize r in MR, we are interested in finding predictors that are correlated significantly with the dependent variable and uncorrelated with each other. • This allows each predictor to explain different components of the variance on the dependent variable. 156 Multicollinearity • In many cases, the predictor variables will be correlated with each other to some degree. Thus, we should choose variables which are correlated minimally with each other. • Multicollinearity occurs when the predictors are correlated with each other. 157 10.2.1 Correlation Matrix Which of the following will have the smallest and the largest multiple correlation? x1 x2 x3 y .2 .1 .3 x1 .5 .4 x2 .6 x1 x2 x3 y .6 .5 .7 x1 .2 .3 x2 .2 x1 x2 x3 y .6 .7 .7 x1 .7 .6 x2 .8 158 Multicollinearity • Multicollinearity: – limits r severely because the predictors are going after much of the same variance on y. – may undermine the importance of a given predictor because the effects of the predictors are confounded due to correlations among them – increases the variance of the regression coefficients 159 Multicollinearity • Multicollinearity can be diagnosed by examining the: – correlation matrix – variance inflation factors • Multicollinearity can be combated by – combining correlated predictors – dropping predictors – adding data 160 10.2.2 Variance Inflation Factor • The VIF for a predictor i indicates if there is a strong linear association between it and the remaining predictors. • A VIF above 10 is cause for concern. 161 10.3 Model Selection There are many methods available for selecting a good set of predictors. Most are sequential, that is, they examine the contribution of a predictor toward the variance on y while holding the effects of other predictors constant. 162 10.3.1 Forward Selection • Enter the first variable with the largest simple correlation with y. • If this predictor is significant then consider the variable with the largest semipartial correlation with y. • Repeat until a variable does not make a significant contribution to the prediction. 163 10.3.2 Backward Selection • Compute a model with all variables and calculate the partial F for every variable as if it were the last variable to enter the model. • Compare the smallest partial F to a given significance value and remove the corresponding variable if necessary. • Repeat until all variables are significant. 164 10.3.1 Stepwise Selection • Similar to Forward Selection except that the significance of each variable is assessed at every stage. • In Forward Selection a variable stays in the equation upon entering. This is not the case in Stepwise Selection. 165 10.4 Under/Overfitting • It is important to not underfit (important variables left out) or overfit (include variables that make marginal or no contribution) a model. • Mallow’s Cp is a criterion that aids in selecting a model with the correct number of predictors. • For a correctly specified model, Cp p. 166 10.5 Model Validation • It is important to determine how well the equation will predict on a given data sample. The following are three forms of model validation: – data splitting – adjusted R2 – Press statistic 167 Model validation • Data splitting - Split the data in half. Derive the model from this half and validate it on the other half. • Adjusted R2 - Compute an R2 adjusted for the number of variables in the model. 168 Model validation • Press statistic - The prediction error for observation i is computed from the equation derived on the remaining n–1 data points. Thus this statistic has n validations of sample size n–1. 169 10.6 Common Errors and Limitations • The magnitude of a partial regression coefficient does not indicate the importance of the corresponding variable. • r can be brought close to 1 by continuously adding variables with marginal contribution. • There should be at least 15 observations per variable. 170 References Hines WW and DC Montgomery, Probability and Statistics in Engineering and Management, 2nd ed, John Wiley & Sons, NY, 1980 Keppel G, Design and Analysis: A Researcher’s Handbook, 3rd ed, Prentice Hall, NJ, 1991 Montgomery DC, Design and Analysis of Experiments, 3rd ed, John Wiley & Sons, NY, 1991 Ott L, An Introduction to Statistical Methods and Data Analysis, 2nd ed, Duxbury Press, MA, 1984 Sanders D, Statistics: A Fresh Approach, 4th ed, McGraw-Hill, NY, 1990 Stevens J, Applied Multivariate Statistics for the Social Sciences, 3rd ed, Lawrence Erlbaum Associates, NJ, 1996 Vaidyanathan R and G Vaidyanathan, College Business Statistics with Canadian Applications, 2nd ed, McGraw-Hill Ryerson, ON, 1992 171