Download Simple Linear Regression and Correlation

BG 2200 Statistics II 1 Notes 1 Topic 0: Important Preliminaries Statistics refers to the body of techniques used for collecting, organizing, analyzing, and interpreting data (information or facts). The data may be quantitative, with values expressed numerically, or it may be qualitative, with the characteristics of observations being tabulated. The data in business can usually be classified into three types, namely nominal (just names), ordinal (just ordering or ranks), and interval (measurements of size). Statistics is used in business to help make better decisions by understanding the sources of variation and by uncovering patterns and relationships to business data. Descriptive statistics include the techniques that are used to summarize and describe numerical data. These methods can either be graphical or involve computational analysis. Inferential statistics include those techniques by which decisions about a statistical population or process are made based only on a sample having been observed. Sample data can be obtained by means of survey, observation, interview or experiment. An experiment is a process that lead to the occurrence of one (and only one) of several possible outcomes. A random experiment is an experiment, the outcome of which cannot be predicted with certainty, but all possible outcomes of which can be described prior to its performance, and which can be repeated under the same conditions. The collection of all possible outcomes is called the sample space. For a random experiment with a sample space, a function X, which assigns to each outcome of the sample space one and only one real number is called a random variable. A discrete random variable can have observed values only at isolated points along a scale of values. In business statistics, such data typically occur through the process of counting; hence, the values generally are expressed as integers (whole numbers). A continuous random variable can assume a value at any point along a specified interval of values. Continuous data are generated by the process of measuring. A collection of every value of a random variable is called a population. A part of population is called a sample. A characteristic of a population is called a parameter. A characteristic of a sample is called a statistic. The difference between a sample statistic and its corresponding population parameter is called a sampling error. A set of pairs (X, f), where X is a value of a random variable and f is the frequency of value X is called a frequency distribution. The set of pairs (X, f ), where X is a value of the random variable n of interest, f is the frequency of value X, and n =  f is the total frequency of all values or the number of all data (all measurements) is called a relative frequency distribution. Probability of an event E (Classic definition), P(E): P(E) = Number of favorable outcomes Number of all possible outcomes Probability Distribution: The set of pairs (X, P(X)), where X is a value of the random variable of interest and P(X) is the probability of X. Binomial probability distribution: A particular experiment has two possible outcomes, usually called success ( + ) and failure ( – ). The experiment is repeated n times. Then the probability that x Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 2 successes occur is P(X = x) = nCxx(1–)n–x = Notes 1 n! π x (1 - π) n- x , where n is the number of x!(n - x)! experiments,  is the probability of a success on each trial. Expectation or mean of the binomial distribution E(X) = n and the variance 2 = n(1–). Sampling distribution of sample means for samples of a fixed size is the set of pairs of the type ( X , P( X )), where X is a sample mean and P( X ) is the probability of sample mean X . The probability for a continuous random variable X is usually given by means of a continuous curve, where the area between any two points under the curve indicates the probability of a value between these two points occurring by chance. The function represented by that curve is called the probability distribution of the continuous random variable X. The most popular continuous probability distribution is the normal probability distribution. The normal probability distribution is important in statistical inference for three distinct reasons: (1) The measurements obtained in many random experiments are known to follow this distribution. (2) Normal probabilities can often be used to approximate other probability distributions, such as the binomial and Poisson distributions. (3) Distributions of such statistics as the sample mean and sample proportion are normally distributed when the sample size is large, regardless of the distribution. (See Central Limit Theorem) Central Limit Theorem: If all samples of a particular size are selected from any population with mean  and standard deviation , the sampling distribution of the sample means is approximately a normal distribution with mean  X =  and standard deviation  X =  n . The approximation improves with larger samples. Point estimate for a population mean : Sample mean X Confidence interval estimate for a population mean: X  z  , where n X is a sample mean,  is the population standard deviation, n is the sample size (the number of data in the sample). If  is unknown, use s. If  is unknown and n < 30, use s instead of  and t instead of z. The value of t is determined by level of confidence (1 – ) and degree of freedom n – 1, Note that the value of t is to be read in the two-tailed column in the t table. One-Sample t-Test: Testing a possible value of mean , using the sample mean X , sample standard deviation s and sample size n, DF = n – 1, t = Formula 14 (the first one). Independent-Samples t-Test: Testing the value of 1 – 2, DF = n1 + n2 – 2. ( X 1 - X 2 ) -( μ - μ ) 1 2 if the two populations have unequal variances, Use the formula t  2 2 s1 s 2  n1 Total 9 topics and 11 pages n2 © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II or the formula t  3 ( X 1  X 2 )  ( 1   2 ) ( n1  1 ) s12  ( n2  1 ) s 22 1 1   n1  n2  2 n1 n2 Notes 1 if the two populations have equal variances. Topic 1: Simple Linear Regression and Correlation To find out whether one interval variable depends upon another interval variable and to construct a linear relationship (a linear equation) between them, one uses Simple Linear Regression method. In using it, one should follow the following steps. 1. Determine the independent variable and dependent variable. The variable we want to estimate (compute) is called dependent. If u and v are variables and we want to find a formula like v = a + bu, then v is called dependent and u is independent. 2. Draw Scatter diagram to study if there is a linear relationship between two variables. To draw Scatter diagram, draw a horizontal real line and vertical real line meeting at a point. Write the name of independent variable (predictor) and name of its unit at the horizontal line. Write the name of dependent variable (predicted variable) and the name of its unit at the vertical line. For each pair of data, plot a dot with horizontal coordinate equal to the value of the independent variable and the vertical coordinate equal to the value of the dependent variable. If all scatter points are on a straight line, we say that X and Y have a perfect linear relationship. If the scatter points are roughly around a straight line, it is quite reasonable to construct a simple linear regression formula for estimation purpose. 3. Determine the Pearson coefficient of correlation r (Formula 2) and interpret it. The scatter points are usually not on a straight line. But, there is a measurement of how close are the scatter points to a straight line. The standard measurement is called Pearson’s coefficient of correlation. Computation of r by a scientific calculator: 1) Choose LR mode. 2) Check the value of n (= number of data already entered in the calculator). If n is 0, you can enter new data file. 3) If n  0, there are old data in the calculator and erase them. 4) Enter data. Type a value of X (independent variable), then comma, and next the correspond value of Y (dependent variable), and finally enter button. Until you erase data, the data file will remain in calculator and you can check anytime summary statistics such as X , X, X2, Xn, Xn–1, Y , Y, Y2, Yn, Yn–1, XY, a, b, r, n by pressing corresponding buttons of the calculator. In some calculators like Casio fx3800p, (1) the list of modes and (2) functions of keys are mentioned on the face. (3) To erase data press Shift and AC or Shift, AC, and =. The second functions are used by pressing Shift first. For example, Shift 1 = X and shift 9 = r. The third Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 4 Notes 1 functions are used by pressing Kout or RCL first. For instance, Kout 3 = n. Comma buttons is marked by XD, YD below the key. In some calculators like Casio fx350MS, (1) the list of modes and (2) functions of keys are mentioned in the menu inside. Call them by Shift 1 or Shift 2. (3) To erase data press Shift CLR 1 =. Enter-data button is usually denoted by DATA or DT below the key. Interpretation of r: The value of r is always at least –1 and at most 1. If r is either 1 or – 1, all scatter points are on a straight line. We can say that the scatter points are in a 100% straight line shape. We can also say that there is a perfect linear relationship between the two variables. “Perfect” means that all points are exactly on a straight line. If r is 0.9 or – 0.9, we can say that scatter points are nearly in a straight line shape and the straight line shape they form is 90% straight. Similarly, if r is either 0.8 or – 0.8, the scatter points are quite close to a straight line, but not as close as in the case of r being 0.9 or – 0.9. In this case, we can say that, the scatter points are 80% in the straight line shape. If the scatter points are in a straight line shape at the extent of 50% or above, we should say that the linear relationship between the two variables is strong. The closer r is to 1 or – 1, is the relationship between the two variables stronger. “Strong” means that the scatter points are close to a straight line and in that case, if we know the value of one variable, then we can estimate the value of another variable quite accurately. Otherwise, the relationship is weak. If r is positive, the relationship is direct: The straight line shape is going upward to the right or it has a positive slope. “Direct” means that if the value of one variable increases, the corresponding value of another variable increases. If r is negative, the relationship is inverse: The straight line shape is falling down to the right or it has a negative slope. “Inverse” means that if the value of one variable increases, the corresponding value of another variable decreases. Thus, we can see the following possible interpretations of the value of r: There is a strong/weak, positive/negative linear relationship between X and Y. r = 0 → no linear relationship, r  0 → There is a linear relationship between “X” and “Y” (If the value of one variable changes, then the value of another variable changes.) r =  1 → perfect linear relationship 0 < r < 0.5 → weak, direct / positive linear relationship – 0.5 < r < 0 → weak, inverse / negative linear relationship 0.5  r < 1 → strong, direct / positive linear relationship – 1 < r  – 0.5 → strong inverse / negative linear relationship 4. Test the significance of correlation  between X and Y. “Significant” means “True in the population”. Two-tailed test: H0:  = 0 (There is no linear relationship between X and Y in the population.) Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 5 Notes 1 Ha:   0 (There is a linear relationship between X and Y in the population.) Right-tailed test: H0:   0 (There is no direct/positive linear relationship between X and Y in the population.) Ha:  > 0 (There is a direct/positive linear relationship between X and Y in the population.) Left-tailed test: H0:   0 (There is no inverse/negative linear relationship between X and Y in the population.) Ha:  < 0 (There is an inverse/negative linear relationship between X and Y in the population.) Critical value: Read it in t distribution for the given  value in the correct choice of one-tailed or two-tailed column, and degree of freedom n – 2. Assign – for left-tailed test, + for right-tailed test,  for two-tailed test. Test statistic: The original test statistic is r. Change r to standard test statistic t by Formula 10. Decision Rule: In a right-tailed test, the rejection region is on the right of the critical value. In a left-tailed test, the rejection region is on the left of the critical value. In a two-tailed test, the rejection region is on the right of the positive critical value and on the left of the negative critical value. If the test statistic is in the rejection region, reject H0. Otherwise, do not reject H0. Conclusion: Conclusion and decision are the same. Rejecting H0 means that we decide H0 to be false and Ha to be true. Not rejecting H0 means that H0 is true. Decision is the end of the statistical process. Conclusion is the report to the original questioner. Therefore, conclusion is to be written in the format of original question. 5. If  is significant, construct the regression or least-squares equation Ŷ = a + bX. b = Formula 1 (the first one), a = Formula 1 (the second one) Practically, read the values of a and b in the calculator. Interpretation of a: If X is 0 unit then estimated Y is a units. Interpretation of b: For one more unit in X, the estimated Y will be b units more. Or, If X increases by 1 unit, Y is estimated to increase by b units. If b is negative (b = – c), then we say, for one more unit in X, the estimated Y will be c units less. Or, If X increases by 1 unit, Y is estimated to decrease by b units. Least-Squares Concept: The least-squares equation produces estimated Y values ( Ŷ ) which are closest to all sample values of Y in total. Therefore, Regression line (graph of the regression equation) is the straight line which has the least sum of squares of vertical distances n  to the scatter points. It has minimum Sum of Squared Error SSE = ∑( Yi Yi ) 2 . Since it has i 1 minimum SSE, it also has minimum Mean of Squared Errors MSE among all straight lines, since MSE is obtained by dividing SSE by the fixed number n – 2. Hence, it has minimum Se, too, since Se is obtained by taking square root of MSE. Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 6 Notes 1 6. Determine the standard error of the estimate Se ( Formula 7 ): It is the standard deviation of the Y values around the estimated Y value for any fixed value of X. It is used to construct confidenceinterval estimates for the Y values for any fixed value of X. 7. Determine the standard error of the slope Sb ( Formula 4 ): It is the standard deviation of the distribution of sample slopes. Sb is used to construct confidence-interval estimate for the population slope . 8. Construct confidence interval for slope or regression coefficient β: b  tSb, DF = n – 2, twotailed. 9. Construct ANOVA table. Construction Regression Error or Residual Total SS SSR (Numerator in Formula 9) SSE = SST – SSR SST (Denominator in Formula 9) DF 1 MS F MSR = SSR  1 F = MSR  MSE n–2 MSE = SSE  (n – 2) n–1 Concept n SST is the total squared deviation of sample values Y from Y (mean Y):  ( Yi  Y ) 2 i1 SST measures the total difference between Y values and Y . If we use knowledge on variable Y alone, when we want to estimate (guess) we have to use Y .In that case in the sample at hand, we will make the total amount of error SST. n SSR is the total squared deviation of Ŷ (regression estimates) from Y (mean Y):  ( Ŷi  Y ) 2 i1 n SSE is the total squared deviation of sample values Y from Ŷ (estimated Y values):  ( Yi  Ŷi ) 2 . i1 SSE measures the total difference between actual values of Y and the values Ŷ estimated by the regression equation. Therefore, the smaller the value of SSE is, is the regression equation better. As SSR = SST – SSE by Formula 3, SSR will be large if SSE is small. Therefore, the larger the value of SSR is, is the regression equation better. SSR is the total variation removed by the regression equation or by the independent variables. MSR is the average variation removed by one independent variable. Hence, SSR divided by 1 is MSR as there is only independent variable. SSE is the total variation made at n sample points. On the average, at a point, the error of estimated Y (difference between Y and Ŷ is MSE = SSE  (n–2). In fact, SSE must be divided by n. However, the probability theory suggests using n– 2 instead of n for using 2 variables in samples. Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 7 Notes 1 Now, we can see that the bigger MSR is, is the regression estimates the better. And, the smaller MSE is, is the regression estimates the better. As F is the ratio of MSR and MSE, the larger F value is, is the regression estimates the better. SSR , and Se = MSE ( Formula 7 ) SST 10. Coefficient of determination r2: It is another measurement of the strength of relationship. It is SSR square of r: r2 = . It is the proportion of squared deviation removed by the regression SST equation. Hence, we have the following interpretation of r2: 100  r2 % of the variation in Y is explained by the regression (or by the variation in X). b 11. Test of significance of the regression equation with β: t, n – 2, (The same CV, TS as  test) Sb There are three important values in ANOVA table. F, r2 = This test can be used instead of the  test mentioned in paragraph 4. The two tests are exactly the same here, but not the same if we use more than one independent variables (in the next topic). The following are interpretations of slope  in the unknown regression equation Ŷ = a + bX of the population:   0: Y depends upon X. ( If X changes, then Y changes. )  > 0: Y depends upon X directly / positively. ( If X increases, then Y increases. )  < 0: Y depends upon X inversely / negatively. ( If X increases, then Y decreases. ) Topic 2: Multiple Linear Regression and Correlation Multiple linear regression is the generalization of simple linear regression to more than one independent variables. Two-value (0, 1) dummy variables can also be included. The number of independent variables is denoted by k. Total sample size (the number of paired data) is denoted by n. Therefore, the regression equation in this case is of the form Ŷ = a + b1X1 + b2X2 + … + bkXk. A procedure of using Multiple Linear Regression is as follows. 1. Study Pearson’s coefficient of correlation between dependent variable Y and all independent variables Xi. Test the significance of correlation between Y and each Xi. Drop the variables Xi which are insignificant for Y. After selecting independent variables which are significantly correlated with Y, check whether any multicollinearity problem exists. If two independent variables are changing in a straight line, we say that there is a multicollinearity. A multicollinearity problem exists when there is a pair of two independent variables Xi and Xj which are too strongly correlated. Usually, if the Pearson coefficient of correlation r between Xi and Xj is stronger than  0.7 then we decide that the multicollinearity problem occurs between Xi and Xj If multicollinearity problem occurs between a pair of significant independent variables, drop one of these two independent variables which has weaker r with Y. Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 8 Notes 1 2. Construct regression equation for Y on the selected variables X1, X2, …, Xk. Conduct the global test of significance of the regression equation: H0: β1 = β2 = … = βk = 0 Ha: Not all βi’s are 0. (Or) At least one βi  0. If p-value for F statistic in ANOVA table is less than , reject H0 and write the conclusion: “ The regression model is significant as a whole ”. If the p-value is not less than , do not reject H0 and write the conclusion: “The regression model is not significant”. The critical value is found in F distribution table at column (degree of freedom for numerator) k and row (degree of freedom for denominator) n – k – 1. It is always a right-tailed test. 3. If the regression model is significant as a whole, conduct the individual test of significance of each independent variable Xi: H0: βi = 0, Ha: βi  0. If p-value for t test statistic in the coefficient table (written sig. which is two-tailed) is less than , reject H0 and write the conclusion: “ Xi is a significant explanatory variable for Y ”. Otherwise, b write“ Xi is not a significant explanatory variable for Y. ” Test statistic t = i , DF = n – k – 1. S bi 4. Drop insignificant independent variables. 5. Rerun SPSS. Confirm that Global test and individual tests show significance of the regression equation and each independent variable. Use the regression equation to estimate. Read the point estimate for Y, confidence interval estimate for mean Y and prediction interval for individual Y in the data viewer. 6. Interpret the values of a, bi, R and R2. a: If all Xi are 0 unit then estimated Y is a units. bi: Holding other independent variables constants, for one more unit in Xi, estimated Y will be bi units more. (Or) Holding other independent variables constants, if Xi increases by 1 unit, Y is estimated to increase by bi units. R : There is a strong/weak linear relationship between X1, X2, …, Xk and Y. (No sign) R = 0: no linear relationship, R = 1: perfect linear relationship 0 < R < 0.5: weak linear relationship 0.5  R < 1: strong linear relationship R2: 100R2% of the variation in Y is explained by the regression (by the variations in X1, X2, …, Xk). 7. Construct a confidence interval for βi: bi t S b i , n – k – 1, two-tailed. Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 9 Notes 1 Topic 3: ANOVA (Analysis of variance) 1. Variance Test (Comparing two population variances): H0:  12 =  22 Ha:  12   22   22  12 >  22  12  12  12   22 <  22 Get critical value from F table at column (degree of freedom for numerator) n1 – 1 and row (degree of freedom for denominator) n2 – 1. Divide  by 2 for two-tailed test. Compute Test statistic (Formula 15). Notice: In computing test statistic, always put the larger variance above and the smaller variance below. The size of the sample with larger variance minus 1 is the degrees of freedom for numerator and the size of the sample with smaller variance minus 1 is the degrees of freedom for denominator The signs of Critical value and Test statistic are always positive. It is a right-tailed test. 2. One-Way ANOVA (Comparing more than two population means) There are two variables. One variable is of interval level (numbers) and another is of nominal level (names) with k values. It is a right-tailed test. H0: 1 = 2 = … = k Ha: Not all k population mean “name of the variable” are equal. Get critical value from F table at column k – 1 (df for numerator = k – 1) and row n – k (df for denominator = n – k). Construct ANOVA table to get test statistic F. This is a right-tailed test. How to construct ANOVA table Using SD mode fill the following summary statistics table. Sample 1 Sample 2 Sample 3 Total Xi2 Ti = Xi ni N = ni = Fill the following ANOVA table. Use Formula 12 to compute SST (Sum Squares Total), Formula 13 to compute SSB (Sum Squares Between), and Formula 3 (the first one) to compute SSW (Sum Squares Within). SS SSB DF k–1 MS F MSB = SSB  (k – 1) F = MSB  MSW Within SSW = SST – SSB n–k MSW = SSW  (n – k) Total SST n–1 Between Idea of the test Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 10 Notes 1 We here use the following notation: X i denotes the mean X of sample i. X denotes the grand mean which is the mean of all N data and at the same time the mean of all k sample means weighted with corresponding sample sizes. k ni SST is the total variation of all data: SST = ∑∑( X ij _ X ) 2 . It is the total squared difference i =1 j =1 between N data and their mean X . To find the average squared difference called variance, we have to divide SST by N – 1 (instead of N, as suggested by probability theory). But, we do not need the variance here. SSB is the total variation between sample means and their weighted mean X : SSB = k 2 ∑ni ( X i _ X ) . It is the total variation between k sample means and their mean, so the degree of i =1 freedom is k – 1. If we divide SSB by k – 1, we get MSB (Mean Squares Between). k ni SSW is the sum of total variations (SSTs) in k samples: SSW = ∑∑( X ij _ X i ) 2 . The degree of i =1 j =1 freedom is ni – 1 in each sample i, so the total is N – k. Dividing SSW by N – k, we get MSW (Mean Squares Within). Now, it can be seen that, if MSB is large, the difference among k sample means is big, and if MSW is small, the data are close to each other in each sample. Therefore, if F = MSB  MSW is big, there is a great difference among the sample means, while values of the test variable are not so different in each sample. Hence, a large value of F is a strong evidences for that population means are different. SPSS contains, inside ANOVA, several statistical tests which can be used to see which pairs of means are significantly different. Scheffe test is one of them. SPSS Scheffe interpretation: Two population means are unequal if the corresponding sample means are found in different columns and not in the same column. For each unequal pair of means, the population mean with larger sample mean will be larger. Topic 4: Index Numbers In this topic, we study the basic idea about an economic measurement of index and the nine most widely used types of indexes. a Index is percent: I = ×100 . But, it is customary not to write the percentage sign “ % “. b Index compares current value and base-period value of the same variable. The nine indexes should be studied in three steps: (1) Three basic indexes, (2) Simple or unweighted, (3) Weighted. (1) Three Basic Indexes: Price, Quantity, and Value p Price index compares prices: I p = t ×100 (0 = Base Period, t = Current period) p0 Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX BG 2200 Statistics II 11 Quantity index compares quantities: I q = Notes 1 qt ×100 q0 Value index compares values, Value = Price  Quantity (v = pq ): Iv = vt pq ×100 = t t ×100 v0 p0 q 0 (2) Simple or Unweighted: Aggregates and Average Aggregates = Comparing group prices Average = Average of price indexes for all items (3) Weighted: Aggregates (Laspeyres, Paasche, and others), and Average Weighted Aggregates Unweighted or Simple Aggregates Weighted aggregates price index:  Unweighted aggregates price index:   w p p ∑p t w ×100 ∑p 0 w ∑p t ×100 ∑p 0 Special cases*: 1) w = q0 → Laspeyres 2) w = qt → Paasche Unweighted or Simple Average  Unweighted average of relatives price index: ∑ Weighted Average  Weighted average of relatives price index*: ∑Iw ∑w w Price Indexes N ∑I p N p p0 N p p0 = ×100 N ∑ t ×100 ∑ t = ∑I p w = ∑w p p0 Price Indexes Ip p p0 = ×100 ∑w ∑ t w ×100 ∑ t w ∑w How to compute: Write down the formula. Enter values. Check the values again. Use simple stepby-step computation and write down every step. Show two decimal places in the answer. Index is percent (Don’t show % sign) ( Ip, Iq, Iv ) Unweighted Aggregate ( ) Unweighted Average ( Weighted Aggregate ( , w ) Weighted Average ( ∑ N ) ∑Iw ) ∑w ( Laspeyres: w = q0 ) ( Paasche: w = qt ) SPSS Commands: Topics 1 & 2: Graphs – Scatter, Analyze – Correlate – Bivariate (Pearson), Analyze – Regression – Linear (Statistics and Save), Topic 3: Analyze – Compare Means – One-way ANOVA (Post-Hoc – Scheffe) Total 9 topics and 11 pages © 2010-1 Dr. Min Aung Basics-SLR_MLR_ANOVA_INDEX

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Simple Linear Regression and Correlation