Download Statistical Inferences Based on Two Samples

Business Statistics in Practice Chapter 9 Statistical Inferences Based on Two Samples Learning Objectives In this chapter, you learn:  How to use hypothesis testing for comparing the difference between  The means of two independent populations  The means of two related populations  The proportions of two independent populations  The variances of two independent populations Two-Sample Tests (两总体检验) Two-Sample Tests Population Means, Independent Samples Means, Related Samples Population Proportions Population Variances Examples: Population 1 vs. independent Population 2 Same population before vs. after treatment Proportion 1 vs. Proportion 2 Variance 1 vs. Variance 2 Difference Between Two Means Population means, independent samples * σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal Goal: Test hypothesis or form a confidence interval for the difference between two population means, μ1 – μ2 The point estimate for the difference is X1 – X2 Independent Samples (独立样本) Population means, independent samples * σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal  Different data sources  Unrelated  Independent  Sample selected from one population has no effect on the sample selected from the other population  Use the difference between 2 sample means  Use Z test, a pooled-variance t test, or a separate-variance t test Difference Between Two Means Population means, independent samples * σ1 and σ2 known Use a Z test statistic σ1 and σ2 unknown, assumed equal Use Sp to estimate unknown σ , use a t test statistic and pooled standard deviation σ1 and σ2 unknown, not assumed equal Use S1 and S2 to estimate unknown σ1 and σ2, use a separate-variance t test σ1 and σ2 Known Population means, independent samples σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal Assumptions: *  Samples are randomly and independently drawn  Population distributions are normal or both sample sizes are 30  Population standard deviations are known σ1 and σ2 Known (continued) Population means, independent samples σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal When σ1 and σ2 are known and both populations are normal or both sample sizes are at least 30, the test statistic is a Z-value… * …and the standard error of X1 – X2 is σ X1  X2  2 1 2 σ σ2  n1 n2 σ1 and σ2 Known (continued) Population means, independent samples σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal The test statistic for μ1 – μ2 is: *  X Z 1   X 2   μ1  μ2  2 1 2 σ σ2  n1 n2 Hypothesis Tests for Two Population Means Two Population Means, Independent Samples Lower-tail test: Upper-tail test: Two-tail test: H0: μ1 ≥ μ2 Ha: μ1 < μ2 H0: μ1 ≤ μ2 Ha: μ1 > μ2 H0: μ1 = μ2 Ha: μ1 ≠ μ2 i.e., i.e., i.e., H0: μ1 – μ2 ≥ 0 Ha: μ1 – μ2 < 0 H0: μ1 – μ2 ≤ 0 Ha: μ1 – μ2 > 0 H0: μ1 – μ2 = 0 Ha: μ1 – μ2 ≠ 0 Hypothesis tests for μ1 – μ2 Two Population Means, Independent Samples Lower-tail test: Upper-tail test: Two-tail test: H0: μ1 – μ2 ≥ 0 Ha: μ1 – μ2 < 0 H0: μ1 – μ2 ≤ 0 Ha: μ1 – μ2 > 0 H0: μ1 – μ2 = 0 Ha: μ1 – μ2 ≠ 0 a a -za Reject H0 if Z < -Za za Reject H0 if Z > Za a/2 -za/2 a/2 za/2 Reject H0 if Z < -Za/2 or Z > Za/2 Confidence Interval, σ1 and σ2 Known Population means, independent samples σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal The confidence interval for μ1 – μ2 is: *   2 1 2 σ σ2 X1  X 2  Z  n1 n2 Example 9.1 Customer Waiting Time Case  A random sample of size 100 waiting times observed under the current system of serving customers has a sample waiting time mean of 8.79  Call this population 1  Assume population 1 is normal or sample size is large  The variance is 4.7  A random sample of size 100 waiting times observed under the new system of serving customers has a sample mean waiting time of 5.14  Call this population 2  Assume population 2 is normal or sample size is large  The variance is 1.9  Then if the samples are independent … Customer Waiting Time Case  At 95% confidence, za/2 = z0.025 = 1.96, and x1  x2   z a 2  12  22 4.7 1.9    8.79  5.14   1.96   n1 n2 100 100      3.65  0.5035   3.15 , 4.15   According to the calculated interval, the bank manager can be 95% confident that the new system reduces the mean waiting time by between 3.15 and 4.15 minutes Customer Waiting Time Case  Test the claim that the new system reduces the mean waiting time  Test at the a = 0.05 significance level the null H0: m1 – m2 ≤ 0 against the alternative Ha: m1 – m2 > 0  Use the rejection rule H0 if z > za  At the 5% significance level, za = z0.05 = 1.645  So reject H0 if z > 1.645  Use the sample and population data in Example 9.1 to calculate the test statistic  x1  x2   D0 8.79  5.14   0 z   12  22  n1 n2 4.7 1.9  100 100 3.65  14 .21 0.2569 Customer Waiting Time Case  Because z = 14.21 > z0.05 = 1.645, reject H0  Conclude that m1 – m2 is greater than 0 and therefore the new system does reduce the waiting time by 3.65 minutes  On average, reduces the mean time by 3.65 minutes  The p-value for this test is the area under the standard normal curve to the right of z = 14.21  This z value is off the table, so the p-value has to be much less than 0.001  So, we have extremely strong evidence that H0 is false and that Ha is true  Therefore, we have extremely strong evidence that the new system reduces the mean waiting time σ1 and σ2 Unknown, Assumed Equal Assumptions: Population means, independent samples  Samples are randomly and independently drawn σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal *  Populations are normally distributed or both sample sizes are at least 30  Population variances are unknown but assumed equal σ1 and σ2 Unknown, Assumed Equal (continued) Forming interval estimates: Population means, independent samples σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal *  The population variances are assumed equal, so use the two sample variances and pool them to estimate the common σ2  the test statistic is a t value with (n1 + n2 – 2) degrees of freedom σ1 and σ2 Unknown, Assumed Equal (continued) Population means, independent samples The pooled variance (合并方差) is σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal * S 2 p  n1  1S   n2  1S2 (n1  1)  (n2  1) 2 1 2 σ1 and σ2 Unknown, Assumed Equal (continued) The test statistic for μ1 – μ2 is: Population means, independent samples  X  X   μ  μ  t 1 σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal 2 1 1 1  S     n1 n2  2 p * Where t has (n1 + n2 – 2) d.f., and S 2 p 2 2  n1  1S1  n2  1S2  (n1  1)  (n2  1) 2 Confidence Interval, σ1 and σ2 Unknown Population means, independent samples The confidence interval for μ1 – μ2 is: X  X   t σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal 1 2 * n1 n2 -2 1 1  S     n1 n2  2 p Where 2 2     n  1 S  n  1 S 1 2 2 S2  1 p (n1  1)  (n2  1) Catalyst Comparison Case Example 9.2  The difference in mean hourly yields of a chemical process 2  Given: n  5, x  811.0, s  386.0 1 1 1 n2  5, x2  750.2, s22  484.2  Assume that populations of all possible hourly yields for the two catalysts are both normal with the same variance  The pooled estimate of 2 is s 2 p  n1  1s12  n2  1s22  n1  n2  2  5  1386  5  1484.2   435.1 5  5  2  Let m1 be the mean hourly yield of catalyst 1 and let m2 be the mean hourly yield of catalyst 2 continued  Want the 95% confidence interval for m1 – m2  df = (n1 + n2 – 2) = (5 + 5 – 2) = 8  At 95% confidence, ta/2 = t0.025. For 8 degrees of freedom, t0.025 = 2.306  The 95% confidence interval is  1    1 1  2 1 x1  x2   t0.025 s p      811  750.2  2.306 435.1     5 5     n1 n2     60.8  30.4217  30.38, 91.22  So we can be confident that the mean hourly yield from catalyst 1 is between 30.38 and 91.22 pounds higher than that of catalyst 2  On average, the mean yields will differ by 60.8 lbs Example 9.3 Pooled-Variance t Test You are a financial analyst for a brokerage firm. Is there a difference in dividend yield between stocks listed on the NYSE & NASDAQ? You collect the following data: NYSE NASDAQ Number 21 25 Sample mean 3.27 2.53 Sample std dev 1.30 1.16 Assuming both populations are approximately normal with equal variances, is there a difference in average yield (a = 0.05)? Calculating the Test Statistic The test statistic is:  X  X   μ  μ  t  1 2 1 1 1 S     n1 n2  2 p 2 3.27  2.53   0 1   1 1.5021    21 25  2 2 2 2         n  1 S  n  1 S 21  1 1.30  25  1 1.16 1 2 2 S2  1  p (n1  1)  (n2  1) (21 - 1)  (25  1)  2.040  1.5021 Solution H0: μ1 - μ2 = 0 i.e. (μ1 = μ2) Ha: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2) a = 0.05 df = 21 + 25 - 2 = 44 Critical Values: t = ± 2.0154 Reject H0 .025 -2.0154 Reject H0 .025 0 2.0154 t 2.040 Test Statistic: Decision: 3.27  2.53 t  2.040 Reject H0 at a = 0.05 1   1 Conclusion: 1.5021     21 25  There is evidence of a difference in means. σ1 and σ2 Unknown, Not Assumed Equal Assumptions: Population means, independent samples  Samples are randomly and independently drawn  Populations are normally distributed or both sample sizes are at least 30 σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal *  Population variances are unknown but cannot be assumed to be equal σ1 and σ2 Unknown, Not Assumed Equal (continued) Population means, independent samples Forming the test statistic:  The population variances are not assumed equal, so include the two sample variances in the computation of the t-test statistic σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal *  the test statistic is a t value with v degrees of freedom (see next slide) σ1 and σ2 Unknown, Not Assumed Equal (continued) Population means, independent samples The number of degrees of freedom is the integer portion of: σ1 and σ2 known σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal * 2 S S2      n  n 2     12 2 2 2  S1   S 2       n   n   1   2  n1  1 n2  1 2 1 2 σ1 and σ2 Unknown, Not Assumed Equal (continued) Population means, independent samples The test statistic for μ1 – μ2 is:  X  X   μ  μ  t σ1 and σ2 known 1 σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal 2 2 1 * 1 2 2 S S  n1 n2 2 Exercise A recent EPA study compared the highway fuel economy of domestic and imported passenger cars. A sample of 15 domestic cars revealed a mean of 33.7 mpg with a sample standard deviation of 2.4 mpg. A sample of 12 imported cars revealed a mean of 35.7 mpg with a sample standard deviation of 3.9. Assuming both populations are approximately normal with equal variances, At the .05 significance level can the EPA conclude that the mpg for the domestic cars is lower than the imported cars? Step 1 State the null and alternate hypotheses. H0: µD > µI H1: µD < µI Step 2 The .05 significance level is stated in the problem Step 3 Find the appropriate test statistic. we use the t distribution. Step 4 The decision rule is to reject H0 if t<-1.708 or if p-value < .05. There are n-1 or 25 degrees of freedom. Step 5 We compute the pooled variance. 2 2 ( n  1 )( s )  ( n  1 )( s 2 1 1 2 2) sp  n1  n 2  2 (15  1)( 2.4) 2  (12  1)(3.9) 2   9.918 15  12  2 t  X1  X 2  1 1   n  n   2   1 33 .7  35 .7 s2 p  1   1 8.312    12   15  1.640 Since a computed t of –1.64 > critical t of –1.71, the p-value of .0567 > a of .05, H0 is not rejected. There is insufficient sample evidence to claim a higher mpg on the imported cars. Related Populations (相关总体) Dependent samples are samples that are paired or related in some fashion. If you wished to buy a car you would look at the same car at two (or more) different dealerships and compare the prices. Town and Country Cadillac If you wished to measure the effectiveness of a new diet you would weigh the dieters at the start and at the finish of the program. Downtown Cadillac Example (Related Population): An analyst for Educational Testing Service wants to Compare the mean GMAT scores of students before & after taking a GMAT review course. Nike wants to see if there is a difference in durability of 2 sole materials. One type is placed on one shoe, the other type on the other shoe of the same pair.. Test of Related Populations Tests Means of 2 Related Populations Related samples    Paired or matched samples Repeated measures (before/after) Use difference between paired values: di = X1i - X2i  Eliminates Variation Among Subjects  Assumptions:  Both Populations Are Normally Distributed  Or, if not Normal, use large samples Mean Difference The ith paired difference is di , where Related samples di = X1i - X2i n The point estimate for the population mean paired difference is d : d  d i 1 i n n is the number of pairs in the paired sample Mean Difference, Estimate of σd Related samples We the unknown population standard deviation with a sample standard deviation: The sample standard deviation is n Sd  2 (d  d)  i i 1 n 1 Mean Difference (continued) Paired samples  Use a paired t test, the test statistic for d is a t statistic, with n-1 d.f.: d  μd t Sd n n Where t has n - 1 d.f. and Sd is: Sd  2 (d  d)  i i 1 n 1 Confidence Interval Paired samples The confidence interval for μd is d  ta / 2, n 1 Sd n n where Sd   (d i 1 i  d) 2 n 1 Hypothesis Testing for Mean Difference, σd Unknown Paired Samples Lower-tail test: Upper-tail test: Two-tail test: H0: μd  0 Ha: μd < 0 H0: μd ≤ 0 Ha: μd > 0 H0: μd = 0 Ha: μd ≠ 0 a a -ta Reject H0 if t < -ta ta Reject H0 if t > ta Where t has n - 1 d.f. a/2 -ta/2 a/2 ta/2 Reject H0 if t < -ta/2 or t > ta/2 Example 9.4 Repair Cost Comparison Table: A sample of n=7 paired differences of the repair cost estimates at garages 1 and 2 (Cost estimate in hundreds of dollars) Damaged cars Repair cost estimates at garage 1 Repair cost Paired estimate at garage differences 2 Car 1 Car 2 $7.1 9.0 $7.9 10.1 d1=-0.8 d2=-1.1 Car 3 11.0 12.2 d3=-1.2 Car 4 Car 5 8.9 9.9 8.8 10.4 d4=0.1 d5=-0.5 Car 6 Car 7 9.1 10.3 9.8 11.7 d6=-0.7 d7=-1.4 Repair Cost Comparison  Sample of n = 7 damaged cars  Each damaged car is taken to Garage 1 for its estimated repair cost, and then is taken to Garage 2 for its estimated repair cost  Estimated repair costs at Garage 1: 1x = 9.329  Estimated repair costs at Garage 2: 2x = 10.129  Sample of n = 7 paired differences and d  x1  x2  9.329  10.129  0.8 s d2  0.2533 s d  0.5033 continued  At 95% confidence, want ta/2 with n – 1 = 6 degrees of freedom ta/2 = 2.447  The 95% confidence interval is  sd   0.5033  d  t a/2    0.8  2.447  n  7     0.8  0.4654    1.2654 ,0.3346   Can be 95% confident that the mean of all possible paired differences of repair cost estimates at the two garages is between $126.54 and -$33.46  Can be 95% confident that the mean of all possible repair cost estimates at Garage 1 is between $126.54 and $33.46 less than the mean of all possible repair cost estimates at Garage 2 Repair Cost Comparison  Now, test if repair cost at Garage 1 is less expensive than at Garage 2, that is, test if md = m1 – m2 is less than zero  H0: md ≥ 0 Ha: md < 0  Test at the a = 0.01 significance level.  Reject if t < –ta, that is , if t < –t0.01  With n – 1 = 6 degrees of freedom, t0.01 = 3.143  So reject H0 if t < –3.143 continued  Calculate the t statistic t d0 sd  n  0.8  4.2053 0.5033 7  Because t = –4.2053 is less than –t0.01 = – 3.143, reject H0  Conclude at the a = 0.01 significance level that the mean repair cost at Garage 1 is less than the mean repair cost of Garage 2  From a computer, for t = -4.2053, the p-value is 0.003  Because this p-value is very small, there is very strong evidence that H0 should be rejected and that m1 is actually less than m2 Paired t Test (成对t检验) Example 9.5  Assume you send your salespeople to a “customer service” training workshop. Has the training made a difference in the number of complaints? You collect the following data: Number of Complaints: (2) - (1) Salesperson Before (1) After (2) Difference, di C.B. T.F. M.H. R.K. M.O. 6 20 3 0 4 4 6 2 0 0 - 2 -14 - 1 0 - 4 -21 d =  di n = -4.2 Sd  2 (d  d)  i  5.67 n 1 Paired t Test: Solution  Has the training made a difference in the number of complaints (at the 0.01 level)? H0: μd = 0 Ha: μd  0 a = .01 d = - 4.2 Critical Value = ± 4.604 d.f. = n - 1 = 4 Reject Reject a/2 a/2 - 4.604 4.604 - 1.66 Decision: Do not reject H0 (t stat is not in the reject region) Test Statistic: d  μ d 4.2  0 t   1.66 Sd / n 5.67/ 5 Conclusion: There is not a significant change in the number of complaints. Two Population Proportions Population proportions Goal: test a hypothesis or form a confidence interval for the difference between two population proportions, p 1 – p2 Assumptions: n1 p1  5 , n1(1- p1)  5 n2 p2  5 , n2(1- p2)  5 The point estimate for the difference is pˆ 1  pˆ 2 Two Population Proportions  Then the population of all possible values of p̂1  p̂2  Has approximately a normal distribution if each of the sample sizes n1 and n2 is large  Here, n1 and n2 are large enough is n1 p1 ≥ 5, n1 (1 - p1) ≥ 5, n2 p2 ≥ 5, and n1 (1 – p2) ≥ 5  Has mean m p̂1  p̂2  p1  p2  Has standard deviation  p̂1  p̂2  p1 1  p1  p2 1  p2   n1 n2 Confidence Interval for Two Population Proportions If the random samples are independent of each other, then the following a 100(1 – α) percent confidence interval for p1 – p2 is: Population proportions   p̂1  p̂2   za 2  p̂1 1  p̂1  p̂2 1  p̂2     n1 n2  Testing the Difference of Two Population Proportions Population proportions The test statistic for p1 – p2 is a Z statistic: pˆ1  pˆ 2   ( p1  p2 )  z=  pˆ  pˆ 1 2 (continued) Testing the Difference of Two Population Proportions (continued)  If p1-p2= 0, estimate  p̂1  p̂2 by s pˆ1  pˆ 2  pˆ  1 1 pˆ 1  pˆ     ,  n1 n2  the total number of units in the two samples that fall into the category of interest the total number of units in the two samples  If p1-p2 ≠ 0, estimate  p̂1  p̂2by s p̂1  p̂2  p̂1 1  p̂1  p̂2 1  p̂2   n1 n2 Hypothesis Tests for Two Population Proportions Population proportions Lower-tail test: Upper-tail test: Two-tail test: H0: p1  p2 Ha: p1 < p2 H0: p1 ≤ p2 Ha: p1 > p2 H0: p1 = p2 Ha: p1 ≠ p2 i.e., i.e., i.e., H0: p1 – p2  0 Ha: p1 – p2 < 0 H0: p1 – p2 ≤ 0 Ha: p1 – p2 > 0 H0: p1 – p2 = 0 Ha: p1 – p2 ≠ 0 Hypothesis Tests for Two Population Proportions (continued) Population proportions Lower-tail test: Upper-tail test: Two-tail test: H0: p1 – p2  0 Ha: p1 – p2 < 0 H0: p1 – p2 ≤ 0 Ha: p1 – p2 > 0 H0: p1 – p2 = 0 Ha: p1 – p2 ≠ 0 a a -za Reject H0 if Z < -Za za Reject H0 if Z > Za a/2 -za/2 a/2 za/2 Reject H0 if Z < -Za/2 or Z > Za/2 Example 9.6 Two population Proportions Is there a significant difference between the proportion of men and the proportion of women who will vote Yes on Proposition A?  In a random sample, 36 of 72 men and 31 of 50 women indicated they would vote Yes  Test at the .05 level of significance Two population Proportions (continued)  The hypothesis test is: H0: p1 – p2 = 0 (the two proportions are equal) Ha: p1 – p2 ≠ 0 (there is a significant difference between proportions)  The sample proportions are:  Men:  Women: p̂1 = 36/72 = .50 p̂2 = 31/50 = .62  The pooled estimate for the overall proportion is: X1  X 2 36  31 67 pˆ     .549 n1  n 2 72  50 122 Example: Two population Proportions (continued) The test statistic for p1 – p2 is:  z  pˆ1  pˆ 2    p1  p2  1 1  pˆ (1  pˆ )     n1 n 2   .50  .62    0  1   1 .549 (1  .549)     72 50  Critical Values = ±1.96 For a = .05 Reject H0 Reject H0 .025 .025 -1.96 -1.31   1.31 1.96 Decision: Do not reject H0 Conclusion: There is not significant evidence of a difference in proportions who will vote yes between men and women. Chapter Summary  Compared two independent samples  Performed Z test for the difference in two means  Performed pooled variance t test for the difference in two means  Performed separate-variance t test for difference in two means  Formed confidence intervals for the difference between two means  Compared two related samples (paired samples)  Performed paired sample Z and t tests for the mean difference  Formed confidence intervals for the mean difference Chapter Summary (continued)  Compared two population proportions  Formed confidence intervals for the difference between two population proportions  Performed Z-test for two population proportions Business Statistics in Practice Chapter 11 Simple Linear Regression Analysis (线性回归分析) Table 11.1 lists the percentage of the labour force that was unemployed during the decade 1991-2000. Plot a graph with the time (years after 1991) on the x axis and percentage of unemployment on the y axis. Do the points follow a clear pattern? Based on these data, what would you expect the percentage of unemployment to be in the year 2005? Table 11.1 Percentage of Civilian Unemployment Number of Years Percentage of Year from 1991 Unemployed 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 0 1 2 3 4 5 6 7 8 9 6.8 7.5 6.9 6.1 5.6 5.4 4.9 4.5 4.2 4.0 The pattern does suggest that we may be able to get useful information by finding a line that “best fits” the data in some meaningful way. It produces the “best-fitting line”. y  0.389 x  7.338 Based on this formula, we can attempt a prediction of the unemployment rate in the year 2005: y (14)  0.389(14)  7.338  1.892 Note: Care must be taken when making predictions by extrapolating from known data, especially when the data set is as small as the one in this example. Learning Objectives In this chapter, you learn:  How to use regression analysis to predict the value of a dependent variable based on an independent variable  The meaning of the regression coefficients b0 and b1  To make inferences about the slope and correlation coefficient  To estimate mean values and predict individual values Correlation(相关) vs. Regression(回归)  A scatter diagram (散点图) can be used to show the relationship between two variables  Correlation (相关) analysis is used to measure strength of the association (linear relationship) between two variables  Correlation is only concerned with strength of the relationship  No causal effect (因果效应) is implied with correlation Scatter Diagrams  Scatter Diagrams are used to examine possible relationships between two numerical variables  The Scatter Diagram:  one variable is measured on the vertical axis and the other variable is measured on the horizontal axis Scatter Plots(散点图) Visualize the data to see patterns, especially “trends” Restaurant Ratings: Mean Preference vs. Mean Taste Introduction to Regression Analysis  Regression analysis is used to:  Predict the value of a dependent variable based on the value of at least one independent variable  Explain the impact of changes in an independent variable on the dependent variable Dependent variable: the variable we wish to predict or explain Independent variable: the variable used to explain the dependent variable Simple Linear Regression Model  Only one independent variable, X  Relationship between X and Y is described by a linear function  Changes in Y are assumed to be caused by changes in X Types of Relationships Linear relationships Y Curvilinear relationships Y X Y X Y X X Types of Relationships (continued) Strong relationships Y Weak relationships Y X Y X Y X X Types of Relationships (continued) No relationship Y X Y X Simple Linear Regression Model Population Y intercept Dependent Variable Population Slope Coefficient Independent Variable Random Error term Yi  β0  β1Xi  ε i Linear component Random Error component Simple Linear Regression Model (continued) Y Yi  β0  β1Xi  ε i Observed Value of Y for Xi εi Predicted Value of Y for Xi Slope = β1 Random Error for this Xi value Intercept = β0 Xi X Simple Linear Regression Equation (Prediction Line) The simple linear regression equation provides an estimate of the population regression line Estimated (or predicted) Y value for observation i Estimate of the regression intercept Estimate of the regression slope Ŷi  b0  b1Xi Value of X for observation i The individual random error terms ei have a mean of zero Least Squares Method (最小二乘方法)  b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared differences between Y and Ŷ : min  (Yi Ŷi )  min  (Yi  (b0  b1Xi )) 2 2 Interpretation of the Slope(斜率) and the Intercept(截距)  b0 is the estimated average value of Y when the value of X is zero  b1 is the estimated change in the average value of Y as a result of a one-unit change in X Example 11.1 The House Price Case  A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)  A random sample of 10 houses is selected  Dependent variable (Y) = house price in $1000s  Independent variable (X) = square feet Graphical Presentation  House price model: scatter plot House Price ($1000s) 450 400 350 300 250 200 150 100 50 0 0 500 1000 1500 2000 Square Feet 2500 3000 Graphical Presentation  House price model: scatter plot and regression line House Price ($1000s) 450 Intercept = 98.248 400 350 Slope = 0.10977 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 Square Feet house price  98.24833  0.10977 (square feet) Interpretation of the Intercept, b0 house price  98.24833  0.10977 (square feet)  b0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of observed X values)  Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet Interpretation of the Slope Coefficient, b1 house price  98.24833  0.10977 (square feet)  b1 measures the estimated change in the average value of Y as a result of a oneunit change in X  Here, b1 = .10977 tells us that the average value of a house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size Predictions using Regression Analysis Predict the price for a house with 2000 square feet: house price  98.25  0.1098 (sq.ft.)  98.25  0.1098(200 0)  317.85 The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850 Interpolation vs. Extrapolation  When using a regression model for prediction, only predict within the relevant range of data Relevant range for interpolation House Price ($1000s) 450 400 350 300 250 200 150 100 50 0 0 500 1000 1500 2000 Square Feet 2500 3000 Do not try to extrapolate beyond the range of observed X’s The Least Square Point Estimates Estimation/prediction equation yˆ  b0  b1 x Least squares point estimate of the slope b1 b1  SS xy SS xx SS xy SS xx   ( xi  x )  x  y     ( x  x )( y  y )   x y  i i i i  x   x  n 2 2 i 2 i i n  y  y  n 2 SS yy 2 i Least squares point estimate of the y-intercept b0 b0  y  b1 x y  y i n i x  x i n i Model Assumptions 1. 2. 3. 4. Mean of Zero At any given value of x, the population of potential error term values has a mean equal to zero Constant Variance Assumption At any given value of x, the population of potential error term values has a variance that does not depend on the value of x Normality Assumption At any given value of x, the population of potential error term values has a normal distribution Independence Assumption Any one value of the error term e is statistically independent of any other value of e Measures of Variation  Total variation is made up of two parts: SST  SSR  SSE Total Sum of Squares Regression Sum of Squares SST   ( Yi  Y)2 SSR   ( Ŷi  Y )2 Error Sum of Squares SSE   ( Yi  Ŷi )2 where: Y = Average value of the dependent variable Yi = Observed values of the dependent variable Ŷi = Predicted value of Y for the given Xi value Measures of Variation (continued)  SST = total sum of squares  Measures the variation of the Yi values around their mean Y  SSR = regression sum of squares  Explained variation attributable to the relationship between X and Y  SSE = error sum of squares  Variation attributable to factors other than the relationship between X and Y Measures of Variation (continued) Y Yi  SSE = (Yi - Yi )2  Y _  Y SST = (Yi - Y)2  _ SSR = (Yi - Y)2 _ Y Xi _ Y X Coefficient of Determination, r2  The coefficient of determination (决定系数) is the portion of the total variation in the dependent variable that is explained by variation in the independent variable  The coefficient of determination is also called r-squared and is denoted as r2 SSR regression sum of squares r   SST total sum of squares 2 note: 0 r 1 2 Examples of Approximate r2 Values Y r2 = 1 r2 = 1 X 100% of the variation in Y is explained by variation in X Y r2 =1 Perfect linear relationship between X and Y: X Examples of Approximate r2 Values Y 0 < r2 < 1 X Weaker linear relationships between X and Y: Some but not all of the variation in Y is explained by variation in X Y X Examples of Approximate r2 Values r2 = 0 Y No linear relationship between X and Y: r2 = 0 X The value of Y does not depend on X. (None of the variation in Y is explained by variation in X) The Simple Correlation Coefficient (简单相关系数) The simple correlation coefficient measures the strength of the linear relationship between y and x and is denoted by r r=  r if b1 is positive, and 2 r=  r if b1 is negative 2 Where, b1 is the slope of the least squares line r can also be calculated using the formula SS xy r SS xx SS yy Different Values of the Correlation Coefficient Inference about the Slope: t Test  t test for a population slope  Is there a linear relationship between X and Y?  Null and alternative hypotheses H0: β1 = 0 Ha: β1  0 (no linear relationship) (linear relationship does exist)  Test statistic b1  β1 t s b1 sb1  s SS xx d.f.  n  2 where: b1 = regression slope coefficient β1 = hypothesized slope sb = standard 1 error of the slope Example 11.2 House Price in $1000s (y) Square Feet (x) 245 1400 312 1600 279 1700 308 1875 199 1100 219 1550 405 2350 324 2450 319 1425 255 1700 The House Price Case Simple Linear Regression Equation: house price  98.25  0.1098 (sq.ft.) The slope of this model is 0.1098 Does square footage of the house affect its sales price? The House Price Case ＃2 b1  β1 0.10977  0 t   3.32938 s b1 0.03297 H0: β1 = 0 Ha: β1  0 n=10 d.f. = 10-2 = 8 a/2=.025 Reject H0 a/2=.025 Do not reject H0 -tα/2 -2.3060 0 Reject H 0 tα/2 2.3060 3.329 There is sufficient evidence that square footage affects house price Confidence Interval Estimate for the Slope Confidence Interval Estimate of the Slope: b1  t n2Sb1 d.f. = n - 2 At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858) Since the units of the house price variable is $1000s, we are 95% confident that the average impact on sales price is between $33.70 and $185.80 per square foot of house size This 95% confidence interval does not include 0. Conclusion: There is a significant relationship between house price and square feet at the .05 level of significance Chapter Summary  Introduced types of regression models  Reviewed assumptions of regression and correlation  Discussed determining the simple linear regression equation  Described measures of variation  Described inference about the slope  Discussed correlation -- measuring the strength of the association  Addressed estimation of mean values and prediction of individual values

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistical Inferences Based on Two Samples