Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Inference for Simple Regression Social Research Methods 2109 & 6507 Spring 2006 March 15, 16, 2006 1 Regression Equation Equation of a regression line: (y_hat) = α +βx y = α +βx + ε y = dependent variable x = independent variable β = slope = predicted change in y with a one unit change in x α= intercept = predicted value of y when x is 0 y_hat = predicted value of dependent variable 2 補充: Proportional Reduction of Error (PRE)(消減錯誤的比例) • PRE measures compare the errors of predictions under different prediction rules; contrasts a naïve to sophisticated rule • R2 is a PRE measure • Naïve rule = predict y_bar • Sophisticated rule = predict y_hat • R2 measures reduction in predictive error from using regression predictions as contrasted to predicting the mean of y 3 Example: SPSS Regression Procedures and Output • To get a scatterplot (): 統計圖(G) → 散佈圖(S) →簡單 →定義(選x 及y) • To get a correlation coefficient: 分析(A) → 相關(C) → 雙變量 • To perform simple regression 分析(A) → 迴歸方法(R) → 線性(L) (選x及 y)(還可選擇儲存預測值及殘差) 4 SPSS Example: Infant mortality vs. Female Literacy, 1995 UN Data Infant Mortality vs. Female Literacy 109 countries, 1995 UN Data 200 100 0 0 20 40 60 80 100 120 Females who read (% ) 5 Example: correlation between infant mortality and female literacy 相關 BABYMORT LIT_FEMA Infant mortality (deaths per 1000 Females who read (%) live births) BABYMORT Infant Pearso n 相關 -.843** 1 mortality (deaths per 顯著性 (雙尾) .000 . 1000 live births) 個數 85 109 LIT_FEMA Females Pearso n 相關 1 -.843** who read (%) 顯著性 (雙尾) . .000 個數 85 85 **. 在顯著水準為0.01時 (雙尾),相關顯著。 6 Regression: infant mortality vs. female literacy, 1995 UN Data 模式摘要b 調過後的 模式 R R 平方 R 平方 估計的標準誤 a 1 .843 .711 .708 20.6971 a. 預測變數:(常數), LIT_FEMA Females who read (%) b. 依變數\:BABYMORT Infant mortality (death s per 1000 live births) 係數a 模式 1 未標準化係數 B 之估計值 標準誤 127.203 5.764 標準化係 數 Beta 分配 (常數) LIT_FEMA Females -1.129 .079 -.843 who read (%) a. 依變數\:BABYMORT Infant mo rtality (deaths per 10 00 live births) t 22.067 顯著性 .000 -14.302 .000 迴歸係數 B 的 95% 信賴 區間 下限 上限 115.738 138.668 -1.286 -.972 7 Diagnosis: a residual plot Regression Residuals vs. Female Literacy 109 countries, 1995 UN Data 60 40 20 0 -20 -40 -60 -80 0 20 40 60 80 100 120 Females who read (%) 8 Global test--F檢定: 檢定迴歸方程式 有無解釋能力 (β= 0) 9 10 The regression model (迴歸模型) • Note: the slope and intercept of the regression line are statistics (i.e., from the sample data). • To do inference, we have to think of α and β as estimates of unknown parameters. 11 Regression as conditional means • Ways to think about regression: 1. Straight-line description of association 2. Prediction 3. Conditional means (條件平均數) Conditional mean: a mean computed conditional on the value of another variable Regression line predicts the conditional mean of y given x 12 Assumptions for regression inference Think about there as being a population or “true” regression line Assumptions: • For any fixed value of x, the response (y) varies according to a normal distribution. Repeated responses y are independent of each other. • μy = α +βx (means of y conditional on x fall in a straight line) • The standard deviation of y (call it σ) for each value of x is the same. The value of σ is unknown. 13 “True” regression line 14 Inference for regression • Population regression line: μy = α +βx estimated from sample: (y_hat) = a + bx b is an unbiased estimator (不偏估計式)of the true slope β, and a is an unbiased estimator of the true intercept α 15 Sampling distribution of a (intercept) and b (slope) • Mean of the sampling distribution of a is α • Mean of the sampling distribution of b is β 16 Sampling distribution of a (intercept) and b (slope) • Mean of the sampling distribution of a is α • Mean of the sampling distribution of b is β • The standard error of a and b are related to the amount of spread about the regression line (σ) • Normal sampling distributions; with σ estimated use t-distribution for inference 17 The standard error of the least-squares line • Estimate σ (spread about the regression line using residuals from the regression) • recall that residual = (y –y_hat) • Estimate the population standard deviation about the regression line (σ) using the sample estimates 18 Estimate σ from sample data 19 Standard Error of Slope (b) • The standard error of the slope has a sampling distribution given by: • Small standard errors of b means our estimate of b is a precise estimate of • SEb is directly related to s; inversely related to sample size (n) and Sx 20 Confidence Interval for regression slope A level C confidence interval for the slope of “true” regression line β is b ± t * SEb Where t* is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom To test the hypothesis H0: β= 0, compute the t statistic: t = b/ SEb In terms of a random variable having the t,n-2 distribution 21 Significance Tests for the slope Test hypotheses about the slope of β. Usually: H0: β= 0 (no linear relationship between the independent and dependent variable) Alternatives: HA: β> 0 or HA: β< 0 or HA: β ≠ 0 22 23 Statistical inference for intercept We could also do statistical inference for the regression intercept, α Possible hypotheses: H0 : α = 0 HA: α≠ 0 t-test based on a, very similar to prior t-tests we have done For most substantive applications, interested in slope (β), not usually interested in α 24 Regression: infant mortality vs. female literacy, 1995 UN Data 模式摘要b 調過後的 模式 R R 平方 R 平方 估計的標準誤 a 1 .843 .711 .708 20.6971 a. 預測變數:(常數), LIT_FEMA Females who read (%) b. 依變數\:BABYMORT Infant mortality (death s per 1000 live births) 變異數分析b 模式 1 平方和 自由度 平均平方和 F 檢定 迴歸 87617.840 1 87617.840 204.538 殘差 35554.673 83 428.370 總和 123172.513 84 a. 預測變數:(常數), LIT_FEMA Females who read (%) b. 依變數\:BABYMORT In fant mortality (deaths p er 1000 liv e births) 顯著性 .000 a 係數a 模式 1 未標準化係數 B 之估計值 標準誤 127.203 5.764 標準化係 數 Beta 分配 (常數) LIT_FEMA Females -1.129 .079 -.843 who read (%) a. 依變數\:BABYMORT Infant mo rtality (deaths per 10 00 live births) t 22.067 顯著性 .000 -14.302 .000 迴歸係數 B 的 95% 信賴 區間 下限 上限 115.738 138.668 -1.286 -.972 25 Hypothesis test example 大華正在分析教育成就的世代差異,他蒐集到117組父子教 育程度的資料。父親的教育程度是自變項,兒子的教育 程度是依變項。他的迴歸公式是:y_hat = 0.2915*x +10.25 迴歸斜率的標準誤差(standard error)是: 0.10 1. 2. 3. 在α=0.05,大華可得出父親與兒子的教育程度是有關連 的嗎? 對所有父親的教育程度是大學畢業的男孩而言,這些男 孩的平均教育程度預測值是多少? 有一男孩的父親教育程度是大學畢業,預測這男孩將來 的教育程度會是多少? 26