Download Simple Regression: Inference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Inference for Simple
Regression
Social Research Methods 2109 & 6507
Spring 2006
March 15, 16, 2006
1
Regression Equation
Equation of a regression line:
(y_hat) = α +βx
y = α +βx + ε
y = dependent variable
x = independent variable
β = slope = predicted change in y with a one unit
change in x
α= intercept = predicted value of y when x is 0
y_hat = predicted value of dependent variable
2
補充: Proportional Reduction of
Error (PRE)(消減錯誤的比例)
• PRE measures compare the errors of
predictions under different prediction rules;
contrasts a naïve to sophisticated rule
• R2 is a PRE measure
• Naïve rule = predict y_bar
• Sophisticated rule = predict y_hat
• R2 measures reduction in predictive error
from using regression predictions as
contrasted to predicting the mean of y
3
Example: SPSS Regression
Procedures and Output
• To get a scatterplot ():
統計圖(G) → 散佈圖(S) →簡單 →定義(選x
及y)
• To get a correlation coefficient:
分析(A) → 相關(C) → 雙變量
• To perform simple regression
分析(A) → 迴歸方法(R) → 線性(L) (選x及
y)(還可選擇儲存預測值及殘差)
4
SPSS Example: Infant mortality vs.
Female Literacy, 1995 UN Data
Infant Mortality vs. Female Literacy
109 countries, 1995 UN Data
200
100
0
0
20
40
60
80
100
120
Females who read (% )
5
Example: correlation between
infant mortality and female literacy
相關
BABYMORT
LIT_FEMA
Infant mortality
(deaths per 1000 Females who
read (%)
live births)
BABYMORT Infant Pearso n 相關
-.843**
1
mortality (deaths per
顯著性 (雙尾)
.000
.
1000 live births)
個數
85
109
LIT_FEMA Females Pearso n 相關
1
-.843**
who read (%)
顯著性 (雙尾)
.
.000
個數
85
85
**. 在顯著水準為0.01時 (雙尾),相關顯著。
6
Regression: infant mortality vs.
female literacy, 1995 UN Data
模式摘要b
調過後的
模式
R
R 平方
R 平方
估計的標準誤
a
1
.843
.711
.708
20.6971
a. 預測變數:(常數), LIT_FEMA Females who read (%)
b. 依變數\:BABYMORT Infant mortality (death s per 1000
live births)
係數a
模式
1
未標準化係數
B 之估計值
標準誤
127.203
5.764
標準化係
數
Beta 分配
(常數)
LIT_FEMA Females
-1.129
.079
-.843
who read (%)
a. 依變數\:BABYMORT Infant mo rtality (deaths per 10 00 live births)
t
22.067
顯著性
.000
-14.302
.000
迴歸係數 B 的 95% 信賴
區間
下限
上限
115.738
138.668
-1.286
-.972
7
Diagnosis: a residual plot
Regression Residuals vs. Female Literacy
109 countries, 1995 UN Data
60
40
20
0
-20
-40
-60
-80
0
20
40
60
80
100
120
Females who read (%)
8
Global test--F檢定: 檢定迴歸方程式
有無解釋能力 (β= 0)
9
10
The regression model (迴歸模型)
• Note: the slope and intercept of the
regression line are statistics (i.e., from the
sample data).
• To do inference, we have to think of α and
β as estimates of unknown parameters.
11
Regression as conditional means
• Ways to think about regression:
1. Straight-line description of association
2. Prediction
3. Conditional means (條件平均數)
Conditional mean: a mean computed
conditional on the value of another
variable
Regression line predicts the conditional
mean of y given x
12
Assumptions for regression inference
Think about there as being a population or “true”
regression line
Assumptions:
• For any fixed value of x, the response (y) varies
according to a normal distribution. Repeated
responses y are independent of each other.
• μy = α +βx (means of y conditional on x fall in a
straight line)
• The standard deviation of y (call it σ) for each
value of x is the same. The value of σ is
unknown.
13
“True” regression line
14
Inference for regression
• Population regression line:
μy = α +βx
estimated from sample:
(y_hat) = a + bx
b is an unbiased estimator (不偏估計式)of
the true slope β, and a is an unbiased
estimator of the true intercept α
15
Sampling distribution of a (intercept)
and b (slope)
• Mean of the sampling distribution of a is α
• Mean of the sampling distribution of b is β
16
Sampling distribution of a (intercept)
and b (slope)
• Mean of the sampling distribution of a is α
• Mean of the sampling distribution of b is β
• The standard error of a and b are related
to the amount of spread about the
regression line (σ)
• Normal sampling distributions; with σ
estimated use t-distribution for inference
17
The standard error of the least-squares line
• Estimate σ (spread about the regression
line using residuals from the regression)
• recall that residual = (y –y_hat)
• Estimate the population standard deviation
about the regression line (σ) using the
sample estimates
18
Estimate σ from sample data
19
Standard Error of Slope (b)
• The standard error of the slope has a
sampling distribution given by:
• Small standard errors of b means our
estimate of b is a precise estimate of
• SEb is directly related to s; inversely
related to sample size (n) and Sx
20
Confidence Interval for regression slope
A level C confidence interval for the slope of “true”
regression line β is
b ± t * SEb
Where t* is the upper (1-C)/2 critical value from the
t distribution with n-2 degrees of freedom
To test the hypothesis H0: β= 0, compute the t
statistic:
t = b/ SEb
In terms of a random variable having the t,n-2
distribution
21
Significance Tests for the slope
Test hypotheses about the slope of β.
Usually:
H0: β= 0 (no linear relationship between the
independent and dependent variable)
Alternatives:
HA: β> 0 or HA: β< 0
or HA: β ≠ 0
22
23
Statistical inference for intercept
We could also do statistical inference for the
regression intercept, α
Possible hypotheses:
H0 : α = 0
HA: α≠ 0
t-test based on a, very similar to prior t-tests
we have done
For most substantive applications, interested
in slope (β), not usually interested in α
24
Regression: infant mortality vs.
female literacy, 1995 UN Data
模式摘要b
調過後的
模式
R
R 平方
R 平方
估計的標準誤
a
1
.843
.711
.708
20.6971
a. 預測變數:(常數), LIT_FEMA Females who read (%)
b. 依變數\:BABYMORT Infant mortality (death s per 1000
live births)
變異數分析b
模式
1
平方和
自由度
平均平方和
F 檢定
迴歸
87617.840
1
87617.840
204.538
殘差
35554.673
83
428.370
總和 123172.513
84
a. 預測變數:(常數), LIT_FEMA Females who read (%)
b. 依變數\:BABYMORT In fant mortality (deaths p er 1000 liv e births)
顯著性
.000 a
係數a
模式
1
未標準化係數
B 之估計值
標準誤
127.203
5.764
標準化係
數
Beta 分配
(常數)
LIT_FEMA Females
-1.129
.079
-.843
who read (%)
a. 依變數\:BABYMORT Infant mo rtality (deaths per 10 00 live births)
t
22.067
顯著性
.000
-14.302
.000
迴歸係數 B 的 95% 信賴
區間
下限
上限
115.738
138.668
-1.286
-.972
25
Hypothesis test example
大華正在分析教育成就的世代差異,他蒐集到117組父子教
育程度的資料。父親的教育程度是自變項,兒子的教育
程度是依變項。他的迴歸公式是:y_hat = 0.2915*x
+10.25
迴歸斜率的標準誤差(standard error)是: 0.10
1.
2.
3.
在α=0.05,大華可得出父親與兒子的教育程度是有關連
的嗎?
對所有父親的教育程度是大學畢業的男孩而言,這些男
孩的平均教育程度預測值是多少?
有一男孩的父親教育程度是大學畢業,預測這男孩將來
的教育程度會是多少?
26
Related documents