Download Theoretical Framework

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

German tank problem wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Handout #7: Theoretical Framework for Simple Linear Regression
Example 7.1: This handout will make use of the Nutrition dataset, which can be found on our course
website. This dataset contains nutritional information on a variety of fast-food restaurants found in
Winona. A snip-it of the data is provided here.
Simple Linear Regression Setup




Model to be fit using only nutritional information from Wendy’s.
Response Variable: Saturated Fat
Predictor Variable: Calories
Assume the following structure for mean and variance functions
o
o
𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝛽0 + 𝛽1 ∗ 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠
𝑉𝑎𝑟(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝜎 2
The above simple linear regression model reduces our original nutritional dataset down to the following.
1
Simple Linear Regression Output
Scatterplot showing the conditional distribution of
SaturatedFat | Calories
Basic Regression Output
Standard Parameter Estimate Output (with 95% confidence intervals)
Questions
1. Write out the estimate mean function.
𝐸̂ (𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) =
2. What is our best estimate for variance in the conditional distribution?
̂ (𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) =
𝑉𝑎𝑟
2
3. Recall, the 95% confidence interval for 𝛽1 , the true slope of mean function is given by.
𝐿𝑜𝑤𝑒𝑟 𝐿𝑖𝑚𝑖𝑡 = 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑆𝑙𝑜𝑝𝑒 − 𝑐 ∗ 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟
= 0.03096 − 2.0555 ∗ 0.0222
= 0.0246
𝑈𝑝𝑝𝑒𝑟 𝐿𝑖𝑚𝑖𝑡
= 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑆𝑙𝑜𝑝𝑒 + 𝑐 ∗ 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟
= 0.03096 + 2.0555 ∗ 0.0222
= 0.0355
Interpret, in context and using laymen’s language, the meaning of this 95% confidence interval
for 𝛽1 .
Comments:
 The above confidence interval uses the t-distribution to
obtain the quantity c.

t-distribution with df = 169 – 2 =
167
The theoretical correctness for the use of the tdistribution relies on the following assumptions:
o
o
o
Constant variance function
Response variable follows a normal distribution
Observations are independent of each other
In Excel
3
Section 7.2: Mathematical Framework
Consider once again the dataset being modeled in this handout.
The mathematical representation for each observation in our dataset has the following form.
Theoretical Representation
of Every Observation
𝑌1
𝑌2
𝑌3
:
:
:
𝑌26
𝑌27
𝑌28
=
=
=
:
:
:
=
=
=
𝛽0 ∗ 1
𝛽0 ∗ 1
𝛽0 ∗ 1
:
:
:
𝛽0 ∗ 1
𝛽0 ∗ 1
𝛽0 ∗ 1
+ 𝛽1 ∗ 𝑥1
+ 𝛽1 ∗ 𝑥2
+ 𝛽1 ∗ 𝑥3
:
:
:
:
:
:
+ 𝛽1 ∗ 𝑥26
+ 𝛽1 ∗ 𝑥27
+ 𝛽1 ∗ 𝑥28
+
+
+
:
:
:
+
+
+
Representation using
the Observed Data
𝜀1
𝜀2
𝜀3
:
:
:
𝜀26
𝜀27
𝜀28
14
21
30
:
:
:
12
2
2.5
=
=
=
:
:
:
=
=
=
−5.8333 ∗ 1
−5.8333 ∗ 1
−5.8333 ∗ 1
:
:
:
−5.8333 ∗ 1
−5.8333 ∗ 1
−5.8333 ∗ 1
+
+
+
:
:
:
+
+
+
0.03096 ∗ 580
0.03096 ∗ 800
0.03096 ∗ 1060
:
:
:
0.03096 ∗ 580
0.03096 ∗ 320
0.03096 ∗ 210
+
+
+
:
:
:
+
+
+
1.88
2.06
3.01
:
:
:
−0.12
−2.07
1.83
4
The following table provides a complete and full disclosure of the theoretical and distributional
assumptions for fitting a simple linear model.
Marginal
Distribution of
SaturatedFat
Conditional Distribution of
SaturatedFat | Calories
𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) = 𝛽0 + 𝛽1 ∗ 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠
Mean
𝑉𝑎𝑟(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) = 𝜎 2
Variance
Independence
Distribution
Y’s must be
independent
Y’s must follow a
normal distribution
=>
Residual’s must be identically distributed (constant variance
assumption) and independent of each other
=>
Residual’s must follow a normal distribution
Notation for Distributional Assumptions
Each observation has the following form.
𝑌𝑖 = ⏟
𝛽0 + 𝛽1 ∗ 𝑥𝑖
𝑀𝑒𝑎𝑛
+
𝜀⏟𝑖
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
The standard representation of the conditional distribution of 𝑌𝑖 |𝑥𝑖 is provided here.
𝑌𝒊 |𝑥𝒊 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(𝛽0 + 𝛽1 ∗ 𝑥𝑖 , 𝜎 2 )
𝐸(𝑌𝑖 |𝑥𝑖 ) = 𝛽0 + 𝛽1 ∗ 𝑥𝑖
𝑉𝑎𝑟(𝑌𝑖 |𝑥𝑖 ) = 𝑉𝑎𝑟(𝜀𝑖 ) = 𝜎 2
Some people emphasize the fact that the variability in the conditional distribution of 𝑌𝑖 |𝑥𝑖 comes solely
from the 𝜀𝑖 term with the following representation for the residual or error component.
𝜀𝑖 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎 2 ), 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖
5
A Visual Depiction of these Assumptions
SaturatedFat | Calories = 580
SaturatedFat | Calories = 900
Certainly, such assumptions regarding the distribution of 𝑌𝑖 |𝑥𝑖 generalize to all 𝑥𝑖 .
Assumptions for variety of 𝑌𝑖 |𝑥𝑖 values
Turing the graph to the left and pushing it over on its back
allows us to more easily see the normality assumption for
the conditional distributions.
6
Section 7.3: Verification of Assumption
Consider once again normal-based assumptions regarding the
Theoretical Normal-based Assumptions
Mean
Variance
Independence
Distribution
𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) = 𝛽0 + 𝛽1 ∗ 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠
𝑉𝑎𝑟(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) = 𝜎 2
Residual’s must be identically distributed (constant variance
assumption) and independent of each other
Residual’s must follow a normal distribution
7
Correct Form for Mean Function
Good Pattern
Bad Patterns
or
8
Constant Variance
Good Pattern
Bad Patterns
or
9
Independence of Observations
Good Pattern
Bad Patterns
10
Normality
Normal Quantile Plot
Kernal Smoothers
Checking for Outliers
The Rule
11