* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Theoretical Framework
Survey
Document related concepts
Transcript
Handout #7: Theoretical Framework for Simple Linear Regression Example 7.1: This handout will make use of the Nutrition dataset, which can be found on our course website. This dataset contains nutritional information on a variety of fast-food restaurants found in Winona. A snip-it of the data is provided here. Simple Linear Regression Setup Model to be fit using only nutritional information from Wendy’s. Response Variable: Saturated Fat Predictor Variable: Calories Assume the following structure for mean and variance functions o o 𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝛽0 + 𝛽1 ∗ 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠 𝑉𝑎𝑟(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝜎 2 The above simple linear regression model reduces our original nutritional dataset down to the following. 1 Simple Linear Regression Output Scatterplot showing the conditional distribution of SaturatedFat | Calories Basic Regression Output Standard Parameter Estimate Output (with 95% confidence intervals) Questions 1. Write out the estimate mean function. 𝐸̂ (𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 2. What is our best estimate for variance in the conditional distribution? ̂ (𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝑉𝑎𝑟 2 3. Recall, the 95% confidence interval for 𝛽1 , the true slope of mean function is given by. 𝐿𝑜𝑤𝑒𝑟 𝐿𝑖𝑚𝑖𝑡 = 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑆𝑙𝑜𝑝𝑒 − 𝑐 ∗ 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 = 0.03096 − 2.0555 ∗ 0.0222 = 0.0246 𝑈𝑝𝑝𝑒𝑟 𝐿𝑖𝑚𝑖𝑡 = 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑆𝑙𝑜𝑝𝑒 + 𝑐 ∗ 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 = 0.03096 + 2.0555 ∗ 0.0222 = 0.0355 Interpret, in context and using laymen’s language, the meaning of this 95% confidence interval for 𝛽1 . Comments: The above confidence interval uses the t-distribution to obtain the quantity c. t-distribution with df = 169 – 2 = 167 The theoretical correctness for the use of the tdistribution relies on the following assumptions: o o o Constant variance function Response variable follows a normal distribution Observations are independent of each other In Excel 3 Section 7.2: Mathematical Framework Consider once again the dataset being modeled in this handout. The mathematical representation for each observation in our dataset has the following form. Theoretical Representation of Every Observation 𝑌1 𝑌2 𝑌3 : : : 𝑌26 𝑌27 𝑌28 = = = : : : = = = 𝛽0 ∗ 1 𝛽0 ∗ 1 𝛽0 ∗ 1 : : : 𝛽0 ∗ 1 𝛽0 ∗ 1 𝛽0 ∗ 1 + 𝛽1 ∗ 𝑥1 + 𝛽1 ∗ 𝑥2 + 𝛽1 ∗ 𝑥3 : : : : : : + 𝛽1 ∗ 𝑥26 + 𝛽1 ∗ 𝑥27 + 𝛽1 ∗ 𝑥28 + + + : : : + + + Representation using the Observed Data 𝜀1 𝜀2 𝜀3 : : : 𝜀26 𝜀27 𝜀28 14 21 30 : : : 12 2 2.5 = = = : : : = = = −5.8333 ∗ 1 −5.8333 ∗ 1 −5.8333 ∗ 1 : : : −5.8333 ∗ 1 −5.8333 ∗ 1 −5.8333 ∗ 1 + + + : : : + + + 0.03096 ∗ 580 0.03096 ∗ 800 0.03096 ∗ 1060 : : : 0.03096 ∗ 580 0.03096 ∗ 320 0.03096 ∗ 210 + + + : : : + + + 1.88 2.06 3.01 : : : −0.12 −2.07 1.83 4 The following table provides a complete and full disclosure of the theoretical and distributional assumptions for fitting a simple linear model. Marginal Distribution of SaturatedFat Conditional Distribution of SaturatedFat | Calories 𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) = 𝛽0 + 𝛽1 ∗ 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠 Mean 𝑉𝑎𝑟(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) = 𝜎 2 Variance Independence Distribution Y’s must be independent Y’s must follow a normal distribution => Residual’s must be identically distributed (constant variance assumption) and independent of each other => Residual’s must follow a normal distribution Notation for Distributional Assumptions Each observation has the following form. 𝑌𝑖 = ⏟ 𝛽0 + 𝛽1 ∗ 𝑥𝑖 𝑀𝑒𝑎𝑛 + 𝜀⏟𝑖 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 The standard representation of the conditional distribution of 𝑌𝑖 |𝑥𝑖 is provided here. 𝑌𝒊 |𝑥𝒊 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(𝛽0 + 𝛽1 ∗ 𝑥𝑖 , 𝜎 2 ) 𝐸(𝑌𝑖 |𝑥𝑖 ) = 𝛽0 + 𝛽1 ∗ 𝑥𝑖 𝑉𝑎𝑟(𝑌𝑖 |𝑥𝑖 ) = 𝑉𝑎𝑟(𝜀𝑖 ) = 𝜎 2 Some people emphasize the fact that the variability in the conditional distribution of 𝑌𝑖 |𝑥𝑖 comes solely from the 𝜀𝑖 term with the following representation for the residual or error component. 𝜀𝑖 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎 2 ), 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖 5 A Visual Depiction of these Assumptions SaturatedFat | Calories = 580 SaturatedFat | Calories = 900 Certainly, such assumptions regarding the distribution of 𝑌𝑖 |𝑥𝑖 generalize to all 𝑥𝑖 . Assumptions for variety of 𝑌𝑖 |𝑥𝑖 values Turing the graph to the left and pushing it over on its back allows us to more easily see the normality assumption for the conditional distributions. 6 Section 7.3: Verification of Assumption Consider once again normal-based assumptions regarding the Theoretical Normal-based Assumptions Mean Variance Independence Distribution 𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) = 𝛽0 + 𝛽1 ∗ 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠 𝑉𝑎𝑟(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) = 𝜎 2 Residual’s must be identically distributed (constant variance assumption) and independent of each other Residual’s must follow a normal distribution 7 Correct Form for Mean Function Good Pattern Bad Patterns or 8 Constant Variance Good Pattern Bad Patterns or 9 Independence of Observations Good Pattern Bad Patterns 10 Normality Normal Quantile Plot Kernal Smoothers Checking for Outliers The Rule 11