Download This file contains two worksheets, these Instructions, and Simulation

PLS206 Spring 2007 Homework 1 – Answer Key This file contains two worksheets, these Instructions, and Simulation. The goals of this exercise are to: Learn what a parametric simulation is. Learn the concept that the sample and the estimated model are not the same as the "real" model. Understand that residuals (e) are not the same as true errors (epsilon). Develop an intuitive feel for the effect of true parameters values and hypothesized model on the estimated parameters and their estimated variances. Describe the effects of changes in beta0, beta1, beta2 and sigma on the shape of the true model. You can use hand-drawn pictures for the description. beta0 adjusts the intercept, or where the true model crosses the Y axis; beta1 is the coefficient for the linear portion of the model; beta2 is the coefficient for the non-linear (quadratic) portion of the model. In general, beta2 controls the width of the parabola. In general, beta1 controls the slope of the linear contribution of the model. beta2 controls the curvature of the graph. However, if beta1 is much much greater than beta2 and much greater than the values of x, then the shape of the model becomes more linear on the range of x's we are interested in. See figure for further details. Sigma controls the spread of the points around the model, but not the shape of the model. List all columns and/or cells that represent random variables and indicate what symbols are usually used to designate them. If you have any edition of the recommended textbook, use the symbols in the textbook. Ignore columns G and H. All random variables are shaded in light blue in the Simulation worksheet. Any cell that canges when you recalculate the worksheet is a random variable because it is ultimately dependent on the randomness of the epsilon column. Set beta2 to 0 and keep beta0 and beta 1 constant. Perform 20 simulations by pressing the key combination indicated. For each simulation, make predictions for the yield expected for X=9 and X=18. Make a table with columns for predicted yield for X=10, predicted yield for X=17, first individual Y value observed for X=9 (cell B15), first individual Y value observed for X=18 (cell B24). This table will have 20 rows. The answer here should be a table with 4 columns and 20 rows. Each column should represent: values for Y9, Yhat9, Y18, Yhat18, for each of twenty replications of the experiment. The values should vary randomly around some mean value. The variance of the Yhats should be lower than the variance of the Y’s – but not necessarily evident in your set of repetitions. The variance in the Y’s should become greater at X values farther from Xbar. Calculate the variance of each of the columns above. Discuss how they relate to the equations to determine the variance of predictions of expected values of Y and individual values of Y. The answer here is dependent on the actual numbers obtained in question 3. The variance can be obtained for Yhati and Yi by calculating the sum of the squares of the deviations around the sample mean (using the 20 individual values as a sample) divided by the degrees of freedom, SUM(YiYbar)^2 divided by n-1, and should be similar to the results obtained when the variance of these two variables is calculated using the equations for estimates of variance of a prediction for the expected value of yield given Xh, and for an individual yield given Xh. The equations calculate the variance of these variables analytically using one repetition of the experiment. Assume that each simulation represents a real experiment. Describe two ways to create a confidence interval for beta1 and discuss the advantages and disadvantages of each one. Note that each simulation run can yield estimates of beta1 and its CI. There are at least two ways to go about calculating a confidence interval around beta1: (1) As with the above table, a table of beta1 values can be simulated by repeating the experiment with the same parameter values 20 or more times, using the shift F9 function. Then the mean and variance for beta1 can be estimated using the sample of beta1's generated by the repetitions and the following simple equations can be used: Σ(b1i)/n and Σ(b1i-b1bar)^2 divided by n-1, respectively, for estimating beta1 and for estimating the variance of beta1. OR (2) the variance for b1 can be estimated using the equation for s2{b1} given in the lectures by using a single rep of the experiment and a single sample of Yi's, Ybar, Yhat's, X Yij Observed yield or dependent variable values (or Padded Y Average of observed yields values) and the "slope" Y. j Average of observed yield for level j of X estimate given in cell E31. ˆ Either way, the Y Estimated yield variance . j  E Yij | X j AMean or expected value of yield for level j of predictor X obtained can be used to eij Residual for observation i of level j of X calculate a confidence  ij Actual random error for observation i of level j of X interval around the estimated beta1 using a t ̂0  b0 Estimated intercept distribution: b1 ± s{b1}*t(df, 1̂1  b1 Estimated slope or regression coefficient for predictor X α/2). The estimate for b1 ̂ 2  MSE Estimated variance of the error is either the SSR Sum of squares of the model or regression; explained sum of squares average of the beta1's from SSE Sum of squares of the error or residuals; unexplained sum of squares method (1) or the b1 or SST Total sum of squares of Y "slope"   calculated in method (2). Note that when using the first method, the dfe=n-1. In the second method dfe=n-2. Thus, the df used to look up the t-value are different. Answer problems 1.2, 1.5, 1.7, 1.12 and 1.16 in page 33 of the textbook. 1.2 Equation: Y=$300 + $2*X. This is characterized best as a “functional” not a “statistical” relationship because in this example, there is no uncertainty associated with Y. Y IS EXACTLY equal to the sum of the two terms: $300 (intercept) and $2*X (slope*X). There would be no scatter around the line you plotted from this equation. If there were some scatter around the line, or in other words, if there was some variation in Y not explained by the model, then this might be better characterized as a statistical relationship. 1.5 The equation written in this question is incorrect. The value “E{Yi},” is the “expected value for Yi” (when X is in the ith level, or when X=Xi). This value can be calculated as the mean of the probability distribution for all Yi’s at that level of X, which is given by the equation Beta0 + Beta1Xi (or the line – remember the picture depicting a regression line with normal distributions centered around the line representing the possible values for Y at various levels of X). Another way of looking at it is this: since the error terms are assumed to be normally distributed, with mean 0 and variance σ2, then the expected value for εi at any given Xi is E{εi}=0 (the expected value is given by the mean of the probability distribution of the random variable ε). If E{εi} is 0, then plugging this information into the familiar equation: Yi = Beta0 + Beta1Xi + εi, we get… E{Yi} = Beta0 + Beta1Xi + 0 OR… E{Yi}=Beta0 + Beta1Xi. 1.7 (a) Regression model 1.1 is Yi = Beta0 + Beta1Xi + εi. In this example, we are given Beta0, Beta1, and σ2 for the error term (the error term has by definition mean 0 and variance σ2), which represent the TRUE parameters for the population. (Note: this is what makes this a simulation experiment – in real experiments we NEVER will know the true parameters, only better and better estimates of them the more data we collect – unless we collected data for every single member of the population at once with perfect precision). Since we’ve been given the actual parameter values, but not the probability distribution associated with ε, we CANNOT calculate the exact probability that Y will fall between 195 and 205. (b) Given a normally distributed error, we CAN state the exact probability that Y will fall between 195 and 205 for X=5. Note first that the E{Yi} = 200 ( calculated from 100 + 20*5), which represents the mean of this probability distribution. Therefore we can estimate the mean value for Yi at X=5 with 100% precision. But, there is some variance around this mean in the true population, denoted by σ2{Yi}. This variance describes the width of our normal distribution for Y’s centered around 200. The probability that Y is between 195 and 205 when X=5 is the same as the probability that z will be between (195-200)/5 and (205-200)/5. 1.12(a) This was an observational study. The researcher did not manipulate any experimental conditions in a controlled way (except to select participants). (b) “correlation does not necessarily imply causation” Thus, the correlation detected by the study may or may not be causal. The conclusion (as a factual statement) may or may not be correct, but the method used to reach it is incorrect. (c) This part of the question is asking you to think about “confounding” variables – those factors that vary with both X and Y and may influence the relationship between X and Y. Some examples include: (1) people who exercise more may lead generally more healthful lifestyles (e.g. eat better, watch their weight, take vitamins, etc.) which may in turn affect their health status; (2) People who are less ill may feel better in general and may tend to want to exercise more (ie. which is the dependent vs. independent variable?). (d) (Note: this is a tricky question – definitively showing cause and effect may require much more information than a single study can provide, whether observational or experimental. However, for the purposes of the question, one can suspend disbelief and describe some ways to minimize the possibility that the relationship among X and Y is caused by some unmeasured variable). This part of the question is asking about how you might minimize the probability that confounding variables are influencing the results, thus isolating as much as possible, the effects of X on Y. This is achieved, in theory, by control and randomization. For example, a researcher could design an experiment by selecting two groups who all start out healthy at the beginning of the study: the experimental group (those that have a higher level of exercise) and the control group (those with a low level of exercise). The experimental and control groups would be selected such that they are as equivalent as possible in all other ways besides exercise level AND/OR participants would be randomized such that the effects of unmeasured confounding variables are likely to be evened out among the groups. Then the researcher would follow the groups over time and measure the level of colds among each group. (Other ways to minimize potential bias in the results can be noted, but are beyond the scope of the question. For example, the researcher could design the study so that (1) people are assigned blindly to groups – not really possible here since people know how much they are exercising; and (2) the researcher measuring the effect could be blinded, e.g. kept unaware about the group assignment of each participant.) There are other possible and equally valid experimental designs that could be used – the key is that control and randomization will hopefully either control for or even out the effects of unmeasured confounding variables among groups. 1.16 The least squares method, by itself, is robust against deviations from normality and doesn't assume any particular distribution. However, by convention, most linear models such as linear regression and ANOVA assume a normal distribution of errors and Y's. It is important to understand where this assumption of normality plays in...It is not in the least squares regression itself, but in the interpretation and hypothesis testing that comes later. For example, when one gets a “p value” in the various outputs from a statistical program such as jmp (or hand calculations of p values), the value is determined by comparing a test statistic to a distribution - usually a "t" distribution or an "F" distribution (which one is used depends on the specific hypothesis or question). THIS STEP, selecting a probability distribution and making inferences about the data in relation to the population being estimated, using these distributions, is the step that is highly dependent on the assumption of normality in the errors and Y's. Without the assumption of normality, the inference about the system using these distributions would be invalid.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download This file contains two worksheets, these Instructions, and Simulation